I have data where each row looks like this:
49, 127 97 136;
CSV.read nicely imports it to a DataFrame like this:
49 127 97 136;
But I donβt want the semicolon. Can this be imported in a better way?
Thanks!
I have data where each row looks like this:
49, 127 97 136;
CSV.read nicely imports it to a DataFrame like this:
49 127 97 136;
But I donβt want the semicolon. Can this be imported in a better way?
Thanks!
A better solution would be to work completely within Julia, but just to get things done, maybe somethings like this:
julia> using DataFrames, CSV
julia> CSV.read(`sed 's/[;,]//g' testdata`, DataFrame; header=false)
1Γ4 DataFrame
Row β Column1 Column2 Column3 Column4
β Int64 Int64 Int64 Int64
ββββββΌββββββββββββββββββββββββββββββββββββ
1 β 49 127 97 136
Can do the trick. It uses sed from Linux to replace the unwanted characters (Linuxβs tr could also be used).
testdata contains the textual data in the post (i.e. 49, 127 97 136;)
This is ambiguous because it doesnβt show how the DataFrame is splitting the text into columns, and we donβt have a minimal text file to reproduce it. Is this aggregate string column really what you want?
julia> write("sample.txt",
"""
49, 127 97 136;
249, 27 1497 1312;
""")
35
julia> using DataFrames, CSV
julia> sample = CSV.read("sample.txt", DataFrame; header=false)
2Γ2 DataFrame
Row β Column1 Column2
β Int64 String15
ββββββΌβββββββββββββββββββββββββ
1 β 49 127 97 136;
2 β 249 27 1497 1312;
The problem is most people would expect integer columns, which is how itβs parsed after Dan removed the , and ;, but that would require multiple column delimiters. CSVs traditionally delimited columns with commas (',') and rows with newlines ('\n', "\r\n", '\r'). Part of the problem here is that CSV.read doesnβt support a custom row delimiter (which is odd because CSV.write has newline= and DelimitedFiles.readdlm has eol::AbstractChar=), but more flexible parsers still often expect 1 column delimiter like CSV.readβs delim= does. There are parsers that support regex patterns for multiple delimiters, but CSV.read does not. Itβs possible to parse rows/lines yourself (EDIT: see next comment by technocrat), whether as UTF-8 String-s or direct bytes, then push! results to a DataFrame. However, established libraries tend to provide optimizations, like CSV.readβs multithreading, that are annoying to fully reimplement.
If that string column is really intended, the string processing as stated is simple:
julia> sample.Column2 .= chop.(sample.Column2); sample
2Γ2 DataFrame
Row β Column1 Column2
β Int64 String15
ββββββΌββββββββββββββββββββββββ
1 β 49 127 97 136
2 β 249 27 1497 1312
Cockroaches are not hermits. Me.
The trailing ; is not the only way this problem will present with real-world data (i.e., data that has passed through a spreadsheet). Itβs worth taking a whack at solving the the problem in greater generality. Here is an approach based on awk.
using Test
function parse_record(record; terminator='\n')
s = record isa AbstractString ? record : join(record, " ")
# Strip the configurable record terminator (if one is provided)
if terminator !== nothing
s = replace(s, terminator => "")
end
# 1. Normalize comma thousands-separators to underscores before splitting
s = replace(s, r"(?<=\d),(?=\d)" => "_")
# 2. Split into fields using remaining commas or whitespace
fields = split(s, r"[,\s]+", keepempty=false)
# 3. Parse each field
map(fields) do f
clean_f = replace(f, "_" => "")
n = tryparse(Int, clean_f)
n !== nothing && return n
n = tryparse(Float64, clean_f)
n !== nothing && return n
error("Cannot parse $(repr(f)) as a number")
end
end
parse_record (generic function with 1 method)
# ββ Test harness
@testset "parse_record test harness" begin
# Updated to use default '\n' terminator
@test parse_record("49, 127 97 136\n") == [49, 127, 97, 136]
@test parse_record([49, 127, 97, "136\n"]) == [49, 127, 97, 136]
# Explicitly test the keyword argument ';'
@test parse_record("49, 127 97 136;", terminator=';') == [49, 127, 97, 136]
# Test an arbitrary string as a terminator
@test parse_record("10, 20, 30|EOF", terminator="|EOF") == [10, 20, 30]
@test parse_record(" 42 7 ") == [42, 7]
@test parse_record("-42") == [-42]
@test parse_record("1_000_000") == [1_000_000]
@test parse_record("1e5") == [100_000.0]
@test parse_record("2.5E-2") == [0.025]
@test parse_record("1e5, -42, 3.14") == [1e5, -42, 3.14]
@test parse_record("1,2,3") == [123]
@test parse_record("1,234") == [1234]
@test parse_record("1 , 2 , 3") == [1, 2, 3]
@test parse_record("99") == [99]
@test parse_record("1,000") == [1000]
@test parse_record("1, 2, 3") == [1, 2, 3]
@test parse_record("1,000, 2,000,000") == [1000, 2_000_000]
@test parse_record(['4', '9']) == [4, 9]
@test_throws ErrorException("Cannot parse \"abc\" as a number") parse_record("1 abc 3")
end
Test Summary: | Pass Total Time
parse_record test harness | 19 19 0.2s
Test.DefaultTestSet("parse_record test harness", Any[], 19, false, false, true, 1.775948339916321e9, 1.77594834008719e9, false, "REPL[3]", Random.Xoshiro(0x164e5b20d958a086, 0xcaacf71698d66bbb, 0x5d93436191c59683, 0xa4df0cb5ce43e8f5, 0xcc4eea45209332f9))