CSV.read with two delimiters

I have data where each row looks like this:
49, 127 97 136;

CSV.read nicely imports it to a DataFrame like this:
49 127 97 136;
But I don’t want the semicolon. Can this be imported in a better way?

Thanks!

A better solution would be to work completely within Julia, but just to get things done, maybe somethings like this:

julia> using DataFrames, CSV

julia> CSV.read(`sed 's/[;,]//g' testdata`, DataFrame; header=false)
1Γ—4 DataFrame
 Row β”‚ Column1  Column2  Column3  Column4 
     β”‚ Int64    Int64    Int64    Int64   
─────┼────────────────────────────────────
   1 β”‚      49      127       97      136

Can do the trick. It uses sed from Linux to replace the unwanted characters (Linux’s tr could also be used).

testdata contains the textual data in the post (i.e. 49, 127 97 136;)

3 Likes

This is ambiguous because it doesn’t show how the DataFrame is splitting the text into columns, and we don’t have a minimal text file to reproduce it. Is this aggregate string column really what you want?

julia> write("sample.txt",
       """
       49, 127 97 136;
       249, 27 1497 1312;
       """)
35

julia> using DataFrames, CSV

julia> sample = CSV.read("sample.txt", DataFrame; header=false)
2Γ—2 DataFrame
 Row β”‚ Column1  Column2
     β”‚ Int64    String15
─────┼─────────────────────────
   1 β”‚      49   127 97 136;
   2 β”‚     249   27 1497 1312;

The problem is most people would expect integer columns, which is how it’s parsed after Dan removed the , and ;, but that would require multiple column delimiters. CSVs traditionally delimited columns with commas (',') and rows with newlines ('\n', "\r\n", '\r'). Part of the problem here is that CSV.read doesn’t support a custom row delimiter (which is odd because CSV.write has newline= and DelimitedFiles.readdlm has eol::AbstractChar=), but more flexible parsers still often expect 1 column delimiter like CSV.read’s delim= does. There are parsers that support regex patterns for multiple delimiters, but CSV.read does not. It’s possible to parse rows/lines yourself (EDIT: see next comment by technocrat), whether as UTF-8 String-s or direct bytes, then push! results to a DataFrame. However, established libraries tend to provide optimizations, like CSV.read’s multithreading, that are annoying to fully reimplement.

If that string column is really intended, the string processing as stated is simple:

julia> sample.Column2 .= chop.(sample.Column2); sample
2Γ—2 DataFrame
 Row β”‚ Column1  Column2
     β”‚ Int64    String15
─────┼────────────────────────
   1 β”‚      49   127 97 136
   2 β”‚     249   27 1497 1312
3 Likes

Cockroaches are not hermits. Me.

The trailing ; is not the only way this problem will present with real-world data (i.e., data that has passed through a spreadsheet). It’s worth taking a whack at solving the the problem in greater generality. Here is an approach based on awk.

using Test

function parse_record(record; terminator='\n')
           s = record isa AbstractString ? record : join(record, " ")

           # Strip the configurable record terminator (if one is provided)
           if terminator !== nothing
               s = replace(s, terminator => "")
           end

           # 1. Normalize comma thousands-separators to underscores before splitting
           s = replace(s, r"(?<=\d),(?=\d)" => "_")

           # 2. Split into fields using remaining commas or whitespace
           fields = split(s, r"[,\s]+", keepempty=false)

           # 3. Parse each field
           map(fields) do f
               clean_f = replace(f, "_" => "")

               n = tryparse(Int, clean_f)
               n !== nothing && return n
               n = tryparse(Float64, clean_f)
               n !== nothing && return n
               error("Cannot parse $(repr(f)) as a number")
           end
       end
parse_record (generic function with 1 method)

# ── Test harness 
       @testset "parse_record test harness" begin
           # Updated to use default '\n' terminator
           @test parse_record("49, 127 97 136\n") == [49, 127, 97, 136]

           @test parse_record([49, 127, 97, "136\n"]) == [49, 127, 97, 136]

           # Explicitly test the keyword argument ';'
           @test parse_record("49, 127 97 136;", terminator=';') == [49, 127, 97, 136]

           # Test an arbitrary string as a terminator
           @test parse_record("10, 20, 30|EOF", terminator="|EOF") == [10, 20, 30]

           @test parse_record("  42  7  ") == [42, 7]

           @test parse_record("-42") == [-42]

           @test parse_record("1_000_000") == [1_000_000]

           @test parse_record("1e5") == [100_000.0]

           @test parse_record("2.5E-2") == [0.025]

           @test parse_record("1e5, -42, 3.14") == [1e5, -42, 3.14]

           @test parse_record("1,2,3") == [123]

           @test parse_record("1,234") == [1234]

           @test parse_record("1  ,  2  ,  3") == [1, 2, 3]

           @test parse_record("99") == [99]

           @test parse_record("1,000") == [1000]

           @test parse_record("1, 2, 3") == [1, 2, 3]

           @test parse_record("1,000, 2,000,000") == [1000, 2_000_000]

           @test parse_record(['4', '9']) == [4, 9]

           @test_throws ErrorException("Cannot parse \"abc\" as a number") parse_record("1 abc 3")
       end
Test Summary:             | Pass  Total  Time
parse_record test harness |   19     19  0.2s
Test.DefaultTestSet("parse_record test harness", Any[], 19, false, false, true, 1.775948339916321e9, 1.77594834008719e9, false, "REPL[3]", Random.Xoshiro(0x164e5b20d958a086, 0xcaacf71698d66bbb, 0x5d93436191c59683, 0xa4df0cb5ce43e8f5, 0xcc4eea45209332f9))
2 Likes