Read text file containing proper CSV data chunks

I have a text file that contains comma-separated entries, and I am interested only in the rows that begin with the character A:

file = """
C comment-1
B,9,5.5,20:30:33
A,2,1.5,20:31:15
C comment-2
A,0,0.5,22:57:00
C comment-3
"""
str = filter(x -> first(x) == 'A', readlines(IOBuffer(file)))

The above command extracts the rows of interest.

How can we streamline this using CSV.jl to read the filtered rows of interest?

1 Like

Maybe this?

julia> CSV.read(IOBuffer(join(str, "\n")), DataFrame, header=false)
2Γ—4 DataFrame
 Row β”‚ Column1  Column2  Column3  Column4  
     β”‚ String1  Int64    Float64  Time     
─────┼─────────────────────────────────────
   1 β”‚ A              2      1.5  20:31:15
   2 β”‚ A              0      0.5  22:57:00
1 Like

Thanks @lmiq, that was nice!

I would also be interested in a β€œrow-by-row” CSV.read solution, if such a possibility exists.

I implemented something like that in FamaFrenchData.jl

Thanks @tbeason. After a quick look at your code, it appears to read all the lines in memory, identify the blocks of interest, then perform CSV.read by block followed by a merge.

What I was asking for was a row-by-row CSV.jl reading with a filter to process only the lines of interest.

1 Like

You can use CSV.rows but it’s less efficient and convenient than CSV.read.

There aren’t many examples of using CSV.Rows, do you know of any that could be a source of inspiration?

No. :-/

You can do it by hand, by pushing to the DataFrame:

julia> function read_df(file)
           df = nothing
           for line in readlines(IOBuffer(file))
       	if line[1] == 'A'
                   if isnothing(df)
       		    df = CSV.read(IOBuffer(line), DataFrame, header=false)
                   else
                       push!(df, CSV.read(IOBuffer(line), DataFrame, header=false)[1,:])
                   end
               end
           end
           return df
       end
read_df (generic function with 2 methods)

julia> read_df(file)
2Γ—4 DataFrame
 Row β”‚ Column1  Column2  Column3  Column4  
     β”‚ String1  Int64    Float64  Time     
─────┼─────────────────────────────────────
   1 β”‚ A              2      1.5  20:31:15
   2 β”‚ A              0      0.5  22:57:00

(sorry for the badly indented code, my terminal went crazy)

2 Likes

Thanks @lmiq, it looks good but not as simple as one might have hoped.

The following is a bit shorter but still not ideal:

cnames = [:Code, :Counter, :Value, :Time]
types = [String[], Int[], Float64[], Time[]] 
df = DataFrame([name => type for (name,type) in zip(cnames,types)])
for line in readlines(IOBuffer(file))
    line[1] == 'A' && append!(df, CSV.read(IOBuffer(line), DataFrame, header=cnames))
end

Ah I see, I misunderstood. Sorry for the misdirection.

CSV.jl does have a comment argument. Perhaps it could be faster to read all the data, except the comment lines, and then use subset to filter out all the lines that do not start with A? Obviously the memory footprint will be a little larger than a line-by-line method but by reading the entire file you can leverage the multithreaded and optimized capabilities of CSV.jl.

1 Like

This works great but only filters β€˜C’, what one would need is comment!='A'