Read text file containing proper CSV data chunks

rafael.guerra · December 19, 2023, 8:24pm

I have a text file that contains comma-separated entries, and I am interested only in the rows that begin with the character A:

file = """
C comment-1
B,9,5.5,20:30:33
A,2,1.5,20:31:15
C comment-2
A,0,0.5,22:57:00
C comment-3
"""
str = filter(x -> first(x) == 'A', readlines(IOBuffer(file)))

The above command extracts the rows of interest.

How can we streamline this using CSV.jl to read the filtered rows of interest?

lmiq · December 20, 2023, 12:01am

Maybe this?

julia> CSV.read(IOBuffer(join(str, "\n")), DataFrame, header=false)
2×4 DataFrame
 Row │ Column1  Column2  Column3  Column4  
     │ String1  Int64    Float64  Time     
─────┼─────────────────────────────────────
   1 │ A              2      1.5  20:31:15
   2 │ A              0      0.5  22:57:00

rafael.guerra · December 20, 2023, 12:10am

Thanks @lmiq, that was nice!

I would also be interested in a “row-by-row” CSV.read solution, if such a possibility exists.

tbeason · December 20, 2023, 12:44am

I implemented something like that in FamaFrenchData.jl

rafael.guerra · December 20, 2023, 7:54am

Thanks @tbeason. After a quick look at your code, it appears to read all the lines in memory, identify the blocks of interest, then perform CSV.read by block followed by a merge.

What I was asking for was a row-by-row CSV.jl reading with a filter to process only the lines of interest.

nalimilan · December 20, 2023, 8:51am

You can use CSV.rows but it’s less efficient and convenient than CSV.read.

rafael.guerra · December 20, 2023, 8:59am

There aren’t many examples of using CSV.Rows, do you know of any that could be a source of inspiration?

nalimilan · December 20, 2023, 10:58am

No. :-/

lmiq · December 20, 2023, 11:01am

You can do it by hand, by pushing to the DataFrame:

julia> function read_df(file)
           df = nothing
           for line in readlines(IOBuffer(file))
       	if line[1] == 'A'
                   if isnothing(df)
       		    df = CSV.read(IOBuffer(line), DataFrame, header=false)
                   else
                       push!(df, CSV.read(IOBuffer(line), DataFrame, header=false)[1,:])
                   end
               end
           end
           return df
       end
read_df (generic function with 2 methods)

julia> read_df(file)
2×4 DataFrame
 Row │ Column1  Column2  Column3  Column4  
     │ String1  Int64    Float64  Time     
─────┼─────────────────────────────────────
   1 │ A              2      1.5  20:31:15
   2 │ A              0      0.5  22:57:00

(sorry for the badly indented code, my terminal went crazy)

rafael.guerra · December 20, 2023, 12:34pm

Thanks @lmiq, it looks good but not as simple as one might have hoped.

The following is a bit shorter but still not ideal:

cnames = [:Code, :Counter, :Value, :Time]
types = [String[], Int[], Float64[], Time[]] 
df = DataFrame([name => type for (name,type) in zip(cnames,types)])
for line in readlines(IOBuffer(file))
    line[1] == 'A' && append!(df, CSV.read(IOBuffer(line), DataFrame, header=cnames))
end

tbeason · December 20, 2023, 2:27pm

Ah I see, I misunderstood. Sorry for the misdirection.

CSV.jl does have a comment argument. Perhaps it could be faster to read all the data, except the comment lines, and then use subset to filter out all the lines that do not start with A? Obviously the memory footprint will be a little larger than a line-by-line method but by reading the entire file you can leverage the multithreaded and optimized capabilities of CSV.jl.

rafael.guerra · December 20, 2023, 3:00pm

This works great but only filters ‘C’, what one would need is comment!='A'

Topic		Replies	Views
DataFrames/CSV: how to read vectors from *.csv? General Usage	9	2809	March 26, 2021
Efficiently filter rows while reading very large CSV-ish file Data	5	146	August 14, 2024
Importing only specific lines from large csv file Data csv	6	910	January 11, 2024
CSV.Rows usage & Tables.jl interface Data csv , io	8	1421	December 22, 2021
CSV, DataFrame read data file with string and Float64 columns New to Julia dataframes	3	66	September 3, 2024

Read text file containing proper CSV data chunks

Related topics