Thanks @tbeason. After a quick look at your code, it appears to read all the lines in memory, identify the blocks of interest, then perform CSV.read by block followed by a merge.
What I was asking for was a row-by-row CSV.jl reading with a filter to process only the lines of interest.
You can do it by hand, by pushing to the DataFrame:
julia> function read_df(file)
df = nothing
for line in readlines(IOBuffer(file))
if line[1] == 'A'
if isnothing(df)
df = CSV.read(IOBuffer(line), DataFrame, header=false)
else
push!(df, CSV.read(IOBuffer(line), DataFrame, header=false)[1,:])
end
end
end
return df
end
read_df (generic function with 2 methods)
julia> read_df(file)
2Γ4 DataFrame
Row β Column1 Column2 Column3 Column4
β String1 Int64 Float64 Time
ββββββΌβββββββββββββββββββββββββββββββββββββ
1 β A 2 1.5 20:31:15
2 β A 0 0.5 22:57:00
(sorry for the badly indented code, my terminal went crazy)
Thanks @lmiq, it looks good but not as simple as one might have hoped.
The following is a bit shorter but still not ideal:
cnames = [:Code, :Counter, :Value, :Time]
types = [String[], Int[], Float64[], Time[]]
df = DataFrame([name => type for (name,type) in zip(cnames,types)])
for line in readlines(IOBuffer(file))
line[1] == 'A' && append!(df, CSV.read(IOBuffer(line), DataFrame, header=cnames))
end
Ah I see, I misunderstood. Sorry for the misdirection.
CSV.jl does have a comment argument. Perhaps it could be faster to read all the data, except the comment lines, and then use subset to filter out all the lines that do not start with A? Obviously the memory footprint will be a little larger than a line-by-line method but by reading the entire file you can leverage the multithreaded and optimized capabilities of CSV.jl.