CSV.write() to Unix Pipe (e.g., lz4 or bzip2)

I am wondering whether CSV.write() can write to a different non-gzip compressor.


julia> using DataFrames, CSV

julia> textbz2= open("test.txt.bz2", "w")
IOStream(<file test.txt.bz2>)

julia> open( `bzip2 -c`, "w", textbz2 ) do fo; println(fo, "hello"); end#do

julia> close(textbz2)

julia> x1= vcat(99,collect(1:2:9)); df= DataFrame( n1=x1, n2=x1.^2, n3=string.(collect('a':'f')), n4=sin.(x1) );

julia> csvbz2= open("test.csv.bz2", "w")
IOStream(<file test.csv.bz2>)

julia> open( `bzip2 -c`, "w", csvbz2 ) do fo; CSV.write(fo, df); end#do
ERROR: MethodError: no method matching seek(::Base.Process, ::Int64)
Closest candidates are:
  seek(::IOStream, ::Integer) at iostream.jl:100
  seek(::Base.GenericIOBuffer, ::Integer) at iobuffer.jl:241
  seek(::Base.Libc.FILE, ::Integer) at libc.jl:96
  ...
Stacktrace:
 [1] seekstart(::Base.Process) at ./iostream.jl:126
 [2] with(::getfield(CSV, Symbol("##53#55")){Char,Char,String,Nothing,Bool,Tables.Schema{(:n1, :n2, :n3, :n4),Tuple{Int64,Int64,String,Float64}},Tables.RowIterator{NamedTuple{(:n1, :n2, :n3, :n4),Tuple{Array{Int64,1},Array{Int64,1},Array{String,1},Array{Float64,1}}}},UInt8,UInt8,UInt8,NTuple{4,Symbol},Int64}, ::Base.Process, ::Bool) at /Users/ivo/.julia/packages/CSV/uLyo0/src/write.jl:21
 [3] #write#52(::Char, ::Char, ::Nothing, ::Nothing, ::Char, ::String, ::Nothing, ::Bool, ::Bool, ::Array{String,1}, ::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}, ::typeof(CSV.write), ::Tables.Schema{(:n1, :n2, :n3, :n4),Tuple{Int64,Int64,String,Float64}}, ::Tables.RowIterator{NamedTuple{(:n1, :n2, :n3, :n4),Tuple{Array{Int64,1},Array{Int64,1},Array{String,1},Array{Float64,1}}}}, ::Base.Process) at /Users/ivo/.julia/packages/CSV/uLyo0/src/write.jl:139
 [4] write(::Tables.Schema{(:n1, :n2, :n3, :n4),Tuple{Int64,Int64,String,Float64}}, ::Tables.RowIterator{NamedTuple{(:n1, :n2, :n3, :n4),Tuple{Array{Int64,1},Array{Int64,1},Array{String,1},Array{Float64,1}}}}, ::Base.Process) at /Users/ivo/.julia/packages/CSV/uLyo0/src/write.jl:135
 [5] #write#51(::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}, ::Function, ::Base.Process, ::DataFrame) at /Users/ivo/.julia/packages/CSV/uLyo0/src/write.jl:111
 [6] write(::Base.Process, ::DataFrame) at /Users/ivo/.julia/packages/CSV/uLyo0/src/write.jl:109
 [7] (::getfield(Main, Symbol("##5#6")))(::Base.Process) at ./REPL[7]:1
 [8] open(::getfield(Main, Symbol("##5#6")), ::Cmd, ::String, ::IOStream) at ./process.jl:617
 [9] top-level scope at none:0

Why would a csv writer (not reader) need to seek? Or am I committing another mistake altogether, and just hit on a misleading error message?

Can’t you see from the stacktrace where the error is thrown and figure out why it is doing the seek?

I am staring at the complete error stacktrace now (and I have updated my original post above for a typo where I now put the stacktrace in. I am guessing that my open is wrong?! Returns a Process instead of an IOStream?

alas, it is not clear at all to me how to open this the correct way. I also tried variations, like julia> fo= open( bzip2 -c > ab.csv.gz, "w" ), but this fails, too. as does

julia> fo= open( pipeline(`bzip2 -c`, "ab.csv.gz"), "w" )
Process(`bzip2 -c`, ProcessRunning)

julia> CSV.write( fo, df )
ERROR: MethodError: no method matching seek(::Base.Process, ::Int64)
Closest candidates are:
  seek(::IOStream, ::Integer) at iostream.jl:100
  seek(::Base.GenericIOBuffer, ::Integer) at iobuffer.jl:241
  seek(::Base.Libc.FILE, ::Integer) at libc.jl:96
  ...
Stacktrace:
 [1] seekstart(::Base.Process) at ./iostream.jl:126
 [2] with(::getfield(CSV, Symbol("##53#55")){Char,Char,String,Nothing,Bool,Tables.Schema{(:n1, :n2, :n3, :n4),Tuple{Int64,Int64,String,Float64}},Tables.RowIterator{NamedTuple{(:n1, :n2, :n3, :n4),Tuple{Array{Int64,1},Array{Int64,1},Array{String,1},Array{Float64,1}}}},UInt8,UInt8,UInt8,NTuple{4,Symbol},Int64}, ::Base.Process, ::Bool) at /Users/ivo/.julia/packages/CSV/uLyo0/src/write.jl:21
 [3] #write#52(::Char, ::Char, ::Nothing, ::Nothing, ::Char, ::String, ::Nothing, ::Bool, ::Bool, ::Array{String,1}, ::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}, ::typeof(CSV.write), ::Tables.Schema{(:n1, :n2, :n3, :n4),Tuple{Int64,Int64,String,Float64}}, ::Tables.RowIterator{NamedTuple{(:n1, :n2, :n3, :n4),Tuple{Array{Int64,1},Array{Int64,1},Array{String,1},Array{Float64,1}}}}, ::Base.Process) at /Users/ivo/.julia/packages/CSV/uLyo0/src/write.jl:139
 [4] write(::Tables.Schema{(:n1, :n2, :n3, :n4),Tuple{Int64,Int64,String,Float64}}, ::Tables.RowIterator{NamedTuple{(:n1, :n2, :n3, :n4),Tuple{Array{Int64,1},Array{Int64,1},Array{String,1},Array{Float64,1}}}}, ::Base.Process) at /Users/ivo/.julia/packages/CSV/uLyo0/src/write.jl:135
 [5] #write#51(::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}, ::Function, ::Base.Process, ::DataFrame) at /Users/ivo/.julia/packages/CSV/uLyo0/src/write.jl:111
 [6] write(::Base.Process, ::DataFrame) at /Users/ivo/.julia/packages/CSV/uLyo0/src/write.jl:109
 [7] top-level scope at none:0

apologies…how would I do this??? somehow, I need to change the Process into an IOStream.

What I meant was that you can see from the stacktrace where the error is thrown from (CSV/src/write.jl:21).

So looking at it

we can see that it tries to seek to the start of the io if the append keyword is false. This is why we are failing.

From the docs to CSV.write there is

* `append=false`: whether to append writing to an existing file/IO, if `true`, it will not write column names by default

so you could try passing append=false:

julia> open( `bzip2 -c`, "w", csvbz2 ) do fo; CSV.write(fo, df; append=true); end#do
Process(`bzip2 -c`, ProcessExited(0))

This won’t get you all the way because that will not write the column headers but a PR to avoid the seekstart on a Process seems easy enough.

thanks, kristoffer. this looks like a bug to me. don’t laugh…I do not know how to file a PR, as I have barely used git myself. but this is easy to describe as an issue, which I am happy to file.

Just open a bug report saying that seekstart shouldn’t be called on Process and it should be an easy fix for the package author.

2 Likes

It’d be nice if there was an official IO interface, so users could count on things like seekstart being defined or not. Base.Process says it’s an IO, but it seems like seeking is not a required part of the IO interface.

Anyway, yes, this is an easy fix in CSV.jl, so I’ll push a fix soon.

1 Like