CSV.jl type stability

I want to convert the rows of a CSV file into custom types in my own code. I have defined the Base.convert methods I need:

Base.convert(::Type{MyCustomType}, row::CSV.Row) = begin
    MyCustomType(
        a = row.a, 
        b = row.b, 
        ...
    )
end

However I notice this is fairly slow. I am a Julia beginner, but to me this seems like a type stability issue (i.e the types of row.a and row.b are not known at compile time).

So my questions are

1.) Is this a type stability issue?
2.) How can I make it type stable to increase performance?
3.) Will I need to build a custom parser that hardcodes the format/types of the data I am using for good performance? Is there a metaprogamming package that can do that for me?

the CSV is stored as column, so you probably don’t want to iterate row by row:

try GitHub - JuliaArrays/StructArrays.jl: Efficient implementation of struct arrays in Julia

@jling my workflow is to yield rows through iteration because I am working with many very large files. The rows are split across these files and even sometimes an individual file is too big for RAM. So I will be streaming data through the whole application which seems easier done on a row by row basis.

On the topic of type stability is the function above type stable?

why are they in CSV then?? you should use Arrow.jl

They get downloaded from the API in .csv.gz format. And I don’t want to save the data locally except for a small cache.

The individual files have been fitting in RAM on my laptop, but if I move to using some small EC2 instances for some of this stuff then I may run into issues. Haven’t done that however yet.

Regardless I am not looking for a new solution right now. Just trying to understand the type behavior of this function.

well, the first thing you should do after downloading is to convert them to .arrow then, CSV is just not suitable for this in multiple ways.

For row iteration speed, you wan TypedTables.jl

1 Like

DataFrames are inherently type unstable, each column can be anything and its type can be changed from one moment to the next:

d = DataFrame(foo=["A","B","C"])
d.foo = [1,2,3] #perfectly valid

If you know something that the compiler doesn’t (like the data files always have a Float64, an Int64, and a String…) you can put some type declarations, but if one of these files is borked and has some invalid characters you’ll have to be able to catch that error.

Maybe, but these are apparently very big, and he has to read the whole thing once either way… if he just calculates what he’s interested in then he’s done. If he has to reread this CSV file over and over, caching it at .arrow would make sense, but if it’s read-once + calculate stats, then the arrow file doesn’t make sense.

1 Like

oh whoops I see you aren’t reading it into a DataFrame … you’re just reading rows… yes those are type unstable too. it’s valid to have a CSV like this:

a,b,c
1,2,3
1.2,3.3,4.4

The format of the files I am working with are standard and I know the types of each column. I am not too worried about there being a messed up entry because

a. This is for experimentation/research not production. A messed up value should throw an error that I don’t really need to recover from (just want a good error message).

b. The data is industry grade and should be fairly reliable in that sense.

I see CSV allows you to input types while reading, but that still wouldn’t make the above type stable because as you said someone could change it at any time.

Is there a CSV parser that takes in the input types and unsafely parses the raw bytes into those types. And then just gives back tuples or some other type stable construct for rows?

that doesn’t make any difference, the final layer dealing with parse and bytes is still unstable, you just have a function barrier that shields, you should be able to achieve it with existing CSV anyway

btw, it would be easier if you could provide a sample file and a “slow” version of what you’re doing

1 Like

I cannot post the data unfortunately because it is proprietary in nature.

I don’t see why it would be unstable? Can you elaborate?

when you read bytes from a file (IOStream or worse, over network), there is no telling if the bytes can actually be parsed into some type, and the read may fail, or you may hit EOF, there’s always gonna be some low level code that is not fundamentally stable, but that’s fine.

surely you can post CSV with same number or columns and same numerical types? just change names of columns and actual values…

8 columns, 10-100 million rows, types are Int64, Float64, and String. An individual column is always just one type.

you want to post the code you’re running that you think is “unstable” and “slower than you’d expected”

if I were you, I don’t think this is enough for people to do free work for me: not only the work is free, but they also need to generate data for the work.

1 Like

I don’t need help optimizing the specific code right now. I just want to have a general theoretical discussion on parsing and its implications for type stable code and performance.

I by no means expect any individual to help me or even engage in discussion they don’t have interest in.

Type stability is a property of sort of compile time…

if I call abc(d,e) and the compiler can’t determine from context that d and e are always a Float64 and an Int64 then it can’t just call the abc(d::Float64,e::Int64) method directly, it will have to do dynamic dispatch (meaning look at the runtime for what the types are, and then look up that method in the method table).

I think what you might do is something like:

for row in CSV.Rows(file,types=[Int64,Float64,String])
   push!(myvals,MyCustomType(a = row.a::Int64, b = row.b::Float64, c = row.c::String)
end

Which will allow the compiler to know the values and which constructor to call…

1 Like

It seems your use case is simple and well-defined, so you could try using a more primitive approach here:

struct MyCustomType
    col1::Int
    col2::Float64
    col3::String 
end

function MyCustomType(line::AbstractString)
    fields = split(line, ",")
    col1 = parse(Int, fields[1])
    col2 = parse(Float64, fields[2])
    col3 = String(fields[3])
    return MyCustomType(col1, col2, col3)
end

function read_file(filename)
    for line in eachline(filename)
        myCustomRecord = MyCustomType(line)
        process(myCustomRecord)
    end
end

function process(myCustomRecord::MyCustomType)
    println(myCustomRecord)
end

Let’s try it out:

function create_testfile(filename, numrows)
    open(filename, "w") do io
        for i in 1:numrows
            col1 = rand(1:1000)
            col2 = rand(Float64)
            col3 = string("Animal ", rand(1:1000))
            println(io, col1, ",", col2, ",", col3)
        end
    end
end

filename = "MyTestFile.csv"
numrows = 10
create_testfile(filename, numrows)
read_file(filename)
3 Likes

But you don’t show how these methods are used. It’s easy to read CSV file into an eltype-stable array of rows, and then stably convert each element to MyCustomType. Try

using CSV, Tables

CSV.File("file.csv") |> rowtable .|> MyCustomType

This results in a plain Vector of MyCustomTypes. Alternatively, convert to a StructArray instead of rowtable: this can be both more convenient and more performant.

1 Like

I think we should add Base.splat():

CSV.File("file.csv") |> rowtable .|> Base.splat(MyCustomType)