CSV.jl type stability

HashBrown · October 22, 2022, 2:06am

I want to convert the rows of a CSV file into custom types in my own code. I have defined the Base.convert methods I need:

Base.convert(::Type{MyCustomType}, row::CSV.Row) = begin
    MyCustomType(
        a = row.a, 
        b = row.b, 
        ...
    )
end

However I notice this is fairly slow. I am a Julia beginner, but to me this seems like a type stability issue (i.e the types of row.a and row.b are not known at compile time).

So my questions are

1.) Is this a type stability issue?
2.) How can I make it type stable to increase performance?
3.) Will I need to build a custom parser that hardcodes the format/types of the data I am using for good performance? Is there a metaprogamming package that can do that for me?

jling · October 22, 2022, 2:24am

the CSV is stored as column, so you probably don’t want to iterate row by row:

try GitHub - JuliaArrays/StructArrays.jl: Efficient implementation of struct arrays in Julia

HashBrown · October 22, 2022, 2:34am

@jling my workflow is to yield rows through iteration because I am working with many very large files. The rows are split across these files and even sometimes an individual file is too big for RAM. So I will be streaming data through the whole application which seems easier done on a row by row basis.

On the topic of type stability is the function above type stable?

jling · October 22, 2022, 2:45am

why are they in CSV then?? you should use Arrow.jl

HashBrown · October 22, 2022, 2:47am

They get downloaded from the API in .csv.gz format. And I don’t want to save the data locally except for a small cache.

The individual files have been fitting in RAM on my laptop, but if I move to using some small EC2 instances for some of this stuff then I may run into issues. Haven’t done that however yet.

Regardless I am not looking for a new solution right now. Just trying to understand the type behavior of this function.

jling · October 22, 2022, 2:51am

well, the first thing you should do after downloading is to convert them to .arrow then, CSV is just not suitable for this in multiple ways.

For row iteration speed, you wan TypedTables.jl

dlakelan · October 22, 2022, 2:51am

DataFrames are inherently type unstable, each column can be anything and its type can be changed from one moment to the next:

d = DataFrame(foo=["A","B","C"])
d.foo = [1,2,3] #perfectly valid

If you know something that the compiler doesn’t (like the data files always have a Float64, an Int64, and a String…) you can put some type declarations, but if one of these files is borked and has some invalid characters you’ll have to be able to catch that error.

dlakelan · October 22, 2022, 2:53am

Maybe, but these are apparently very big, and he has to read the whole thing once either way… if he just calculates what he’s interested in then he’s done. If he has to reread this CSV file over and over, caching it at .arrow would make sense, but if it’s read-once + calculate stats, then the arrow file doesn’t make sense.

dlakelan · October 22, 2022, 2:56am

oh whoops I see you aren’t reading it into a DataFrame … you’re just reading rows… yes those are type unstable too. it’s valid to have a CSV like this:

a,b,c
1,2,3
1.2,3.3,4.4

HashBrown · October 22, 2022, 2:59am

The format of the files I am working with are standard and I know the types of each column. I am not too worried about there being a messed up entry because

a. This is for experimentation/research not production. A messed up value should throw an error that I don’t really need to recover from (just want a good error message).

b. The data is industry grade and should be fairly reliable in that sense.

I see CSV allows you to input types while reading, but that still wouldn’t make the above type stable because as you said someone could change it at any time.

Is there a CSV parser that takes in the input types and unsafely parses the raw bytes into those types. And then just gives back tuples or some other type stable construct for rows?

jling · October 22, 2022, 3:01am

that doesn’t make any difference, the final layer dealing with parse and bytes is still unstable, you just have a function barrier that shields, you should be able to achieve it with existing CSV anyway

btw, it would be easier if you could provide a sample file and a “slow” version of what you’re doing

HashBrown · October 22, 2022, 3:03am

I cannot post the data unfortunately because it is proprietary in nature.

I don’t see why it would be unstable? Can you elaborate?

jling · October 22, 2022, 3:10am

when you read bytes from a file (IOStream or worse, over network), there is no telling if the bytes can actually be parsed into some type, and the read may fail, or you may hit EOF, there’s always gonna be some low level code that is not fundamentally stable, but that’s fine.

surely you can post CSV with same number or columns and same numerical types? just change names of columns and actual values…

HashBrown · October 22, 2022, 3:14am

8 columns, 10-100 million rows, types are Int64, Float64, and String. An individual column is always just one type.

jling · October 22, 2022, 3:15am

you want to post the code you’re running that you think is “unstable” and “slower than you’d expected”

if I were you, I don’t think this is enough for people to do free work for me: not only the work is free, but they also need to generate data for the work.

HashBrown · October 22, 2022, 3:22am

I don’t need help optimizing the specific code right now. I just want to have a general theoretical discussion on parsing and its implications for type stable code and performance.

I by no means expect any individual to help me or even engage in discussion they don’t have interest in.

dlakelan · October 22, 2022, 4:37am

Type stability is a property of sort of compile time…

if I call abc(d,e) and the compiler can’t determine from context that d and e are always a Float64 and an Int64 then it can’t just call the abc(d::Float64,e::Int64) method directly, it will have to do dynamic dispatch (meaning look at the runtime for what the types are, and then look up that method in the method table).

I think what you might do is something like:

for row in CSV.Rows(file,types=[Int64,Float64,String])
   push!(myvals,MyCustomType(a = row.a::Int64, b = row.b::Float64, c = row.c::String)
end

Which will allow the compiler to know the values and which constructor to call…

greg_plowman · October 22, 2022, 5:38am

It seems your use case is simple and well-defined, so you could try using a more primitive approach here:

struct MyCustomType
    col1::Int
    col2::Float64
    col3::String 
end

function MyCustomType(line::AbstractString)
    fields = split(line, ",")
    col1 = parse(Int, fields[1])
    col2 = parse(Float64, fields[2])
    col3 = String(fields[3])
    return MyCustomType(col1, col2, col3)
end

function read_file(filename)
    for line in eachline(filename)
        myCustomRecord = MyCustomType(line)
        process(myCustomRecord)
    end
end

function process(myCustomRecord::MyCustomType)
    println(myCustomRecord)
end

Let’s try it out:

function create_testfile(filename, numrows)
    open(filename, "w") do io
        for i in 1:numrows
            col1 = rand(1:1000)
            col2 = rand(Float64)
            col3 = string("Animal ", rand(1:1000))
            println(io, col1, ",", col2, ",", col3)
        end
    end
end

filename = "MyTestFile.csv"
numrows = 10
create_testfile(filename, numrows)
read_file(filename)

aplavin · October 22, 2022, 7:56am

But you don’t show how these methods are used. It’s easy to read CSV file into an eltype-stable array of rows, and then stably convert each element to MyCustomType. Try

using CSV, Tables

CSV.File("file.csv") |> rowtable .|> MyCustomType

This results in a plain Vector of MyCustomTypes. Alternatively, convert to a StructArray instead of rowtable: this can be both more convenient and more performant.

rafael.guerra · October 22, 2022, 10:03am

I think we should add Base.splat():

CSV.File("file.csv") |> rowtable .|> Base.splat(MyCustomType)

Topic		Replies	Views
Type-stable reading from a file General Usage question , csv , code_warntype , type-stability	14	326	February 18, 2025
[ANN] New CSV.jl 0.5 Release Package Announcements data , csv	18	5065	October 20, 2019
Question about "type-stability" of Arrays in Julia 0.6 General Usage performance , type , arrays , type-stability	3	858	August 16, 2017
[ANN] CSV.jl 0.7 Release Data	38	5280	July 18, 2020
CSV.jl fails precompling, TypeError: in Type{...} expression, expected UnionAll, got Type{Parsers.Options} General Usage question , csv	19	3602	December 8, 2021

CSV.jl type stability

Related topics