If I understand correctly rowtable is going to consume the entire file and make a vector of tuples, but this could be enormous. Better to consume one row at a time and make the special type one at a time as in my loop. It’s possible I’m misunderstanding though.
@dlakelan, you are right, but it is still very interesting code.
As for the CSV.Rows approach, it seems to be much more inefficient in this case than Greg Plowman’s eachline
parsing (I’ve tested for 1 M rows, with 4 Ints, 3 Float64 and 1 String in each). Are you seeing the same thing?
Thank you for all the suggestions everyone! I will play around with all of them.
@rafael.guerra I am actually seeing the opposite. That the manual parsing is taking longer than the type annotations. In fact the manual parsing is taking longer than no type annotations.
Hmm it actually seems the slowness is coming from the GZip package that I am using to decompress the file, rather than the actual conversion themselves. Let me try with versions of the file that are already decompressed on disk.
Ok so now I am seeing that both the manual reading and the type annotated reading are comparable. And that without type annotations is about half as fast.
The manual reading is still slightly slower (by about 10%) once both it and the type annotated CSV.File have already been compiled (on the first run, manual reading is 50% faster because of the significantly lower compile time)
If you load all the data in memory anyway in the form of those objects, then an extra copy of the same data won’t hurt performance much.
And when performance is needed, column-oriented storage is often better. For example, with StructArrays there’s a very efficient solution CSV.File(...) |> columntable |> StructArray{MyCustomType}