I have finally gotten some code I translated from R working. It is by no means optimized, and it is my first julia code, so I didn’t expect great performance but it is significantly slower than I would have expected. In my first runs I have noticed that a lot of my time is gc time. Are there methods to reduce this? I am running in an environment with plenty of memory available so it getting garbage collected in the middle of an operation is really not necessary.
92.123407 seconds (382.46 M allocations: 12.969 GiB, 64.78% gc time)
1999638×9 DataFrames.DataFrame.
It’s hard to tell without knowing what the code does, but in general you can reduce GC time by reducing the number of allocations, which generally implies using type-stable code. In the context of data frames, this notably implies applying functions to a whole column rather than iterating over them, or passing them as vectors to a function which will loop over them.
Extra allocations can be indicative of type-stability issues. Did you check @code_warntype
?
I’d have to really dig down with @code_warntype it appears to only do a function at a time. It is a great pointer though, I did not realize that existed.
As for the code it is reading in some large files, I suspect because I read them in a field at time and stream to a data frame using the data streaming API. I might try to run the profier on it and see if that gives any insight. It could also be that I am using Missings and .6, I might upgrade my env to .7dev and see if that helps.
An update…
I wrote code that calls the DataStreams methods directly on my objects and just drops the results and then compared that to streaming the data into a data frame (benchmarks below). It seems as though 2/3s of my time is streaming to the data frame. As an FYI this data has ~ 55000 rows, 24 cols each up to 350 characters in length
As a next step I am going to modify my file reading code to stream whole columns instead of a field at a time and see what effect that has on performance.
Main> @benchmark stream_to_thevoid()
BenchmarkTools.Trial:
memory estimate: 1.40 GiB
allocs estimate: 14714627
--------------
minimum time: 2.531 s (9.83% GC)
median time: 2.544 s (9.89% GC)
mean time: 2.544 s (9.89% GC)
maximum time: 2.556 s (9.96% GC)
--------------
samples: 2
evals/sample: 1
Main> @benchmark stream_to_df()
BenchmarkTools.Trial:
memory estimate: 280.94 MiB
allocs estimate: 5706479
--------------
minimum time: 627.397 ms (22.18% GC)
median time: 686.483 ms (27.38% GC)
mean time: 715.479 ms (30.52% GC)
maximum time: 832.426 ms (40.66% GC)
--------------
samples: 7
evals/sample: 1
And another update for using column based streaming…
The column based streaming method I developed significantly lowers GC time and data frame creation time. This trial based on the same small sample file above lowers the overall and GC time to just under what was for just reading in data and dumping it above.
Main> @benchmark stream_to_df()
BenchmarkTools.Trial:
memory estimate: 390.63 MiB
allocs estimate: 9537896
--------------
minimum time: 2.320 s (3.38% GC)
median time: 2.432 s (7.78% GC)
mean time: 2.398 s (6.61% GC)
maximum time: 2.440 s (8.53% GC)
--------------
samples: 3
evals/sample: 1
Also using a column based streaming method has improved loading a much larger amount of data down to around 60% of its old value.
62.887358 seconds (325.86 M allocations: 11.366 GiB, 69.74% gc time)
2035011×9 DataFrames.DataFrame. Omitted printing of 4 columns
One significant slowdown that I hope I can figure out how to get over is having to use strip after turning a byte array into a string.
strip(String(buf))
I’m not sure what you mean exactly by “column-based streaming method”, but if you care about performance then don’t use Missing
on Julia 0.6. Use DataArrays
instead, and readtable
instead of CSV.read
.
Just as some more background…I have been using a project I wrote in R as a learn Julia opportunity. As part of it I need to read in and convert to a usable format ~30GB of fixed width datafiles. To do this I wrote a fixed width file parser that uses DataStreams to convert the files into DataFrames. The parser is here: GitHub - RandomString123/FWF.jl: Fixed width file parsing in Julia
The “column-based streaming method” is enabling ::Type{Data.Column} in the DataStreams API.
Using Data.Field is nice for as needed parsing but seems to really slow down with a large number of fields.
And for what it is worth…I have gotten the time to read the file and stream it into a dataframe to a reasonableish level. I will take your suggestion and try using DataArrays for .6 testing and leave missings for .7, that might make up the rest of the performance gap. As for processing I think there is a lot of efficiency to be gained as I process in an “R” way which is not optimal in Julia.