That is not a lot of RAM On the other hand, sometimes, if you have a lot of numbers and not strings in the file, things might take a lot less space in a parsed state relative to the CSV format.
Ah, ok, then indeed I had not understood your problem very well. The statement below made it sound like you had a very small number of duplicates (100 rows total) but if the number of unique elements stays roughly the same for each new file, it must be that almost every row is a duplicate. In that case, your original dictionary solution should (if correctly implemented) be faster than the new solution, since the vcat/grouping does a lot of repeat work.
I couldnβt believe the original solution was so slow.
In it, first I created a dict:
SDict = Dict{String, Array{Float64,1}}()
then on every row i of every file, I did something like:
keyStr = df.col1[i] * string(df.col2[i]) * string(df.col3[i]) ...
if haskey(SDict, keyStr)
SDict[keyStr][1] += df.col41[i]
...
else
SDict[keyStr] = [df.col41[i], df.col42[i], ...]
end
There is almost nothing else in the code. I checked with --track-allocation and indeed the key creation statement allocates a lot of memory, but I donβt know if thatβs the sole cause.
Your thinking is correct, there must be something implementation-specific that made it so slow. If you have another implementation now that works and is fast enough, thatβs all that matters really, but in case you wanted to keep hacking at your original code, here are some ideas:
- Donβt parse the key columns and then re-serialize them to strings. Read the entire row as a string, locate the 40th comma, and use that substring as a key.
- Moreover, concatenating keys that way could be buggy. Keys β12β and β34β concatenate the same way as β123β and β4β.
- Stream the data instead of first loading it all and then processing it.
- If you do
SDict[keyStr][1] += ...; SDict[keyStr][2] += ...;
repeatedly, each statement would lead to a dictionary lookup. Better to do a single lookup, e.g.:entry = SDict[keyStr]; entry[1] += ...; entry[2] += ...;
- In addition to --track-allocation, use the profiler, or comment code out, to locate what causes the slowness.
Thanks for the great ideas.
Donβt parse the key columns and then re-serialize them to strings. Read the entire row as a string, locate the 40th comma, and use that substring as a key.
Great idea to not parse every column. As I have datasets with different number of attributes, I should have parsed the column names before looping through rows.
Moreover, concatenating keys that way could be buggy. Keys β12β and β34β concatenate the same way as β123β and β4β.
This was not an issue, as I actually had β,β in between each string
Stream the data instead of first loading it all and then processing it.
What exactly do you mean by this? I did df = CSV.read
To not use CSV.read
, or use a custom sink (see CSV.read doc).