Suggestions needed on analysing large number of csv files

That is not a lot of RAM :slight_smile: On the other hand, sometimes, if you have a lot of numbers and not strings in the file, things might take a lot less space in a parsed state relative to the CSV format.

Ah, ok, then indeed I had not understood your problem very well. The statement below made it sound like you had a very small number of duplicates (100 rows total) but if the number of unique elements stays roughly the same for each new file, it must be that almost every row is a duplicate. In that case, your original dictionary solution should (if correctly implemented) be faster than the new solution, since the vcat/grouping does a lot of repeat work.

I couldn’t believe the original solution was so slow.

In it, first I created a dict:
SDict = Dict{String, Array{Float64,1}}()
then on every row i of every file, I did something like:

keyStr = df.col1[i] * string(df.col2[i]) * string(df.col3[i]) ...
if haskey(SDict, keyStr)
    SDict[keyStr][1] += df.col41[i]
    ...
else
    SDict[keyStr] = [df.col41[i], df.col42[i], ...]
end

There is almost nothing else in the code. I checked with --track-allocation and indeed the key creation statement allocates a lot of memory, but I don’t know if that’s the sole cause.

Your thinking is correct, there must be something implementation-specific that made it so slow. If you have another implementation now that works and is fast enough, that’s all that matters really, but in case you wanted to keep hacking at your original code, here are some ideas:

  • Don’t parse the key columns and then re-serialize them to strings. Read the entire row as a string, locate the 40th comma, and use that substring as a key.
  • Moreover, concatenating keys that way could be buggy. Keys β€œ12” and β€œ34” concatenate the same way as β€œ123” and β€œ4”.
  • Stream the data instead of first loading it all and then processing it.
  • If you do SDict[keyStr][1] += ...; SDict[keyStr][2] += ...; repeatedly, each statement would lead to a dictionary lookup. Better to do a single lookup, e.g.: entry = SDict[keyStr]; entry[1] += ...; entry[2] += ...;
  • In addition to --track-allocation, use the profiler, or comment code out, to locate what causes the slowness.
1 Like

Thanks for the great ideas.

Don’t parse the key columns and then re-serialize them to strings. Read the entire row as a string, locate the 40th comma, and use that substring as a key.

Great idea to not parse every column. As I have datasets with different number of attributes, I should have parsed the column names before looping through rows.

Moreover, concatenating keys that way could be buggy. Keys β€œ12” and β€œ34” concatenate the same way as β€œ123” and β€œ4”.

This was not an issue, as I actually had β€œ,” in between each string

Stream the data instead of first loading it all and then processing it.

What exactly do you mean by this? I did df = CSV.read

To not use CSV.read, or use a custom sink (see CSV.read doc).