Suggestions needed on analysing large number of csv files

davidanthoff · January 4, 2019, 12:03am

That is not a lot of RAM On the other hand, sometimes, if you have a lot of numbers and not strings in the file, things might take a lot less space in a parsed state relative to the CSV format.

bennedich · January 4, 2019, 12:22am

Ah, ok, then indeed I had not understood your problem very well. The statement below made it sound like you had a very small number of duplicates (100 rows total) but if the number of unique elements stays roughly the same for each new file, it must be that almost every row is a duplicate. In that case, your original dictionary solution should (if correctly implemented) be faster than the new solution, since the vcat/grouping does a lot of repeat work.

LeoK987 · January 4, 2019, 12:57am

I couldn’t believe the original solution was so slow.

In it, first I created a dict:
SDict = Dict{String, Array{Float64,1}}()
then on every row i of every file, I did something like:

keyStr = df.col1[i] * string(df.col2[i]) * string(df.col3[i]) ...
if haskey(SDict, keyStr)
    SDict[keyStr][1] += df.col41[i]
    ...
else
    SDict[keyStr] = [df.col41[i], df.col42[i], ...]
end

There is almost nothing else in the code. I checked with --track-allocation and indeed the key creation statement allocates a lot of memory, but I don’t know if that’s the sole cause.

bennedich · January 4, 2019, 1:31am

Your thinking is correct, there must be something implementation-specific that made it so slow. If you have another implementation now that works and is fast enough, that’s all that matters really, but in case you wanted to keep hacking at your original code, here are some ideas:

Don’t parse the key columns and then re-serialize them to strings. Read the entire row as a string, locate the 40th comma, and use that substring as a key.
Moreover, concatenating keys that way could be buggy. Keys “12” and “34” concatenate the same way as “123” and “4”.
Stream the data instead of first loading it all and then processing it.
If you do SDict[keyStr][1] += ...; SDict[keyStr][2] += ...; repeatedly, each statement would lead to a dictionary lookup. Better to do a single lookup, e.g.: entry = SDict[keyStr]; entry[1] += ...; entry[2] += ...;
In addition to --track-allocation, use the profiler, or comment code out, to locate what causes the slowness.

LeoK987 · January 4, 2019, 1:50am

Thanks for the great ideas.

Don’t parse the key columns and then re-serialize them to strings. Read the entire row as a string, locate the 40th comma, and use that substring as a key.

Great idea to not parse every column. As I have datasets with different number of attributes, I should have parsed the column names before looping through rows.

Moreover, concatenating keys that way could be buggy. Keys “12” and “34” concatenate the same way as “123” and “4”.

This was not an issue, as I actually had “,” in between each string

Stream the data instead of first loading it all and then processing it.

What exactly do you mean by this? I did df = CSV.read

bennedich · January 4, 2019, 1:58am

To not use CSV.read, or use a custom sink (see CSV.read doc).

Topic		Replies	Views
Allocation - groupby Performance	5	1244	December 5, 2017
Read array of strings into Dictionary of DataFrames Performance question , dataframes	1	340	July 17, 2020
Best practices for local exploratory data analysis Data	15	1532	October 16, 2024
Why DataFrames v.0.21.2 (julia v1.4.2) requires more memory than the previous version Performance dataframes	22	2263	June 29, 2020
Memory allocations when converting from NamedTuples to DataFrame Performance dataframes	4	874	July 17, 2020

Suggestions needed on analysing large number of csv files

Related topics