Suggestions needed on analysing large number of csv files


#1

I have 100 csv files. They have the same column names. Of the columns, about 40 are attributes, and 10 are results. I need to go through all the rows of these files and sum up the results that correspond to the same attribute sets. I currently load the numbers into a Dict{String, Tuple}, where the String is the combination of the 40 attributes and the Tuple holds the sum of the 10 results. It appears very slow. I believe forming the String keys on every row requires a lot of memory allocations. What can be better ways to handle this?


#2

Why not have a Dict{Tuple, Tuple} and avoid forming the string?


#3

So it’s less costly to form a tuple on every row than form a string?


#4

A bigger problem (if you have many duplicates) could be that tuples are immutable, so every time you add to it, you’ll be allocating a new tuple. Try using an array or custom mutable type instead.


#5

I would assume so, but you should benchmark. But in any case, some tools (eg CSV.jl) would give you (named) tuples in any case.

Also, comparing values by their string representation is a code smell.


#6

I can’t easily form the keys with Tuple as I need to form each key in steps based on conditions, but I tried to form the keys as Array{String,1}. It did not improve performance over keys as String. It looks memory allocation is unavoidable either way. Perhaps I can try to use part of a DataFrameRow as keys. But I guess it will need to be copied into the Dict, making a memory allocation anyhow.


#7

Are they actually strings, or unparsed numbers, dates, etc? My suggestion would help in the latter case.


#8

Some are numbers and some are strings


#9

Could you provide some more details? What is “very slow” and “lot of memory allocations” in terms of seconds and bytes? How many rows are there in total? How many unique attribute sets are there? Could you paste a sample row? Could you try to narrow down what exactly in your code allocates a lot of memory, and paste that code snippet here?

It could be creating the keys that’s the problem, or it could be adding up the tuples, or something else entirely. It’s hard to tell. As for the tuple sums, did you try turning them into an array as I suggested earlier? Here’s an example illustrating how that can reduce allocations:

julia> test1() = (a = zeros(Int, 10); b = rand(1:9, 10); for n = 1:1_000_000; a .+= b; end; a);

julia> test2() = (a = Tuple(zeros(Int, 10)); b = Tuple(rand(1:9, 10)); for n = 1:1_000_000; a = a .+ b; end; a);

julia> @btime test1();
  6.178 ms (2 allocations: 320 bytes)

julia> @btime test2();
  266.343 ms (3000006 allocations: 427.25 MiB)

#10

@bennedich, I did change the Tuple into Array, and it improved a bit - thank you for the suggestion. Since the number of duplicates (about 100) is far smaller than the total number of rows (over 100,000 x 100), improvement is limited.

Each of the 100 files can have over 100,000 rows. The attributes in a row are like:
AU_BC_H3M,A_Sin,0,1:2,-0.0842,-0.014,AU_BC_H3M,A_Sin,0,2:5,-0.014,0.0992,AU_BC_M3C,A_Fin,1,2:4,-0.0082,-0.0028,AU_BC_M2M,A_Din,0,3:4,-0.001,0.006
and the results are like:
60.0,3942.0,28100.0,156.0,14.0,1.981142e6,1.55052e6,1.55052e6,3.74,190.0,15.0,1.051024e7,93.0,81491.0,43.0,433936.0,281.0

I don’t have benchmark data, but the code runs on single CPU for 5 hours on my MacBook Pro just looping through the rows of the 100 files, forming keys and loading a Dict, summing up on duplicate keys, and then writing out a summary file.


#11

Here is a different tack:

using TextParse, Query, DataFrames

filenames = readdir() |> @filter(endswith(_, ".csv")) |> collect

data, colnames, rows = csvread(filenames)

df = DataFrame(collect(data), Symbol.(colnames))

df |> 
  @groupby({_.col1, _.col2, _col3}) |> 
  @map({col1=key(_).col1, col2=key(_).col2, col3=key(_).col3, col4=sum(_.col4), col5=sum(_.col5)}) |> 
  DataFrame

TextParse.jl allows you to read multiple CSV files that have the same structure in in one go, you just have to pass a vector of filenames to csvread. Here I pass it all the filenames that have a *.csv extension in the current directory.

The conversion from the return value of csvread to a DataFrame is a bit cumbersome… Unfortunately CSVFiles.jl doesn’t support the multiple files case yet, normally that is the “nice” interface that sits between the relatively low level TextParse.jl and a DataFrame.

And then I’m using Query.jl to group and then aggregate. I’m only grouping by three cols, you would have to add the other 40 there manually, I’m afraid… I’m also only aggregatin col4 and col5, but you should be able to add the other ones fairly easily as well.

I haven’t tested the code because I don’t have the data, so I have no idea whether this performs well or works at all. But at least in theory this might work :slight_smile: And reports back whether this does work would be great!


#12

@davidanthoff Thanks for suggesting a new approach. The grouping feature of Query.jl sounds cool. It saves the effort for creating keys, I guess, causing less memory allocations. I would like to try it. The only question is if/how I can load in files progressively rather than all at once (I am afraid my RAM is not enough for all the files).


#13

For that data, I’d actually just keep each key as a single string, completely disregarding the fact that it consists of multiple columns, since that’s how it’s stored on disk already, and you don’t need to differentiate between the columns. Parsing all these fields and creating key objects will be a waste of time.

Is this a one-time job, or something that you’ll need to do repeatedly?

It’s quite a bit of data. What’s the total file size, does it all fit in memory?

One thing you could try is to concatenate all files and sort the resulting mega-file with a simple bash command. Then all duplicates will end up next to each other and can be easily detected, either from bash or Julia. Sorting has higher time complexity than iterating and adding to a dictionary, but the constant factor is very low (at least if done in-memory). This would also give you an idea of the potential to speed up your Julia implementation: if the above experiment finishes in a couple of minutes for example, a 5 hour runtime in Julia sounds completely unreasonable.


#14

Since I don’t think I can always read in all the files into RAM, I ended up loading in the files one by one, “vcat” the new one onto the previously processed, and call “by” from DataFrames to sum up results on duplicate attributes. It now takes about 10 minutes to process all 100 files, big reduction from 5 hours!

Thanks to all for the ideas.


#15

Congrats on coming up with an efficient solution! Would that approach work though when the data doesn’t all fit in memory? Does it stream or swap/map to disk?


#16

One file is roughly less than 100MB, so it fits in memory, After it’s processed, it’s discarded from RAM


#17

Hmm, but wouldn’t this still keep everything in memory? Or are you reducing the data during the processing step? Perhaps I missed something in your question.


#18

Have you tried to read them all in one go with TextParse.jl? Depending on what is in these files, this does not strike me as a super large amount of data, maybe it would just fit into main memory? How much of that do you have?


#19

No, I haven’t. I only have 4GB RAM on the VBox, so it’s not enough for 100x50MB.


#20

No, I don’t think they stay in memory. At every step, I do
df = CSV.read(filename)
then I do roughly

df1 = vcat(dfo, df)
dfo = by(df1, groupVec, col5 => sum, ...)

The duplicates then are summed up together. dfo stays roughly the size of one single df.
And then df should be gc’ed out on the next step