What's the best way to work with millions of rows of data?

Julia1 · February 24, 2020, 2:26pm

Hi how’s it going?

I made a prototype of an app using the DataFrames.jl package. The data was about 100,000 rows. I need to scale this up to millions, possibly hundreds of millions of rows. I don’t know too much about how to use chunks and/or multiple threads, besides that it’s these giant data use cases where Julia excels. I was wondering if someone here could point me in the right direction, to a learning resource or a particular package(es) that you think would be useful.

Thank you!

ExpandingMan · February 24, 2020, 3:15pm

For most of what you’d want to do with millions of rows, as long as you don’t also have a huge number (\gg 10) columns, DataFrames.jl should be perfectly adequate. The groupby is a little slower than it could be, but it isn’t that slow and just about everything else you could want to do should be quite fast.

Because a DataFrame is basically just a hash map of whatever AbstractVector type you want to put in it, in principle there are all sorts of ways you could use DataFrames.jl with parallel processing. For example, you could use DistributedArrrays as your columns, you could iterate over it with pmap or @threads. DataFrames.jl doesn’t give you any tools to help you load data in a distributed environment, but once it’s loaded you can pretty much do whatever you want. The one really big limitation is that it would be non-trivial to parallelize groupby and join, which obviously is potentially really important.

Once you start getting to hundreds of millions of rows, you may want to start considering alternatives to DataFrames.jl (again, in principle, you could still do it, but DataFrames.jl doesn’t provide you with lots of tools for this). The alternative is JuliaDB which is worth checking out. The development on JuliaDB has been a lot slower than the development on DataFrames, but I’m pretty sure it’s quite usable. (Note that most of the development happens in other repos, so lack of activity in JuliaDB itself is not indicative of much.)

Julia1 · February 24, 2020, 4:33pm

Ok sounds great thank you. Yeah I have way more than 10 columns so I’m going to look into JuliaDB.

tbeason · February 24, 2020, 7:08pm

https://github.com/joshday/OnlineStats.jl might be worth checking out too.

I found that the best way to approach these types of problems (or the ones that I’ve faced) is to think about it in terms of tasks and separate the tasks into two bins: those which absolutely need all of the data to be completed vs those which can operate on a subset. The more you can work only on subsets of a huge dataset, the better.

Julia1 · February 24, 2020, 7:56pm

Ok thank you. If I were to use distributed and pmap, how would I set it up to read in the huge CSV file? I’ve tried creating a function and using @everywhere to make sure all workers have that function along with the other packages, but I’m still getting errors like “(method too new to be called from this world context.)”

Say you have a CSV file set to a string variable test. What’s wrong with the code below?

using Distributed

addprocs(length(Sys.cpu_info()))

@everywhere using DataFrames, CSV

@everywhere function readData(inputData::String)
df = DataFrame(CSV.File(inputData));
return df
end

pmap(readData,test)

tro3 · February 24, 2020, 8:26pm

Are you sticking with one CPU, or many? If just one, multithreading (@threads, etc) might be easier than Distributed. You can use ThreadPools.jl to simplify some of the overhead (though @threads may be all you need, assuming the jobs are pretty uniform). But if you’re going multi-CPU, Distributed is your friend.

Reading a file in parallel fashion, though? I haven’t seen this. I guess if you could do an arbitrary seek for your start position, you could have a pool of processes take the file 100k lines at a time or something, then combine the outputs when done. Would require the OS allow multiple read access. If you do get that one running, I’d love to hear the technique

aaowens · February 24, 2020, 8:38pm

Before bothering with all that distributed stuff I would make sure DataFrames.jl and CSV aren’t already fast enough for you. A few hundred million rows isn’t that big. CSV has multithreading by default on 1.3 so you don’t need to do anything.

tbeason · February 24, 2020, 8:38pm

I could be wrong here but I believe the author of CSV.jl was trying to get it to do multithreading natively, as long as you had started Julia with multiple threads.

Beaten by 1 sec by @aaowens, but obviously “what he said” ^^

It is quite possible that even though you may not be able to read it all into memory at once (you should check this, it is possible that you can if you have enough RAM), that iterating over the entire file might not take too long at all.

Topic		Replies	Views
JuliaDB versus Data question	12	2254	June 18, 2019
The state of DataFrames.jl H2O benchmark Package Announcements dataframes	53	9250	January 1, 2025
Difference between JuliaDB and DataFrames Data	13	1878	June 17, 2021
[ANN] RowTables.jl Data announcement	6	1105	July 26, 2018
Suggestions for a package to read tabular data Data question	12	2713	February 13, 2017

What's the best way to work with millions of rows of data?

Related topics