Processing csv's in parallel

chasecb · February 3, 2018, 7:31am

I’m basically trying to serialize csvs in parallel. The parallel stuff seems to have changed a bit from Julia’s more sophomoric (to be as polite as I can) versions but I could never get it working back then either. Let’s say I had a list of files [1.csv, 2.csv,3.csv] where I just wanted to do a readable() and a writeable() using DataFrames what would be the most efficient way to achieve that (seems pmap or @parallel could both work so wondering what approach is better and why)? Are there any good resources for this type of thing? Many many thanks in advance.
Chase CB

stillyslalom · February 3, 2018, 8:09pm

Whenever you’re performing operations that include I/O to/from a single disk, the bottleneck is the I/O, and that can’t be sped up by parallelizing.

If this is a limiting factor in a production application, use faster disks (PCIe SSDs and/or RAID).

chasecb · February 3, 2018, 8:46pm

Sorry should have been clearer there is an intermediary data processing
step where the paralleization might pay some dividends but I get what you
are saying about I/O.

stillyslalom · February 3, 2018, 9:09pm

Have you profiled your script’s execution yet? Unless the time spent on data processing is on the same order as the I/O time, parallelizing will only buy you a minuscule speed-up.

If it’s on the same order, I would read the files into memory serially, process in parallel, and write the results back serially, but the speed-up will only be ~25-50% at best.

chasecb · February 3, 2018, 9:27pm

Yes I did use the profiler actually and the data processing takes roughly
12x the I/O time.

stillyslalom · February 3, 2018, 11:26pm

Can you post a MWE of your current code? If processing takes that long, parallelizing it will likely help, but your approach will depend on the size of the files you’re working with and the type of operations you’re performing on the data.

jwu · February 3, 2018, 11:52pm

I used pmap to do a similar stuff, it works great.
The IO goes through network though.

chasecb · February 4, 2018, 12:10am

The files are roughly 100mb and there are many of them.

chasecb · February 4, 2018, 12:12am

using DataFrames

files = ["2000a.csv", "2000b.csv", "2000c.csv", "2000d.csv"]

function process(df::DataFrame)
    const conversions = Dict(
                :date => c -> Date( c, "mm/dd/yyyy"),
                :value => c -> [float(s) for s in c],
                :date2 => c -> Date( c, "yyyy-mm-dd" ),
                :value2 => c -> [float(s) for s in c],
                 )
    for (name,col) in eachcol(df)
        if haskey(conversions, name )
            df[name] = conversions[name](col)
        end
    end
    return df
end

for file in files ## this loop is what I would like to make parallel
    df = readtable(file)
    df = process(df)
    writetable(df, "path")
end

Topic		Replies	Views
Reading and processing Data files concurrently Data parallel	18	3833	September 20, 2017
Parallel Processing File New to Julia question	3	1798	August 29, 2018
What's the best way to work with millions of rows of data? Performance	7	2109	February 24, 2020
Read csv files slow Performance filesystem	13	1738	July 28, 2020
Using Threads with I/O to processing many files in parallel New to Julia	3	950	December 23, 2016

Processing csv's in parallel

Related topics