Processing csv's in parallel

question

#1

I’m basically trying to serialize csvs in parallel. The parallel stuff seems to have changed a bit from Julia’s more sophomoric (to be as polite as I can) versions but I could never get it working back then either. Let’s say I had a list of files [1.csv, 2.csv,3.csv] where I just wanted to do a readable() and a writeable() using DataFrames what would be the most efficient way to achieve that (seems pmap or @parallel could both work so wondering what approach is better and why)? Are there any good resources for this type of thing? Many many thanks in advance.
Chase CB


#2

Whenever you’re performing operations that include I/O to/from a single disk, the bottleneck is the I/O, and that can’t be sped up by parallelizing.

If this is a limiting factor in a production application, use faster disks (PCIe SSDs and/or RAID).


#3

Sorry should have been clearer there is an intermediary data processing
step where the paralleization might pay some dividends but I get what you
are saying about I/O.


#4

Have you profiled your script’s execution yet? Unless the time spent on data processing is on the same order as the I/O time, parallelizing will only buy you a minuscule speed-up.

If it’s on the same order, I would read the files into memory serially, process in parallel, and write the results back serially, but the speed-up will only be ~25-50% at best.


#5

Yes I did use the profiler actually and the data processing takes roughly
12x the I/O time.


#6

Can you post a MWE of your current code? If processing takes that long, parallelizing it will likely help, but your approach will depend on the size of the files you’re working with and the type of operations you’re performing on the data.


#7

I used pmap to do a similar stuff, it works great.
The IO goes through network though.


#8

The files are roughly 100mb and there are many of them.


#9
using DataFrames

files = ["2000a.csv", "2000b.csv", "2000c.csv", "2000d.csv"]

function process(df::DataFrame)
    const conversions = Dict(
                :date => c -> Date( c, "mm/dd/yyyy"),
                :value => c -> [float(s) for s in c],
                :date2 => c -> Date( c, "yyyy-mm-dd" ),
                :value2 => c -> [float(s) for s in c],
                 )
    for (name,col) in eachcol(df)
        if haskey(conversions, name )
            df[name] = conversions[name](col)
        end
    end
    return df
end

for file in files ## this loop is what I would like to make parallel
    df = readtable(file)
    df = process(df)
    writetable(df, "path")
end