Benchmarking ways to write/load DataFrames IndexedTables to disk

Looks about right. I am writing out 100m records in 500 seconds about 100 times your 1m benchmarks

I add the performance comparison of Pandas of Python. The results are as follows.
Figure 1 Reading and saving performance comparison (Julia/Python/R)

Figure 2 Saving performance comparison (Julia/Python/R)

System information:


SSD disk
Python: condon.jl + Python2.7

I also test pandas in Python 3, from python environment. The reading/saving speed is as follows.

import pandas as pd
csv_file = "e:/tmp/df_fwrite.csv"

It seems that:
(1) R( with fst package) is the fastest;
(2) Julia ( Feather.jl ) reading/saving speed are balanced;
(3) Python( Pandas package) reading speed is ok, but the writing speed is very slow.

2 Likes

The Julia code is modified based on @xiaodai 's code, which is as follows.

1 Benchmark the running time

################################################################################
# setup
################################################################################
using Revise, RCall, Plots, StatPlots
# using DataBenchs
using uCSV, Feather, BenchmarkTools, CSV
using TextParse, FileIO, JLD, IterableTables
using JuliaDB, IndexedTables, uCSV, JLD2, DataFrames
outpath = "e:/tmp/"

srand(1);
N = 100;  K = 100;
# @time df = DataBench.createSynDataFrame(N, K); #31 #40
pool = "id".*dec.(1:K,3);
pool1 = "id".*dec.(1:N÷K,10);
nums = round.(rand(100).*100, 4);

############################ python package test
using PyCall
@pyimport pandas as pd
function test_pandas_read()
    csv_file = "e:/tmp/df_fwrite.csv"
    pd.read_csv( csv_file )
end

function test_pandas_save( frm )
    file_pandas_write = outpath * "pandas_fwrite.csv"
    frm[:to_csv]( file_pandas_write )
end

""" Create `N` row, 9 columns, several col with 1:K integer
    outpath: dir; exclslow: defalut is true, only test 3 libraries, else 7
"""
function bench_df_write_read(N,K, outpath, exclslow = true)
    df = DataFrame(
        id1 = rand(pool,N),
        id2 = rand(pool,N),
        id3 = rand(pool1,N),
        id4 = rand(1:K,N),
        id5 = rand(1:K,N),
        id6 = rand(1:(N/K),N),
        v1 = rand(1:5,N),
        v2 = rand(1:5,N),
        v3 = rand(nums,N));

    FileIO.save(outpath*"df_fileio.csv", df)
    csv_file = "e:/tmp/df_fwrite.csv"
    frm = pd.read_csv( csv_file )  # will be used for testting Pandas
    test_pandas_save(frm)
    test_pandas_read()
    if exclslow
        writeres = [
            @benchmark(Feather.write(outpath*"df.feather", $df)), # 138
            # @benchmark(CSV.write(outpath*"df.csv", $df)), #569.749011
            @benchmark(FileIO.save(outpath*"df_fileio.csv", $df)), # 209.438193 seconds (1.20 G allocations: 47.704 GiB, 5.91% gc time)
            # @benchmark(uCSV.write(outpath*"df_u.csv", $df)), #528.785193 seconds (3.60 G allocations: 157.952 GiB, 8.43% gc time)
            @benchmark(FileIO.save(outpath*"df.jld","df", $df)), #215.839709 seconds (1.16 k allocations: 6.706 GiB, 2.50% gc time)
            # @benchmark(JLD2.@save(outpath*"df.jld2", $df)), #765.809597 seconds (2.70 G allocations: 58.094 GiB, 19.22% gc time)
            #,@benchmark(JuliaDB.save($dfit,outpath*randstring(8)))
            @benchmark( test_pandas_save($frm) )  # pandas writing
            ]
        readres = [
            @benchmark(Feather.read(outpath*"df.feather"))
            ,@benchmark(DataFrame(FileIO.load(outpath*"df_fileio.csv")))
            ,@benchmark(FileIO.load(outpath*"df.jld"))
            ,@benchmark( test_pandas_read() )
        ]
        return (writeres, readres)
    else
        dfit = table(df)

        @time return [
            @benchmark(Feather.write(outpath*"df.feather", $df)) , # 138
            @benchmark(CSV.write(outpath*"df.csv", $df)), #569.749011
            @benchmark(FileIO.save(outpath*"df_fileio.csv", $df)), # 209.438193 seconds (1.20 G allocations: 47.704 GiB, 5.91% gc time)
            @benchmark(uCSV.write(outpath*"df_u.csv", $df)), #528.785193 seconds (3.60 G allocations: 157.952 GiB, 8.43% gc time)
            @benchmark(FileIO.save(outpath*"df.jld","df", $df)), #215.839709 seconds (1.16 k allocations: 6.706 GiB, 2.50% gc time)
            @benchmark(JLD2.@save(outpath*"df.jld2", $df)), #765.809597 seconds (2.70 G allocations: 58.094 GiB, 19.22% gc time)
            @benchmark(JuliaDB.save($dfit,outpath*randstring(8))),
            @benchmark( test_pandas_save($frm) )
            ]
    end
end

function rreadwrite(outpath)
    r = R"""
    memory.limit(2^31-1)
    library(fst)
    library(feather)
    library(data.table)
    df = feather::read_feather(file.path($outpath,"df.feather"))

    pt = proc.time()
    system.time(write_fst(df,file.path($outpath,"df_default.fst")))[3]
    fstt = proc.time() - pt

    #system.time(write_fst(df,file.path($outpath,"df_0.fst"), 0))[3]
    #system.time(write_fst(df,file.path($outpath,"df_100.fst"), 100))[3]

    pt = proc.time()
    system.time(read_fst(file.path($outpath,"df_default.fst")))[3]
    fsttr= proc.time() - pt

    # system.time(read_fst(file.path($outpath,"df_0.fst")))[3],
    # system.time(read_fst(file.path($outpath,"df_100.fst")))[3],

    # multi threaded read write
    pt = proc.time()
    fwrite(df, file.path($outpath, "df_fwrite.csv"))
    fwritet = proc.time() - pt

    pt = proc.time()
    #system.time(fread("df_fwrite.csv"))[3]
    system.time(fread(file.path($outpath, "df_fwrite.csv")))[3]
    dtr = proc.time() - pt

    list(
        fstt[3],
        fsttr[3],
        fwritet[3],
        dtr[3]
    )
    """
    [Float64(r[i]) for i=1:length(r)]
end

function plot_bench_df_read_write(julres, rres, N, exclslow=true)
    if exclslow
        x = ["Feather.jl",          "TextParse.jl\n FileIO.jl",         "JLD.jl\n FileIO.jl"]
        rx = ["R\n fst (default)","R\n data.table"]

        x = vcat(repeat(x, outer=2), repeat(rx, outer=2))
        rw = ["write","read"]

        group = vcat(repeat(rw, inner=3), repeat(rw, inner=2))

        julwrite = (x->(x.times ./ 1e9) |> mean).(julres[1])
        julread = (x->(x.times ./ 1e9) |> mean).(julres[2])
        y = vcat(julwrite, julread, rres)

        groupedbar(
        x
        , y
        , group = group
        , bar_position = :dodge
        , title = "DataFrame read/write to disk performance ($(Int(N/1_000_000))m)"
        , ylabel = "seconds")
        savefig("e:/read_write_df_bench $(Int(N/1_000_000))m.png")
    else
        x = ["Feather.jl","CSV.jl", "TextParse.jl\n FileIO.jl","uCSV.jl","JLD.jl\n FileIO.jl","JLD2.jl","IndexedTables.jl", "R\n fst (default)","R\n fwrite"]
        y = [(x->(x.times ./ 1e9) |> mean).(res[1:7])]
        groupedbar(
            x
            , y
            , group = group
            , bar_position = :dodge
            , title = "DataFrame write to disk performance ($(Int(N/1_000_000))m)"
            , ylabel = "seconds")
        savefig("e:/read_write_df_bench $(Int(N/1_000_000))m.png")
    end
end

################################################################################
# benchmark write
# ZJ: from my own testing testing writing of 1m rows is sufficient to assess the
# relativities
################################################################################
println("starting testing")
if true
    julres = bench_df_write_read(1_000_000, 100, outpath, false )
    rres=rreadwrite(outpath)
    res1m = (julres, rres)
    FileIO.save("e:/df_write_bench_1m_more.jld", "res1m", res1m)
end
if true
    julres = bench_df_write_read(1_000_000, 100, outpath )
    rres=rreadwrite(outpath)
    res1m = (julres, rres)
    FileIO.save("e:/df_write_bench_1m.jld", "res1m", res1m)
end

2 Show the figure

using BenchmarkTools
using GR
using StatPlots
using FileIO
using Plots

function plot_bench_df_read_write(julres, rres, N, exclslow=true)
    if exclslow
        x = ["Feather.jl",          "TextParse.jl\n FileIO.jl",         "JLD.jl\n FileIO.jl", "Python\n Pandas"]
        rx = ["R\n fst (default)","R\n data.table"]

        x = vcat(repeat(x, outer=2), repeat(rx, outer=2))
        rw = ["write","read"]

        group = vcat(repeat(rw, inner=4), repeat(rw, inner=2))

        julwrite = (x->(x.times ./ 1e9) |> mean).(julres[1])
        julread = (x->(x.times ./ 1e9) |> mean).(julres[2])
        y = vcat(julwrite, julread, rres)

        groupedbar(
        x
        , y
        , group = group
        , bar_position = :dodge
        , title = "DataFrame read/write to disk performance ($(Int(N/1_000_000))m)"
        , ylabel = "seconds")
        #GR.savefig("e:/read_write_df_bench $(Int(N/1_000_000))m.png")
    else
        x = ["Feather.jl","CSV.jl", "TextParse.jl\n FileIO.jl","uCSV.jl","JLD.jl\n FileIO.jl","JLD2.jl","IndexedTables.jl", "Python\n Pandas", "R\n fst (default)","R\n fwrite"]
        julwrite = (x->(x.times ./ 1e9) |> mean).(julres)
        rwrite = [rres[1], rres[3]]
        
        rm = ["write"]
        group = vcat(repeat(rm, inner=10))
        y = vcat(julwrite, rwrite)
        groupedbar(
            x
            , y
            , group = group
            , bar_position = :dodge
            , title = "DataFrame write to disk performance ($(Int(N/1_000_000))m)"
            , ylabel = "seconds")
        #GR.savefig("read_write_df_bench $(Int(N/1_000_000))m.png")        
    end
end

using GR
gr()

# show read and write
data = FileIO.load("e:/df_write_bench_1m.jld")
res1m = data["res1m"]
(julres, rres) = res1m
plot_bench_df_read_write(julres, rres, 1_000_000)

#  show writing test
data = FileIO.load("e:/df_write_bench_1m_more.jld")
res1m = data["res1m"]
(julres, rres) = res1m

plot_bench_df_read_write(julres, rres, 1_000_000, false)
1 Like

Thanks, this stuff is super helpful!

Here are some conclusions from this re CSV files:

  • we are in really good shape in terms of CSV reading performance. @shashi’s TextParse.jl is simply awesome. If I read this correctly, it beats both the Python/pandas story (narrowly) and the R/data.table story (pretty clearly). There is another twist to this: Right now CSVFiles.jl introduces an entirely unnecessary overhead into the story. I’m super close to getting rid of that. But even with that overhead we are faster than the Python and R story. But, that should get even better going forward. One interesting test to add would be a pure TextParse.jl story that skips the CSVFiles/FileIO integration. Once I fix the thing in CSVFiles.jl we should essentially see that performance from CSVFiles.jl going forward. The test to add would just be csvread("filename.csv").
  • For CSV writing, CSVFiles.jl right now takes the second spot. It beats the Python/pandas story (clearly), but not the R/data.tables story. We of course know why that is: data.tables CSV writing is multi-threaded, ours is not. I would love to see another benchmark added to this pile: fwrite with nThread=1. That would tell whether fwrite is faster because of multi-threading or whether it has additional other optimizations.

Oh, one more question: which version of DataFrames.jl is used for these comparisons? v0.10 or v0.11?

Thank you so much for your suggestions. I will try to add more if possible.
In fact, I am just a researcher in Urban Transportation and not familiar with multiple-threading / multiple-cores programming. If the default package can utilize multiple cores automatically, it will be very useful for the users like me who can just use Julia language like Matlab, as a tool for research, engineering or teaching. Here, I just show some information based on which I selected the package for a research project.
By the way, thanks for your great work on julia-vscode. I like it very much!

1 Like

The version of DataFrames.jl is v0.11.5

Totally agreed, and I’m not suggesting to replace the normal fwrite test with one that uses only one core. The “fair” comparison is to give each package its best shot. I’d just like to see the single core performance of fwrite as well for my own curiosity. If that is comparable to what we have with CSVFiles.jl, it would suggest that we are doing as good as we can right now and just have to wait until we get better threading in julia to improve the situation. But if single-core fwrite beats us in a significant way, then we should dig in and try to see whether we can improve performance of CSVFiles.jl today.

What compression options are being used by the different packages?

fst is using default compression parameters. I don’t think Feather not JLD has compression options. This reminds me. Should add R’s rds files to the mixed with compression on or off

@davidanthoff Totally agree! Some other package such as Dask of Python (Dask — Dask documentation) can even automatically use multi-cores in the clustering environment. The comparison results would be useful for the package developers only with the same calculation resources.
I saw there is a data benckmark project by @xiaodai, which would be very useful (https://github.com/xiaodaigh/DataBench.jl). Currently, I mainly use Python for my daily work and only a beginner of Julia. Maybe @xiaodai could give more useful results to help the development of the Julia package.

Dask, JuliaDB.jl, SAS, and Spark on a single machine (I know most of them can be distributed) belong to the medium data tool set. They are not just in-memory tools like pandas and data.table. I think they should get their own benchmarks; but also they can be benchmarked in this suite as well. Spark uses Parquet.jl and HDFS, so at some point should add those to the benchmarks.

1 Like

If you are not limiting your study to CSV, JuliaDB can also save in binary format: you could benchmark JuliaDB.save and JuliaDB.load for completeness.

To clarify here, what good benchmarks can really do is find out exactly what assumptions the algorithms are making and how that effects their performance. It can be very helpful to us to know not only that there is a difference, but what we would need to do in order to overcome this difference. Single core to multithreaded comparisons might say what’s going on to the user, but it’s not helpful to us without the single core comparison (are we doing something fundamentally different? Or do we just need to multithread?). Even as a beginner of Julia, you can always help by benchmarking to find cases where algorithms are doing well and where they are not, and what that means about the implementation. Anyways, if you do this, you’ll find yourself knowing more about the implementations than anyone else quite quickly.

4 Likes

Let me just second what @ChrisRackauckas wrote. These kind of benchmarks really, really help me prioritize things in the universe of packages I maintain. My main constraint is time, so I’m just incredibly grateful to folks that create and run these benchmarks, it is a fantastic way to help the creators of packages with their work!

2 Likes

@davidanthoff I would like to add more tests according to the advice from you and the community. It would also be a good way to learn the Julia language.

Updated with @zhangliye’s code. R’s feather implementation is quite a bit faster than Julia’s. This can probably be improved. Actually data.table’s fwrite is actually very very fast and is competitive with fst.

@davidanthoff Looks like CSV.jl has a reasonably fast reader, on par with Pandas and data.table, in this case which is reading in 1m rows with 9 columns of mixed string, float and integer types. I am interested to test this out on a largish real-world dataset e.g. Fannie Mae to see how it stacks up, last time I tried it it didn’t compare so favourably.

Hello all, as I noted on the other thread, I’m am in the process of completely rewriting Feather.jl to use my new Arrow.jl which is a back-end for serializing and deserializing any Apache Arrow formatted data (which Feather is). I don’t actually expect it to be any faster than the old version of Feather (the reason for rewriting it was more about cleaning up code and expanding functionality), but I do expect it to be very performant particularly for non-nullable bits types. I expect things to change significantly in 0.7 as Arrow relies strongly on reinterpret, the behavior of which has been drastically altered in 0.7. Once there are some 0.7 release candidates I will spend some time trying to optimize Feather.jl if I have it. At that point I’ll try to post it as a benchmark with the others.

6 Likes

I’ve done some performance testing on the new Feather.jl (my fork, still a PR). I read in a 20 million row data frame of mostly “worst case scenario” data (strings that may be missing) that was about 5.6 GB. Took the new feather about 19 seconds. Took about 14 seconds for python feather (keep in mind that’s pretty much all C++). A big chunk of the time it took Julia was because of the inefficiency of Union types in 0.6, so hopefully we’ll be able to beat python easily in 0.7. I haven’t thoroughly tested the “best case scenario” (non-missings and bits types) yet, but it is pretty much as fast as it could possibly be.

The new Feather will also take advantage of random access capability (memory mapping with lazy loading), so most of the time it will not be necessary for you to actually load in all the data.

So, there’s light at the end of the tunnel for deserialization!

By the way, in the benchmarks you show above, are you absolutely sure that R is actually deserializing all the data, and not using some sort of memory mapping or lazy loading scheme?

8 Likes

Just looked at your awesome Arrow.jl package. Wonder how it’s doing in v1?