Upload of DataFrame data to databases

Hi,

for evaluation of a possible projet usage I did a comparison of DataFrames.jl to Pandas, with side-by-side examples and timings:

Overall, DataFrames.jl performs very well in my experiments, great work!

One functionality I could not find out-of-the-box is for writing the content of a DataFrame to a database (e.g. PostgreSQL), analogue to Pandas df.to_sql().

A simple implementation of the database upload would be (taken mostly from LibPQ.jl documentation):

using DataFrames
using LibPQ
using IterTools

function insert_by_copy!(con:: LibPQ.Connection, tablename:: AbstractString, df:: DataFrame)
    row_strings = imap(eachrow(df)) do row
        join((ismissing(x) ? "" : x for x in row), ",")*"\n"
    end
    copyin = LibPQ.CopyIn("COPY $tablename FROM STDIN (FORMAT CSV);", row_strings)
    execute(con, copyin)
end

Note that this does not cover all cases - notably the column order must be the same for the DataFrame and Table and there must not be “,” in strings (and probably more edge cases I am not aware of yet).

Using the COPY command the performance is much better than using SQL Inserts, therefore this simple function outperforms Pandas df.to_sql() (but you can do the same trick for Pandas, too).

Is such a functionality already available somewhere?
If not, where would be the best point to add it? DataFrames.jl, LibPQ.jl or in a separate package?
Maybe the CSV.jl package could be used for improving the upload functionality and making it more general?

Best Regards
Benjamin

2 Likes

Thanks for the kind words. The comparison is very interesting. The fact that our sorting implementation is slower than Pandas is expected since we should use radix sort for integers. Regarding filter, a new filter(col => fun, df) syntax has just been added to master, it will be much faster than the current syntax.

I can’t help you regarding databases, hopefully others will comment. At least I can point you at Tables.jl, which is the general interface for tabular data in Julia, that LibPQ.jl already uses.

Just a small suggestion: using df.col instead of df[!, :col] makes the code much nicer to read (and closer to Pandas).

Thanks for your suggestion! I got a deprecation warning for df[:column] and somehow mixed it up with df.column - agreed that the latter syntax is nicer for simple column access.

I did some improvements to the PostgreSQL upload functionality and generalized it (as suggested by @nalimilan) for the Tables .jl interface:

Imho it would best fit into LibPQ.jl, I’ll make a PR.

2 Likes

@lunghben this is great, did it ever make it into LibPQ.jl?

Already done and waiting for merge:

Due to the dependency on CSV.jl, this function should only be included into LibPQ.jl documentation and tests, but not into the library itself.

Ah nice, that makes sense.