for evaluation of a possible projet usage I did a comparison of DataFrames.jl to Pandas, with side-by-side examples and timings:
Overall, DataFrames.jl performs very well in my experiments, great work!
One functionality I could not find out-of-the-box is for writing the content of a DataFrame to a database (e.g. PostgreSQL), analogue to Pandas df.to_sql().
A simple implementation of the database upload would be (taken mostly from LibPQ.jl documentation):
using DataFrames
using LibPQ
using IterTools
function insert_by_copy!(con:: LibPQ.Connection, tablename:: AbstractString, df:: DataFrame)
row_strings = imap(eachrow(df)) do row
join((ismissing(x) ? "" : x for x in row), ",")*"\n"
end
copyin = LibPQ.CopyIn("COPY $tablename FROM STDIN (FORMAT CSV);", row_strings)
execute(con, copyin)
end
Note that this does not cover all cases - notably the column order must be the same for the DataFrame and Table and there must not be “,” in strings (and probably more edge cases I am not aware of yet).
Using the COPY command the performance is much better than using SQL Inserts, therefore this simple function outperforms Pandas df.to_sql() (but you can do the same trick for Pandas, too).
Is such a functionality already available somewhere?
If not, where would be the best point to add it? DataFrames.jl, LibPQ.jl or in a separate package?
Maybe the CSV.jl package could be used for improving the upload functionality and making it more general?
Thanks for the kind words. The comparison is very interesting. The fact that our sorting implementation is slower than Pandas is expected since we should use radix sort for integers. Regarding filter, a new filter(col => fun, df) syntax has just been added to master, it will be much faster than the current syntax.
I can’t help you regarding databases, hopefully others will comment. At least I can point you at Tables.jl, which is the general interface for tabular data in Julia, that LibPQ.jl already uses.
Just a small suggestion: using df.col instead of df[!, :col] makes the code much nicer to read (and closer to Pandas).
Thanks for your suggestion! I got a deprecation warning for df[:column] and somehow mixed it up with df.column - agreed that the latter syntax is nicer for simple column access.