Upload of DataFrame data to databases

lungben · April 3, 2020, 11:15am

Hi,

for evaluation of a possible projet usage I did a comparison of DataFrames.jl to Pandas, with side-by-side examples and timings:

github.com

lungben/julia_experiments/blob/master/notebooks/data_frame_comparison_julia.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Setup"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Install Dependencies"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},

This file has been truncated. show original

Overall, DataFrames.jl performs very well in my experiments, great work!

One functionality I could not find out-of-the-box is for writing the content of a DataFrame to a database (e.g. PostgreSQL), analogue to Pandas df.to_sql().

A simple implementation of the database upload would be (taken mostly from LibPQ.jl documentation):

using DataFrames
using LibPQ
using IterTools

function insert_by_copy!(con:: LibPQ.Connection, tablename:: AbstractString, df:: DataFrame)
    row_strings = imap(eachrow(df)) do row
        join((ismissing(x) ? "" : x for x in row), ",")*"\n"
    end
    copyin = LibPQ.CopyIn("COPY $tablename FROM STDIN (FORMAT CSV);", row_strings)
    execute(con, copyin)
end

Note that this does not cover all cases - notably the column order must be the same for the DataFrame and Table and there must not be “,” in strings (and probably more edge cases I am not aware of yet).

Using the COPY command the performance is much better than using SQL Inserts, therefore this simple function outperforms Pandas df.to_sql() (but you can do the same trick for Pandas, too).

Is such a functionality already available somewhere?
If not, where would be the best point to add it? DataFrames.jl, LibPQ.jl or in a separate package?
Maybe the CSV.jl package could be used for improving the upload functionality and making it more general?

Best Regards
Benjamin

nalimilan · April 3, 2020, 4:51pm

Thanks for the kind words. The comparison is very interesting. The fact that our sorting implementation is slower than Pandas is expected since we should use radix sort for integers. Regarding filter, a new filter(col => fun, df) syntax has just been added to master, it will be much faster than the current syntax.

I can’t help you regarding databases, hopefully others will comment. At least I can point you at Tables.jl, which is the general interface for tabular data in Julia, that LibPQ.jl already uses.

Just a small suggestion: using df.col instead of df[!, :col] makes the code much nicer to read (and closer to Pandas).

lungben · April 3, 2020, 5:39pm

Thanks for your suggestion! I got a deprecation warning for df[:column] and somehow mixed it up with df.column - agreed that the latter syntax is nicer for simple column access.

lungben · April 4, 2020, 7:50pm

I did some improvements to the PostgreSQL upload functionality and generalized it (as suggested by @nalimilan) for the Tables .jl interface:

github.com

lungben/julia_experiments/blob/master/notebooks/df_to_db.ipynb

{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [],
   "source": [
    "using IterTools\n",
    "using DataFrames\n",
    "using LibPQ\n",
    "using BenchmarkTools\n",
    "using CSV\n",
    "using Tables"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [

This file has been truncated. show original

Imho it would best fit into LibPQ.jl, I’ll make a PR.

mmiller · October 16, 2020, 3:15am

@lunghben this is great, did it ever make it into LibPQ.jl?

lungben · October 16, 2020, 6:43am

Already done and waiting for merge:

https://github.com/invenia/LibPQ.jl/pull/186

Due to the dependency on CSV.jl, this function should only be included into LibPQ.jl documentation and tests, but not into the library itself.

mmiller · October 19, 2020, 12:18am

Ah nice, that makes sense.

Topic		Replies	Views
How to Create a Table in a DataBase Using DataFrames? Data data , dataframes , database	3	1330	February 4, 2022
Given a PostgresSQL.jl connection string, what's the easiest way to create a table in the database by copying a dataframe into it? Data	6	872	September 14, 2021
Create a PostgreSQL Table from a DataFrame or CSV with Schema Inference New to Julia question , dataframes , postgresql	2	2107	March 3, 2022
How to quickly bulk insert into postgres Data	14	21935	September 10, 2019
Database backed for DataFrames Data dataframes , sql	6	835	May 4, 2024

Upload of DataFrame data to databases

Related topics