Best practices for local exploratory data analysis

danvinci · January 13, 2022, 2:45am

Hi!

I’m trying to come up with a way of handling a relatively large volume of data for my laptop (~2GB worth of CSVs), and I’m not 100% sure I’m bumping into RAM limitations, Pluto limitations, limitations of available libraries, or simply lacking knowledge about best practices.

I have a bunch (20+) of CSVs files that I’d ideally like to load into 3 tables in a DB-like object that I can later query in all sorts of ways.

I started with DataFrames, but I had the necessity of appending several to one another, and doing so would’ve eaten all my ram as I had to load all data from CSVs first.
An alternative would have been to do everything in a step-wise fashion, but at some point in time I’m loading in memory more data than I can accommodate. If managing that, would then prevent any sort of analysis as those results would have to be sunk to a dataframe as well (?).

I’ve looked into JuliaDB, but trying to install it results in several other packages in my environment being downgraded, and as I’m trying to become proficient in a set of tools/libraries that I can count on, JuliaDBthe doesn’t seem as solid building block now (not in active development).

SQLite worked when it came to loading data into it, but querying results in long-running cells (at least in Pluto) when querying the same data that is otherwise handled without issues when read directly from the CSV.

What else is out there, when it comes to libraries and/or resources?
I saw the poll involving the future development of DataFrames, and I think what I’m looking for is out-of-memory processing: I’d expect everything to be on-disk, except the query results.

ImreSamu · January 13, 2022, 4:04am

My best practices - working with extra-large databases ( > TB )

creating a pre-aggregated tables ( views ) - with the minimal columns …
adding indexes
creating sample data / table - and analyze them
writing the performance critical part ( or data cleaning part ) in pure SQL.

so try to limit the data size ( records / columns ) as early as you can.

What I like in the SQLite.jl

I can write a Julia function - and register with SQLite.register() so I can use in the SQL statements - and push the business logic to the SQL side …

The other interesting OLAP tools you can try - but not an easy tools:

Clickhouse + ClickHouse.jl
https://duckdb.org/ ( probably a julia interface exists )

EDIT:

you can tune the SQLite queries ; example:

        PRAGMA journal_mode=PERSIST;
        PRAGMA page_size=80000;
        PRAGMA synchronous=OFF;

you an find more info in the web:

nalimilan · January 13, 2022, 8:08am

Could you show more details about what you are doing and the amount of RAM you have? 2GB certainly isn’t considered as “large” these days (your description sounds like you have 2TB :-).

danvinci · January 13, 2022, 10:28pm

Thanks, I’ll look into these resources - I’m pretty sure I’m overlooking something.

danvinci · January 13, 2022, 10:46pm

Yep, that’s why I tentatively defined it as ‘‘relatively large’’

I’m working on a 2019 MBP with 8GB di RAM, the latest macOS version with Julia 1.7.

With open browser tabs + Pluto in Chrome, at any given point I have probably around ~3GB of RAM available for the Julia runtime and whatever RAM gets eaten up by the task at hand.

I have a bunch of .csv files (some passing 400MB in size) that I load up in a DataFrame and start querying for getting a sense of it, keeping track of the data I need, and how the files are supposed to be appended/merged together - either using several dataframes objects or progressively loading up an on-disk DB file (SQLite).

The idea would be then to query the DB, but I’m not getting past the first attempt at returning anything from it.

What I do is the following:

read a csv file from disk and sink into a dataframe (completes in ~40sec)
loading the contents of the dataframe into SQLite

    SQLite.load!(df_name, db_file, table_name);

and then trying to read back the same data I just loaded

    DBInterface.execute(db_file, "SELECT * from table_name") |> DataFrame

also tried ‘SQLite.execute’ but seems to behave differently, throwing this error:

ArgumentError: 'Int32' iterates 'Int32' values, which doesn't satisfy the Tables.jl `AbstractRow` interface

The cell in Pluto runs for 10+mins and RAM consumption goes up until my laptop freezes, and at that point I have to kill the process.

I’m aware I’m requesting everything that’s in that table, maybe the data is being handled different compared to when dealing with a dataframe?

EDIT: it finally worked with DB.Interface when sinking the result of the query into a DB, but took x5 the time it takes for loading the same data from a CSV.

Is there a way to take a sneak peek at the query results without having to sink everything into a dataframe, thus eating up RAM?

ImreSamu · January 14, 2022, 5:25am

just use SQL : LIMIT \ WHERE:

DBInterface.execute(db, "SELECT * FROM Invoice LIMIT 2000") |> DataFrame

And after : list only the “important” column names … ( less column → less memory )

DBInterface.execute(db, "SELECT InvoiceId, CustomerId, BillingCity, BillingState, Total FROM Invoice LIMIT 4000") |> DataFrame

usually - I am using a simple tool - to check table/column names :

https://sqlitebrowser.org/

You can select random records :

Select random row from a sqlite table - Stack Overflow

lawless-m · January 14, 2022, 8:15am

I ran into this problem myself because I have 11Gb of CSV files to process.

All of the solutions I found needed the data in RAM.

So I spent a couple of weeks coding up an on-disk B-Tree I can use to index into the CSV files.

https://github.com/lawless-m/Index1024.jl

I appreciate that is not exactly well documented for someone else to get to grips with.

As a bonus, this is the code that I use it in

https://github.com/lawless-m/ONSUD.jl

That’s the best I can do

nilshg · January 14, 2022, 8:56am

Honestly 2GB worth of CSVs should not be a problem to handle on an 8GB laptop with bog standard CSV.jl + DataFrames.jl.

ImreSamu · January 14, 2022, 6:09pm

IMHO: It depends on the data and the dataframe customization

sometimes you can expect: 4x memory ( with the default settings Int64 )
and if you keep 2 copy of the data ( in the memory - because transforming )
you need more …

so in linux: it is better to add an extra >20GB memory swapfile

A simple: 10col + ( header and 10000 lines ) example

200061 csv bytes on the disk
801136 julia dataframe bytes on the memory

A simple 1 digit number → Int64 ( 8 byte ) in the memory - with the default settings.

$ ls -l  jcoltest.csv 
-rw-r--r-- 1 juliauser fuse 200061 Jan 14 18:35 jcoltest.csv

$ cat  jcoltest.csv | wc -l
10001

$ head jcoltest.csv 
jcol1,jcol2,jcol3,jcol4,jcol5,jcol6,jcol7,jcol8,jcol9,jcol10
0,1,0,1,0,1,0,1,0,1
0,1,0,1,0,1,0,1,0,1
0,1,0,1,0,1,0,1,0,1
0,1,0,1,0,1,0,1,0,1
0,1,0,1,0,1,0,1,0,1
0,1,0,1,0,1,0,1,0,1
0,1,0,1,0,1,0,1,0,1
0,1,0,1,0,1,0,1,0,1
0,1,0,1,0,1,0,1,0,1

julia> using CSV
julia> using DataFrames
julia> df = DataFrame(CSV.File("jcoltest.csv"))
10000×10 DataFrame
│ Row   │ jcol1 │ jcol2 │ jcol3 │ jcol4 │ jcol5 │ jcol6 │ jcol7 │ jcol8 │ jcol9 │ jcol10 │
│       │ Int64 │ Int64 │ Int64 │ Int64 │ Int64 │ Int64 │ Int64 │ Int64 │ Int64 │ Int64  │
├───────┼───────┼───────┼───────┼───────┼───────┼───────┼───────┼───────┼───────┼────────┤
│ 1     │ 0     │ 1     │ 0     │ 1     │ 0     │ 1     │ 0     │ 1     │ 0     │ 1      │
│ 2     │ 0     │ 1     │ 0     │ 1     │ 0     │ 1     │ 0     │ 1     │ 0     │ 1      │
│ 3     │ 0     │ 1     │ 0     │ 1     │ 0     │ 1     │ 0     │ 1     │ 0     │ 1      │
│ 4     │ 0     │ 1     │ 0     │ 1     │ 0     │ 1     │ 0     │ 1     │ 0     │ 1      │
│ 5     │ 0     │ 1     │ 0     │ 1     │ 0     │ 1     │ 0     │ 1     │ 0     │ 1      │
...
│ 9995  │ 0     │ 1     │ 0     │ 1     │ 0     │ 1     │ 0     │ 1     │ 0     │ 1      │
│ 9996  │ 0     │ 1     │ 0     │ 1     │ 0     │ 1     │ 0     │ 1     │ 0     │ 1      │
│ 9997  │ 0     │ 1     │ 0     │ 1     │ 0     │ 1     │ 0     │ 1     │ 0     │ 1      │
│ 9998  │ 0     │ 1     │ 0     │ 1     │ 0     │ 1     │ 0     │ 1     │ 0     │ 1      │
│ 9999  │ 0     │ 1     │ 0     │ 1     │ 0     │ 1     │ 0     │ 1     │ 0     │ 1      │
│ 10000 │ 0     │ 1     │ 0     │ 1     │ 0     │ 1     │ 0     │ 1     │ 0     │ 1      │

julia> Base.summarysize(df)
801136

ImreSamu · January 14, 2022, 6:25pm

and with proper type settings ( Int8) it is less …

julia> df2 = DataFrame(CSV.File("jcoltest.csv",types=[Int8, Int8, Int8, Int8, Int8, Int8, Int8, Int8, Int8, Int8], strict=true) )

julia> Base.summarysize(df2)
101136

danvinci · January 14, 2022, 6:28pm

Thanks to everyone for the input!

I managed to load everything into SQLite, reading each CSV and writing to the DB file.

On-disk footprint went from 2GB to ~5GB, but now I should be able to work directly through SQL.

lawless-m · January 17, 2022, 8:16am

If all the types are the same

julia> df2 = DataFrame(CSV.File("jcoltest.csv",types=Int8, strict=true) )

lawless-m · January 18, 2022, 4:08pm

Prompted by this post, I decided to write an extended example of using my Index1024 code.

https://lawless-m.github.io/Index1024.jl/dev/example1/

The example was code I was writing anyway. It is always instructive to write such things and I can recommend it to anyone.

I got some more tests out of it and found some code was broken.

I also found out that with 1_265_038 keys and (data=zero(UInt64), aux=zero(UInt64)) my Disk Based B-Tree is only 20x slower than using a Julia Dictionary for 1 random lookup, which I think is pretty good - considering it also only needs 9.31 KiB, allocs estimate: 247 to do the lookup.

Keith_Myerscough1 · October 15, 2024, 8:32pm

I know this post is a few years old, but I was wondering if it’s still relevant. I am also looking to process some CSVs into some form of on-disk database. The data’s only small, but growing over time and I want to add new data to it incrementally, while also being able to poll the data for analysis purposes. SQLite.jl would probably do the job, but it’s – quite literally – unjulian to have to write SQL queries and execute them. Would JLD2.jl or JDF.jl also be suitable?

Jake · October 15, 2024, 11:40pm

Take a look at FunSQL.jl and SQLCollections.jl.

I have used FunSQL and it works well. I have been meaning to try SQLCollections but have not done that yet. SQLCollections tries to make the code more Julian.

aplavin · October 16, 2024, 11:55am

SQLCollections.jl (discourse thread) is specifically designed for querying SQL databases with the exact same code one can use for regular Julian collections. Supports a large but fundamentally limited subset of SQL functionality.
Useful for code consistency in general, and for actually writing functions that can query both kinds of data interchangeably.

It is based on FunSQL.jl as a backend which is a great package! FunSQL gives you access to 100% of SQL but requires using special syntax – different from regular Julian data manipulation. Still more composable than SQL

For lightweight databases, try SQLite or DuckDB, depending on the workload. The latter is convenient for querying large CSV files without loading into memory, especially so in synergy with SQLCollections. Maybe you’ll be fine with simply querying CSV files as they are, without even loading into a database?

Topic		Replies	Views
How do you store your data before and after processing with Julia? Data	35	7197	March 1, 2021
Struggling with Julia and large datasets General Usage question , big-data	67	11100	October 17, 2024
[ANN] DataFrameDBs.jl Data package , announcement	60	4050	May 2, 2020
JuliaDB Questions/Issues New to Julia package	13	2534	July 3, 2019
What's the latest and greatest in data in Julia Data	29	2131	August 15, 2024

Best practices for local exploratory data analysis

Related topics