Best practices for local exploratory data analysis

Hi!

I’m trying to come up with a way of handling a relatively large volume of data for my laptop (~2GB worth of CSVs), and I’m not 100% sure I’m bumping into RAM limitations, Pluto limitations, limitations of available libraries, or simply lacking knowledge about best practices.

I have a bunch (20+) of CSVs files that I’d ideally like to load into 3 tables in a DB-like object that I can later query in all sorts of ways.

I started with DataFrames, but I had the necessity of appending several to one another, and doing so would’ve eaten all my ram as I had to load all data from CSVs first.
An alternative would have been to do everything in a step-wise fashion, but at some point in time I’m loading in memory more data than I can accommodate. If managing that, would then prevent any sort of analysis as those results would have to be sunk to a dataframe as well (?).

I’ve looked into JuliaDB, but trying to install it results in several other packages in my environment being downgraded, and as I’m trying to become proficient in a set of tools/libraries that I can count on, JuliaDBthe doesn’t seem as solid building block now (not in active development).

SQLite worked when it came to loading data into it, but querying results in long-running cells (at least in Pluto) when querying the same data that is otherwise handled without issues when read directly from the CSV.

What else is out there, when it comes to libraries and/or resources?
I saw the poll involving the future development of DataFrames, and I think what I’m looking for is out-of-memory processing: I’d expect everything to be on-disk, except the query results.

1 Like

My best practices - working with extra-large databases ( > TB )

  • creating a pre-aggregated tables ( views ) - with the minimal columns …
  • adding indexes
  • creating sample data / table - and analyze them
  • writing the performance critical part ( or data cleaning part ) in pure SQL.

so try to limit the data size ( records / columns ) as early as you can.

What I like in the SQLite.jl

  • I can write a Julia function - and register with SQLite.register() so I can use in the SQL statements - and push the business logic to the SQL side …

The other interesting OLAP tools you can try - but not an easy tools:

EDIT:

  • you can tune the SQLite queries ; example:
        PRAGMA journal_mode=PERSIST;
        PRAGMA page_size=80000;
        PRAGMA synchronous=OFF;

you an find more info in the web:

Could you show more details about what you are doing and the amount of RAM you have? 2GB certainly isn’t considered as “large” these days (your description sounds like you have 2TB :-).

1 Like

Thanks, I’ll look into these resources - I’m pretty sure I’m overlooking something.

Yep, that’s why I tentatively defined it as ‘‘relatively large’’ :sweat_smile:

I’m working on a 2019 MBP with 8GB di RAM, the latest macOS version with Julia 1.7.

With open browser tabs + Pluto in Chrome, at any given point I have probably around ~3GB of RAM available for the Julia runtime and whatever RAM gets eaten up by the task at hand.

I have a bunch of .csv files (some passing 400MB in size) that I load up in a DataFrame and start querying for getting a sense of it, keeping track of the data I need, and how the files are supposed to be appended/merged together - either using several dataframes objects or progressively loading up an on-disk DB file (SQLite).

The idea would be then to query the DB, but I’m not getting past the first attempt at returning anything from it.

What I do is the following:

  • read a csv file from disk and sink into a dataframe (completes in ~40sec)
  • loading the contents of the dataframe into SQLite
    SQLite.load!(df_name, db_file, table_name); 
  • and then trying to read back the same data I just loaded
    DBInterface.execute(db_file, "SELECT * from table_name") |> DataFrame

also tried ‘SQLite.execute’ but seems to behave differently, throwing this error:

ArgumentError: 'Int32' iterates 'Int32' values, which doesn't satisfy the Tables.jl `AbstractRow` interface

The cell in Pluto runs for 10+mins and RAM consumption goes up until my laptop freezes, and at that point I have to kill the process.

I’m aware I’m requesting everything that’s in that table, maybe the data is being handled different compared to when dealing with a dataframe?

EDIT: it finally worked with DB.Interface when sinking the result of the query into a DB, but took x5 the time it takes for loading the same data from a CSV.

Is there a way to take a sneak peek at the query results without having to sink everything into a dataframe, thus eating up RAM?

just use SQL : LIMIT \ WHERE:

  • DBInterface.execute(db, "SELECT * FROM Invoice LIMIT 2000") |> DataFrame

And after : list only the “important” column names … ( less column → less memory )

  • DBInterface.execute(db, "SELECT InvoiceId, CustomerId, BillingCity, BillingState, Total FROM Invoice LIMIT 4000") |> DataFrame

usually - I am using a simple tool - to check table/column names :

You can select random records :

I ran into this problem myself because I have 11Gb of CSV files to process.

All of the solutions I found needed the data in RAM.

So I spent a couple of weeks coding up an on-disk B-Tree I can use to index into the CSV files.

I appreciate that is not exactly well documented for someone else to get to grips with.

As a bonus, this is the code that I use it in

That’s the best I can do :slight_smile:

4 Likes

Honestly 2GB worth of CSVs should not be a problem to handle on an 8GB laptop with bog standard CSV.jl + DataFrames.jl.

1 Like

IMHO: It depends on the data and the dataframe customization

  • sometimes you can expect: 4x memory ( with the default settings Int64 )
  • and if you keep 2 copy of the data ( in the memory - because transforming )
    you need more …

so in linux: it is better to add an extra >20GB memory swapfile

A simple: 10col + ( header and 10000 lines ) example

  • 200061 csv bytes on the disk
  • 801136 julia dataframe bytes on the memory

A simple 1 digit number → Int64 ( 8 byte ) in the memory - with the default settings.

$ ls -l  jcoltest.csv 
-rw-r--r-- 1 juliauser fuse 200061 Jan 14 18:35 jcoltest.csv

$ cat  jcoltest.csv | wc -l
10001

$ head jcoltest.csv 
jcol1,jcol2,jcol3,jcol4,jcol5,jcol6,jcol7,jcol8,jcol9,jcol10
0,1,0,1,0,1,0,1,0,1
0,1,0,1,0,1,0,1,0,1
0,1,0,1,0,1,0,1,0,1
0,1,0,1,0,1,0,1,0,1
0,1,0,1,0,1,0,1,0,1
0,1,0,1,0,1,0,1,0,1
0,1,0,1,0,1,0,1,0,1
0,1,0,1,0,1,0,1,0,1
0,1,0,1,0,1,0,1,0,1
julia> using CSV
julia> using DataFrames
julia> df = DataFrame(CSV.File("jcoltest.csv"))
10000×10 DataFrame
│ Row   │ jcol1 │ jcol2 │ jcol3 │ jcol4 │ jcol5 │ jcol6 │ jcol7 │ jcol8 │ jcol9 │ jcol10 │
│       │ Int64 │ Int64 │ Int64 │ Int64 │ Int64 │ Int64 │ Int64 │ Int64 │ Int64 │ Int64  │
├───────┼───────┼───────┼───────┼───────┼───────┼───────┼───────┼───────┼───────┼────────┤
│ 1     │ 0     │ 1     │ 0     │ 1     │ 0     │ 1     │ 0     │ 1     │ 0     │ 1      │
│ 2     │ 0     │ 1     │ 0     │ 1     │ 0     │ 1     │ 0     │ 1     │ 0     │ 1      │
│ 3     │ 0     │ 1     │ 0     │ 1     │ 0     │ 1     │ 0     │ 1     │ 0     │ 1      │
│ 4     │ 0     │ 1     │ 0     │ 1     │ 0     │ 1     │ 0     │ 1     │ 0     │ 1      │
│ 5     │ 0     │ 1     │ 0     │ 1     │ 0     │ 1     │ 0     │ 1     │ 0     │ 1      │
...
│ 9995  │ 0     │ 1     │ 0     │ 1     │ 0     │ 1     │ 0     │ 1     │ 0     │ 1      │
│ 9996  │ 0     │ 1     │ 0     │ 1     │ 0     │ 1     │ 0     │ 1     │ 0     │ 1      │
│ 9997  │ 0     │ 1     │ 0     │ 1     │ 0     │ 1     │ 0     │ 1     │ 0     │ 1      │
│ 9998  │ 0     │ 1     │ 0     │ 1     │ 0     │ 1     │ 0     │ 1     │ 0     │ 1      │
│ 9999  │ 0     │ 1     │ 0     │ 1     │ 0     │ 1     │ 0     │ 1     │ 0     │ 1      │
│ 10000 │ 0     │ 1     │ 0     │ 1     │ 0     │ 1     │ 0     │ 1     │ 0     │ 1      │

julia> Base.summarysize(df)
801136

and with proper type settings ( Int8) it is less …

julia> df2 = DataFrame(CSV.File("jcoltest.csv",types=[Int8, Int8, Int8, Int8, Int8, Int8, Int8, Int8, Int8, Int8], strict=true) )

julia> Base.summarysize(df2)
101136

Thanks to everyone for the input!

I managed to load everything into SQLite, reading each CSV and writing to the DB file.

On-disk footprint went from 2GB to ~5GB, but now I should be able to work directly through SQL.

1 Like

If all the types are the same

julia> df2 = DataFrame(CSV.File("jcoltest.csv",types=Int8, strict=true) )
2 Likes

Prompted by this post, I decided to write an extended example of using my Index1024 code.

https://lawless-m.github.io/Index1024.jl/dev/example1/

The example was code I was writing anyway. It is always instructive to write such things and I can recommend it to anyone.

I got some more tests out of it and found some code was broken.

I also found out that with 1_265_038 keys and (data=zero(UInt64), aux=zero(UInt64)) my Disk Based B-Tree is only 20x slower than using a Julia Dictionary for 1 random lookup, which I think is pretty good - considering it also only needs 9.31 KiB, allocs estimate: 247 to do the lookup.

1 Like