[ANN] DataFrameDBs.jl

randiaz95 · March 20, 2020, 2:36am

I am going to fork the repo and try to see if I can learn Julia whilst trying to contribute to your library.

I think that this database system can add tremendous value; if it can pre-cache certain files in memory using maybe some predictive classifier like gradient boosting.

I don’t know if something like this has ever existed to leverage machine learning to lower memory/cpu costs in servers.

randiaz95 · March 20, 2020, 3:39am

I did my first pull request ever with your repo so I am very excited . Please let me know how I can contribute; I am not as smart as you guys but I definitely think that I can mimick your idiomatic Julia code.

waralex · March 20, 2020, 8:20am

Welcome, @randiaz95!
I don’t know Julia very well, this is my second attempt to write something on it , so I don’t think my code is the gold standard of code on Julia
If you have any questions, please, ask.

If you have experience with this, then that’s good. I have some experience developing databases and high load projects in c ++, but I practically did not work with data science algorithms.

Yifan_Liu · March 23, 2020, 7:42am

I still got an error:

ERROR: MethodError: no method matching Channel(::getfield(DataFrameDBs, Symbol("##23#24")){Float64,Int64,Base.TTY}; spawn=true)
Closest candidates are:
  Channel(::Function; ctype, csize, taskref) at channels.jl:100 got unsupported keyword argument "spawn"
  Channel(::Any) at channels.jl:50 got unsupported keyword argument "spawn"
Stacktrace:
 [1] kwerr(::NamedTuple{(:spawn,),Tuple{Bool}}, ::Type, ::Function) at ./error.jl:125
 [2] (::getfield(Core, Symbol("#kw#Type")))(::NamedTuple{(:spawn,),Tuple{Bool}}, ::Type{Channel}, ::Function) at ./none:0
 [3] #write_progress_channel#22(::Bool, ::Float64, ::typeof(DataFrameDBs.write_progress_channel), ::Int64, ::Base.TTY) at /home/yifanliu/.julia/packages/DataFrameDBs/LxszG/src/tables/progress.jl:61
 [4] write_progress_channel(::Int64, ::Base.TTY) at /home/yifanliu/.julia/packages/DataFrameDBs/LxszG/src/tables/progress.jl:61 (repeats 2 times)
 [5] #insert#39(::Bool, ::typeof(insert), ::DFTable, ::CSV.Rows{false,Parsers.Options{false,true,false,Missing,UInt8,Nothing}}) at /home/yifanliu/.julia/packages/DataFrameDBs/LxszG/src/io/columns.jl:135
 [6] #insert at ./none:0 [inlined]
 [7] #create_table#17(::CSV.Rows{false,Parsers.Options{false,true,false,Missing,UInt8,Nothing}}, ::Int64, ::Bool, ::typeof(create_table), ::String) at /home/yifanliu/.julia/packages/DataFrameDBs/LxszG/src/tables/creators.jl:87
 [8] (::getfield(DataFrameDBs, Symbol("#kw##create_table")))(::NamedTuple{(:from, :show_progress),Tuple{CSV.Rows{false,Parsers.Options{false,true,false,Missing,UInt8,Nothing}},Bool}}, ::typeof(create_table), ::String) at ./none:0
 [9] top-level scope at REPL[3]:1

waralex · March 23, 2020, 8:23am

fixed in master

Yifan_Liu · March 23, 2020, 6:34pm

It worked!

I tested your package with a csv data set that is 33.5 gb on a laptop with 16 gb RAM. I got the results as below:

Time: 1:03:35.9681 written: 225.83 MRows (59.18 KRows/sec), uncompressed size: 47.57 GB, compressed size: 9.0 GB, compression ratio: 5.29

The uncompressed size seems to be much larger than the original data. I would like to know if there is any limit on the size of data that I could work on with your package.

tbeason · March 23, 2020, 6:57pm

I can’t speak specifically to your file or this package, but it is common for the DataFrame to be bigger than the csv file. There is just more information that gets stored with the df. There are ways to address it depending on the data, such as using CategoricalArrays for things with few unique values.

waralex · March 23, 2020, 7:43pm

Great! Have you tried to make queries on this dataset?

About uncompressed size - this is the size that files would occupy without compression. The main overhead is given by the lengths of the strings. Since a block of strings is stored as a continuous array of bytes, it is necessary to write the length of each string before this block. If the strings are short , the lengths give a considerable overhead. But with compression this is not very important.

Theoretically, there are no restrictions on the data size. At least until you try to materialize a query result that doesn’t fit in RAM. When I implement aggregation, the restrictions on it will be about the same - the aggregation result should fit in RAM
Similar architecture, which I implemented in c++ as an internal database of the company I work for , now contains 320,000,000,000 rows in 100 tables, takes up 2 TB of disk space and runs on a server with 96 GB of RAM. At the same time, it rarely consumes more than 20 GB - only on large aggregations and joynes

Yifan_Liu · March 23, 2020, 8:02pm

I will try the query functions later this week and thank you for this amazing package!

waralex · March 23, 2020, 8:04pm

Thank you! And, please, tell me about any problems.

mthelm85 · March 23, 2020, 8:21pm

Have you benchmarked performance of this package against JuliaDB, by chance? I ask because I’m a fairly regular user of JuliaDB as I work a lot with U.S. Census Bureau data sets that usually come in very large .csv files (millions of rows x hundreds of columns). My current workflow is to save them as IndexedTables and then do the querying/filtering/aggregating with JuliaDBMeta. It’s worked really well for me so far but I’d love to see how this compares.

waralex · March 23, 2020, 9:01pm

I haven 't worked with DB, so I can only think theoretically. As I understand it, IndexedTables is an in memory table that is fully loaded into memory. If you have enough memory, they are probably faster than DataFrameDBs. On the other hand, with DataFrameDBs, you can only load the columns that you need. Or perform filtering and calculations without loading the entire columns into memory. Given the compression, this can greatly reduce the data loading time.

Yifan_Liu · March 25, 2020, 4:49am

some questions:

the package name DataFrameDBs looks weird, maybe something more informative?
what does the reuse_row=true condition do for CSV.Rows?
would be better if show_progress=true can tell how much time is needed instead of how much time has passed
when there is missing value in a column, how to convert it from string to Int64?
when I run the code

c_best_bid = parse.(Int64, test.best_bid)
materialize(c_best_bid[1:10])

I got the error message (test.best_bid has no missing value):

ERROR: ArgumentError: invalid base 10 digit '.' in "15.6"
Stacktrace:
 [1] parse at ./parse.jl:240 [inlined]
 [2] _broadcast_getindex_evalf at ./broadcast.jl:625 [inlined]
 [3] _broadcast_getindex at ./broadcast.jl:608 [inlined]
 [4] getindex at ./broadcast.jl:558 [inlined]
 [5] macro expansion at ./broadcast.jl:888 [inlined]
 [6] macro expansion at ./simdloop.jl:77 [inlined]
 [7] copyto! at ./broadcast.jl:887 [inlined]
 [8] copyto! at ./broadcast.jl:842 [inlined]
 [9] materialize! at ./broadcast.jl:801 [inlined]
 [10] eval_on_range(::NamedTuple{(:best_bid_raw,),Tuple{DataFrameDBs.FlatStringsVectors.FlatStringsVector{Union{Missing, String}}}}, ::DataFrameDBs.BroadcastExecutor{Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{1},Nothing,typeof(parse),Tuple{Base.RefValue{Type{Int64}},Array{Union{Missing, String},1}}},NamedTuple{(:best_bid_raw,),Tuple{Array{Union{Missing, String},1}}},Array{Int64,1}}, ::Base.LogicalIndex{Int64,Array{Bool,1}}) at /home/yifanliu/.julia/packages/DataFrameDBs/A2bCW/src/tables/broadcast.jl:130
 [11] _proj_elem_eval_on_range at /home/yifanliu/.julia/packages/DataFrameDBs/A2bCW/src/tables/projection.jl:128 [inlined]
 [12] _proj_eval_on_range at /home/yifanliu/.julia/packages/DataFrameDBs/A2bCW/src/tables/projection.jl:136 [inlined]
 [13] eval_on_range at /home/yifanliu/.julia/packages/DataFrameDBs/A2bCW/src/tables/projection.jl:152 [inlined]
 [14] iterate(::DataFrameDBs.BlocksIterator{DataFrameDBs.DataReader,NamedTuple{(:best_bid_raw,),Tuple{DataFrameDBs.BlockStream}},NamedTuple{(:best_bid_raw,),Tuple{DataFrameDBs.FlatStringsVectors.FlatStringsVector{Union{Missing, String}}}},DataFrameDBs.ProjectionExecutor{NamedTuple{(:a,),Tuple{DataFrameDBs.BroadcastExecutor{Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{1},Nothing,typeof(parse),Tuple{Base.RefValue{Type{Int64}},Array{Union{Missing, String},1}}},NamedTuple{(:best_bid_raw,),Tuple{Array{Union{Missing, String},1}}},Array{Int64,1}}}}},DataFrameDBs.SelectionExecutor{Tuple{DataFrameDBs.RangeToProcess{UnitRange{Int64}}}},Tuple{},Tuple{Symbol}}, ::Nothing) at /home/yifanliu/.julia/packages/DataFrameDBs/A2bCW/src/io/blocksiterator.jl:117
 [15] iterate(::DataFrameDBs.BlocksIterator{DataFrameDBs.DataReader,NamedTuple{(:best_bid_raw,),Tuple{DataFrameDBs.BlockStream}},NamedTuple{(:best_bid_raw,),Tuple{DataFrameDBs.FlatStringsVectors.FlatStringsVector{Union{Missing, String}}}},DataFrameDBs.ProjectionExecutor{NamedTuple{(:a,),Tuple{DataFrameDBs.BroadcastExecutor{Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{1},Nothing,typeof(parse),Tuple{Base.RefValue{Type{Int64}},Array{Union{Missing, String},1}}},NamedTuple{(:best_bid_raw,),Tuple{Array{Union{Missing, String},1}}},Array{Int64,1}}}}},DataFrameDBs.SelectionExecutor{Tuple{DataFrameDBs.RangeToProcess{UnitRange{Int64}}}},Tuple{},Tuple{Symbol}}) at /home/yifanliu/.julia/packages/DataFrameDBs/A2bCW/src/io/blocksiterator.jl:99
 [16] materialize(::DFColumn{Int64}) at /home/yifanliu/.julia/packages/DataFrameDBs/A2bCW/src/tables/materialization.jl:48
 [17] top-level scope at REPL[31]:1

waralex · March 25, 2020, 6:57am

It is not registered yet, so I am ready to consider any suggestions for the name
It’s my mistake I meant reusebuffer, but I made a mistake when I tested the import myself, the CSV didn’t say that this parameter doesn 't mean anything to it . Of course it must be reusebuffer = true
The problem is that I don 't know the full amount of data in the import source in general.

conv_miss_int = (v)->ismissing(v) ? missing : parse(Int64, v)
new_column = conv_miss_int.(string_column)

or, if you want to replace missing with default value

conv_miss_int = (v)->ismissing(v) ? 0 #=default value=# : parse(Int64, v)
new_column = conv_miss_int.(string_column)

It looks like there is a “15.6” entry in the column that can’t be parsed as Int64. Perhaps you should use Float64?

Yifan_Liu · March 26, 2020, 3:07am

Thanks. more questions:

converting column data type seems not easy, a new column needs to be created and then inserted back, is it possible to have something like parse!(Int64, v)?
when I run view = test[:secid .== "6505", :], I got the error message:

ERROR: BoundsError: attempt to access View of table test/
Projection: secid=>col(secid)::Union{Missing, String}; date=>col(date)::Union{Missing, String}; symbol=>col(symbol)::Union{Missing, String}; symbol_flag=>col(symbol_flag)::Union{Missing, String}; exdate=>col(exdate)::Union{Missing, String}; last_date=>col(last_date)::Union{Missing, String}; cp_flag=>col(cp_flag)::Union{Missing, String}; strike_price=>col(strike_price)::Union{Missing, String}; best_bid=>col(best_bid)::Union{Missing, String}; best_offer=>col(best_offer)::Union{Missing, String}; volume=>col(volume)::Union{Missing, String}; open_interest=>col(open_interest)::Union{Missing, String}; impl_volatility=>col(impl_volatility)::Union{Missing, String}; delta=>col(delta)::Union{Missing, String}; gamma=>col(gamma)::Union{Missing, String}; vega=>col(vega)::Union{Missing, String}; theta=>col(theta)::Union{Missing, String}; optionid=>col(optionid)::Union{Missing, String}; cfadj=>col(cfadj)::Union{Missing, String}; am_settlement=>col(am_settlement)::Union{Missing, String}; contract_size=>col(contract_size)::Union{Missing, String}; ss_flag=>col(ss_flag)::Union{Missing, String}; forward_price=>col(forward_price)::Union{Missing, String}; expiry_indicator=>col(expiry_indicator)::Union{Missing, String}; root=>col(root)::Union{Missing, String}; suffix=>col(suffix)::Union{Missing, String}
Selection: 

  at index [false]

my data test is like this:


27×6 DataFrames.DataFrame
│ Row │ column           │ type                   │ rows         │ uncompressed size │ compressed size │ compression ratio │
│     │ Symbol           │ String                 │ String       │ String            │ String          │ Float64           │
├─────┼──────────────────┼────────────────────────┼──────────────┼───────────────────┼─────────────────┼───────────────────┤
│ 1   │ secid            │ Union{Missing, String} │ 225.83 MRows │ 2.1 GB            │ 9.28 MB         │ 232.06            │
│ 2   │ date             │ Union{Missing, String} │ 225.83 MRows │ 2.94 GB           │ 24.94 MB        │ 120.9             │
│ 3   │ symbol           │ Union{Missing, String} │ 225.83 MRows │ 4.34 GB           │ 297.23 MB       │ 14.96             │
│ 4   │ symbol_flag      │ Union{Missing, String} │ 225.83 MRows │ 1.05 GB           │ 4.37 MB         │ 246.19            │
│ 5   │ exdate           │ Union{Missing, String} │ 225.83 MRows │ 2.94 GB           │ 36.07 MB        │ 83.58             │
⋮
│ 22  │ ss_flag          │ Union{Missing, String} │ 225.83 MRows │ 1.05 GB           │ 4.75 MB         │ 226.62            │
│ 23  │ forward_price    │ Union{Missing, String} │ 225.83 MRows │ 2.74 GB           │ 103.63 MB       │ 27.09             │
│ 24  │ expiry_indicator │ Union{Missing, String} │ 225.83 MRows │ 916.7 MB          │ 8.02 MB         │ 114.35            │
│ 25  │ root             │ Union{Missing, String} │ 225.83 MRows │ 861.49 MB         │ 3.51 MB         │ 245.69            │
│ 26  │ suffix           │ Union{Missing, String} │ 225.83 MRows │ 861.49 MB         │ 3.51 MB         │ 245.69            │
│ 27  │ Table total      │                        │ 225.83 MRows │ 47.57 GB          │ 9.0 GB          │ 5.29              │

it would be great to see some statistics by group examples
is it possible to run rolling statistics? like rolling mean, rolling standard deviation?
is it possible to run regressions with the columns?
is the speed of importing data totally dependent on the CSV package or there could be further optimization on your package?

waralex · March 26, 2020, 6:47am

I’m thinking about how to do this. The conversion functions themselves are not part of the package, but a workaround over the reinsert is possible. Something like a map!(function, column)
view = test[test.secid .== "6505", :] or view = test[:secid => (v) -> v== "6505", :] It’s similar to the DataFrames and other array-like structures. :secid .== "6505" is the broadcast on Symbol and String, so test[:secid .== "6505", :] is equivalent of test[false, :]
3 - 5. Yes, it is possible, and this is the next big block of work that I plan to start as soon as I have a little time. I think that the integration of OnlineStats.jl will allow you to do all this.
The CSV package plays a crucial role, but I will try to speed up the import as much as possible

waralex · March 26, 2020, 7:06am

In addition to item 2. If a column has type Union{Missing, String}, it is better to filter it as test[:secid => (v) -> Bool(v== "6505"), :]. Because the DataFrames requires the filtering function to have return type Bool, but return type of v== "6505" if v is Union{Missing, String} is Union{Missing, Bool}. If the column actually contains missings, then test[:secid => (v) -> Bool(!ismissing(v) && v== "6505"), :]

waralex · March 26, 2020, 7:09am

After a little use, do you think this package is useful. Does it make sense to develop it further?

Juan · April 29, 2020, 12:47am

It would be great to have joins and also the ability to reshape (stack/melt/pivot) the larger than memory datasets.

waralex · April 29, 2020, 1:07am

I have now almost completely switched to developing Dash. So the development of DataFrameDBs has slowed down a lot. But I’ll get back to it as soon as I can. Joins is an interesting task, but not an easy one. Apparently first it will be implemented in memory (i.e. the result of the join will have to fit entirely in memory), then I can try to implement the join using temporary files.

Topic		Replies	Views
How is the data ecosystem right now for large datasets? Data	35	6776	July 13, 2017
JuliaDB Questions/Issues New to Julia package	13	2565	July 3, 2019
Benchmarking ways to write/load DataFrames IndexedTables to disk Data	42	7037	October 25, 2018
DataTables or DataFrames? Data question	32	15436	November 19, 2018
JuliaDB loading data General Usage juliadb	15	1910	July 12, 2019

[ANN] DataFrameDBs.jl

Related topics