[ANN] DataFrameDBs.jl

waralex · March 16, 2020, 9:14am

Thanks, @Yifan_Liu !
I had an"optimization day" yesterday, and here is the result

>clickhouse-client --max_threads=1
SELECT count()
FROM ecomm
WHERE (price > 100) AND (price < 200)

┌──count()─┐
│ 24898078 │
└──────────┘

1 rows in set. Elapsed: 0.699 sec. Processed 109.95 million rows, 879.61 MB (157.32 million rows/s., 1.26 GB/s.)

julia> @time size(t[:price => (p)->p > 100 && p < 200, :])
Time: 0:00:0.7438 readed: 109.95 MRows (147.82 MRows/sec)
  1.013283 seconds (519.70 k allocations: 29.687 MiB, 0.75% gc time)
(24898078, 9)

SELECT sum((category_id + user_id) / category_id)
FROM ecomm
WHERE (price > 100) AND (price < 200)

┌─sum(divide(plus(category_id, user_id), category_id))─┐
│                                    24898078.00147663 │
└──────────────────────────────────────────────────────┘

1 rows in set. Elapsed: 2.555 sec. Processed 109.95 million rows, 398.37 MB (43.03 million rows/s., 155.90 MB/s.)

julia> view = t[:price => (p)->p > 100 && p < 200, :]
julia> col = @. (view.category_id + view.user_id)/view.category_id
julia> @time sum(col)
Time: 0:00:3.2785 readed: 109.95 MRows (33.54 MRows/sec)
  3.425695 seconds (50.06 M allocations: 1.505 GiB, 3.49% gc time)
2.489807800147663e7

Julia is amazing language

There is still a performance problem with string columns due to the materialization of strings (unsafe_string). I think what to do with it.

I think I should implement aggregation (probably by integrating OnlineStats.jl) before registering it. Without aggregations, IMHO, this is not very useful for real tasks

waralex · March 16, 2020, 10:22am

I will try to explain it in my poor English, if something is not clear, ask.

Block structure is a tradeoff between memory consumption and performance.

Of course a in memory database has the highest performance, but for 2 billion rows you just might not have enough memory to materialize even one column.

On the other hand, you can theoretically read one value from a file at a time and perform the required calculations on it. In this case, memory consumption is minimal, but performance is low . Due to poor cache locality, inability to use simd, etc. In addition, file compression is not possible in this case.

A block structure is something in the middle. You only need enough memory to process a single block , i.e. even for 10 billion rows in the database , you need enough memory to store only 65,536 (in case of DataFramesDBs) rows to perform calculations and iterations.
Since the block is large enough, good cache locality and simd optimizations are possible.
Blocks are also well compressed when writing to a file. Each block is compressed separately before being written to a file. When reading a single compressed block is loaded from file and unpacked to memory. Compression is very important not only for saving disk space, but also for performance. IO operations are slow, and the fewer bytes you have to read from disk, the higher the performance. The cost of unpacking in the case of lz4 is fully recouped by reducing the disk read time

xiaodai · March 16, 2020, 12:39pm

So theoretically, depending on hardward and data size, 2^16 may or may not be the “optimal” chunk size. But it is definitely better than everything in one chunk or one row per chunk

waralex · March 16, 2020, 12:51pm

Yes. The optimal chunk depends on the hardware and the requests you plan to make. In DataFrameDBs you can specify block_size on table creation. 2^16 is default. I take it from ClickHouse defaults and maybe it’s too small. Because of ClickHouse run each request in several threads and at the same time can execute several requests, so memory limit for each thread is small. In DataFrameDBs, which run in one thread locally memory limit is bigger

joshday · March 16, 2020, 6:37pm

This looks very cool! Please reach out if you need any help hooking up to OnlineStats!

Yifan_Liu · March 17, 2020, 1:50pm

does this package try to be something like JuliaDB?

waralex · March 17, 2020, 1:57pm

To be honest, I don’t really understand the JuliaDB philosophy

Rather something like ClickHouse without network interface on pure Julia. I.e. persistent relational DB with optimizations to do requests directly on disk storage.

Yifan_Liu · March 18, 2020, 4:12am

I tried to import a large data set using the code:

using DataFrameDBs, CSV

t = create_table("trial", from = CSV.Rows("opprcd2018.csv", reuse_row=true), show_progress=true)

but get the error message:

ERROR: MethodError: no method matching findfirst(::Char, ::SubString{String})
Closest candidates are:
  findfirst(::Function, ::Union{AbstractString, AbstractArray}) at array.jl:1742
  findfirst(::Function, ::Any) at array.jl:1735
  findfirst(::AbstractString, ::AbstractString) at strings/search.jl:104

waralex · March 18, 2020, 7:02am

Can you show me full error trace?

findfirst(ch::AbstractChar, string::AbstractString)

  Find the first occurrence of character ch in string.

  │ Julia 1.3
  │
  │  This method requires at least Julia 1.3.

You have version 1.2?

yakir12 · March 18, 2020, 7:22am

(Just a remark)
If you do get a really nice API, which seems like where this is going, it might be great to decouple the interface from specifically DataFrames.jl and be able to serve all Tables.jl compatible containers.

waralex · March 18, 2020, 7:34am

What do you have in mind? Now any container, that support Tables.schema and Tables.rows can be imported to DataFrameDBs. DataFrame is using as default materialization of views. Make DataFrameDBs itself support the Tables interface is one of my TODOs

yakir12 · March 18, 2020, 8:53am

Awesome!

Steven_Sagaert · March 18, 2020, 10:16am

Just a little nitpick: the past tense of to read is read not readed. Otherwise great work!

xiaodai · March 18, 2020, 10:56am

there are many such special cases in English

waralex · March 18, 2020, 11:03am

If studying Julia will advance me in learning English, then this will be an excellent result.

Yifan_Liu · March 18, 2020, 3:41pm

yes, i am using julia 1.2

waralex · March 18, 2020, 3:43pm

If it’s critical, than I can try to make it compatible with 1.2

Yifan_Liu · March 18, 2020, 3:44pm

that would be great!

waralex · March 18, 2020, 3:51pm

Ok. I originally wrote it in 1.3 and didn 't track what features are supported in 1.2. findfirst is a small thing, and if there is no another critical features that requires 1.3, then I will soon add support for 1.2

waralex · March 18, 2020, 5:51pm

Now master should work with 1.2.
If you test this with your data sets and let me know about the problems, it will be very cool. And it will greatly help further development. I apologize in advance for any errors that may occur - the package is quite complex and there may be many different unrecorded problems.

Topic		Replies	Views
ANN: JuliaDB.jl Community	40	9705	November 13, 2018
Difference between JuliaDB and DataFrames Data	13	1903	June 17, 2021
JuliaData BoF @ JuliaCon2023 discussion Data discussion	2	466	August 14, 2023
Rowwise compuation in `InMemoryDatasets.jl` vs `DataFrames.jl` Performance data , dataframes , inmemorydatasets	2	684	March 23, 2022
The state of DataFrames.jl H2O benchmark Package Announcements dataframes	53	9351	January 1, 2025

[ANN] DataFrameDBs.jl

Related topics