[ANN] DataFrameDBs.jl

Thanks, @Yifan_Liu !
I had an"optimization day" yesterday, and here is the result

>clickhouse-client --max_threads=1
SELECT count()
FROM ecomm
WHERE (price > 100) AND (price < 200)

┌──count()─┐
│ 24898078 │
└──────────┘

1 rows in set. Elapsed: 0.699 sec. Processed 109.95 million rows, 879.61 MB (157.32 million rows/s., 1.26 GB/s.)
julia> @time size(t[:price => (p)->p > 100 && p < 200, :])
Time: 0:00:0.7438 readed: 109.95 MRows (147.82 MRows/sec)
  1.013283 seconds (519.70 k allocations: 29.687 MiB, 0.75% gc time)
(24898078, 9)
SELECT sum((category_id + user_id) / category_id)
FROM ecomm
WHERE (price > 100) AND (price < 200)

┌─sum(divide(plus(category_id, user_id), category_id))─┐
│                                    24898078.00147663 │
└──────────────────────────────────────────────────────┘

1 rows in set. Elapsed: 2.555 sec. Processed 109.95 million rows, 398.37 MB (43.03 million rows/s., 155.90 MB/s.)
julia> view = t[:price => (p)->p > 100 && p < 200, :]
julia> col = @. (view.category_id + view.user_id)/view.category_id
julia> @time sum(col)
Time: 0:00:3.2785 readed: 109.95 MRows (33.54 MRows/sec)
  3.425695 seconds (50.06 M allocations: 1.505 GiB, 3.49% gc time)
2.489807800147663e7

Julia is amazing language :slight_smile:

There is still a performance problem with string columns due to the materialization of strings (unsafe_string). I think what to do with it.

I think I should implement aggregation (probably by integrating OnlineStats.jl) before registering it. Without aggregations, IMHO, this is not very useful for real tasks

1 Like

I will try to explain it in my poor English, if something is not clear, ask.

Block structure is a tradeoff between memory consumption and performance.

Of course a in memory database has the highest performance, but for 2 billion rows you just might not have enough memory to materialize even one column.

On the other hand, you can theoretically read one value from a file at a time and perform the required calculations on it. In this case, memory consumption is minimal, but performance is low . Due to poor cache locality, inability to use simd, etc. In addition, file compression is not possible in this case.

A block structure is something in the middle. You only need enough memory to process a single block , i.e. even for 10 billion rows in the database , you need enough memory to store only 65,536 (in case of DataFramesDBs) rows to perform calculations and iterations.
Since the block is large enough, good cache locality and simd optimizations are possible.
Blocks are also well compressed when writing to a file. Each block is compressed separately before being written to a file. When reading a single compressed block is loaded from file and unpacked to memory. Compression is very important not only for saving disk space, but also for performance. IO operations are slow, and the fewer bytes you have to read from disk, the higher the performance. The cost of unpacking in the case of lz4 is fully recouped by reducing the disk read time

5 Likes

So theoretically, depending on hardward and data size, 2^16 may or may not be the “optimal” chunk size. But it is definitely better than everything in one chunk or one row per chunk

1 Like

Yes. The optimal chunk depends on the hardware and the requests you plan to make. In DataFrameDBs you can specify block_size on table creation. 2^16 is default. I take it from ClickHouse defaults and maybe it’s too small. Because of ClickHouse run each request in several threads and at the same time can execute several requests, so memory limit for each thread is small. In DataFrameDBs, which run in one thread locally memory limit is bigger

This looks very cool! Please reach out if you need any help hooking up to OnlineStats!

1 Like

does this package try to be something like JuliaDB?

To be honest, I don’t really understand the JuliaDB philosophy :frowning:

Rather something like ClickHouse without network interface on pure Julia. I.e. persistent relational DB with optimizations to do requests directly on disk storage.

4 Likes

I tried to import a large data set using the code:

using DataFrameDBs, CSV

t = create_table("trial", from = CSV.Rows("opprcd2018.csv", reuse_row=true), show_progress=true)

but get the error message:

ERROR: MethodError: no method matching findfirst(::Char, ::SubString{String})
Closest candidates are:
  findfirst(::Function, ::Union{AbstractString, AbstractArray}) at array.jl:1742
  findfirst(::Function, ::Any) at array.jl:1735
  findfirst(::AbstractString, ::AbstractString) at strings/search.jl:104

Can you show me full error trace?

findfirst(ch::AbstractChar, string::AbstractString)

  Find the first occurrence of character ch in string.

  │ Julia 1.3
  │
  │  This method requires at least Julia 1.3.

You have version 1.2?

(Just a remark)
If you do get a really nice API, which seems like where this is going, it might be great to decouple the interface from specifically DataFrames.jl and be able to serve all Tables.jl compatible containers.

What do you have in mind? Now any container, that support Tables.schema and Tables.rows can be imported to DataFrameDBs. DataFrame is using as default materialization of views. Make DataFrameDBs itself support the Tables interface is one of my TODOs

1 Like

Awesome!

Just a little nitpick: the past tense of to read is read not readed. Otherwise great work!

1 Like

there are many such special cases in English

If studying Julia will advance me in learning English, then this will be an excellent result. :slight_smile:

4 Likes

yes, i am using julia 1.2

If it’s critical, than I can try to make it compatible with 1.2

that would be great!

Ok. I originally wrote it in 1.3 and didn 't track what features are supported in 1.2. findfirst is a small thing, and if there is no another critical features that requires 1.3, then I will soon add support for 1.2

Now master should work with 1.2.
If you test this with your data sets and let me know about the problems, it will be very cool. And it will greatly help further development. I apologize in advance for any errors that may occur - the package is quite complex and there may be many different unrecorded problems.