Hi all!
Julia is my hobby and I had some free time for the last 4 weeks, so here is the first results of my experiments GitHub - waralex/DataFrameDBs.jl: The DateFrameDBs is the prototype of persistent, space efficient columnar database on pure Julia
It is the prototype of columnar, persistent, type stable and space efficient database in pure Julia.
Some examples on this dataset imported to DataFrameDBs
julia> using DataFrameDBs
julia> t = open_table("ecommerce")
DFTable path: ecommerce
10Γ6 DataFrames.DataFrame
β Row β column β type β rows β uncompressed size β compressed size β compression ratio β
β β Symbol β String β String β String β String β Float64 β
βββββββΌββββββββββββββββΌβββββββββββββββββΌβββββββββββββββΌββββββββββββββββββββΌββββββββββββββββββΌββββββββββββββββββββ€
β 1 β event_time β Dates.DateTime β 109.95 MRows β 838.86 MB β 43.81 MB β 19.15 β
β 2 β event_type β String β 109.95 MRows β 845.2 MB β 43.02 MB β 19.65 β
β 3 β product_id β Int64 β 109.95 MRows β 838.86 MB β 403.31 MB β 2.08 β
β 4 β category_id β Int64 β 109.95 MRows β 838.86 MB β 298.06 MB β 2.81 β
β 5 β category_code β String β 109.95 MRows β 1.97 GB β 467.32 MB β 4.31 β
β 6 β brand β String β 109.95 MRows β 956.34 MB β 418.3 MB β 2.29 β
β 7 β price β Float64 β 109.95 MRows β 838.86 MB β 475.22 MB β 1.77 β
β 8 β user_id β Int64 β 109.95 MRows β 838.86 MB β 424.92 MB β 1.97 β
β 9 β user_session β String β 109.95 MRows β 4.1 GB β 2.79 GB β 1.47 β
β 10 β Table total β β 109.95 MRows β 11.92 GB β 5.3 GB β 2.25 β
This data set occupies 5.3 GB of disk space, the source CSV files occupy 14 GB
Letβs do some actions with it:
julia> turnon_progress!(t) #turn on displaing progress for all reads from disc
julia> view = t[1:100:end, :] #It's a lazy view
View of table ecommerce
Projection: event_time=>col(event_time)::Dates.DateTime; event_type=>col(event_type)::String; ...
Selection: 1:100:10995070
julia> view2 = view[350 .> view.price .> 300, [:brand, :price, :event_type]] #It's a lazy view too
View of table ecommerce
Projection: brand=>col(brand)::String; price=>col(price)::Float64; event_type=>col(event_type)::String
Selection: 1:100:109950701 |> &(>(350, col(price)::Float64)::Bool, >(col(price)::Float64, 300)::Bool)::Bool
julia> size(view2) #Reads only the :price column by blocks of ~ 65000 rows and calculates the number of elements according to the condition
Time: 0:00:00 readed: 109.95 MRows (169.65 MRows/sec)
(43239, 3)
julia> materialize(view2) #Materialize to the dataframe only those rows that satisfy the condition
Time: 0:00:01 readed: 109.95 MRows (64.2 MRows/sec)
43239Γ3 DataFrames.DataFrame
β Row β brand β price β event_type β
β β String β Float64 β String β
βββββββββΌββββββββββββΌββββββββββΌβββββββββββββ€
β 1 β β 321.73 β view β
β 2 β xiaomi β 348.53 β view β
β 3 β sv β 308.63 β view β
β 4 β xiaomi β 348.53 β view β
β 5 β intel β 348.79 β view β
β 6 β irobot β 319.45 β view β
....
julia> brand = view2.brand #Lazy iteratable column of view2
DFColumn{String}
julia> unique(brand) #find unique brands by condition of view2
Time: 0:00:01 readed: 109.95 MRows (81.66 MRows/sec)
542-element Array{Any,1}:
""
"xiaomi"
"sv"
"intel"
"irobot"
"pioneer"
"dauscher"
...
julia> length.(brand) #Lazy column of lengths of brands
DFColumn{Int64}
julia> sum(length.(brand)) #sum of lengths of brands
Time: 0:00:01 readed: 109.95 MRows (86.65 MRows/sec)
204850
Physically, a table is a directory, each column is stored in a separate file, divided into blocks of 65536 elements. Each block is compressed using lz4. When iterating over a view or a column, one block is read from the columns required to check the selection conditions. After that, if this block has the necessary rows, the block is read from the columns required for constructing the projection and the broadcasts (if any) are performed on this block. So only the data necessary for a given request is read into memory.
If this package is interesting for the Community, then it can be expanded to support aggregates, joins, additional stored types and etc.
Best regards, Alexandr