[ANN] New and Improved JuliaDB

JuliaDB has had a series of big improvements over the past months. In addition to better performance and an API refresh, there are a few particular features that we would like to bring to your attention:

  1. JuliaDB is now closely integrated with OnlineStats. The algorithms in OnlineStats naturally lend themselves to working with very large distributed data sets. As a result, you can now calculate descriptive statistics and perform online statistical learning on distributed data sets within JuliaDB quickly and easily.

  2. JuliaDB now supports machine learning workflows with helpful utility functions that extract feature matrices out of raw input data. JuliaDB automatically detects continuous/categorical variables to create one-hot representations and standardized data. This allows immediate use of JuliaDB tables in machine learning algorithms with no additional data wrangling.

  3. JuliaDB leverages OnlineStats to visualize datasets of unlimited size using a broad selection of descriptive statistics, computed with single-pass distributed algorithms. The system builds fixed-size summaries of infinite data streams as the data comes in.

29 Likes

Maybe this post can go well with the tutorial that will happen on youtube : Intro to JuliaDB, a package for working with large persistent data sets - YouTube

which I will definitely watch

2 Likes

Shameless self-promotion: I’ve also tried to put together a small WIP tutorial on JuliaDB (focused on data manipulation and visualization for researchers - not focused at all on the machine learning side).

It’s based on a R dplyr tutorial (which somebody has already ported to DataFrames here. I’ve added something on visualizations which was not present in the original dplyr tutorial but I think it’s relevant topic. Feel free to write feedback in the “Issues” of the repositories if you wish (not to divert this thread from the original topic too much).

@joshday I unfortunately have not included OnlineStats integration for two reasons:

  • The recipes PR is not merged yet
  • I couldn’t get it to work smoothly with missing data (which I had in my example dataset)
    Still, if you manage to circumvent these issues and want to display some of your really cool work, feel free to modify the notebook and open a PR.
3 Likes

@piever Thanks for the pointer to your tutorial! I’ll respond here partially because dealing with missing data via OnlineStats is a newer feature I’d like to advertise.

  1. You can filter and transform data with (as an example):

    s = series(Mean(), Variance(); filter = isfinite, transform = abs)
    reduce(s, table; select = :mycolumn)
    

    which updates the series with abs(data[i]) only if isfinite(data[i]) == true. By some combination of filtering and transforming you should be able to handle any of the ways missing data can be represented in Julia.

  2. While the plot recipe PR is not in JuliaDB yet, all the functionality is currently available through OnlineStats.Partition and OnlineStats.IndexedPartition. What we are adding through the PR is using these two types with a simpler syntax.

1 Like

Thanks for the clarification, I’ll definitely try and explore a bit more the JuliaDB-OnlineStats integration

We will definitely be covering these topics! Stay tuned.

1 Like

A compare and constrast with SAS would be nice

Looks really cool, thanks for this great work.

I actually couldn’t get it to work with either type of missing data, I think some signatures are maybe too strict (or correct typing is enforced too early in the pipeline) but it’d be definitely really cool to get this to work.

using OnlineStats
using DataValues
s = series(Mean(), Variance(); filter = !isnull, transform = get)
y = cumsum(randn(10^6)) + 100randn(10^6)
ym = DataValueArray(y)
ym[2] = DataValue()
fit!(s, ym)

gives:


MethodError: no method matching fit!(::OnlineStats.AugmentedSeries{0,OnlineStats.Series{0,Tuple{OnlineStats.Mean,OnlineStats.Variance},OnlineStatsBase.EqualWeight},Base.##57#58{Base.#isnull},Base.#get,Base.#identity}, ::DataValues.DataValue{Float64})
Closest candidates are:
  fit!(::OnlineStats.HyperLogLog, ::Any, ::Float64) at /home/pietro/.julia/v0.6/OnlineStats/src/stats/stats.jl:314
  fit!(::StatsBase.StatisticalModel, ::Any...) at /home/pietro/.julia/v0.6/StatsBase/src/statmodels.jl:104
  fit!(::OnlineStats.CountMap{T}, ::T, ::Float64) where T at /home/pietro/.julia/v0.6/OnlineStats/src/stats/stats.jl:122
  ...

Stacktrace:
 [1] fit!(::OnlineStats.AugmentedSeries{0,OnlineStats.Series{0,Tuple{OnlineStats.Mean,OnlineStats.Variance},OnlineStatsBase.EqualWeight},Base.##57#58{Base.#isnull},Base.#get,Base.#identity}, ::DataValues.DataValueArray{Float64,1}) at /home/pietro/.julia/v0.6/OnlineStats/src/series.jl:177
 [2] include_string(::String, ::String) at ./loading.jl:522

and:

using Missings
s = series(Mean(), Variance(); filter = !ismissing)
y = cumsum(randn(10^6)) + 100randn(10^6)
ym = allowmissing(y)
ym[2] = missing
fit!(s, ym)

gives:

MethodError: no method matching fit!(::OnlineStats.AugmentedSeries{0,OnlineStats.Series{0,Tuple{OnlineStats.Mean,OnlineStats.Variance},OnlineStatsBase.EqualWeight},Base.##57#58{Missings.#ismissing},Base.#identity,Base.#identity}, ::Missings.Missing)
Closest candidates are:
  fit!(::OnlineStats.HyperLogLog, ::Any, ::Float64) at /home/pietro/.julia/v0.6/OnlineStats/src/stats/stats.jl:314
  fit!(::StatsBase.StatisticalModel, ::Any...) at /home/pietro/.julia/v0.6/StatsBase/src/statmodels.jl:104
  fit!(::OnlineStats.CountMap{T}, ::T, ::Float64) where T at /home/pietro/.julia/v0.6/OnlineStats/src/stats/stats.jl:122
  ...

Stacktrace:
 [1] fit!(::OnlineStats.AugmentedSeries{0,OnlineStats.Series{0,Tuple{OnlineStats.Mean,OnlineStats.Variance},OnlineStatsBase.EqualWeight},Base.##57#58{Missings.#ismissing},Base.#identity,Base.#identity}, ::Array{Union{Float64, Missings.Missing},1}) at /home/pietro/.julia/v0.6/OnlineStats/src/series.jl:177
 [2] include_string(::String, ::String) at ./loading.jl:522

whereas if everything is Float64, it filters just fine:

s = series(Mean(), Variance(); filter = isfinite)
y = cumsum(randn(10^6)) + 100randn(10^6)
y[2] = NaN
fit!(s, y) # gives correct finite result

Worth opening an issue or am I doing something silly?

There may also be extra typing issues as to filter you need an AugmentedSeries whereas the recipe is just for Series:

y = cumsum(randn(10^6)) + 100randn(10^6)
o = Partition(Hist(50))
s = series(y, o, filter = isfinite)
plot(s, xlab = "Nobs")
No user recipe defined for OnlineStats.AugmentedSeries{0,OnlineStats.Series{0,Tuple{OnlineStats.Partition{0,OnlineStats.Hist{OnlineStats.AdaptiveBins{Float64}}}},OnlineStatsBase.EqualWeight},Base.#isfinite,Base.#identity,Base.#identity}

All of your examples work for me…but that’s because I forgot to tag an important change in OnlineStatsBase that allows DataValues/Missings as input. There’s a pending PR in Metadata for OnlineStats that let’s you plot AbstractSeries as well.

2 Likes

Everything works smoothly after Pkg.checkout, and syntax will get even better after the transition to missing as one no longer needs transform = get. Once again, kudos for the amazing work!

5 Likes

I would also like to draw attention to the out-of-core functionality http://juliadb.org/latest/manual/out-of-core.html it’s limited yet useful for big datasets, especially in combination with OnlineStats.

Cheers!

2 Likes

Is there a timeline for updating JuliaDB for v0.7? (Not urging anything, I am grateful for free software, just looking for information). I considered making a PR, but there are already multiple ones.

I know @shashi has been working on it, but I won’t speak for him on the timeline. The first step is IndexedTables (https://github.com/JuliaComputing/IndexedTables.jl/pull/182), which has a CI-passing PR.

1 Like