First part benchmarking medium-data tools

I compared and contrast JuliaDB with Dask and disk.frame.

Probably more to come, but JuliaDB needs to mature before it can be a contender in these benchmarks.


I don’t know much about JuliaDB but could it be that there are some precompile times from Julia in these benchmark tests that slows down the result?

Instead of using the @time macro, you should use the @btime macro from the BenchmarkTools package.

In general, you should use the BenchmarkTools package for benchmarking.

given how much benchmarking @xiaodai has done for csv reading etc, I imagine he’s well aware of this :slight_smile: . In my (limited) personal experience, JDB is awesome in principle but I had the same impressions than the ones that are summarised in the post and was always better off using DataFrames though that may just reflect the nature of the data I was playing with.

As an aside, I think as a community we should be wary of only just telling people to look at @btime because it looks better; while it’s certainly relevant in most cases where functions are used multiple times etc; for things like loading files and transforming to a given format, (new) users who would do such operations once or twice in a session would very much experience the compilation time and be disappointed that their experience doesn’t match promised speedy results. But then that’s the TTFP issue which has already been discussed at length…

1 Like

Couldn’t have put it better myself