Looking forward to your blog post. I represent the proportion of people who have never heard of mmap before.
Have you looked at JuliaDB.jl?
Thanks!! I thought JuliaDB was for connection to databases. Didn’t realise it had persistent data storage capability. Looks very close to what I need. Will do the research.
Also, there will soon be OnlineStats integration in JuliaDB (https://github.com/JuliaComputing/JuliaDB.jl/pull/75) which would help building algorithms on top of it. Take a look at SparseRegression, for an example.
It’s a bit tricky, as there’s a “JuliaDB” organisation for connecting to databases, and then there’s the unrelated “JuliaDB.jl” package…
@xiaodai, Julia has some amazing tools for big data. One example is the ability to do lazy transformations of large arrays. For example, let’s imagine you have a 10TB 4d array stored as an NRRD file, and you want to take the square root of each element and swap dimensions 3 and 4. This could easy take a couple of hours using other tools, and would involve writing out another disk file in the process. In Julia it only takes a few microseconds and can be done “in memory”:
using FileIO, MappedArrays A = load("bigfile.nrrd") C = PermutedDimsArray(mappedarray(sqrt, A), (1,2,4,3))
That’s because all the operations here are lazy (“virtual”) and are computed on-demand. You can pass these lazy arrays to visualization code, etc, and as long as it’s all been written against our generic AbstractArray interface it should all Just Work.
Of course Julia also supports eager computation (which would be
permutedims(sqrt.(A), (1,2,4,3))), but for big data lazy is very nice.
I hope to be able to learn more about these and be able to introduce this to the masses. It’s not something that I’ve seen and the syntax looks a bit different to the type programming I am used to e.g. R data.frame, data.table.
It’s also worth mentioning packages wrapping SQL engines, like SQLite. I know SAS users often rely on
proc sql because it’s faster than the standard
data step, so that should make sense to them. Of course that requires writing SQL instructions.
I think @davidanthoff has also been working on a SQL backend to Query.jl, which would essentially allow you to run the same query against a data frame or against a SQL database depending on your needs.
I don’t think it’s been mentioned in the thread, but the term you’re looking for is out-of-core.
JuliaDB does out-of-core through Dagger.jl, and databases like SQL do this as well like @nalimilan says.
But one of the important things with Julia is distinguishing between the representation of data and the API. Using generic functions with dispatch, the same API can apply to many different “backends” which handle the data differently. So you may want to look at interfaces like this (Query.jl, DataStreams.jl, IterableTables.jl, etc) to mix the choices depending on the circumstance, but using the same code.
I did a short writeup here:
Does not go into much detail, but the libraries I made public are much better documented. Hope you find this useful.
FWIW, once data is ingested into a binary format and mmapped, I find that I can process a 100 GB dataset in a few minutes with a reasonably recent computer (even a laptop) with an SSD. The key is almost-linear access, random access is of course much worse.
There isn’t a minimal subset of the dataset anywhere for trying out your code?
I will create one soon if that would help.
I just wrote a very similar blog post: https://medium.com/@sdanisch/drawing-2-7-billion-points-in-10s-ecc8c85ca8fa
Not sure how on topic this is, but it’s at least disc based!
Interesting writeup, thanks! Regarding
write: AFAICT there is no simple
write(::IO, ::T) where
isbits(T) even in
master, so I submitted a PR:
but since you know much more about the internals, maybe you could suggest an improvement or make another PR that does this.
If anyone is interested, Feather.jl is already quite useful for working with memory mapped data via this PR. I already use it that way quite routinely (also feather is a really wonderful format). I really should talk to @quinnj about getting that merged, but I’ve been happily using my fork and have mostly forgotten about it.
I did look at
Feather.jl, and found two problems with it:
- I need to know the data size in advance (which requires another pass),
- AFAICT types are restricted to what Feather supports (is this correct?)
Yes, that is certainly true. If you have need of custom datatypes, Feather is definitely not for you. In those cases I use JLD, but I rarely have much need to store large amounts of data of custom types.
For writing you mean? Yes, that seems to be a limitation as well. At least in my case I usually “write once, read millions of times”.
Custom types are helpful especially when I have large amounts of data: I can compress the representation, and still work with it seamlessly. Eg using 16-bit integers for dates saves 75%, see
i really like fstpackage.org it provides random access to rows and columns. The interface is in C++ so is not restricted to R