Disk based data manipulation framework needed

data

#1

SAS is big in the corporate world. I work in the finance industry and SAS is still quite big. To be honest, using SAS is a pain. SAS can’t even syntax highlight its own language properly. But it’s still there because it has one trick – disk-based data manipulation and associated algorithms.

Ten years ago when I introduced R to my workplace, people were skeptical – you can’t load a large dataset in R and manipulate it like in SAS. That’s because in R the dataset needs to be loaded into memory and at that time the largest laptop only had 4G of RAM. Today, 32G RAM laptops are becoming the norm but still I can’t load really large (50G) datasets into RAM.

I think Julia and R can replace SAS by implementing disk based data manipulation as a first class citizen. Also can most algorithms works once the data becomes disk based? If not then Julia still can’t replace SAS, because most of SAS’s algorithms (e.g. proc glm) works off disk based data


#2

I am not sure I understand what you are saying. If you are advocating the development of libraries that have functionality like SAS, the best way to do that is to start working on one.

If you are asking about disk-based data access: it is very easy to do for large data using mmap. I am in the process of working on a project that involves this, and will do a blog post soon, but the principle is very simple: map a file and and array, and from then on just access your data with [].


#3

Nice. I have started working on some functions that work on feather files stored manually as chunks. Hoepfully it will turn into a package later on. I will look into mmap seems prettt cool.


#4

Looking forward to your blog post. I represent the proportion of people who have never heard of mmap before.


#5

Have you looked at JuliaDB.jl?


#6

Thanks!! I thought JuliaDB was for connection to databases. Didn’t realise it had persistent data storage capability. Looks very close to what I need. Will do the research.


#7

Also, there will soon be OnlineStats integration in JuliaDB (https://github.com/JuliaComputing/JuliaDB.jl/pull/75) which would help building algorithms on top of it. Take a look at SparseRegression, for an example.


#8

It’s a bit tricky, as there’s a “JuliaDB” organisation for connecting to databases, and then there’s the unrelated “JuliaDB.jl” package…


#9

@xiaodai, Julia has some amazing tools for big data. One example is the ability to do lazy transformations of large arrays. For example, let’s imagine you have a 10TB 4d array stored as an NRRD file, and you want to take the square root of each element and swap dimensions 3 and 4. This could easy take a couple of hours using other tools, and would involve writing out another disk file in the process. In Julia it only takes a few microseconds and can be done “in memory”:

using FileIO, MappedArrays
A = load("bigfile.nrrd")
C = PermutedDimsArray(mappedarray(sqrt, A), (1,2,4,3))

That’s because all the operations here are lazy (“virtual”) and are computed on-demand. You can pass these lazy arrays to visualization code, etc, and as long as it’s all been written against our generic AbstractArray interface it should all Just Work.

Of course Julia also supports eager computation (which would be permutedims(sqrt.(A), (1,2,4,3))), but for big data lazy is very nice.


#10

I hope to be able to learn more about these and be able to introduce this to the masses. It’s not something that I’ve seen and the syntax looks a bit different to the type programming I am used to e.g. R data.frame, data.table.


#11

It’s also worth mentioning packages wrapping SQL engines, like SQLite. I know SAS users often rely on proc sql because it’s faster than the standard data step, so that should make sense to them. Of course that requires writing SQL instructions.

I think @davidanthoff has also been working on a SQL backend to Query.jl, which would essentially allow you to run the same query against a data frame or against a SQL database depending on your needs.


#12

I don’t think it’s been mentioned in the thread, but the term you’re looking for is out-of-core.

JuliaDB does out-of-core through Dagger.jl, and databases like SQL do this as well like @nalimilan says.

But one of the important things with Julia is distinguishing between the representation of data and the API. Using generic functions with dispatch, the same API can apply to many different “backends” which handle the data differently. So you may want to look at interfaces like this (Query.jl, DataStreams.jl, IterableTables.jl, etc) to mix the choices depending on the circumstance, but using the same code.


#13

I did a short writeup here:
https://tpapp.github.io/post/large-ragged-dataset-julia/
Does not go into much detail, but the libraries I made public are much better documented. Hope you find this useful.

FWIW, once data is ingested into a binary format and mmapped, I find that I can process a 100 GB dataset in a few minutes with a reasonably recent computer (even a laptop) with an SSD. The key is almost-linear access, random access is of course much worse.


#14

There isn’t a minimal subset of the dataset anywhere for trying out your code?


#15

I will create one soon if that would help.


#16

I just wrote a very similar blog post: https://medium.com/@sdanisch/drawing-2-7-billion-points-in-10s-ecc8c85ca8fa
:slight_smile: Not sure how on topic this is, but it’s at least disc based!


#17

Interesting writeup, thanks! Regarding write: AFAICT there is no simple write(::IO, ::T) where isbits(T) even in master, so I submitted a PR:
https://github.com/JuliaLang/julia/pull/24234
but since you know much more about the internals, maybe you could suggest an improvement or make another PR that does this.


#18

If anyone is interested, Feather.jl is already quite useful for working with memory mapped data via this PR. I already use it that way quite routinely (also feather is a really wonderful format). I really should talk to @quinnj about getting that merged, but I’ve been happily using my fork and have mostly forgotten about it.


#19

I did look at Feather.jl, and found two problems with it:

  1. I need to know the data size in advance (which requires another pass),
  2. AFAICT types are restricted to what Feather supports (is this correct?)

#20

Yes, that is certainly true. If you have need of custom datatypes, Feather is definitely not for you. In those cases I use JLD, but I rarely have much need to store large amounts of data of custom types.

For writing you mean? Yes, that seems to be a limitation as well. At least in my case I usually “write once, read millions of times”.