Reading large Excel files (with UTF-8 entries) for caching and later processing?

There seem to be many options for how to read Excel files into Julia and caching them for later/future processing. Can someone recommend a few good ways? Would ExcelFiles.jl and JuliaDB.jl be a good first try? Or is something like Query.jl and DataFrames going to be performant enough?

Context: I’ll read on the order of 15 million rows of data comprising 10-15 columns where some contain UTF-8 text entries and need to “cache” them on disk or similar for making queries and calculations based on them later.

Thanks for any advice/experiences you can share.

I would just read into whatever format that is preferred, and then dump that with Serialization.serialize, with the understanding that this needs to be redone whenever a new Julia version comes out. But I assume this is not a problem as you have the original data. You can automate the whole things with a Makefile.

Thanks, Tamas. That sounds like a simple solution. :slight_smile: I’ll try it out.

The added benefits of something like JuliaDB is not super-clear to me then, if simply relying on serialization covers many use cases. Anyway, I’ll experiment.

AFAIK the forte of JuliaDB is not saving/loading data, but working with large datasets. That said, the fact that is has not been updated for 1.0 would make me skeptical about investing in it for daily work.

If JLD2 is fast enough for your needs, I’d use it instead of serialize as it will be more robust across Julia versions.

Might be more robust across Julia versions. Look at JLD.jl which isn’t getting any love from the authors in the 0.6 → 1.0 transition.

Sorry for being negative here, but I’m a bit annoyed by this.

Before Julia 1.0 packages are expected to break across releases, but after 1.0 that will no longer be the case. So JLD2 will definitely be much more robust than serialize.

Ok, thanks for all replies. I’ll eval serialize/deserialize as well as JLD2 for my case and report back here what I end up using and why. JLD2 does sound intriguing if there is no real performance penalty. Feather might be an alternative if I need to interop with R at some point.

Pre 1.0, the serialize format changed incompatibly on minor version changes (which were really much like major version changes, 0.3.x → 0.4.x → 0.5.x → 0.6.x → 0.7.0), however, will there be a guarantee that only backwards compatible changes will be added to the serialize format until v2.0?

There are many use cases where serialize can be used (keeping a the julia version as part of the file name or path where it is stored), where it’s much more efficient that using some thing like JLD2.