Reading large Excel files (with UTF-8 entries) for caching and later processing?

robertfeldt · September 19, 2018, 10:02am

There seem to be many options for how to read Excel files into Julia and caching them for later/future processing. Can someone recommend a few good ways? Would ExcelFiles.jl and JuliaDB.jl be a good first try? Or is something like Query.jl and DataFrames going to be performant enough?

Context: I’ll read on the order of 15 million rows of data comprising 10-15 columns where some contain UTF-8 text entries and need to “cache” them on disk or similar for making queries and calculations based on them later.

Thanks for any advice/experiences you can share.

Tamas_Papp · September 19, 2018, 10:49am

I would just read into whatever format that is preferred, and then dump that with Serialization.serialize, with the understanding that this needs to be redone whenever a new Julia version comes out. But I assume this is not a problem as you have the original data. You can automate the whole things with a Makefile.

robertfeldt · September 19, 2018, 11:41am

Thanks, Tamas. That sounds like a simple solution. I’ll try it out.

The added benefits of something like JuliaDB is not super-clear to me then, if simply relying on serialization covers many use cases. Anyway, I’ll experiment.

Tamas_Papp · September 19, 2018, 11:52am

AFAIK the forte of JuliaDB is not saving/loading data, but working with large datasets. That said, the fact that is has not been updated for 1.0 would make me skeptical about investing in it for daily work.

nalimilan · September 19, 2018, 11:58am

If JLD2 is fast enough for your needs, I’d use it instead of serialize as it will be more robust across Julia versions.

carstenbauer · September 19, 2018, 12:01pm

Might be more robust across Julia versions. Look at JLD.jl which isn’t getting any love from the authors in the 0.6 → 1.0 transition.

Sorry for being negative here, but I’m a bit annoyed by this.

nalimilan · September 19, 2018, 12:28pm

Before Julia 1.0 packages are expected to break across releases, but after 1.0 that will no longer be the case. So JLD2 will definitely be much more robust than serialize.

robertfeldt · September 19, 2018, 1:21pm

Ok, thanks for all replies. I’ll eval serialize/deserialize as well as JLD2 for my case and report back here what I end up using and why. JLD2 does sound intriguing if there is no real performance penalty. Feather might be an alternative if I need to interop with R at some point.

ScottPJones · September 20, 2018, 12:25pm

Pre 1.0, the serialize format changed incompatibly on minor version changes (which were really much like major version changes, 0.3.x → 0.4.x → 0.5.x → 0.6.x → 0.7.0), however, will there be a guarantee that only backwards compatible changes will be added to the serialize format until v2.0?

There are many use cases where serialize can be used (keeping a the julia version as part of the file name or path where it is stored), where it’s much more efficient that using some thing like JLD2.

Topic		Replies	Views
Breaking ExcelReaders.jl update Data announcement	2	1049	May 7, 2018
[ANN] JDF.jl - Experimental Julia DataFrames serialization format Package Announcements	3	1428	January 19, 2020
TextParse.jl is fast again Data announcement	14	2541	October 30, 2018
File Format for Large Two-Dimensional Dataset Data	19	2611	July 31, 2018
ANN: SASLib.jl Data	14	2306	December 30, 2017

Reading large Excel files (with UTF-8 entries) for caching and later processing?

Related topics