How is the data ecosystem right now for large datasets?

ExpandingMan · July 13, 2017, 3:08pm

Of course you will always be stuck with having to deal with dates and strings in some way, but you are still free to choose how you represent this data in memory or on disk. For example, in Julia, dates and times are represented by integers (do DateTime().instant.periods.value) and indeed, at the end of the day everything is an integer, but the usual approach is to keep them wrapped in DateTime objects when they sit in a dataframe. This is the approach I’m starting to question. Perhaps instead of storing these in a Vector{DateTime} we should store them as a Vector{Int} with metadata that tells it to convert to DateTime only when appropriate.

This might sound silly (and I’m certainly not committed to this idea, I’ve just been tossing it around), but when one considers that ultimately all the data has to go into some sort of analysis that only understands integers and floats anyway, one wonders whether DateTime is appropriate as a wrapper for stored data or whether it is merely an interface for presenting data to humans. Similar arguments can be made for strings since these almost always represent objects that can be mapped to the integers.

As an example, you’re talking about datestamps for events: how would we have dealt with time if we encountered it in HEP? It’s a float (we may have to use integers here because of precision issues). There’d never be any question about it because everyone knows that time is represented by real numbers. Perhaps it would behoove us not to forget this fact even when someone tells us a date.

Topic		Replies	Views
The state of DataFrames.jl H2O benchmark Package Announcements dataframes	53	9372	January 1, 2025
DataTables or DataFrames? Data question	32	15379	November 19, 2018
A living post of Julia vs R's data manipulation tasks speeds Data data	21	7785	August 27, 2021
[ANN] A new lightning fast package for data manipulation in pure Julia Package Announcements data , dataframes , inmemorydatasets	95	10613	July 4, 2022
Julia performs poorly on group-by benchmarks Data performance	48	5803	January 23, 2019

How is the data ecosystem right now for large datasets?

Related topics