Composite types versus lookup tables

I’ve been fluctuating between holding my data in database-style tables versus composite types. Things like joining in the DB version is very useful, but things tend to stay as floats and strings. Method dispatch and custom types are very useful in the composite type system. At the end I think it falls on if things have a one-to-one relationship or one/many-to-many: If everything is unique and has just one unique thing to be connected to, then a collection of such things would be best described as a composite type with two fields, one for each “thing”. If one thing can be related to multiple things then it seems like a DB table would be best.

Of course, nothing stoops me from mixing the two. It is however hard to know where to cut it, what should not be a composite type and instead be a DB-table entry, and what should be a composite type. I think I’m down the right track with the one-to-one versus one/many-to-many distinction “described” above, but I would love to hear what you have to say if any of this rambling sounds familiar…

Thanks!

Yes, this is a common design problem.

For data analysis, I try to start with a tables format (eg DataFrame) and work there unless there is a good reason to define structs. Good reasons include:

  1. ability to manage different representations with a single verb (ie multiple dispatch)
  2. unit testing of complex operations
  3. optimization/benchmarking

But as you said, you can freely mix the two. I usually define simple functions for the sole purpose of converting from one representation to another, to get the benefit of both worlds.

May I ask why DataFrames? Can you easily join DataFrames? I find JuliaDB api really really convenient. I kinda wish we had their api without necessarily all the other stuff (parallel, online stats, etc)…

No particular reason against JuliaDB, I am just OK with DataFrames.

But how do you do a join though? The reason I’m asking is that it’s mostly the joins that I find ultra useful when working with the table representation. Where an entry from one table relates to multiple entries in another table, so joining those two tables results in a product that moves further down the analysis pipeline.

I’ll mention that one of the things that pushed me away from the table representation was the way the data entered the pipeline. If before the initial state of the data was .csv and auxiliary files, now it’s fed more manually. When it’s entered entry by entry like that, I find it more useful to have types – types that by definition have the correct structure and values.

Thanks btw for this conversation, it’s super useful to know I’m not crazy :slight_smile:

Personally, I typically combine DataFrames.jl with Query.jl which lets you join (alternative).

I have to start using Query…

Also DataFramesMeta has useful capabilities.