I have a working C++ project that involves reading in several different CSV files, some with header info, and some more like tables. Most of the data is record-like with “stn_id” for column 1, followed by other column data, ie gps coordinates, and other field measurements at those locations.
In C++, I read the simple ones into maps (dictionaries) of key-value pairs. A “station” object has up to 20 parameters, so they are stored as a multi map (“stn_id” → object pointer). Along the way I use the Std Template Library to manipulate this multi map using typical collection-based functions.
To my point, I am trying to rewrite this project in Julia and I’m having trouble deciding between arrays, structs, tables, and data frames. I want to take advantage of the vector operations as much as possible and avoid the continual for-next looping and indexing. What’s the goto process for combining various data based on a common key and then adding more columns of derived data. This is pretty simple in a spreadsheet for instance.
One other issue I will have to overcome is duplicate station data. Repeat measurements happen a lot, and a standard dictionary can’t handle duplicate keys, that’s why I used multi maps in C++. Is there a Julia solution to this?
I’ve spent a few days using data frames, and I like the way you can reference the columns by their column names. Arrays are not so friendly, especially if you refactor the code and the column order changes.
if you’re talking about analyzing CSV files most likely I’d be looking at CSV.jl and DataFrames.jl with DataFramesMeta.jl to handle filtering and joins and other database type operations. If you want to do something fairly complicated it can be useful to load all the data into SQLite tables and manipulate it with SQL to get it into the analyzable form you need.
I would stick to the Tables.jl interface and would try to find out if any existing implementation fits your needs. Your problem seems to be geospatial? We have done a lot of work in this area, and so I would check the existing ecosystems to see if you can contribute missing features:
You picked up on that - more geophysical, but in the same family for sure. If the geospatial community likes Tables then I’d better take a look. Thanks for the link.
Yes, take a look at the data structures in Meshes.jl which are behind the geospatial ecosystem described in the video. We adopted the view that meshes + tables = geospatial data. This includes geophysical data.
In Bouguer Gravity processing, there is a correction called the Terrain correction. Around 1995, I wrote a mesh generator based on Knuth’s Delaunay Triangulation algorithm. The idea was to assemble the DEM data for 40km around a survey area, generate a topo mesh and for each triangle compute the terrain correction term. Sum up all the terms for a complete correction value. Do that for each station position. On my computer back then, I would start the program at the end of the day, and it take all night to run. The mesh would typically have 2-4 million elements, and there might be 1000 stations.
We’ve been doing geometric processing with meshes containing millions of elements and doing geostatistics with thousands of stations as well. If you have expertise with Delaunay triangulation, consider joining the boat, there is a lot to contribute there.
This is all good stuff. I appreciate the feedback. Wandering through the maze of packages that manipulate base data structures is daunting. It’s good to keep your eye on easy of use vs performance/overhead from time to time. And improvements change old ideas.
If you are doing “traditional” data cleaning with heterogenous data, i.e. something you might do in dplyr or Stata, then DataFramesMeta.jl is a great choice.
Unlike dplyr it can also be used to construct more robust, replicable pipelines. It’s easy to take a complicated DataFramesMeta.jl chain and put it into a function that you apply to many different data sets.
One solution to my common keys problem is to take a DataFrame and then use groupby on the key column, with further processing on each group. Don’t have the full picture as yet.
Here is a tutorial for doing that with DataFramesMeta.jl. DataFramesMeta.jl uses all the same functions as in DataFrames, but with a slightly easier syntax.
Trying to re-write a C++ program ran into difficulties with classes. I got so used to writing methods for classes and then using object.method() with no arguments that I forgot how to do it any other way. I also find that for loops are inevitable when it comes to looping over an array of objects.
I’m trying to rethink the data structures for Julia. It seems that structs are kinda complicated. Sure, making one is easy but using it in an array of many is not. Is this true? So, it boils down to choosing between lists of objects (a vector of structs) or a matrix. A matrix is fine if all the data is Float64, but I deal with rows of data that have ID’s.
Ok, so Dataframes are still an option, less fundamental. I guess I’m kvetching over what the Julia idiom would be for row data that typically has string id’s in column 1 and numerical data that builds more columns as processing continues.
I could mention here that I chose a TOML config file as input and it created a dictionary, which I’m fine with. But I started adding more and more fields to it, and passing it to every function in the processing. Does that sound reasonable? Thanks for listening.
I ran across this comment on a Julia tutorial website:
Note Unlike in Python and some other dynamic languages, dictionaries are rarely the right approach (ie. often referred to as “the devil’s datastructure”).
Sure, I wouldn’t want to generate a huge dictionary with a million entities, and I wouldn’t want to perform linear algebra with them, but come on, what are you supposed to do with mixed data. Again, I come from an OOPs world where a class contains data and methods. This is basically a struct with functions. A dictionary is a nice way to combine data and pass it around. Nobody said anything about inverting a matrix of dictionaries.
An vector (list) of objects is a basic thing, so should a list of structs. Maybe a dictionary of vectors is ok too.
After more reading and testing code, I’m getting a better picture. Not quite there yet. But I did find a project called JuliaGeo which I think I can contribute to. I could add a few more transformations like Lambert Conformal Conic and a strange one that seems to be used only in New Brunswick - that’s in eastern Canada. I had to convert a lot of DEM data from their topo database to UTM. The project is a little short on ellipsoids too, but then who uses an Airy 1830 ellipsoid anymore, except for some historical surveys in Britain.
I got a lot of good material out of a text called “Map Projections - A Working Manual” by John P. Snyder, US Geological Survey Professional Paper 1395, fourth printing 1997.