When should I choose a struct, mutable struct, Dict, named tuple or DataFrame?

When should I use a struct, mutable struct, Dict, named tuple or DataFrame in Julia? Mutability and performance must be factors in deciding which data type to use. But when should I use each type?

7 Likes

Hello, welcome to the Julia community.

To be upfront: this question is pretty hard to answer comprehensively. If you have a specific application in mind, it would be better to present your problem and ideas and ask for opinions here.
In most cases, there is not a single correct way to do things¹.

If you meant to pose this as a general question, then take some of my considerations. (no guarantee for correctness, objectivity and especially not completeness):

All depends on what you want to do with it. Tuples are great and efficient if you want to structure a fixed (smallish) amount of objects. Bonus points if the types rarely change. NamedTuples are the convenience version. StaticArrays can lift a lot of the benefits of Tuples to the Array space.

(Immutable) Structs should be preferred over tuples for abstraction reasons or dispatch. That is, if it makes sense that your collection of a fixed number of things equates to some atomic concept in whatever you plan to do, making a struct makes sense. The other reason for structs is dispatch. So if you need to specialize the behavior of something on exactly your structure, then creating (at least wrapper) types makes sense.
Immutable to mutable struct is more of a gradient than a clear cut. You can make single fields of an immutable struct mutable via Ref, so if you group a lot of truly immutable things with a single mutable one, you need not make the entire structure mutable. Likewise, if you expect changes to be rare, you can handle mutations of immutable structs as copy-and-change if you adapt your handling accordingly. If changes are frequent, you should use mutable structs or at least mutable fields (mutable objects in an immutable struct are still mutable). You can even abstract away your decision entirely with Setfield.jl or its successors.

If you need a dictionary or not is, uhm, well, you will know if you need one yourself. Dictionaries are mappings, that’s it. Performance wise, performance depends on the implementation. Normal dictionaries have the usual performance implications of hashmaps, i.e. the need to hash and compare things. If hashing is expensive (or you typically store few things), you can use hashing-free implementations like LittleDict, but determining the performance cutoff in your specific application is usually only possible empirically.
Care must be taken if you want to use mutable things as keys. Other languages disallow this entirely, Julia lets you use whatever you fancy, but mutating keys will break the normal dictionaries, but there are ways around that if you care.

I’ll let out dataframes because this is already long enough. DataFrames are for tabular data, pretty mature and efficient for that matter. Be aware that tabular here means columns are first class and rows not (not named). There’s probably yet a better structure for cases with named rows and columns.

[¹]: This is true for many languages, but many have strong support for one or the other approach, while you are often enough different approaches work equally well in Julia.

Take away: It depends, if you want to do anything particular, just ask :slight_smile:

Edit 2: I didn’t even came to mention Arrays, gaah, the arrays. There’s literally dozens of different array-like things with different foci. StaticArrays, ComponentArrays, ElasticArrays, regular Arrays. It really depends on your use case. In general, Arrays provide a convenient abstraction for discernible dimensions in data and a way to allow varying number of elements (with plain arrays only for one-dimensional arrays, but other implementations can grow even in higher dimensions, e.g. ElasticArrays)

22 Likes

@FPGro nice response
@davidwallis echoing above – best to ask about which to use for a specific need/task/algorithm

Half the question is more a Data Structures 101 question than a Julia question. The reasons for why you should use a Dict vs DataFrame vs struct (or a Vector of structs) in Julia are not so different of the reason you take the same decision in other languages.

Is this a homework question, or do you have a use case?

3 Likes

Thanks for the answers. So, perhaps I can give examples from the thing I’m working on at the moment, which is a Dash web app for processing microphone array data. Certainly not homework (that was a few decades ago for me). But my background is physics, not computer science. I’ve been writing mathematical code for decades (Matlab, Python, Yorick, IDL, Lisp and others), but not really much experience with a full-blown application. And I agree that the question could perhaps apply to other languages.

The first example is that I have a function that does the main processing, and this needs to package the results in a structured data type. There are quite a lot of fields, with input conditions, arrays with intermediate data, and results. I originally defined a struct and called the constructor in the last line of the function. But this isn’t very readable with so many fields and a long list of parameters. I now return a named tuple. The object is of course immutable because it’s just the results of the processing.

Another example is that I will have to write a data structure to hold an array of microphone array parameters. This will have the(x,y,z) position of the reference microphone, arrays of (x,y,z) positions of the rest of the microphones in an individual array relative to the reference, and the direction (azimuth, elevation) of where the array is pointing to. I will have an array of these types, because the entire system has multiple microphone arrays. So, for the elements of the array would it be best to use a struct, named tuple, something else? Or doesn’t it really matter?

The next example is something to hold the application data for the web app. So it is essentially the data structure for the backend processor, with a bunch of functions that do the various levels of processing and update the application data object. This will be all the inputs and outputs to the calculations. Guessing this should be mutable because every time the user clicks something on the GUI it changes the application data. I have this as a Dict() at the moment, defined in the backend module.

The final example is perhaps the most important in terms of speed. I imagine I will often refer to a point in space (x,y,z). This might be used in the inner loops of the processing so it could impact performance. I could have a simple 3-element array and decide that x= a[1] etc. But I loose the ability to refer to each axis by name (x,y or z). Similarly for a tuple. It could also be a stuct, a named tuple a Dict() (probably a terrible idea) or something else. I also might also need the coordinates in different types (integers, floats or similar).

I don’t think the choice really matters too much in these examples. But in writing this code it made me think about what data structures I should be using, whether or not I am making the right choices, and what would a computer scientist do? This might end up as quite a big application so I want to make the right choices at the start. And more generally, I wondered if it would be possible to make a sort of dichotomous key that someone could follow in order to make the best choice. Perhaps there is a text book with such a key?

Thanks for any input. Sorry if it’s a bit basic for this forum.

3 Likes

Perhaps another point is this. When I write a function that returns a big package of results, I could use a struct or named tuple. A struct is rigidly defined with a struct declaration, and any functions that accept or return these types must conform to the definition. A named tuple can be constructed in a more ad-hoc way, and the definition need not be predefined. Functions that use the named tuples can use a kind of duck typing. E.g. I might have functions that need an object with sample rate and frequency. Any named tuple with these properties will do. But using structs makes the code more rigid and C/Fortran like. On the other hand, dispatch presumably isn’t possible with the named tuples, so this gives structs an advantage.

What are the tradeoffs with regard to performance, style, maintainability?

Regarding your last question, you can do multiple dispatch on named tuples, it’s just too hard to write the type (it is rather verbose). Also, it is too easy to write type unstable code with named tuples, because of their flexibility.

function foo(x,  y) 
  if y < 0
     return (a = 1)
   else
      a = baz(x)  # here Baz some other function which return float, for example 0.5
       return (a = a) 
  end
end

In this example unintentionally was introduced type instability. Using structs, you would get type promotion and type stable result.

In my experience, Named Tuples are nice on prototyping stage, but closer to final stage of application development I prefer to switch to structs, to avoid various issues.

Generally, you should use whatever suits you better. Too much thinking in advance may lead to overengineering and development paralysis. Fail faster: do code that works, identify bottlenecks and optimize them away, repeat till perfectness. If this is your first big application, just embrace the idea that first implementation will be wrong and postpone optimizations things to a later time.

4 Likes

Check the Parameters package, which will make that much more readable.

3 Likes

Oh I like that. Thank you.

Thanks. I’ll just press on and get it running.

1 Like

There exist GitHub - davidavdav/NamedArrays.jl: Julia type that implements a drop-in replacement of Array with named dimensions , but my feeling is that defining an (immutable) struct would be optimal in this case. The data are not going to change for every given space point, no the space dimensionality.

1 Like

Although this is hard, I can only emphasize

I’ve been there and I’ve messed up my first non-small project sufficiently. It’s definitely an experience to learn from. And it’s incredibly hard to forsee what will work out eventually, what’s a good and what’s a bad optimization until you’re knee-deep in the stuff, so delay most of that though until when you’re there.

A lot of those details can often be circumvented by less considering what you use in favor of how you use it. Expressing your high-level operations in terms of low-level operations that you can specialize on a particular datastructure may go a long way. Many things can also be written generically, so that they just work for most inputs, irrespective of the concrete type. Just be sure to not put unnecessary type restrictions everywhere (again, I did that mistake myself). Example: if you need to iterator over something, you usually don’t need to know the particular structure if you just keep to the usual iteration functions. Use eachindex instead of 1:... and so on. I guarantee that you’ll get a feeling for this eventually ^^

On your case: Struct for position looks fine. You could use a StaticArray, that’s basically what those are for, but that doesn’t come with the named dimensions.

So does NamedTuple for returning many things in a way that is “self-explanatory” at the other side. But returning a whole lot of things from a huge function may be a sign to split that into smaller pieces.

Good luck with your application.

3 Likes

With FieldVector from StaticArrays you get both:

struct Point{T} <: FieldVector{3,T}
  x::T
  y::T
  z::T
end

And all the arithmetic works for Point.

5 Likes

I’m also new and had to go through similar problems for my code. What I ended up using (after lots of failing fast =]) for coordinates was a Vector of StaticArrays. This improved performance dramatically for me. I put that in a struct, so to get the y coordinate of point 7 I use body.points[7][2]. A struct for each point can be a bit nicer, but most of my operations happen on the 3 coordinate (say building a vector body.points[8]-body.points[7]).
For parameters I use:

Base.@kwdef mutable struct Solver
    Δt :: Float64 = 0.3
    ...
end

Which allows for defaults and kwargs.
Hope this helps!
PS: don’t worry about a problem being “a bit basic for this forum”. People here are really helpful.

1 Like

Thank you all for your suggestions and help. Great to be part of the Julia community.

2 Likes

That probably should be
struct Point{T} <: FieldVector{3,T}

3 Likes

Yes, that. fixed.