NamedTuple with 10_000 elements?

bbejanov · November 23, 2021, 9:30pm

Hi everyone,

I have basically a matrix with named columns. Currently my implementation is similar to this:

struct SimData{T, C <: AbstractMatrix{T}} <: AbstractMatrix{T}
    columns::NamedTuple
    values::C

    # inner constructor enforces constraints.
    function SimData(names::NTuple{N,Symbol}, values::AbstractMatrix) where {N}
        if N != size(values, 2)
            throw(ArgumentError("Number of names and columns don't match: $N ≠ $(size(values, 2))."))
        end
        columns = NamedTuple{names}([view(values, :, i) for i in 1:N ])
        new{eltype(values),typeof(values)}(columns, values)
    end
end

julia> sd = SimData((:a,:b), ones(3,2));

julia> sd.columns.a .= 3:5;

julia> sd.values
3×2 Matrix{Float64}:
 3.0  1.0
 4.0  1.0
 5.0  1.0

I chose a NamedTuple because it’s light-weight and fast. It has served me well, but now I have a new requirement and I’m trying to decide whether to keep the NamedTuple as the type of columns or switch it to Dict or something else.

The new requirement is that I sometimes have a large number of columns (of the order of 10_000 columns). The compilation time of a NamedTuple with N=10_000 is a concern here. The alternative I can think of is a Dict, but that’s a heavier data structure and the look up times would be longer (I do that a lot) and I lose the convenient dot-notation to access the columns by name. Also, I iterate over the columns by name and I need them in the given order (which NamedTuple preserves naturally), so I would need an ordered Dict, or some other mechanism to iterate in the correct order, which adds even more complexity.

My question is: Do you think I’m crazy to even consider a NamedTuple with 10_000 elements? Or is it okay to do that? If not okay, could you think of something leaner and faster than OrderedDict that would do the job here?

Thanks a lot in advance for your help!
Boyan

goerch · November 23, 2021, 9:41pm

Hm. Named tuple with 10000 columns sounds like it is sparse. Are you trying to reinvent MUMPS here?

bbejanov · November 23, 2021, 9:56pm

Sorry, I’m not familiar with MUMPS.

I don’t have any missing columns, so no, it’s not sparse.

pdeffebach · November 23, 2021, 9:58pm

You can use an OrderedDict from OrderedCollections.jl.

goerch · November 23, 2021, 9:59pm

Sorry, I’m not meaning to be offensive, but this is a quite unorthodox requirement. Is your data model normalized?

bbejanov · November 23, 2021, 10:10pm

This is a data structure that I use for simulations of a structural model and the columns correspond to model variables. That’s why they are named and the names are Symbols. But now I have a model that is automatically generated (not written by hand), so that’s how I end up with 10000 variables. I hope this helps to clarify my question a bit.

Thanks for trying to help me!

goerch · November 23, 2021, 10:15pm

Sorry again for being insistent: are these 10000 variables(without missing values) independent of each other? Cool problem, then.

bbejanov · November 23, 2021, 10:21pm

Not at all, I’m happy to clarify as much as I can.

The variables are related to each other by the equations of the model. Some are external input, the rest are computed to satisfy the equations for the given external data.

pdeffebach · November 23, 2021, 10:21pm

is this for an economics model? Modern macro models can definitely have this many parameters.

bbejanov · November 23, 2021, 10:22pm

Yes, exactly!

lmiq · November 23, 2021, 10:22pm

If you want to access the values by name (symbol), probably a dict will be faster.

If you just want to associate the name with the values, maybe a vector of named tuples is better?

i. e.

julia> pars = [ (:a=>1), (:b=>2) ]
2-element Vector{Pair{Symbol, Int64}}:
 :a => 1
 :b => 2

pdeffebach · November 23, 2021, 10:24pm

There are also many packages for named arrays in Julia. Though I always forget which one is best for which instance. There are threads about this you can search for more information.

goerch · November 23, 2021, 10:25pm

Interesting. I believe the best way to proceed would be to confront us with some artificial benchmarks then?

pdeffebach · November 23, 2021, 10:26pm

A 10,000 element NamedTuple is definitely not the way to go though. The compiler will give up and you will lose lots of performance.

lmiq · November 23, 2021, 10:29pm

example:

julia> function create_dict(n)
           d = Dict{Int,Int}()
           for i in 1:n
               d[i] = i
           end
           d
       end
create_dict (generic function with 1 method)

julia> @btime create_dict(10000);
  177.355 μs (24 allocations: 364.61 KiB)

julia> create_tuple(n) = ntuple(i -> i => i, n)
create_tuple (generic function with 1 method)

julia> @btime create_tuple(10000);
  284.041 μs (10004 allocations: 703.39 KiB)

and accessing the values:

julia> @btime ntup[:15]
  51.812 μs (1 allocation: 32 bytes)
15 => 15

julia> @btime d[:15]
  16.737 ns (0 allocations: 0 bytes)
15

bbejanov · November 23, 2021, 10:34pm

It looks like an OrderedDict may be the way to go.

I’ll run some benchmarks - both of the instantiation time (which in the case of NamedTuple will require compilation), but also of access time once the instance is created. I’ll get back with these sometime tomorrow.

Thanks everyone, much appreciated!

bbejanov · November 24, 2021, 4:42pm

Benchmark results (code and output are at the end).

As I expected the creation times using OrderedDict is much faster for large n than using NamedTuple.

I am surprised to see that random access time is the same with OrderedDict and NamedTuple.

I am very surprised to see the sequential access times with NamedTuple. Something happens at 1000 columns - it looks very strange! I don’t know what to make of it. I ran the benchmarks several times and I get the same thing every time.

Anyhow, my concern that using a Dict to solve my large n case would make my small n case slower has been resolved – benchmarks show that OrderedDict is faster in all cases.

Code:


struct SDataNT{T <: Number, C <: AbstractMatrix{T}} <: AbstractMatrix{T}
    columns::NamedTuple
    values::C

    function SDataNT(names::NTuple{N, Symbol}, values::AbstractMatrix) where {N}
        if N != size(values, 2)
            error("Number of columns mismatch.")
        end
        columns = NamedTuple{names}([view(values, :, i) for i in 1:N ])
        new{eltype(values),typeof(values)}(columns, values)
    end
end

Base.propertynames(sd::SDataNT) = keys(getfield(sd, :columns))
Base.getproperty(sd::SDataNT, n::Symbol) = n in fieldnames(SDataNT) ? getfield(sd, n) : getfield(getfield(sd, :columns), n)

using OrderedCollections

struct SDataOD{T <: Number, C <: AbstractMatrix{T}} <: AbstractMatrix{T}
    columns::OrderedDict
    values::C

    function SDataOD(names::NTuple{N, Symbol}, values::AbstractMatrix) where {N}
        if N != size(values, 2)
            error("Number of columns mismatch.")
        end
        columns = OrderedDict(names[i] => view(values, :, i) for i = 1:N)
        new{eltype(values),typeof(values)}(columns, values)
    end
end

Base.propertynames(sd::SDataOD) = keys(getfield(sd, :columns))
Base.getproperty(sd::SDataOD, n::Symbol) = n in fieldnames(SDataOD) ? getfield(sd, n) : getindex(getfield(sd, :columns), n)

function test_times(SD::Type, sym="a")
    sdd = Dict()
    println("Times for ", SD)
    
    nreps = 100_000 # access time replications
    rnum = rand(nreps)   # random numbers used in random access test
    sizes = (10,100,200,500,1_000,2_000,5_000,10_000)

    println()
    println("Instantiation times:")
    for ncols in sizes
        print("  ncols = ", ncols, ":\t")
        @time sdd[ncols] = SD(Tuple(Symbol(sym, i) for i=1:ncols), rand(10, ncols))
    end

    println()
    println("Random access times:")
    for ncols in sizes
        print("  ncols = ", ncols, ":\t")
        sd = sdd[ncols]
        @time for i = 1:nreps
            nc = Symbol(sym, ceil(Int, rnum[i]*ncols))
            vec = sd.:($nc)
            vec[1] = i
        end
    end

    println()
    println("Sequential access times:")
    for ncols in sizes
        print("  ncols = ", ncols, ":\t")
        sd = sdd[ncols]
        @time begin
            i = 1
            while i <= nreps
                for (col, vec) in pairs(sd.columns)
                    vec[1] = i
                    (i += 1) > nreps && break
                end
            end
        end
    end

    return
end

println("="^80)
test_times(SDataOD)
println("="^80)
test_times(SDataNT)

Results:

================================================================================
Times for SDataOD

Instantiation times:
  ncols = 10:	  0.319266 seconds (611.11 k allocations: 35.257 MiB, 99.95% compilation time)
  ncols = 100:	  0.223908 seconds (303.18 k allocations: 17.603 MiB, 99.86% compilation time)
  ncols = 200:	  0.402427 seconds (304.43 k allocations: 17.700 MiB, 99.90% compilation time)
  ncols = 500:	  1.181648 seconds (306.23 k allocations: 17.847 MiB, 0.83% gc time, 99.91% compilation time)
  ncols = 1000:	  2.232119 seconds (310.73 k allocations: 18.210 MiB, 99.88% compilation time)
  ncols = 2000:	  3.622623 seconds (319.72 k allocations: 18.822 MiB, 99.93% compilation time)
  ncols = 5000:	  0.176305 seconds (346.73 k allocations: 21.042 MiB, 10.21% gc time, 96.63% compilation time)
  ncols = 10000:	  0.115145 seconds (391.73 k allocations: 24.349 MiB, 93.01% compilation time)

Random access times:
  ncols = 10:	  0.368723 seconds (959.44 k allocations: 43.206 MiB, 15.93% compilation time)
  ncols = 100:	  0.336634 seconds (899.49 k allocations: 39.665 MiB, 3.52% gc time)
  ncols = 200:	  0.335097 seconds (899.49 k allocations: 39.665 MiB, 1.67% gc time)
  ncols = 500:	  0.376298 seconds (899.49 k allocations: 39.665 MiB, 6.88% gc time)
  ncols = 1000:	  0.420171 seconds (948.59 k allocations: 40.414 MiB, 15.37% gc time)
  ncols = 2000:	  0.392851 seconds (974.25 k allocations: 40.806 MiB, 12.04% gc time)
  ncols = 5000:	  0.345910 seconds (989.49 k allocations: 41.038 MiB, 0.45% gc time)
  ncols = 10000:	  0.352303 seconds (994.58 k allocations: 41.116 MiB, 0.44% gc time)

Sequential access times:
  ncols = 10:	  0.108961 seconds (941.80 k allocations: 46.246 MiB, 2.60% gc time, 37.38% compilation time)
  ncols = 100:	  0.046738 seconds (902.49 k allocations: 44.365 MiB, 5.85% gc time)
  ncols = 200:	  0.043418 seconds (900.99 k allocations: 44.304 MiB, 6.37% gc time)
  ncols = 500:	  0.039709 seconds (900.09 k allocations: 44.267 MiB, 4.00% gc time)
  ncols = 1000:	  0.039261 seconds (948.79 k allocations: 45.003 MiB, 3.93% gc time)
  ncols = 2000:	  0.038185 seconds (974.14 k allocations: 45.386 MiB, 3.95% gc time)
  ncols = 5000:	  0.038183 seconds (989.35 k allocations: 45.615 MiB, 3.97% gc time)
  ncols = 10000:	  0.038023 seconds (994.42 k allocations: 45.692 MiB, 4.00% gc time)
================================================================================
Times for SDataNT

Instantiation times:
  ncols = 10:	  0.108898 seconds (204.78 k allocations: 12.240 MiB, 99.32% compilation time)
  ncols = 100:	  0.458683 seconds (132.13 k allocations: 8.028 MiB, 99.81% compilation time)
  ncols = 200:	  1.183346 seconds (143.31 k allocations: 8.536 MiB, 99.80% compilation time)
  ncols = 500:	  4.157052 seconds (175.42 k allocations: 9.921 MiB, 0.55% gc time, 99.95% compilation time)
  ncols = 1000:	 10.889727 seconds (237.81 k allocations: 12.411 MiB, 99.97% compilation time)
  ncols = 2000:	 12.057584 seconds (381.40 k allocations: 17.658 MiB, 99.96% compilation time)
  ncols = 5000:	 37.980823 seconds (813.49 k allocations: 33.828 MiB, 0.06% gc time, 99.97% compilation time)
  ncols = 10000:	110.425426 seconds (1.53 M allocations: 60.316 MiB, 0.02% gc time, 99.99% compilation time)

Random access times:
  ncols = 10:	  0.392019 seconds (908.97 k allocations: 40.297 MiB, 6.26% gc time, 17.29% compilation time)
  ncols = 100:	  0.314703 seconds (899.49 k allocations: 39.665 MiB, 1.19% gc time)
  ncols = 200:	  0.383834 seconds (899.49 k allocations: 39.665 MiB, 11.32% gc time)
  ncols = 500:	  0.528141 seconds (899.49 k allocations: 39.665 MiB, 23.48% gc time)
  ncols = 1000:	  0.415638 seconds (948.54 k allocations: 40.413 MiB, 12.81% gc time)
  ncols = 2000:	  0.385840 seconds (974.15 k allocations: 40.804 MiB, 0.48% gc time)
  ncols = 5000:	  0.479783 seconds (989.47 k allocations: 41.038 MiB, 0.47% gc time)
  ncols = 10000:	  0.630030 seconds (994.47 k allocations: 41.114 MiB, 0.29% gc time)

Sequential access times:
  ncols = 10:	  0.169690 seconds (1.16 M allocations: 98.965 MiB, 6.02% gc time, 35.88% compilation time)
  ncols = 100:	  0.795865 seconds (1.13 M allocations: 445.769 MiB, 3.07% gc time, 76.08% compilation time)
  ncols = 200:	  1.712942 seconds (1.13 M allocations: 824.477 MiB, 1.96% gc time, 85.08% compilation time)
  ncols = 500:	  4.712930 seconds (1.13 M allocations: 1.926 GiB, 1.20% gc time, 89.49% compilation time)
  ncols = 1000:	625.152897 seconds (1.18 M allocations: 3.786 GiB, 0.02% gc time, 99.85% compilation time)
  ncols = 2000:	  0.147846 seconds (1.12 M allocations: 52.995 MiB, 2.84% gc time, 22.34% compilation time)
  ncols = 5000:	  0.299168 seconds (1.16 M allocations: 53.862 MiB, 1.51% gc time, 33.90% compilation time)
  ncols = 10000:	  0.420844 seconds (1.21 M allocations: 54.137 MiB, 1.18% gc time, 18.02% compilation time)

Topic		Replies	Views
Using large NTuples makes Julia hang General Usage hdf5 , memory-allocation , data_structures , ntuple	23	1598	February 10, 2023
I want to replicate with Named Tuples what I do with Dictionaries but I can't General Usage dictionary , namedtuple	9	2109	November 9, 2021
Why is the NamedTuple slower? When/How would it be faster? Is it still allocated on the stack? Performance	6	250	February 14, 2025
Looping through NamedTuple is slow General Usage	7	249	July 8, 2024
Performance of NamedArrays vs Dictionaries of Tuples General Usage	2	1853	August 28, 2020

NamedTuple with 10_000 elements?

Related topics