Julia equivalent of C++ map

I’m partway through translating one of my C++ modules to Julia. The part I haven’t done yet uses a map<int,double> and has to iterate it from smallest to largest index. A Dict wouldn’t work, because the elements are stored in hash order. Here’s an object of this class after a test run:

$2 = {bucket = std::map with 26 elements = {[-77] = 0,  
   [-76] = 7.0678450453239403e-24, [-75] = 1.4146047593286919e-23,  
   [-74] = 2.8333572431958879e-23, [-73] = 5.6833432416834089e-23, [-72] = 0,  
   [-71] = 2.2956685041615732e-22, [-70] = 4.7003608677487724e-22,  
   [-69] = 9.8530744367994676e-22, [-67] = 4.6190213642120732e-21,  
   [-65] = 1.8340575803522825e-20, [-64] = 3.6806385893800465e-20,  
   [-63] = 7.3666474004393324e-20, [-62] = 1.4735579758999915e-19,  
   [-53] = 7.5450657282317029e-17, [-45] = 1.9315443981147606e-14,  
   [-38] = 2.4724470271325055e-12, [-30] = 6.3292450599891027e-10,  
   [-22] = 1.6202930869341321e-07, [-15] = 2.0740340371071444e-05,  
   [-7] = 0.0053093431485301268, [4] = 10.872598936094619,  
   [5] = 22.998593185090815, [9] = 348.27528152227217,  
   [10] = 727.42713809693339, [12] = 2986.9210783585077},  
 static cnt = 2321088}

bucket[-3] does not exist, but if the program attempts to set bucket[-3] to something, it will be inserted. How do I create such an object?

OrderedDict from DataStructures.jl

2 Likes

DefaultDict in DataStructures.jl

julia> using DataStructures

julia> b = DefaultDict{Int, Float64}(2.0)
DefaultDict{Int64,Float64,Float64} with 0 entries

julia> b[3]
2.0

julia> b
DefaultDict{Int64,Float64,Float64} with 1 entry:
  3 => 2.0

Nice - is it not more ideomatic to use get!(mydict, -3, 2.) though?

1 Like

You want to use the do syntax if there is a setup cost for the default value.

get!(mydict, -3) do
    Int[]
end

If there is a default value, I would just use a DefaultDict and keep the nice getindex syntax:

push!(mydict[2], 3)

vs

push!(get!(() -> Int[], mydict, 2), 3)
1 Like

OrderedDict / DefaultOrderedDict are not correct since they refer to insertion order. What you want is SortedDict:

julia> d = SortedDict{Int64,Float64}();

julia> d[8] = 34;

julia> d[-3] = 12;

julia> d
SortedDict{Int64,Float64,Base.Order.ForwardOrdering} with 2 entries:
  -3 => 12.0
  8 => 34.0

As for “if the program attempts to set bucket[-3] to something, it will be inserted”, this is done by simply setting d[-3] = something (as in the example above).

Be aware that I’ve had mixed experience with DataStructures.jl when it comes to performance. In particular, for SortedDict you might be better off using a regular Dict (or Vector of tuples if you don’t need the dictionary properties) and sort the objects when you need them in order.

2 Likes

How does the performance of SortedDict compare to that of C++'s map?

My comment regarding SortedDict was more general: I don’t know how it compares to C++, but regardless of language it has different access complexities than a regular dictionary since it requires the elements to be sorted at all times. I don’t know about OP’s case, but most applications I’ve seen don’t have this requirement – rather they do batches of insertions / queries, followed by ordered iteration (often at the very end). If that is the case, it is often more efficient to use a different data structure, such as an unsorted dictionary or simply a vector. Sample timings below.

Case 1: Duplicates. Use a dictionary.

julia> k = rand(1:1000, 100_000); v = rand(100_000);

julia> @btime (a = SortedDict{Int64,Float64}(); for n = 1:100_000 a[$k[n]] = get(a, $k[n], 0) + $v[n]; end);
  9.327 ms (30 allocations: 145.81 KiB)

julia> @btime (b = Dict{Int64,Float64}(); for n = 1:100_000 b[$k[n]] = get(b, $k[n], 0) + $v[n]; end; b = sort(b));
  1.695 ms (52 allocations: 187.34 KiB)

Case 2: No duplicates. Use a vector.

julia> k = rand(Int64, 100_000); v = rand(100_000);

julia> @btime (a = SortedDict{Int64,Float64}(); for n = 1:100_000 a[$k[n]] = $v[n]; end);
  23.487 ms (51 allocations: 12.03 MiB)

julia> @btime (c = Vector{Tuple{Int64,Float64}}(undef, 100_000); for n = 1:100_000 c[n] = ($k[n],$v[n]) end; sort!(c));
  7.490 ms (4 allocations: 2.29 MiB)

# Alternatively:
julia> @btime c = sort!(collect(zip(k, v)));
  7.517 ms (5 allocations: 2.29 MiB)
1 Like

I’m not that concerned about the efficiency of SortedDict. The number of items in the SortedDict is normally on the same order of magnitude as the number of bits in whatever float type is being added. The use case for this class is that the number of floats being added is not known in advance, or is known to be more than fits in a comfortable amount of memory, and each takes more time to generate than the SortedDict overhead. (The pairwiseSum! function is for arrays that are known to fit in memory.)

I’m having trouble defining the constructor:

module ManySum
using DataStructures
export pairwiseSum!
export Manysum

function pairwiseSum!(a::Array{T})::T where T<:AbstractFloat
  i=1
  n=length(a)
  while i<n
    for j=1:2*i:n-i
      a[j]+=a[j+i]
    end
    i*=2
  end
  if n>0
    a[1]
  else
    0
  end
end

struct Manysum{T <: AbstractFloat}
  bucket ::SortedDict{Int,T}
  function Manysum()::T
    msum=new{T}()
    bucket=SortedDict{Int,T}()
    msum
  end
end

end # module

This loads, but when I try “ms=Manysum()”, I get “T not defined”. I’ve also tried it with “function Manysum()” without “::T”. What’s the right syntax?

Two reflections on what you wrote above:

  1. Julia already uses pairwise summation by default, so you might not have to implement it yourself. There’s also KahanSummation.jl if you need more precision.

  2. If you know the range of the keys beforehand, it might be easier to just use an array instead of a dictionary/map. E.g. if all keys are between -128 and 128, create an array with 257 entries, and use the key as index into this array.

As for Manysum, try this syntax:

struct Manysum{T <: AbstractFloat}
  bucket::SortedDict{Int,T}

  Manysum{T}() where T = new{T}(SortedDict{Int,T}())
end

Thanks! That worked. I’ll add the rest of the functions.

sum_kbn adds up an array; that’s not what Manysum is for. Since Manysum is intended to be a library function, it can’t know the range in advance, except that it’s within the range of the float type being used.

Given a float type, how do I find the maximum possible exponent? In C++, this is DBL_MAX_EXP for the double type.

I meant that the built-in sum already does what pairwiseSum! does, but in a more efficient way, and without destroying the array:

julia> v = rand(100_000_000);

julia> sum(v)
5.000039762225862e7

julia> pairwiseSum!(v)
5.000039762225862e7

julia> @btime sum($v);
  44.303 ms (0 allocations: 0 bytes)

julia> @btime pairwiseSum!($v);
  442.701 ms (0 allocations: 0 bytes)

The use case for this class is that the number of floats being added is not known in advance, or is known to be more than fits in a comfortable amount of memory // sum_kbn adds up an array; that’s not what Manysum is for

To be honest, I am not convinced about these cases. First, if you have so many floats that they barely fit
in memory, performance will likely be an important factor, in which case SortedDict is not a good choice.
If you know the range of exponents (which it sounds like you can figure out, if not at compile time then at least at run time), using an array with one slot per exponent should perform much better.

Second, sum_kbn doesn’t just work on arrays, it works on any iterable. You don’t need to know the number of elements in advance, and it doesn’t allocate any memory. Example:

julia> @btime sum_kbn(rand() for n = 1:1_000_000 if mod(n,2) == 0)
  3.452 ms (0 allocations: 0 bytes)
250002.82428255636

The use case I can see for this is if the items you want to add are not suitable to be iterated over. In such case, I would consider creating your own version of the Kahan algorithm, it sounds simpler and much more efficient than Manysum; the code (available here) is really short.

Even if the functionality is already in Julia, I’m trying to learn how to write a module, and it’s still a good exercise in module-making.

The reason I clobber the array is that the algorithm in Wikipedia, which does not clobber it, has recursive function call overhead. I wrote the C++ version as part of Bezitopo; all the uses of it, such as computing a function by its Taylor series, compute terms into an array or vector, sum them, and then destroy the array or vector.

Ok, well in case you’re interested, here’s the original implementation of pairwise summation in Julia. It avoids the recursive overhead by having a large base case (naive sum for < 128 entries). In practice your implementation will perform much worse (see benchmark above) since it doesn’t traverse the array strictly in order, it has more of a random-access pattern, which means that you’ll have a lot of cache misses, and can likely not take advantage of SIMD (vector instructions).