Julia equivalent of C++ map

phma · December 31, 2018, 11:03am

I’m partway through translating one of my C++ modules to Julia. The part I haven’t done yet uses a map<int,double> and has to iterate it from smallest to largest index. A Dict wouldn’t work, because the elements are stored in hash order. Here’s an object of this class after a test run:

$2 = {bucket = std::map with 26 elements = {[-77] = 0,
   [-76] = 7.0678450453239403e-24, [-75] = 1.4146047593286919e-23,
   [-74] = 2.8333572431958879e-23, [-73] = 5.6833432416834089e-23, [-72] = 0,
   [-71] = 2.2956685041615732e-22, [-70] = 4.7003608677487724e-22,
   [-69] = 9.8530744367994676e-22, [-67] = 4.6190213642120732e-21,
   [-65] = 1.8340575803522825e-20, [-64] = 3.6806385893800465e-20,
   [-63] = 7.3666474004393324e-20, [-62] = 1.4735579758999915e-19,
   [-53] = 7.5450657282317029e-17, [-45] = 1.9315443981147606e-14,
   [-38] = 2.4724470271325055e-12, [-30] = 6.3292450599891027e-10,
   [-22] = 1.6202930869341321e-07, [-15] = 2.0740340371071444e-05,
   [-7] = 0.0053093431485301268, [4] = 10.872598936094619,
   [5] = 22.998593185090815, [9] = 348.27528152227217,
   [10] = 727.42713809693339, [12] = 2986.9210783585077},
static cnt = 2321088}

bucket[-3] does not exist, but if the program attempts to set bucket[-3] to something, it will be inserted. How do I create such an object?

mkborregaard · December 31, 2018, 11:05am

OrderedDict from DataStructures.jl

kristoffer.carlsson · December 31, 2018, 12:54pm

DefaultDict in DataStructures.jl

julia> using DataStructures

julia> b = DefaultDict{Int, Float64}(2.0)
DefaultDict{Int64,Float64,Float64} with 0 entries

julia> b[3]
2.0

julia> b
DefaultDict{Int64,Float64,Float64} with 1 entry:
  3 => 2.0

mkborregaard · December 31, 2018, 1:53pm

Nice - is it not more ideomatic to use get!(mydict, -3, 2.) though?

kristoffer.carlsson · December 31, 2018, 2:22pm

You want to use the do syntax if there is a setup cost for the default value.

get!(mydict, -3) do
    Int[]
end

If there is a default value, I would just use a DefaultDict and keep the nice getindex syntax:

push!(mydict[2], 3)

vs

push!(get!(() -> Int[], mydict, 2), 3)

bennedich · December 31, 2018, 10:38pm

OrderedDict / DefaultOrderedDict are not correct since they refer to insertion order. What you want is SortedDict:

julia> d = SortedDict{Int64,Float64}();

julia> d[8] = 34;

julia> d[-3] = 12;

julia> d
SortedDict{Int64,Float64,Base.Order.ForwardOrdering} with 2 entries:
  -3 => 12.0
  8 => 34.0

As for “if the program attempts to set bucket[-3] to something, it will be inserted”, this is done by simply setting d[-3] = something (as in the example above).

Be aware that I’ve had mixed experience with DataStructures.jl when it comes to performance. In particular, for SortedDict you might be better off using a regular Dict (or Vector of tuples if you don’t need the dictionary properties) and sort the objects when you need them in order.

dpsanders · December 31, 2018, 11:41pm

How does the performance of SortedDict compare to that of C++'s map?

bennedich · January 1, 2019, 1:42am

My comment regarding SortedDict was more general: I don’t know how it compares to C++, but regardless of language it has different access complexities than a regular dictionary since it requires the elements to be sorted at all times. I don’t know about OP’s case, but most applications I’ve seen don’t have this requirement – rather they do batches of insertions / queries, followed by ordered iteration (often at the very end). If that is the case, it is often more efficient to use a different data structure, such as an unsorted dictionary or simply a vector. Sample timings below.

Case 1: Duplicates. Use a dictionary.

julia> k = rand(1:1000, 100_000); v = rand(100_000);

julia> @btime (a = SortedDict{Int64,Float64}(); for n = 1:100_000 a[$k[n]] = get(a, $k[n], 0) + $v[n]; end);
  9.327 ms (30 allocations: 145.81 KiB)

julia> @btime (b = Dict{Int64,Float64}(); for n = 1:100_000 b[$k[n]] = get(b, $k[n], 0) + $v[n]; end; b = sort(b));
  1.695 ms (52 allocations: 187.34 KiB)

Case 2: No duplicates. Use a vector.

julia> k = rand(Int64, 100_000); v = rand(100_000);

julia> @btime (a = SortedDict{Int64,Float64}(); for n = 1:100_000 a[$k[n]] = $v[n]; end);
  23.487 ms (51 allocations: 12.03 MiB)

julia> @btime (c = Vector{Tuple{Int64,Float64}}(undef, 100_000); for n = 1:100_000 c[n] = ($k[n],$v[n]) end; sort!(c));
  7.490 ms (4 allocations: 2.29 MiB)

# Alternatively:
julia> @btime c = sort!(collect(zip(k, v)));
  7.517 ms (5 allocations: 2.29 MiB)

phma · January 1, 2019, 9:27am

I’m not that concerned about the efficiency of SortedDict. The number of items in the SortedDict is normally on the same order of magnitude as the number of bits in whatever float type is being added. The use case for this class is that the number of floats being added is not known in advance, or is known to be more than fits in a comfortable amount of memory, and each takes more time to generate than the SortedDict overhead. (The pairwiseSum! function is for arrays that are known to fit in memory.)

I’m having trouble defining the constructor:

module ManySum
using DataStructures
export pairwiseSum!
export Manysum

function pairwiseSum!(a::Array{T})::T where T<:AbstractFloat
  i=1
  n=length(a)
  while i<n
    for j=1:2*i:n-i
      a[j]+=a[j+i]
    end
    i*=2
  end
  if n>0
    a[1]
  else
    0
  end
end

struct Manysum{T <: AbstractFloat}
  bucket ::SortedDict{Int,T}
  function Manysum()::T
    msum=new{T}()
    bucket=SortedDict{Int,T}()
    msum
  end
end

end # module

This loads, but when I try “ms=Manysum()”, I get “T not defined”. I’ve also tried it with “function Manysum()” without “::T”. What’s the right syntax?

bennedich · January 1, 2019, 11:01am

Two reflections on what you wrote above:

Julia already uses pairwise summation by default, so you might not have to implement it yourself. There’s also KahanSummation.jl if you need more precision.
If you know the range of the keys beforehand, it might be easier to just use an array instead of a dictionary/map. E.g. if all keys are between -128 and 128, create an array with 257 entries, and use the key as index into this array.

As for Manysum, try this syntax:

struct Manysum{T <: AbstractFloat}
  bucket::SortedDict{Int,T}

  Manysum{T}() where T = new{T}(SortedDict{Int,T}())
end

phma · January 2, 2019, 6:21am

Thanks! That worked. I’ll add the rest of the functions.

sum_kbn adds up an array; that’s not what Manysum is for. Since Manysum is intended to be a library function, it can’t know the range in advance, except that it’s within the range of the float type being used.

Given a float type, how do I find the maximum possible exponent? In C++, this is DBL_MAX_EXP for the double type.

bennedich · January 2, 2019, 4:12pm

I meant that the built-in sum already does what pairwiseSum! does, but in a more efficient way, and without destroying the array:

julia> v = rand(100_000_000);

julia> sum(v)
5.000039762225862e7

julia> pairwiseSum!(v)
5.000039762225862e7

julia> @btime sum($v);
  44.303 ms (0 allocations: 0 bytes)

julia> @btime pairwiseSum!($v);
  442.701 ms (0 allocations: 0 bytes)

The use case for this class is that the number of floats being added is not known in advance, or is known to be more than fits in a comfortable amount of memory // sum_kbn adds up an array; that’s not what Manysum is for

To be honest, I am not convinced about these cases. First, if you have so many floats that they barely fit
in memory, performance will likely be an important factor, in which case SortedDict is not a good choice.
If you know the range of exponents (which it sounds like you can figure out, if not at compile time then at least at run time), using an array with one slot per exponent should perform much better.

Second, sum_kbn doesn’t just work on arrays, it works on any iterable. You don’t need to know the number of elements in advance, and it doesn’t allocate any memory. Example:

julia> @btime sum_kbn(rand() for n = 1:1_000_000 if mod(n,2) == 0)
  3.452 ms (0 allocations: 0 bytes)
250002.82428255636

The use case I can see for this is if the items you want to add are not suitable to be iterated over. In such case, I would consider creating your own version of the Kahan algorithm, it sounds simpler and much more efficient than Manysum; the code (available here) is really short.

phma · January 3, 2019, 4:25am

Even if the functionality is already in Julia, I’m trying to learn how to write a module, and it’s still a good exercise in module-making.

The reason I clobber the array is that the algorithm in Wikipedia, which does not clobber it, has recursive function call overhead. I wrote the C++ version as part of Bezitopo; all the uses of it, such as computing a function by its Taylor series, compute terms into an array or vector, sum them, and then destroy the array or vector.

bennedich · January 3, 2019, 9:06am

Ok, well in case you’re interested, here’s the original implementation of pairwise summation in Julia. It avoids the recursive overhead by having a large base case (naive sum for < 128 entries). In practice your implementation will perform much worse (see benchmark above) since it doesn’t traverse the array strictly in order, it has more of a random-access pattern, which means that you’ll have a lot of cache misses, and can likely not take advantage of SIMD (vector instructions).

Topic		Replies	Views
Sort indices based on value New to Julia	13	418	December 8, 2023
Reimagining C++ project in Julia: arrays and structs vs data frames New to Julia data_structures , c	23	2184	December 5, 2021
Dictionary-like data structure representing a bijective map? General Usage question	1	916	August 21, 2020
Java to julia General Usage question	11	1242	August 29, 2019
Benchmarking Dict against C++ std::unordered_map Internals & Design performance , dictionary	6	733	July 3, 2022

Julia equivalent of C++ map

Related topics