How to replace multiple values in an large array with the occurrence of the value?

Hi!
I have an array and I want to create a new array (or replace the values) with the occurrences of the values from the input array. For example:

using StatsBase

blobs
10×10 Array{Int64,2}:
 1  1  0  2  0  2  0  2  0  2
 1  1  0  2  2  0  2  2  2  2
 0  1  0  0  0  0  0  2  0  0
 1  1  0  0  2  0  2  2  2  2
 0  1  0  0  2  2  2  2  2  0
 1  0  0  2  2  0  0  2  2  0
 0  0  0  0  2  2  2  0  2  0
 0  0  2  0  0  2  2  0  2  2
 0  0  2  0  2  2  2  0  0  0
 0  0  0  2  2  0  2  0  3  0

blobs_occ = countmap(vec(blobs))
Dict{Int64,Int64} with 4 entries:
  0 => 49
  2 => 41
  3 => 1
  1 => 9

blob_counts = zeros(size(arr));

for (key, value) in blobs_occ
       C=findall(blobs.==key)
       blob_counts[C] .= value
end

blob_counts
10×10 Array{Float64,2}:
  9.0   9.0  49.0  41.0  49.0  41.0  49.0  41.0  49.0  41.0
  9.0   9.0  49.0  41.0  41.0  49.0  41.0  41.0  41.0  41.0
 49.0   9.0  49.0  49.0  49.0  49.0  49.0  41.0  49.0  49.0
  9.0   9.0  49.0  49.0  41.0  49.0  41.0  41.0  41.0  41.0
 49.0   9.0  49.0  49.0  41.0  41.0  41.0  41.0  41.0  49.0
  9.0  49.0  49.0  41.0  41.0  49.0  49.0  41.0  41.0  49.0
 49.0  49.0  49.0  49.0  41.0  41.0  41.0  49.0  41.0  49.0
 49.0  49.0  41.0  49.0  49.0  41.0  41.0  49.0  41.0  41.0
 49.0  49.0  41.0  49.0  41.0  41.0  41.0  49.0  49.0  49.0
 49.0  49.0  49.0  41.0  41.0  49.0  41.0  49.0   1.0  49.0

This works well with small arrays but if I have an larger array e.g. 10000x10000 this could take many hours. Is there a way to do this faster?
Thanks in advance, Bjoern

Using map! seems moderately faster: map!(x -> blobs_occ[x], blob_counts, blobs).

2 Likes

Although, if the keys are “dense” like in your example, you can instead store the countmap in a Vector, which makes it then quite faster to look up:

minkey, maxkey = extrema(keys(blobs_occ))
occ = zeros(Int, maxkey-minkey+1)
for (k, v) in blobs_occ
    occ[k+1-minkey] = v
end
map!(x -> occ[x+1-minkey], blob_counts, blobs)
2 Likes

You need to iterate the array keeping the column-major layout in mind. map! would do this for you. You may also loop over the array by yourself:

julia> using StatsBase

julia> blobs = rand(1:10, 10_000, 10_000);

julia> blobs_occ = countmap(vec(blobs))
Dict{Int64,Int64} with 10 entries:
  7  => 9999281
  4  => 10004145
  9  => 9998303
  10 => 10000867
  2  => 10003930
  3  => 10000831
  5  => 10001701
  8  => 9993679
  6  => 9996181
  1  => 10001082

julia> blob_counts = similar(blobs);

julia> function elcount!(blob_counts, blobs, blobs_occ)
           for (ind, el) in enumerate(blobs)
               blob_counts[ind] = blobs_occ[el]
           end
           return blob_counts
       end

julia> @time elcount!(blob_counts, blobs, blobs_occ);
  1.499894 seconds
2 Likes

Or more succinctly:

blobs_occ = countmap(vec(blobs))
[blobs_occ[value] for value in blobs]

Edit: and the same in one line, though less readable I think:

getindex.(Ref(countmap(vec(blobs))), blobs)
5 Likes

@rfourquet, your solution is amazingly fast but not the easiest to read.

On another note, how can we @btime a simple but slower in-place double-loop solution as below?

function elcount6!(blobs, blobs_occ)
    @inbounds for j in axes(blobs,2), i in axes(blobs,1)
        blobs[i,j] = blobs_occ[blobs[i,j]];
    end
    return blobs
end

Thanks in advance.

Using @btime to time functions over large arrays is not difficult. Remember to prefix the arguments in the function call with $, this tells BenchmarkTools not to treat their presence in the function call as a something of specific timing interest (it is the time required by however they are used within the function that matters).

using BenchmarkTools

# toi see the tiing info and the returned values
@btime elcount6!($blobs, $blobs_occ)

# to see the timing  info only (suppressing the values)
@btime elcount6!($blobs, $blobs_occ);

@JeffreySarnoff, thanks for your time and answer, but it issues error messages:

function elcount6!(blobs, blobs_occ)
    @inbounds for j in axes(blobs,2), i in axes(blobs,1)
        blobs[i,j] = blobs_occ[blobs[i,j]];
    end
    return blobs
end

blobs = rand(0:9,10_000,10_000);
blobs_occ = countmap(vec(blobs))
@btime elcount6!($blobs, $blobs_occ)

julia> @btime elcount6!($blobs, $blobs_occ)
ERROR: KeyError: key 10002175 not found
Stacktrace:
 [1] getindex at .\dict.jl:467 [inlined]
 ...

Some setup seems to be required to @btime in-place modifying functions but I did not figure out yet how to do it for the function above.

Indeed a fresh blob variable is required for every evaluation, you can do it like this:

function elcount!(blobs, blobs_occ)
	@inbounds for i in eachindex(blobs)
		blobs[i] = blobs_occ[blobs[i]];
	end
	return blobs
end

blobs = rand(0:9,10_000,10_000);
blobs_occ = countmap(vec(blobs));

@btime elcount!(b, $blobs_occ) setup=(b=copy(blobs)) evals=1;
1 Like

@sijo, much appreciated.
The result shows 0 allocations and 0 bytes. Is it correct?

@btime elcount6!(b, $blobs_occ) setup=(b=copy(blobs)) evals=1;  #
  910 ms (0 allocations: 0 bytes)

Isn’t that beautiful? :blush: It’s expected since the dictionary of counts is created outside of the benchmarked code. The code really has nothing to do except finding and assigning existing values.

2 Likes