Replacing components with indices

KeepLearning · April 24, 2024, 12:39am

I am wondering if the following can be done in a better way (without for loop, perhaps): given a vector and a vector of vectors, e.g.

v = [1, 2, 3, 3, 4, 4, 5]
u = [[1,2], [3, 5], [4]]

I like to get

[1, 1, 2, 2, 3, 3, 2]

by mapping the components in v to the indices of u. I came up with two ways as follows:

for i in eachindex(u)
           v[findall(x -> x in u[i], v)] .= i
end

and

cond = Iterators.flatmap(enumerate(u) |> collect) do (i, x) map(reverse, i .=> x) end
for c in cond
           replace!(v, c)
end

The reason why I have a question is that I wonder whether I could do somthing like

for i in eachindex(u)
           replace!(v, u[i] .=> i)
end

, since replace(v, 1=>0) (for example) is possible and I thought of some extension and a possibility of avoiding using a for loop.

stevengj · April 24, 2024, 12:51am

I would worry less about getting rid of loops (which are perfectly fine in Julia) and more about making your complexity better. Right now, your computational cost scales proportionally to length(v) * sum(length, u), which is pretty bad if the vectors are long. You can do much better.

For example, make a single pass over u first to make a dictionary mapping values to indices:

d = Dict{eltype(eltype(u)),Int}()
for i in eachindex(u), x in u[i]
    d[x] = i
end

then you can just do

julia> v .= get.(Ref(d), v, v)
7-element Vector{Int64}:
 1
 1
 2
 2
 3
 3
 2

which has nearly linear complexity in the two lengths. You might also want to consider using the dictionary (or similar) in the first place to store your mapping (so that you never create u at all).

(And you can do even better if you have more knowledge of u. e.g. if you know that the elements of u cover all of the values of v, and that these are small integers — as in your example above — then you can just store an array mapping values to indices, rather than a dictionary.)

Dan · April 24, 2024, 4:52am

Another option:

foldl((r,(i,e)) -> (r[e] .= i; r), pairs(u); init=fill(0,5))[v]

This option uses the fact u is a partition of the values of v, and the size 5 of the value set of v is encoded in the initialization. But it feels this knowledge is implicit in the OP setup.

rocco_sprmnt21 · April 24, 2024, 9:44pm

in.(v, permutedims(u))*axes(u,1)

rafael.guerra · April 24, 2024, 10:26pm

Shorter:
in.(v,u') * axes(u,1)

stevengj · April 24, 2024, 11:27pm

This is cute, but still has poor time complexity proportional to length(v) * sum(length, u), as well as poor space complexity due to allocating a temporary matrix of size length(v) * length(u).

rocco_sprmnt21 · April 25, 2024, 7:25am

indexinin(v,u)=[findfirst(eu->in(e, eu), u) for e in v]

this variant seems to be a little more efficient


function indexinin!(v,u)
    for i in eachindex(v)
        v[i]=findfirst(eu->in(v[i], eu), u)
    end
end

rocco_sprmnt21 · April 25, 2024, 12:21pm

KeepLearning:

The reason why I have a question is that I wonder whether I could do somthing like
for i in eachindex(u)
           replace!(v, u[i] .=> i)
end
, since replace(v, 1=>0) (for example) is possible and I thought of some extension and a possibility of avoiding using a for loop.

Such a strategy would have the problem that the changes you apply can also change the part of the vector that has already been modified.

Unless you use kwarg count=1

foreach(e->replace!(v, e=>findfirst(eu->in(e, eu), u),count=1),v)

stevengj · April 25, 2024, 12:55pm

It still has poor time complexity proportional to length(v) * sum(length, u).

Also has the same poor time complexity.

Realize that nearly all of the solutions posted so far (excluding mine) correspond to three nested loops, hence the multiplicative complexity. Squeezing the code into a minimal number of library calls is fine if you want to minimize code size, but it’s too easy to fall into the myth of “library calls = fast”, or to think only about code size and not about computational complexity.

rocco_sprmnt21 · April 25, 2024, 1:18pm

In general terms I would say yes.
In my experience (I often try to independently create scripts that replace/improve library calls) and with my level of knowledge, library calls work better.
But I wonder if in the specific case, being u something like a partition of unique[v], the evaluation of the spatial or temporal complexity is not different from the general worst case.

Dan · April 25, 2024, 1:26pm

Hopefully, the “nearly” refers to my solution too. Which has the same low complexity. In fact, due to implementation of Dicts, assuming, as it seems, that u is a partition of 1:n (n=5 here), my solution would keep a multlipicative advantage in larger scales (both in memory and speed). stevengj already mentions this possibility at the end of his solution, so perhaps I’ve added nothing novel.

rocco_sprmnt21 · April 25, 2024, 1:44pm

I wanted to compare this with the script that uses findfirst, but I can’t measure it with @benchmark f(…) setup=(…)

function richcomplexity!(u,vv)
    d=zeros(Int, length(unique(vv)))
    for i in eachindex(u), x in u[i]
        d[x] = i
    end
    vv .= getindex.(Ref(d), vv)
end

I tried with larger vectors than the ones provided in the OP.

bigv=rand(1:100, 10^3)

uu=unique(bigv)  

p=sort(rand(uu),10))
pl=[1;p]
pr=[p.-1;length(uu)
bigu=getindex.([uu], (:).(pl,pr))

ulia> function richcomplexity(u,v)
           d = Dict{eltype(eltype(u)),Int}()
           for i in eachindex(u), x in u[i]
               d[x] = i
           end
           v .= get.(Ref(d), v, v)
       end
richcomplexity (generic function with 1 method)


julia> @benchmark richcomplexity($bigu,cv)  setup=(cv=copy($bigv))     
BenchmarkTools.Trial: 10000 samples with 6 evaluations.
 Range (min … max):  4.833 μs … 870.450 μs  ┊ GC (min … max): 0.00% … 98.16%
 Time  (median):     5.400 μs               ┊ GC (median):    0.00%    
 Time  (mean ± σ):   8.108 μs ±  24.920 μs  ┊ GC (mean ± σ):  8.84% ±  2.96%

  █▇▅▄▄▄▁▁▁▁▁               ▂▂▁                               ▁        
  █████████████▇██▇▇▇▇▆▇▇▇▇▆█████▆▆▆▆▆▇▇▅▅▅▅▄▄▅▅▂▃▃▃▄▅▅▅▃▂▅▃▂ █        
  4.83 μs      Histogram: log(frequency) by time      31.8 μs <        

 Memory estimate: 6.36 KiB, allocs estimate: 10.

julia> function indexinin!(v,u)
           for i in eachindex(v)
               v[i]=findfirst(eu->in(v[i], eu), u)
           end
       end
indexinin! (generic function with 1 method)

julia> @benchmark indexinin!(cv,$bigu) setup=(cv=copy($bigv))
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  28.000 μs …  1.277 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     29.800 μs              ┊ GC (median):    0.00%    
 Time  (mean ± σ):   32.776 μs ± 20.407 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▇██▆▄▁ ▁      ▅▂▁  ▁▁▁▁▁▁▁                                  ▂        
  ████████▆█▆▇▇▆██████████████▇▇▆▄▅▄▄▄▃▄▄▃▃▁▄▄▄▄▃▃▄▄▄▄▁▅▁▃▄▃▆ █        
  28 μs        Histogram: log(frequency) by time        78 μs <        

 Memory estimate: 0 bytes, allocs estimate: 0.

It seems like @stevenj said

stevengj · April 25, 2024, 4:00pm

“Library calls are faster” is an unreliable heuristic.

First, you should always try to have some sense of what the library calls are doing — at least know the expected complexity. (e.g. findfirst(predicate, array) or x in array are both sequential search, so have an average cost that scales with the array length multiplied by the cost of the predicate).
Second, as I wrote in another blog post: There is a tension between two general principles in computing: on the one hand, re-using highly optimized code is good for performance; on the other other hand, optimized code that is specialized for your problem can usually beat general-purpose functions.

That being said, there is absolutely nothing wrong with writing code to minimize code length or maximize clarity (hopefully both) rather than optimizing speed. But you should be aware that you are doing so, especially if your code has suboptimal complexity/scaling (not just a suboptimal constant factor, which requires a lot more expertise to deal with).

Right, your solution also does one linear-time pass over u and one linear-time pass over v.

Topic		Replies	Views
Dictionary of all indices of all unique values in an array General Usage	0	473	December 14, 2016
With_repetitions_permutations v0.6 VS v1 General Usage question	12	1787	February 1, 2019
Vector to lists of indices grouped by key Data question , array , dictionary , splitapplycombine	9	766	August 1, 2022
Any shortcut to 1:length(myVector)? General Usage	34	3212	January 12, 2021
Is something like reversed Dict? findall(x -> x=="house", cc) is to slow:/ General Usage	25	3162	October 20, 2018

Replacing components with indices

Related topics