Replacing components with indices

I am wondering if the following can be done in a better way (without for loop, perhaps): given a vector and a vector of vectors, e.g.

v = [1, 2, 3, 3, 4, 4, 5]
u = [[1,2], [3, 5], [4]]

I like to get

[1, 1, 2, 2, 3, 3, 2]

by mapping the components in v to the indices of u. I came up with two ways as follows:

for i in eachindex(u)
           v[findall(x -> x in u[i], v)] .= i
end

and

cond = Iterators.flatmap(enumerate(u) |> collect) do (i, x) map(reverse, i .=> x) end
for c in cond
           replace!(v, c)
end

The reason why I have a question is that I wonder whether I could do somthing like

for i in eachindex(u)
           replace!(v, u[i] .=> i)
end

, since replace(v, 1=>0) (for example) is possible and I thought of some extension and a possibility of avoiding using a for loop.

I would worry less about getting rid of loops (which are perfectly fine in Julia) and more about making your complexity better. Right now, your computational cost scales proportionally to length(v) * sum(length, u), which is pretty bad if the vectors are long. You can do much better.

For example, make a single pass over u first to make a dictionary mapping values to indices:

d = Dict{eltype(eltype(u)),Int}()
for i in eachindex(u), x in u[i]
    d[x] = i
end

then you can just do

julia> v .= get.(Ref(d), v, v)
7-element Vector{Int64}:
 1
 1
 2
 2
 3
 3
 2

which has nearly linear complexity in the two lengths. You might also want to consider using the dictionary (or similar) in the first place to store your mapping (so that you never create u at all).

(And you can do even better if you have more knowledge of u. e.g. if you know that the elements of u cover all of the values of v, and that these are small integers β€” as in your example above β€” then you can just store an array mapping values to indices, rather than a dictionary.)

3 Likes

Another option:

foldl((r,(i,e)) -> (r[e] .= i; r), pairs(u); init=fill(0,5))[v]

This option uses the fact u is a partition of the values of v, and the size 5 of the value set of v is encoded in the initialization. But it feels this knowledge is implicit in the OP setup.

in.(v, permutedims(u))*axes(u,1)
2 Likes

Shorter:
in.(v,u') * axes(u,1)

1 Like

This is cute, but still has poor time complexity proportional to length(v) * sum(length, u), as well as poor space complexity due to allocating a temporary matrix of size length(v) * length(u).

2 Likes
indexinin(v,u)=[findfirst(eu->in(e, eu), u) for e in v]

this variant seems to be a little more efficient


function indexinin!(v,u)
    for i in eachindex(v)
        v[i]=findfirst(eu->in(v[i], eu), u)
    end
end

Such a strategy would have the problem that the changes you apply can also change the part of the vector that has already been modified.

Unless you use kwarg count=1 :grinning:

foreach(e->replace!(v, e=>findfirst(eu->in(e, eu), u),count=1),v)

It still has poor time complexity proportional to length(v) * sum(length, u).

Also has the same poor time complexity.

Realize that nearly all of the solutions posted so far (excluding mine) correspond to three nested loops, hence the multiplicative complexity. Squeezing the code into a minimal number of library calls is fine if you want to minimize code size, but it’s too easy to fall into the myth of β€œlibrary calls = fast”, or to think only about code size and not about computational complexity.

In general terms I would say yes.
In my experience (I often try to independently create scripts that replace/improve library calls) and with my level of knowledge, library calls work better.
But I wonder if in the specific case, being u something like a partition of unique[v], the evaluation of the spatial or temporal complexity is not different from the general worst case.

Hopefully, the β€œnearly” refers to my solution too. Which has the same low complexity. In fact, due to implementation of Dicts, assuming, as it seems, that u is a partition of 1:n (n=5 here), my solution would keep a multlipicative advantage in larger scales (both in memory and speed). stevengj already mentions this possibility at the end of his solution, so perhaps I’ve added nothing novel.

I wanted to compare this with the script that uses findfirst, but I can’t measure it with @benchmark f(…) setup=(…)

function richcomplexity!(u,vv)
    d=zeros(Int, length(unique(vv)))
    for i in eachindex(u), x in u[i]
        d[x] = i
    end
    vv .= getindex.(Ref(d), vv)
end

I tried with larger vectors than the ones provided in the OP.

bigv=rand(1:100, 10^3)

uu=unique(bigv)  

p=sort(rand(uu),10))
pl=[1;p]
pr=[p.-1;length(uu)
bigu=getindex.([uu], (:).(pl,pr))
ulia> function richcomplexity(u,v)
           d = Dict{eltype(eltype(u)),Int}()
           for i in eachindex(u), x in u[i]
               d[x] = i
           end
           v .= get.(Ref(d), v, v)
       end
richcomplexity (generic function with 1 method)


julia> @benchmark richcomplexity($bigu,cv)  setup=(cv=copy($bigv))     
BenchmarkTools.Trial: 10000 samples with 6 evaluations.
 Range (min … max):  4.833 ΞΌs … 870.450 ΞΌs  β”Š GC (min … max): 0.00% … 98.16%
 Time  (median):     5.400 ΞΌs               β”Š GC (median):    0.00%    
 Time  (mean Β± Οƒ):   8.108 ΞΌs Β±  24.920 ΞΌs  β”Š GC (mean Β± Οƒ):  8.84% Β±  2.96%

  β–ˆβ–‡β–…β–„β–„β–„β–β–β–β–β–               ▂▂▁                               ▁        
  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–ˆβ–ˆβ–‡β–‡β–‡β–‡β–†β–‡β–‡β–‡β–‡β–†β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–†β–†β–†β–†β–†β–‡β–‡β–…β–…β–…β–…β–„β–„β–…β–…β–‚β–ƒβ–ƒβ–ƒβ–„β–…β–…β–…β–ƒβ–‚β–…β–ƒβ–‚ β–ˆ        
  4.83 ΞΌs      Histogram: log(frequency) by time      31.8 ΞΌs <        

 Memory estimate: 6.36 KiB, allocs estimate: 10.

julia> function indexinin!(v,u)
           for i in eachindex(v)
               v[i]=findfirst(eu->in(v[i], eu), u)
           end
       end
indexinin! (generic function with 1 method)

julia> @benchmark indexinin!(cv,$bigu) setup=(cv=copy($bigv))
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  28.000 ΞΌs …  1.277 ms  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     29.800 ΞΌs              β”Š GC (median):    0.00%    
 Time  (mean Β± Οƒ):   32.776 ΞΌs Β± 20.407 ΞΌs  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

  β–‡β–ˆβ–ˆβ–†β–„β– ▁      ▅▂▁  ▁▁▁▁▁▁▁                                  β–‚        
  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–†β–ˆβ–†β–‡β–‡β–†β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–‡β–†β–„β–…β–„β–„β–„β–ƒβ–„β–„β–ƒβ–ƒβ–β–„β–„β–„β–„β–ƒβ–ƒβ–„β–„β–„β–„β–β–…β–β–ƒβ–„β–ƒβ–† β–ˆ        
  28 ΞΌs        Histogram: log(frequency) by time        78 ΞΌs <        

 Memory estimate: 0 bytes, allocs estimate: 0.

It seems like @stevenj said

β€œLibrary calls are faster” is an unreliable heuristic.

  1. First, you should always try to have some sense of what the library calls are doing β€” at least know the expected complexity. (e.g. findfirst(predicate, array) or x in array are both sequential search, so have an average cost that scales with the array length multiplied by the cost of the predicate).
  2. Second, as I wrote in another blog post: There is a tension between two general principles in computing: on the one hand, re-using highly optimized code is good for performance; on the other other hand, optimized code that is specialized for your problem can usually beat general-purpose functions.

That being said, there is absolutely nothing wrong with writing code to minimize code length or maximize clarity (hopefully both) rather than optimizing speed. But you should be aware that you are doing so, especially if your code has suboptimal complexity/scaling (not just a suboptimal constant factor, which requires a lot more expertise to deal with).

Right, your solution also does one linear-time pass over u and one linear-time pass over v. :+1:

1 Like