How can I make this more elegant/performant?

Based on past experience, I have a feeling that I’ve horrendously over-engineered a solution to what seems like a simple problem. I have a vector of codes (numbers represented as strings) that I need to ‘collapse’ down. For example, given the following vector:

codes = ["11", "13", "112", "11213", "113", "213", "21345", "21376", "7567", "75679"]

The function should produce this result:

["11", "13", "213", "7567"]

Basically, it should find the shortest codes in the vector and then see if there are any other codes that start with the same and eliminate them. So “112”, “11213”, and “113” all fall under the “11” category and, as such, should be removed from the vector. “21345” and “21376” fall under the “213” category so they should be removed, and so on.

The Frankenstein recursive function that I’ve written to do this looks like this:

function collapsecodes(v::Vector{String}, minlength::Int = 0)
    m = minlength == 0 ? minimum(length.(v)) : minlength
    s = filter(x -> length(x) == m, v)
    diff = setdiff(v, s)
    c = vcat([filter(x -> length(x) >= m && x[1:m] == s[i], diff) for i in 1:length(s)]...)
    a = setdiff(v, c)
    return length(setdiff(v, a)) == 0 ? a : collapsecodes(a, m+1)
end

# REPL

julia> @time collapsecodes(codes)
  0.085379 seconds (167.69 k allocations: 7.989 MiB)
4-element Array{String,1}:
 "11"
 "13"
 "213"
 "7567"

I have a feeling one of you amazing Julians can put together a solution that is 10 times more elegant and 10 times more performant :stuck_out_tongue_winking_eye:

i don’t know if its faster but I’m sure is shorter:

codes = ["11", "13", "112", "11213", "113", "213", "21345", "21376", "7567", "75679"]
function collapsecodes2(codes::Vector{String}) #out of place,creates a new vector
    _codes = sort(codes)  
    min_len  = minimum(length,_codes) #using first argument as a function
    return min_codes = unique(x->first(x,min_len),_codes)
end

function collapsecodes2!(codes::Vector{String}) #inplace, modifies the input, should be faster
    sort!(codes)
    min_len  = minimum(length,codes)
    return min_codes = unique!(x->first(x,min_len),codes)
end

benchamarking:

using BenchmarkTools
julia> @benchmark collapsecodes($codes)
BenchmarkTools.Trial:
  memory estimate:  10.56 KiB
  allocs estimate:  134
  --------------
  minimum time:     9.900 μs (0.00% GC)
  median time:      11.499 μs (0.00% GC)
  mean time:        12.470 μs (0.00% GC)
  maximum time:     60.501 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark collapsecodes2($codes)
BenchmarkTools.Trial:
  memory estimate:  944 bytes
  allocs estimate:  17
  --------------
  minimum time:     938.095 ns (0.00% GC)
  median time:      1.014 μs (0.00% GC)
  mean time:        1.232 μs (10.85% GC)
  maximum time:     355.852 μs (99.19% GC)
  --------------
  samples:          10000
  evals/sample:     21

EDIT: added the presorting, as suggested by @jling

2 Likes

Here’s a simple faster version:

julia> function collapsecodes(codes)
           "" in codes && return [""] # use the empty string as a sentinel value below
           result = copy(codes)
           sort!(result, by=length)
           for i in eachindex(result)
               for j in i + 1 : lastindex(result)
                   if !isempty(result[i]) && startswith(result[j], result[i])
                       result[j] = ""
                   end
               end
           end
           filter!(!isempty, result)
           result
       end

collapsecodes (generic function with 1 method)

julia> @btime collapsecodes(codes)
  424.985 ns (2 allocations: 240 bytes)
4-element Array{String,1}:
 "11"
 "13"
 "213"
 "7567"
3 Likes

this won’t work, the min_len is 2 in this example, if you swap the order or "213", "21345",, you will get "21345" because you only looking at first 2 chars of each string. Pre-sorting by length should fix this.

2 Likes

This won’t work if you need to compare prefixes of other lengths than the minimum, e.g. if you add 214, which AFAIU should have its own category (or maybe create the 21 category, this is not clear from the question).

1 Like

@tkluck this is awesome! I’m struggling a bit to understand exactly what’s going on here though, what is the purpose of the first line: "" in codes && return [""]? You added the comment that the empty string is used as a sentinel value but it seems as though this line is just a check to make sure that there are no empty strings in the array that’s passed to the function…??

e.g. if you add 214, which AFAIU should have its own category (or maybe create the 21 category, this is not clear from the question).

Nope, shouldn’t create a new category. The function should just find the ‘parent’ category, which is basically the shortest-length code. So in this case, “214” would stand alone and should be returned in the result.

Bravo, and thanks to all of you :clap: :clap: This community never ceases to amaze me :slightly_smiling_face:

It’s faster to do

result[j] = <sentinel>
...
filter!(!equals(<sentinel>), result)

than to delete eagerly

deleteat!(result, j)

as it avoids moving the tail elements every time. Typically, you’d use something like nothing as a sentinel value. In your specific case, if any string is empty, it is a prefix for all others(!) and so [""] is the correct result. That means we can short-circuit that case and use "" as the sentinel in other cases. It doesn’t make a big difference; just avoids some type juggling to allow result to also contain nothing. The alternative would have been something like (untested):

function collapsecodes(codes)
    result = collect(Union{eltype(codes), Nothing}, codes)
    sort!(result, by=length)
    for i in eachindex(result)
        isnothing(result[i]) && continue
        for j in i + 1 : lastindex(result)
            if startswith(result[j], result[i])
                result[j] = nothing
            end
        end
    end
    filter!(!isnothing, result)
end

Makes sense?

3 Likes

Yes, thanks!