Super slow string performance

Hi,
while solving a coding competition problem with julia I experienced extreme performance issues with the String type.
This program:

function deleteCount(target::String, input::String)
    if length(target) > length(input)
        return "IMPOSSIBLE"
    end
    offset::Int = 0
    for i = 1:length(target)
        while target[i] != input[i+offset]
            offset += 1
            if length(target) > length(input) - offset
                return "IMPOSSIBLE"
            end
        end
    end
    return length(input) - length(target)
end

function processTest(io::IO=stdin)
    numCases = parse(Int, readline(io))
    for i = 1:numCases
        target = readline(io)
        actual = readline(io)
        println("Case #", i, ": ", deleteCount(target, actual))
    end
end
processTest()

gets a timeout on the test set with over 20s execution time,
while the following seemingly similar version of it

function deleteCount(target::Vector{Char}, input::Vector{Char})
    if length(target) > length(input)
        return "IMPOSSIBLE"
    end
    offset::Int = 0
    for i = 1:length(target)
        while target[i] != input[i+offset]
            offset += 1
            if length(target) > length(input) - offset
                return "IMPOSSIBLE"
            end
        end
    end
    return length(input) - length(target)
end

function processTest(io::IO=stdin)
    numCases = parse(Int, readline(io))
    for i = 1:numCases
        target = readline(io)
        actual = readline(io)
        println("Case #", i, ": ", deleteCount(collect(target), collect(actual)))
    end
end
processTest()

finishes in next to no time.

Why are Strings so slow compared to char arrays?

1 Like

The main difference is that strings are UTF-8-encoded, which is a variable-length encoding. This means that length of a string is actually an O(n) operation. Since you’re calling length twice in each loop iteration, that means your code is suddenly O(n^2) instead of O(n).

From the way you are indexing the strings, you also seem to assume that all characters you are reading in are ASCII, since you always increase your offset by 1. This will error if the input contains any non-ASCII characters, typically you’ll want to use nextind or just a for loop over eachindex(::String) instead.

Julia does allow you to index strings by single-byte code units, which might be what you want. You can use codeunit(s, i) to get the ith code unit and codeunits(s) to get the total number of code units. Those will both be O(1) operations.

6 Likes

Since your strings are not changed in the loop, and thus the lengths constant, then it seems you can change to:

    len_target = length(target); len_input = length(input)
    for i = 1:len_target
        while target[i] != input[i+offset]
            offset += 1
            if len_target > len_input - offset
                return "IMPOSSIBLE"
            end
        end
    end

I would like the Julia compiler to be able to make that transformation, but it’s probably too non-obvious for compilers (and sometimes humans!). Not sure what they do for other languages, e.g. C++.