Topological sort (performance)


#1

Hi.

I needed to implement a topological sort, so I found some implementations on the Internet here: https://rosettacode.org/wiki/Topological_sort

I initially needed it in Python and experimented with Julia just out of curiosity. Here is a Python code:

try:
    from functools import reduce
except:
    pass
 
data = {
    'des_system_lib':   set('std synopsys std_cell_lib des_system_lib dw02 dw01 ramlib ieee'.split()),
    'dw01':             set('ieee dw01 dware gtech'.split()),
    'dw02':             set('ieee dw02 dware'.split()),
    'dw03':             set('std synopsys dware dw03 dw02 dw01 ieee gtech'.split()),
    'dw04':             set('dw04 ieee dw01 dware gtech'.split()),
    'dw05':             set('dw05 ieee dware'.split()),
    'dw06':             set('dw06 ieee dware'.split()),
    'dw07':             set('ieee dware'.split()),
    'dware':            set('ieee dware'.split()),
    'gtech':            set('ieee gtech'.split()),
    'ramlib':           set('std ieee'.split()),
    'std_cell_lib':     set('ieee std_cell_lib'.split()),
    'synopsys':         set(),
    }
 
def toposort2(data):
    for k, v in data.items():
        v.discard(k) # Ignore self dependencies
    extra_items_in_deps = reduce(set.union, data.values()) - set(data.keys())
    data.update({item:set() for item in extra_items_in_deps})
    while True:
        ordered = set(item for item,dep in data.items() if not dep)
        if not ordered:
            break
        for el in sorted(ordered):
            yield el
        data = {item: (dep - ordered) for item,dep in data.items()
                if item not in ordered}
    assert not data, "A cyclic dependency exists amongst %r" % data

and I’ve got the performance of:

%timeit list(toposort2(data))
10000 loops, best of 3: 20.2 µs per loop

Here is a Julia translation from the site:

function toposort2(data::Dict{T,Set{T}}) where T
    for (k, v) in data
        delete!(v, k)
    end
    extraitems = setdiff(reduce(∪, values(data)), keys(data))
    for item in extraitems
        data[item] = Set{T}()
    end
    rst = Vector{T}()
    while true
        ordered = Set(item for (item, dep) in data if isempty(dep))
        if isempty(ordered) break end
        append!(rst, ordered)
        data = Dict{T,Set{T}}(item => setdiff(dep, ordered) for (item, dep) in data if item ∉ ordered)
    end
    @assert isempty(data) "a cyclic dependency exists amongst $(keys(data))"
    return rst
end

And this is the benchmark I’ve got:

data = Dict{String,Set{String}}(
    "des_system_lib" => Set(split("std synopsys std_cell_lib des_system_lib dw02 dw01 ramlib ieee")),
    "dw01" =>           Set(split("ieee dw01 dware gtech")),
    "dw02" =>           Set(split("ieee dw02 dware")),
    "dw03" =>           Set(split("std synopsys dware dw03 dw02 dw01 ieee gtech")),
    "dw04" =>           Set(split("dw04 ieee dw01 dware gtech")),
    "dw05" =>           Set(split("dw05 ieee dware")),
    "dw06" =>           Set(split("dw06 ieee dware")),
    "dw07" =>           Set(split("ieee dware")),
    "dware" =>          Set(split("ieee dware")),
    "gtech" =>          Set(split("ieee gtech")),
    "ramlib" =>         Set(split("std ieee")),
    "std_cell_lib" =>   Set(split("ieee std_cell_lib")),
    "synopsys" =>       Set("")
    )

using BenchmarkTools
@benchmark toposort2(data)

BenchmarkTools.Trial: 
  memory estimate:  31.05 KiB
  allocs estimate:  447
  --------------
  minimum time:     15.935 μs (0.00% GC)
  median time:      16.964 μs (0.00% GC)
  mean time:        24.255 μs (26.77% GC)
  maximum time:     35.626 ms (99.92% GC)
  --------------
  samples:          10000
  evals/sample:     1

Less then 25% improvement on a purely algorithmic task, where Julis should have a significant margin over Python as advertized. Not that impressive at all, and Julia version is not even lazy comparing to Python code.

I have modified Julia version to use channels for lazy generation to recreate Python version more precicely:

function toposort3(data::Dict{T,Set{T}}) where T
    for (k, v) in data
        delete!(v, k)
    end
    extraitems = setdiff(reduce(∪, values(data)), keys(data))
    for item in extraitems
        data[item] = Set{T}()
    end
    rst = Vector{T}()
    Channel(ctype=T) do c
        while true
            ordered = Set(item for (item, dep) in data if isempty(dep))
            if isempty(ordered) break end
            for el in ordered push!(c, el) end
            data = Dict{T,Set{T}}(item => setdiff(dep, ordered) for (item, dep) in data if item ∉ ordered)
        end
        @assert isempty(data) "a cyclic dependency exists amongst $(keys(data))"
    end
end

Now the performance is:

@benchmark collect(toposort3(data))

BenchmarkTools.Trial: 
  memory estimate:  42.02 KiB
  allocs estimate:  603
  --------------
  minimum time:     68.883 μs (0.00% GC)
  median time:      72.482 μs (0.00% GC)
  mean time:        87.621 μs (9.52% GC)
  maximum time:     35.695 ms (99.61% GC)
  --------------
  samples:          10000
  evals/sample:     1

Either Python is not as slow as I was led to believe, or Julia implementation is missing something important here. I understand I am doing this just out of curiosity, so don’t want to waste your time with these code listings. I understand the performance isn’t all about Julia, still, it may be hard to convince someone to learn a new language if the outcome isn’t really that apparent.

Thank you for your comments!


#2

This looks pretty quick, but may need more than 1 pass

function xlookup(data)
  k = Dict();
  for (i, x) in enumerate(data)
    k[x.first] = i;
    end;
  return(k);
  end;

function xtop(data)
  k = xlookup(data);
  for x in data
    i = k[x.first];
    maxh = i-1;
    for n in x.second
      n==x.first && continue;
      if(!haskey(k, n))  continue; end; ##println("Key missing! ", n);
      h = k[n];
      if(h > maxh)  maxh = h; end;
      end;
    k[x.first] = maxh+1;
    end;
  return k;
  end;

@time xtop(data) - 0.000036 sec, 11 allocations


#3

Sorry, your code does not seem to do topological sort. However my question wasn’t about HOW to do it. I was mostly wondering why a vanilla Python code above, simple loops, dict and set manipulations without any numpy, numba, etc. was not slower, and strictly speaking even faster than ‘identical’ Julia code? Have I misinterpret the benchmark results?


#4

It did give a valid set of topological ranks though in this instance, so when sorted

Any["dware" Set(["ieee", "dware"]) 3]
Any["gtech" Set(["ieee", "gtech"]) 4]
Any["dw01" Set(["ieee", "dw01", "dware", "gtech"]) 5]
Any["dw04" Set(["ieee", "dw01", "dware", "gtech", "dw04"]) 6]
Any["synopsys" Set(String[]) 6]
Any["dw02" Set(["ieee", "dw02", "dware"]) 7]
Any["dw03" Set(["synopsys", "dw03", "dw02", "dw01", "dware", "ieee", "gtech", "std"]) 8]
Any["std_cell_lib" Set(["ieee", "std_cell_lib"]) 8]
Any["dw05" Set(["dw05", "ieee", "dware"]) 9]
Any["dw07" Set(["ieee", "dware"]) 11]
Any["ramlib" Set(["ieee", "std"]) 12]
Any["des_system_lib" Set(["ramlib", "synopsys", "des_system_lib", "dw02", "ieee", "dw01", "std_cell_lib", "std"]) 13]
Any["dw06" Set(["ieee", "dw06", "dware"]) 13]

I guess you could port to python and benchmark again, random code v random code


#5

Your innermost loop is inside a call to setdiff, for which the corresponding dep - ordered Python code is a built-in function that is implemented in well-optimized C code. (The individual elements are also strings, for which Python is again calling highly optimized C for hashing, comparison, etc.) Moreover, you are allocating new objects continually inside your loops, which is not something that will come close to cpu-bound performance in any language. Under these circumstances, the overhead of the Python interpreter is not limiting.

If you want to exceed the Python speed here, you probably need to re-think the algorithm and data structures.


#6

Try it with less obscure input:

data2 = Dict{Char,Set{Char}}( 'f' => Set(['g', 'h', 'i']),
  'd' => Set(['e']),
  'e' => Set(['f', 'g', 'i', 'k']),
  'j' => Set(['l', 'k']),
  'h' => Set(['i', 'k']),
  'i' => Set(['j', 'k']),
  'k' => Set(['l']),
  'c' => Set(['f', 'j', 'e']),
  'a' => Set(['b']),
  'g' => Set(['h', 'i']),
  'b' => Set(['c']))

toposort2(data2)

12-element Array{Char,1}:
 'l'
 'k'
 'j'
 'i'
 'h'
 'g'
 'f'
 'e'
 'd'
 'c'
 'b'
 'a'

xtop(data2)

Dict{Any,Any} with 11 entries:
  'f' => 11
  'd' => 4
  'e' => 12
  'j' => 8
  'h' => 8
  'i' => 9
  'k' => 7
  'c' => 13
  'a' => 12
  'g' => 10
  'b' => 14

#7

The innermost loop in Julia:

data = Dict{T,Set{T}}(item => setdiff(dep, ordered) for (item, dep) in data if item ∉ ordered)

in Python:

data = {item: (dep - ordered) for item,dep in data.items()
                if item not in ordered}

Practically identical.
I have modified the algorithm to destructuring form so to reduce allocations:

function toposort_destructuring(data::Dict{T,Set{T}}) where T
    data = deepcopy(data)
    for (k, v) in data
        delete!(v, k)
    end
    extraitems = setdiff(reduce(∪, values(data)), keys(data))
    for item in extraitems
        data[item] = Set{T}()
    end
    rst = Vector{T}()
    while (tail = findfirst( isempty, data)) != nothing
        push!(rst, tail)
        delete!(data, tail)
        for (_, dep) in data
                delete!(dep, tail)
        end
    end
    @assert isempty(data) "a cyclic dependency exists amongst $(keys(data))"
    return rst
end

The benchmark was even worse:

@benchmark toposort_destructuring(data)

BenchmarkTools.Trial: 
  memory estimate:  19.53 KiB
  allocs estimate:  183
  --------------
  minimum time:     18.505 μs (0.00% GC)
  median time:      19.534 μs (0.00% GC)
  mean time:        23.121 μs (11.39% GC)
  maximum time:     46.126 ms (99.91% GC)
  --------------
  samples:          207866
  evals/sample:     1

With bigger input set, however, the destructuring algorithm outperformed the original, but Python was even faster than Julia. Take a look:

Julia:

str=collect("" * a * b for a in 'a':'z' for b in 'a':'z')
lst = zip(str, Iterators.drop(str,1))
tst = collect((a, Set((b,))) for (a, b) in lst)
dict = Dict(tst)

@benchmark toposort2(dict)

BenchmarkTools.Trial: 
  memory estimate:  164.59 MiB
  allocs estimate:  2307031
  --------------
  minimum time:     99.260 ms (20.26% GC)
  median time:      103.084 ms (22.06% GC)
  mean time:        103.281 ms (21.54% GC)
  maximum time:     114.402 ms (23.58% GC)
  --------------
  samples:          49
  evals/sample:     1

@benchmark toposort_destructuring(dict)

BenchmarkTools.Trial: 
  memory estimate:  8.08 MiB
  allocs estimate:  10504
  --------------
  minimum time:     18.658 ms (0.00% GC)
  median time:      20.467 ms (0.00% GC)
  mean time:        20.688 ms (3.79% GC)
  maximum time:     26.442 ms (10.46% GC)
  --------------
  samples:          242
  evals/sample:     1

Python:

import string
str = ["" + a + b for a in string.ascii_lowercase for b in string.ascii_lowercase]
lst = zip(str, str[1:])
lss = [(a, set((b,))) for (a, b) in lst]
dct = dict(lss)

%timeit list(toposort2(dct) )

10 loops, best of 3: 55.8 ms per loop

With laziness added to the equation, Julia performance would be even poorer. It might be the case, however, as Python comprehensions are lazy by nature. How to effectively implement lazy computations in Julia is still a question as both Channels and ResumableFunctions have horrible performance for that matter. I have seen the Lazy.jl module, but its documentation is brief and obscure to someone not already familiar with it.


#8

Adding in an extra loop, you can approximate roughly how xtop() scales for worse-case inputs

function xtop(data, xloop = 3)
  k = xlookup(data);
  for iter = 1:xloop
  for x in data ...

> @time xtop(data2, 1)     0.000025
> @time xtop(data2, 10)    0.000082
> @time xtop(data2, 100)   0.000653
> @time xtop(data2, 1000)  0.006233      ~best case for 10000 * 11 = 110000 length &
> @time xtop(data2, 10000) 0.060356 - ~worst case for sqrt(10000 * 11) = 332 length

#9

Practically identical

Seriously, You can’t expect the same (poor?) implementation in different languages to perform differently. Try writing the same implementation even in C and I bet the performance won’t be much better;

You constantly allocate and dealocate hash maps and that simply can not have good performance as @stevengj pointed out; This is slow take on this algorithm and no programming language can change that.

@y4lu provided a different (better) implementation. Try porting that to python and see how performant it is


#10

A relatively performant and readable implementation is in LightGraphs.jl.

julia> using BenchmarkTools, LightGraphs
julia> function dict_to_graph(d::Dict{T, Set{T}}) where T
       g = DiGraph(length(d))
       enum = Dict{T,Int64}()
       kk=1
       for k in keys(d)
           enum[k]=kk
           kk+=1
       end

       for (k,v) in d
           ki = enum[k]
           for vv in v
               if vv in keys(enum)
                   add_edge!(g, ki, enum[vv])
               end
           end
       end
       return (g,enum)
       end;

julia> data2 = Dict{Char,Set{Char}}( 'f' => Set(['g', 'h', 'i']),
         'd' => Set(['e']),
         'e' => Set(['f', 'g', 'i', 'k']),
         'j' => Set(['l', 'k']),
         'h' => Set(['i', 'k']),
         'i' => Set(['j', 'k']),
         'k' => Set(['l']),
         'c' => Set(['f', 'j', 'e']),
         'a' => Set(['b']),
         'g' => Set(['h', 'i']),
         'b' => Set(['c']));

This gives

julia> @btime dict_to_graph($data2);
  3.396 μs (52 allocations: 4.70 KiB)
julia> g,enum = dict_to_graph(data2);
julia> @btime topological_sort_by_dfs($g);
  921.875 ns (18 allocations: 1.58 KiB)

julia> ienum = Vector{eltype(enum.keys)}(undef, length(enum));

julia> for (k,v) in enum
       ienum[v]=k
       end;

julia> ienum[topological_sort_by_dfs(g)]
11-element Array{Char,1}:
 'a'
 'b'
 'c'
 'd'
 'e'
 'f'
 'g'
 'h'
 'i'
 'j'
 'k'

julia> ienum[topological_sort_by_dfs(reverse(g))]
11-element Array{Char,1}:
 'k'
 'j'
 'i'
 'h'
 'g'
 'f'
 'e'
 'c'
 'b'
 'a'
 'd'

#11

Thanks for all your replies!

Again, my question was not how to implement a topological search, otherwise I would be using library code or some module. My question was why Julia is slower than Python on THAT particular code, when it supposed to be faster. To better understand practices of writing Julia code. Excessive memory allocations and not enough laziness in computations I guess would be the answer.
I’m sorry for the misunderstanding.


#12

Why would you assume Julia is faster than Python when all of the time in Python is spent inside optimized C-code. You aren’t even running actual Python code… (Julia’s code for Sets is of course implemented in Julia).

Where Julia shines is when you have some code that is not of the form where you can just off load the main computation to C-implementation someone else wrote, and you have to write the code yourself.


#13

The ‘identical code’ idea is a bit misplaced though, as it either compiles to the same instructions in the two languages (and you get the exact same benchmark) or it compiles to different instructions (which is about the same as using different code).
I don’t think rossetta code’s intention was at all performance focused in this case


#14

Wasn’t Julia 25% faster in the original direct translation? Doing substatially faster than Python at a very Pythonoc style of code seems pretty decent. I can certainly understand that you want something faster but for that a different style of algorithm is required.

I would be interested in what the fastest toposort one can do in Julia is. The LightGraphs version is already pretty impressively fast. Might be possible to do better, however.


#15

On a smaller input, yes. On a larger input, however, Julia was nearly 2 times slower than Python. The reason is memory allocations. Python somehow re-uses memory when generating new dictionary from the old one, which gives it an advantage.
My second question was about lazy computations, as Python code uses yield. I tried both Channels and ResumableFunctions and the result was, unfortunately, equally poor in terms of performance. I didn’t try Lazy.jl though as I couldn’t find simple tutorial on how to do this without rewriting the original code significantly.
Thank you! :slight_smile:


#16

Reference counting allows this kind of thing. In Julia it’s idiomatic to explicitly mutate data structures instead of relying on the language to do it implicitly.

Yield isn’t really lazy. Terminology aside, Python has heavily optimized this pattern because it’s a common way to implement things in Python. This is not a common way to write high performance code in Julia, and it could certainly be optimized further.


#17

Line-for-line translations from another language into Julia are hardly guaranteed to give better performance—to take advantage of Julia you generally need to write a different style of code than what you are used to in Python. The goal of Julia is not to make it impossible to write slow code, it is to make it possible to write fast code.

In this particular case, you are comparing slow code in Python to slow code in Julia and the slow Python code (whose inner loops are inside well-optimized C library routines) happens to be a bit faster, but this is not an especially interesting comparison because continually allocating lots of dictionaries within inner loops is not how you would write optimized Julia code.


#18

This is same interesting as to benchmarking recursive fibonacci because: “It is important to note that the benchmark codes are not written for absolute maximal performance. Instead, the benchmarks are written to test the performance of identical algorithms and code patterns implemented in each language.” (Source: https://julialang.org/benchmarks/) .

ping @Sukera (our discussion here ):

We have now this competent answer:

We have now these possibility in Performance tips in doc:
a) not discourage people from using generator because “we have it” :stuck_out_tongue_winking_eye:
b) admit that this code pattern is better implemented in python (*) and add performance tip to not use it in performance critical part of code (this can be changed in future!)
c) do as this is not happening.

(*) we don’t need to say about python there it is enough to say that generator (cite:) “is not a common way to write high performance code in Julia”. And we could add this for sure to doc: Noteworthy differences from Python because it will bite many pythonistas trying to port their code!


#19

That thread you linked is about the exact same problem - not reusing stuff but rather continously allocating new objects in a tight loop. This is precisely one of the cases where using a generator is just wrong. Even though it’s only marginally related, I stand by my point: Generators are fine when reusing them or their contents, but not so when continously recreating them.


#20

I think it is misinterpretation of

%timeit list(toposort2(data))
10000 loops, best of 3: 20.2 µs per loop

and

  minimum time:     15.935 μs (0.00% GC)
  median time:      16.964 μs (0.00% GC)
  mean time:        24.255 μs (26.77% GC)
  maximum time:     35.626 ms (99.92% GC)

where OP compared minimum time in Julia to average (more precisely best of 3 averages of 10000 iteration) time python. (I read that Julia was 20% slower)