Topological sort (performance)

Hi.

I needed to implement a topological sort, so I found some implementations on the Internet here: https://rosettacode.org/wiki/Topological_sort

I initially needed it in Python and experimented with Julia just out of curiosity. Here is a Python code:

try:
    from functools import reduce
except:
    pass
 
data = {
    'des_system_lib':   set('std synopsys std_cell_lib des_system_lib dw02 dw01 ramlib ieee'.split()),
    'dw01':             set('ieee dw01 dware gtech'.split()),
    'dw02':             set('ieee dw02 dware'.split()),
    'dw03':             set('std synopsys dware dw03 dw02 dw01 ieee gtech'.split()),
    'dw04':             set('dw04 ieee dw01 dware gtech'.split()),
    'dw05':             set('dw05 ieee dware'.split()),
    'dw06':             set('dw06 ieee dware'.split()),
    'dw07':             set('ieee dware'.split()),
    'dware':            set('ieee dware'.split()),
    'gtech':            set('ieee gtech'.split()),
    'ramlib':           set('std ieee'.split()),
    'std_cell_lib':     set('ieee std_cell_lib'.split()),
    'synopsys':         set(),
    }
 
def toposort2(data):
    for k, v in data.items():
        v.discard(k) # Ignore self dependencies
    extra_items_in_deps = reduce(set.union, data.values()) - set(data.keys())
    data.update({item:set() for item in extra_items_in_deps})
    while True:
        ordered = set(item for item,dep in data.items() if not dep)
        if not ordered:
            break
        for el in sorted(ordered):
            yield el
        data = {item: (dep - ordered) for item,dep in data.items()
                if item not in ordered}
    assert not data, "A cyclic dependency exists amongst %r" % data

and Iā€™ve got the performance of:

%timeit list(toposort2(data))
10000 loops, best of 3: 20.2 Āµs per loop

Here is a Julia translation from the site:

function toposort2(data::Dict{T,Set{T}}) where T
    for (k, v) in data
        delete!(v, k)
    end
    extraitems = setdiff(reduce(āˆŖ, values(data)), keys(data))
    for item in extraitems
        data[item] = Set{T}()
    end
    rst = Vector{T}()
    while true
        ordered = Set(item for (item, dep) in data if isempty(dep))
        if isempty(ordered) break end
        append!(rst, ordered)
        data = Dict{T,Set{T}}(item => setdiff(dep, ordered) for (item, dep) in data if item āˆ‰ ordered)
    end
    @assert isempty(data) "a cyclic dependency exists amongst $(keys(data))"
    return rst
end

And this is the benchmark Iā€™ve got:

data = Dict{String,Set{String}}(
    "des_system_lib" => Set(split("std synopsys std_cell_lib des_system_lib dw02 dw01 ramlib ieee")),
    "dw01" =>           Set(split("ieee dw01 dware gtech")),
    "dw02" =>           Set(split("ieee dw02 dware")),
    "dw03" =>           Set(split("std synopsys dware dw03 dw02 dw01 ieee gtech")),
    "dw04" =>           Set(split("dw04 ieee dw01 dware gtech")),
    "dw05" =>           Set(split("dw05 ieee dware")),
    "dw06" =>           Set(split("dw06 ieee dware")),
    "dw07" =>           Set(split("ieee dware")),
    "dware" =>          Set(split("ieee dware")),
    "gtech" =>          Set(split("ieee gtech")),
    "ramlib" =>         Set(split("std ieee")),
    "std_cell_lib" =>   Set(split("ieee std_cell_lib")),
    "synopsys" =>       Set("")
    )

using BenchmarkTools
@benchmark toposort2(data)

BenchmarkTools.Trial: 
  memory estimate:  31.05 KiB
  allocs estimate:  447
  --------------
  minimum time:     15.935 Ī¼s (0.00% GC)
  median time:      16.964 Ī¼s (0.00% GC)
  mean time:        24.255 Ī¼s (26.77% GC)
  maximum time:     35.626 ms (99.92% GC)
  --------------
  samples:          10000
  evals/sample:     1

Less then 25% improvement on a purely algorithmic task, where Julis should have a significant margin over Python as advertized. Not that impressive at all, and Julia version is not even lazy comparing to Python code.

I have modified Julia version to use channels for lazy generation to recreate Python version more precicely:

function toposort3(data::Dict{T,Set{T}}) where T
    for (k, v) in data
        delete!(v, k)
    end
    extraitems = setdiff(reduce(āˆŖ, values(data)), keys(data))
    for item in extraitems
        data[item] = Set{T}()
    end
    rst = Vector{T}()
    Channel(ctype=T) do c
        while true
            ordered = Set(item for (item, dep) in data if isempty(dep))
            if isempty(ordered) break end
            for el in ordered push!(c, el) end
            data = Dict{T,Set{T}}(item => setdiff(dep, ordered) for (item, dep) in data if item āˆ‰ ordered)
        end
        @assert isempty(data) "a cyclic dependency exists amongst $(keys(data))"
    end
end

Now the performance is:

@benchmark collect(toposort3(data))

BenchmarkTools.Trial: 
  memory estimate:  42.02 KiB
  allocs estimate:  603
  --------------
  minimum time:     68.883 Ī¼s (0.00% GC)
  median time:      72.482 Ī¼s (0.00% GC)
  mean time:        87.621 Ī¼s (9.52% GC)
  maximum time:     35.695 ms (99.61% GC)
  --------------
  samples:          10000
  evals/sample:     1

Either Python is not as slow as I was led to believe, or Julia implementation is missing something important here. I understand I am doing this just out of curiosity, so donā€™t want to waste your time with these code listings. I understand the performance isnā€™t all about Julia, still, it may be hard to convince someone to learn a new language if the outcome isnā€™t really that apparent.

Thank you for your comments!

1 Like

This looks pretty quick, but may need more than 1 pass

function xlookup(data)
  k = Dict();
  for (i, x) in enumerate(data)
    k[x.first] = i;
    end;
  return(k);
  end;

function xtop(data)
  k = xlookup(data);
  for x in data
    i = k[x.first];
    maxh = i-1;
    for n in x.second
      n==x.first && continue;
      if(!haskey(k, n))  continue; end; ##println("Key missing! ", n);
      h = k[n];
      if(h > maxh)  maxh = h; end;
      end;
    k[x.first] = maxh+1;
    end;
  return k;
  end;

@time xtop(data) - 0.000036 sec, 11 allocations

1 Like

Sorry, your code does not seem to do topological sort. However my question wasnā€™t about HOW to do it. I was mostly wondering why a vanilla Python code above, simple loops, dict and set manipulations without any numpy, numba, etc. was not slower, and strictly speaking even faster than ā€˜identicalā€™ Julia code? Have I misinterpret the benchmark results?

It did give a valid set of topological ranks though in this instance, so when sorted

Any["dware" Set(["ieee", "dware"]) 3]
Any["gtech" Set(["ieee", "gtech"]) 4]
Any["dw01" Set(["ieee", "dw01", "dware", "gtech"]) 5]
Any["dw04" Set(["ieee", "dw01", "dware", "gtech", "dw04"]) 6]
Any["synopsys" Set(String[]) 6]
Any["dw02" Set(["ieee", "dw02", "dware"]) 7]
Any["dw03" Set(["synopsys", "dw03", "dw02", "dw01", "dware", "ieee", "gtech", "std"]) 8]
Any["std_cell_lib" Set(["ieee", "std_cell_lib"]) 8]
Any["dw05" Set(["dw05", "ieee", "dware"]) 9]
Any["dw07" Set(["ieee", "dware"]) 11]
Any["ramlib" Set(["ieee", "std"]) 12]
Any["des_system_lib" Set(["ramlib", "synopsys", "des_system_lib", "dw02", "ieee", "dw01", "std_cell_lib", "std"]) 13]
Any["dw06" Set(["ieee", "dw06", "dware"]) 13]

I guess you could port to python and benchmark again, random code v random code

Your innermost loop is inside a call to setdiff, for which the corresponding dep - ordered Python code is a built-in function that is implemented in well-optimized C code. (The individual elements are also strings, for which Python is again calling highly optimized C for hashing, comparison, etc.) Moreover, you are allocating new objects continually inside your loops, which is not something that will come close to cpu-bound performance in any language. Under these circumstances, the overhead of the Python interpreter is not limiting.

If you want to exceed the Python speed here, you probably need to re-think the algorithm and data structures.

5 Likes

Try it with less obscure input:

data2 = Dict{Char,Set{Char}}( 'f' => Set(['g', 'h', 'i']),
  'd' => Set(['e']),
  'e' => Set(['f', 'g', 'i', 'k']),
  'j' => Set(['l', 'k']),
  'h' => Set(['i', 'k']),
  'i' => Set(['j', 'k']),
  'k' => Set(['l']),
  'c' => Set(['f', 'j', 'e']),
  'a' => Set(['b']),
  'g' => Set(['h', 'i']),
  'b' => Set(['c']))

toposort2(data2)

12-element Array{Char,1}:
 'l'
 'k'
 'j'
 'i'
 'h'
 'g'
 'f'
 'e'
 'd'
 'c'
 'b'
 'a'

xtop(data2)

Dict{Any,Any} with 11 entries:
  'f' => 11
  'd' => 4
  'e' => 12
  'j' => 8
  'h' => 8
  'i' => 9
  'k' => 7
  'c' => 13
  'a' => 12
  'g' => 10
  'b' => 14

The innermost loop in Julia:

data = Dict{T,Set{T}}(item => setdiff(dep, ordered) for (item, dep) in data if item āˆ‰ ordered)

in Python:

data = {item: (dep - ordered) for item,dep in data.items()
                if item not in ordered}

Practically identical.
I have modified the algorithm to destructuring form so to reduce allocations:

function toposort_destructuring(data::Dict{T,Set{T}}) where T
    data = deepcopy(data)
    for (k, v) in data
        delete!(v, k)
    end
    extraitems = setdiff(reduce(āˆŖ, values(data)), keys(data))
    for item in extraitems
        data[item] = Set{T}()
    end
    rst = Vector{T}()
    while (tail = findfirst( isempty, data)) != nothing
        push!(rst, tail)
        delete!(data, tail)
        for (_, dep) in data
                delete!(dep, tail)
        end
    end
    @assert isempty(data) "a cyclic dependency exists amongst $(keys(data))"
    return rst
end

The benchmark was even worse:

@benchmark toposort_destructuring(data)

BenchmarkTools.Trial: 
  memory estimate:  19.53 KiB
  allocs estimate:  183
  --------------
  minimum time:     18.505 Ī¼s (0.00% GC)
  median time:      19.534 Ī¼s (0.00% GC)
  mean time:        23.121 Ī¼s (11.39% GC)
  maximum time:     46.126 ms (99.91% GC)
  --------------
  samples:          207866
  evals/sample:     1

With bigger input set, however, the destructuring algorithm outperformed the original, but Python was even faster than Julia. Take a look:

Julia:

str=collect("" * a * b for a in 'a':'z' for b in 'a':'z')
lst = zip(str, Iterators.drop(str,1))
tst = collect((a, Set((b,))) for (a, b) in lst)
dict = Dict(tst)

@benchmark toposort2(dict)

BenchmarkTools.Trial: 
  memory estimate:  164.59 MiB
  allocs estimate:  2307031
  --------------
  minimum time:     99.260 ms (20.26% GC)
  median time:      103.084 ms (22.06% GC)
  mean time:        103.281 ms (21.54% GC)
  maximum time:     114.402 ms (23.58% GC)
  --------------
  samples:          49
  evals/sample:     1

@benchmark toposort_destructuring(dict)

BenchmarkTools.Trial: 
  memory estimate:  8.08 MiB
  allocs estimate:  10504
  --------------
  minimum time:     18.658 ms (0.00% GC)
  median time:      20.467 ms (0.00% GC)
  mean time:        20.688 ms (3.79% GC)
  maximum time:     26.442 ms (10.46% GC)
  --------------
  samples:          242
  evals/sample:     1

Python:

import string
str = ["" + a + b for a in string.ascii_lowercase for b in string.ascii_lowercase]
lst = zip(str, str[1:])
lss = [(a, set((b,))) for (a, b) in lst]
dct = dict(lss)

%timeit list(toposort2(dct) )

10 loops, best of 3: 55.8 ms per loop

With laziness added to the equation, Julia performance would be even poorer. It might be the case, however, as Python comprehensions are lazy by nature. How to effectively implement lazy computations in Julia is still a question as both Channels and ResumableFunctions have horrible performance for that matter. I have seen the Lazy.jl module, but its documentation is brief and obscure to someone not already familiar with it.

Adding in an extra loop, you can approximate roughly how xtop() scales for worse-case inputs

function xtop(data, xloop = 3)
  k = xlookup(data);
  for iter = 1:xloop
  for x in data ...

> @time xtop(data2, 1)     0.000025
> @time xtop(data2, 10)    0.000082
> @time xtop(data2, 100)   0.000653
> @time xtop(data2, 1000)  0.006233      ~best case for 10000 * 11 = 110000 length &
> @time xtop(data2, 10000) 0.060356 - ~worst case for sqrt(10000 * 11) = 332 length

Practically identical

Seriously, You canā€™t expect the same (poor?) implementation in different languages to perform differently. Try writing the same implementation even in C and I bet the performance wonā€™t be much better;

You constantly allocate and dealocate hash maps and that simply can not have good performance as @stevengj pointed out; This is slow take on this algorithm and no programming language can change that.

@y4lu provided a different (better) implementation. Try porting that to python and see how performant it is

2 Likes

A relatively performant and readable implementation is in LightGraphs.jl.

julia> using BenchmarkTools, LightGraphs
julia> function dict_to_graph(d::Dict{T, Set{T}}) where T
       g = DiGraph(length(d))
       enum = Dict{T,Int64}()
       kk=1
       for k in keys(d)
           enum[k]=kk
           kk+=1
       end

       for (k,v) in d
           ki = enum[k]
           for vv in v
               if vv in keys(enum)
                   add_edge!(g, ki, enum[vv])
               end
           end
       end
       return (g,enum)
       end;

julia> data2 = Dict{Char,Set{Char}}( 'f' => Set(['g', 'h', 'i']),
         'd' => Set(['e']),
         'e' => Set(['f', 'g', 'i', 'k']),
         'j' => Set(['l', 'k']),
         'h' => Set(['i', 'k']),
         'i' => Set(['j', 'k']),
         'k' => Set(['l']),
         'c' => Set(['f', 'j', 'e']),
         'a' => Set(['b']),
         'g' => Set(['h', 'i']),
         'b' => Set(['c']));

This gives

julia> @btime dict_to_graph($data2);
  3.396 Ī¼s (52 allocations: 4.70 KiB)
julia> g,enum = dict_to_graph(data2);
julia> @btime topological_sort_by_dfs($g);
  921.875 ns (18 allocations: 1.58 KiB)

julia> ienum = Vector{eltype(enum.keys)}(undef, length(enum));

julia> for (k,v) in enum
       ienum[v]=k
       end;

julia> ienum[topological_sort_by_dfs(g)]
11-element Array{Char,1}:
 'a'
 'b'
 'c'
 'd'
 'e'
 'f'
 'g'
 'h'
 'i'
 'j'
 'k'

julia> ienum[topological_sort_by_dfs(reverse(g))]
11-element Array{Char,1}:
 'k'
 'j'
 'i'
 'h'
 'g'
 'f'
 'e'
 'c'
 'b'
 'a'
 'd'
5 Likes

Thanks for all your replies!

Again, my question was not how to implement a topological search, otherwise I would be using library code or some module. My question was why Julia is slower than Python on THAT particular code, when it supposed to be faster. To better understand practices of writing Julia code. Excessive memory allocations and not enough laziness in computations I guess would be the answer.
Iā€™m sorry for the misunderstanding.

1 Like

Why would you assume Julia is faster than Python when all of the time in Python is spent inside optimized C-code. You arenā€™t even running actual Python codeā€¦ (Juliaā€™s code for Sets is of course implemented in Julia).

Where Julia shines is when you have some code that is not of the form where you can just off load the main computation to C-implementation someone else wrote, and you have to write the code yourself.

5 Likes

The ā€˜identical codeā€™ idea is a bit misplaced though, as it either compiles to the same instructions in the two languages (and you get the exact same benchmark) or it compiles to different instructions (which is about the same as using different code).
I donā€™t think rossetta codeā€™s intention was at all performance focused in this case

Wasnā€™t Julia 25% faster in the original direct translation? Doing substatially faster than Python at a very Pythonoc style of code seems pretty decent. I can certainly understand that you want something faster but for that a different style of algorithm is required.

I would be interested in what the fastest toposort one can do in Julia is. The LightGraphs version is already pretty impressively fast. Might be possible to do better, however.

2 Likes

On a smaller input, yes. On a larger input, however, Julia was nearly 2 times slower than Python. The reason is memory allocations. Python somehow re-uses memory when generating new dictionary from the old one, which gives it an advantage.
My second question was about lazy computations, as Python code uses yield. I tried both Channels and ResumableFunctions and the result was, unfortunately, equally poor in terms of performance. I didnā€™t try Lazy.jl though as I couldnā€™t find simple tutorial on how to do this without rewriting the original code significantly.
Thank you! :slight_smile:

Reference counting allows this kind of thing. In Julia itā€™s idiomatic to explicitly mutate data structures instead of relying on the language to do it implicitly.

Yield isnā€™t really lazy. Terminology aside, Python has heavily optimized this pattern because itā€™s a common way to implement things in Python. This is not a common way to write high performance code in Julia, and it could certainly be optimized further.

4 Likes

Line-for-line translations from another language into Julia are hardly guaranteed to give better performanceā€”to take advantage of Julia you generally need to write a different style of code than what you are used to in Python. The goal of Julia is not to make it impossible to write slow code, it is to make it possible to write fast code.

In this particular case, you are comparing slow code in Python to slow code in Julia and the slow Python code (whose inner loops are inside well-optimized C library routines) happens to be a bit faster, but this is not an especially interesting comparison because continually allocating lots of dictionaries within inner loops is not how you would write optimized Julia code.

7 Likes

This is same interesting as to benchmarking recursive fibonacci because: ā€œIt is important to note that the benchmark codes are not written for absolute maximal performance. Instead, the benchmarks are written to test the performance of identical algorithms and code patterns implemented in each language.ā€ (Source: Julia Micro-Benchmarks) .

ping @Sukera (our discussion here ):

We have now this competent answer:

We have now these possibility in Performance tips in doc:
a) not discourage people from using generator because ā€œwe have itā€ :stuck_out_tongue_winking_eye:
b) admit that this code pattern is better implemented in python (*) and add performance tip to not use it in performance critical part of code (this can be changed in future!)
c) do as this is not happening.

(*) we donā€™t need to say about python there it is enough to say that generator (cite:) ā€œis not a common way to write high performance code in Juliaā€. And we could add this for sure to doc: Noteworthy differences from Python because it will bite many pythonistas trying to port their code!

That thread you linked is about the exact same problem - not reusing stuff but rather continously allocating new objects in a tight loop. This is precisely one of the cases where using a generator is just wrong. Even though itā€™s only marginally related, I stand by my point: Generators are fine when reusing them or their contents, but not so when continously recreating them.

2 Likes

I think it is misinterpretation of

%timeit list(toposort2(data))
10000 loops, best of 3: 20.2 Āµs per loop

and

  minimum time:     15.935 Ī¼s (0.00% GC)
  median time:      16.964 Ī¼s (0.00% GC)
  mean time:        24.255 Ī¼s (26.77% GC)
  maximum time:     35.626 ms (99.92% GC)

where OP compared minimum time in Julia to average (more precisely best of 3 averages of 10000 iteration) time python. (I read that Julia was 20% slower)