Can I speed up this recursive set implementation?

foobar_lv2 · May 8, 2024, 9:09pm

If this is the case, it doesn’t change much. But maybe I did it too quickly.

To see the difference, you need to test denser bitvectors:

julia> b0=Random.bitrand(1<<20);

julia> @btime findall($b0);
  423.937 μs (3 allocations: 4.00 MiB)

julia> @btime bv2idx($b0);
  856.161 μs (3 allocations: 4.00 MiB)

julia> bv2idxLogical(b)=(1:length(b))[b]
bv2idxLogical (generic function with 1 method)

julia> @btime bv2idxLogical($b0);
  383.551 μs (3 allocations: 4.00 MiB)

PS. I don’t know why the findall appears to be slower than logical indexing. On a naive first glance, the code looks ok, but I’m seeing it the first time. On the other hand, I wrote the logical indexing loop, and I did count cycles with pen & paper. You could fix that with a one-liner Base.findall(b::BitVector) = (1:length(b))[b].

But its unclear to me whether that is worth it – the logical indexing / broadcast stuff is somewhat convoluted dispatch / type-soup, and there is value in having clean straight-line code via @less findall(b).

gdalle · May 9, 2024, 5:26am

That was just an example designed to give a short benchmarking code. The real structure of operations can be much more messy, and the same x_i can be involved in several different places. In general it’s far from being a chain, it’s a tree of unions. But I don’t need the intermediate nodes, I only want to materialize the union at the one and only leaf.

foobar_lv2 · May 9, 2024, 10:46am

I am wondering: What do you think you can gain by lazy set unions?

Naively, if you have a computation res = materialize( A | (B | C) ) (using | for set union), what do you think lazyness of B | C can possibly gain you? When you finally materialize res, then most implementations will require you to also materialize B | C.

I think that exploring lazy set unions is premature until you have at least a half-baked fairy-tale story how you hypothetically can gain anything at all by that (cache locality? faster memory recycling? less spiky memory use? any computation that you can avoid? a fancy construction that can rapidly compute 4-way unions better than 3x 2-way unions?).

Also consider that you might also have tmp = B|C and then res = materialize( (A | tmp) | (D | tmp) ). If you are lazy and use the obvious materialization strategy, then you must cache the result of materialize(tmp), in order to compute it only once, on pain of exponential (!) slowdown.

Storing this intermediate materialization will cause you some pain with atomics / multi-threading, and will especially force your temporaries to be garbage collected (no way of reusing pre-allocated arrays for the materialization). Either you use the julia allocator / tracing GC for that, or you implement your own reference counting in your recursive set implementation (with associated pain on atomics / multithreading).

gdalle · May 9, 2024, 10:48am

The idea is that a lazy union allows me, at the very end, to iterate through all of the elements in the union tree without materializing intermediate nodes as sets in their own rights

foobar_lv2 · May 9, 2024, 10:50am

How?

What is, algorithmically, your strategy for iterating through a union tree that has any chance of being faster than materialization of the intermediate nodes?

Also: Your tree is not a tree, it is a DAG. The same expression / tmp value can be used multiple times.

gdalle · May 9, 2024, 10:52am

Algorithmically it won’t be faster, but at least I won’t have Julia heap-allocating one new set per intermediate node.

Yes indeed

foobar_lv2 · May 9, 2024, 10:56am

So your answer to “fairy tale gains” is “I can use a pre-allocated temp for intermediate nodes and reuse the memory rapidly, instead of waiting for GC”?

Or is it “I have so many intermediate nodes alive at the same time that I will run out of main memory”?

How then do you deal with the exponential slow-down from DAG-reuse? You probably need to keep the materialization of the intermediate around until it has been fully consumed.

I suggest you check that GC is indeed the bottleneck of your eager implementation?

gdalle · May 9, 2024, 11:04am

How about both?

And by the way, I find your tone slightly dismissive when you speak of my “fairy tale” wishes. That’s okay with me because I’m an experienced forum user and I am aware I know less than the people I ask here (which is the whole point), but I wouldn’t be surprised if less seasoned visitors were discouraged by that kind of reaction.

In some cases it will be a problem (very deep computational graph), in others, not at all. I’m not saying this will be the ideal set type for every scenario, I’m saying it might be interesting in some.

I will, but as stated above, it’s not uniform over all cases. I wanna give the user several choices that are each as good as possible in a specific situation.

foobar_lv2 · May 9, 2024, 11:14am

Apologies! I didn’t want to dismiss your approach, I wanted to invite you to be bold but concrete in what you hope to achieve by laziness. Otherwise the discussion can become too unfocussed.

A different plausible goal could have been: “I observe that when dealing with solverXYZ, the vast majority of intermediate values do not directly contribute to the end result, they are only used for error estimation / tolerance checking. I want to avoid differentiating through them!”

gdalle · May 9, 2024, 11:20am

Good idea!
The non-lazy implementation allocates a new set from the parents every time. The lazy points to the parents instead, at the risk of allowing duplicates when the DAG is traversed at the very end. My hope in doing so is to get fewer heap allocations, at the cost of a potentially larger computation time when I materialize the final lazy set.

foobar_lv2 · May 9, 2024, 11:43am

You will allocate an intermediate node on the heap – no way will you get fewer heap allocs. They may be smaller, though!

Can you also give a general outline why you want to do the whole set thing in the first place?

I’d be especially interested in a comparison to the following two naive constructions:

Use AD at a point x0 to get the numerical jacobian. Then look at the exact zeros. Maybe take a handful of nearby small pertubations of x0.
Use finite differences. Again, look for exact zeros.

There are two obvious problems:

You have bad luck and \partial_i f_j(x_0) = 0 but there exists a sequence x_k \to x_0 such that \partial_i f_j(x_k) > 0 for all k. You have claimed that the jacobian structurally vanishes in a neighborhood, but it doesn’t
You do have \partial_i f_j = 0 in a neighborhood of x_0, but numerical / floating point infidelities mess this up. Think f(x) = (1+x)-x or f(x) = sin(x)^2 + cos(x)^2.

Of these, (1) can be solved by randomization and (2) is a giant pain in the butt that your union approach won’t address at all (in mathematical terminology, you’re computing something like “term rank” which overestimates true rank).

I’d like to recommend taking a very short look at Matrices and Matroids for Systems Analysis | SpringerLink (skim the introduction and then decide whether you care about the contents. The book is good and is not a multi-year journey like the big Zeidler, but its not a one-day-read either)

Dan · May 9, 2024, 12:29pm

Not sure it makes clear enough sense to me. A more complete example would be best.

Maybe the package SuiteSparseGraphBLAS.jl would be right for this, conceptually and practically. The sparsity pattern can be calculated using linear algebra on the (OR, AND) semiring. Let me try to provide some examples of the package:


julia> using SuiteSparseGraphBLAS

julia> v = GBVector(sprand(Bool, 10,0.6))
10x1 GraphBLAS bool matrix, bitmap by col
  8 entries, memory: 288 bytes

    (1,1)   1
    (2,1)   1
    (4,1)   1
    (5,1)   1
    (6,1)   1
    (8,1)   1
    (9,1)   1
    (10,1)   1


julia> w = GBVector(sprand(Bool, 10,0.6))
10x1 GraphBLAS bool matrix, bitmap by col
  5 entries, memory: 288 bytes

    (1,1)   1
    (3,1)   1
    (4,1)   1
    (7,1)   1
    (8,1)   1


julia> v+w
10x1 GraphBLAS int64_t matrix, full by col
  10 entries, memory: 384 bytes

    (1,1)   1
    (2,1)   1
    (3,1)   1
    (4,1)   1
    (5,1)   1
    (6,1)   1
    (7,1)   1
    (8,1)   1
    (9,1)   1
    (10,1)   1

As can be seen, the sum v + w is the union. Similarly for matrices:

julia> A = GBMatrix(sprand(Bool, 5,5,0.5))
5x5 GraphBLAS bool matrix, bitmap by col
  9 entries, memory: 320 bytes

    (2,1)   1
    (5,1)   1
    (1,2)   1
    (2,2)   1
    (4,2)   1
    (2,3)   1
    (3,3)   1
    (4,4)   1
    (5,4)   1


julia> B = GBMatrix(sprand(Bool, 5,5,0.5))
5x5 GraphBLAS bool matrix, bitmap by col
  16 entries, memory: 320 bytes

    (1,1)   1
    (3,1)   1
    (1,2)   1
    (3,2)   1
    (4,2)   1
    (1,3)   1
    (2,3)   1
    (4,3)   1
    (5,3)   1
    (2,4)   1
    (4,4)   1
    (5,4)   1
    (1,5)   1
    (2,5)   1
    (3,5)   1
    (5,5)   1


julia> A*B
5x5 GraphBLAS bool matrix, bitmap by row
  20 entries, memory: 320 bytes

    (1,3)   1
    (1,4)   1
    (1,5)   1
    (2,1)   1
    (2,2)   1
    (2,3)   1
    (2,4)   1
    (2,5)   1
    (3,1)   1
    (3,2)   1
    (3,5)   1
    (4,2)   1
    (4,3)   1
    (4,4)   1
    (4,5)   1
    (5,1)   1
    (5,2)   1
    (5,3)   1
    (5,4)   1
    (5,5)   1

The matrices A and B can be the sparsity of the Jacobians of g and h where f = g(f(x)) and then A*B is sparsity of the Jacobian of f (at appropriate evaluation points). Again, this is the regular matrix multiplication on the semiring (OR, AND).

The implementation seems to be efficient.

gdalle · May 9, 2024, 1:30pm

I’m currently writing it up for a conference whose name I shall not give, so I’ll link the preprint here as soon as it’s out! But I’ll do a brief summary.

Beyond the problems with numerical cancellations, your two approaches have a deeper issue related to complexity: they require paying n times the cost of the function f. Whether it is with autodiff or finitediff, computing a dense Jacobian costs n Jacobian-vector products, and computing a JVP costs roughly as much as the function itself.

If you know the sparsity pattern ahead of time, you can exploit it in conjunction with graph coloring to greatly reduce the number of products necessary. Essentially, you evaluate several Jacobian columns at once. The following paper is a great summary:

https://epubs.siam.org/doi/10.1137/S0036144504444711

Interestingly, this works even when you have an overestimate of the sparsity pattern: your coloring will just be less tight but your Jacobian will be accurate.

gdalle · May 9, 2024, 1:34pm

You’re completely right and this was in fact my very first idea, the one that started this whole project with @hill. Conceptually this is a bit differerent, and a lot more work to implement, so it will go in the perspectives of our paper.

Roughly speaking:

We replace each scalar y_i with a sparse boolean vector (aka a set of integers) to track the dependencies on the input indices x_j
You would replace each vector y with a sparse boolean matrix to track the dependencies on the input vector x

There is a nice correspondence between operations on sparse arrays and operations on sets, which involves replacing + with max / or. Indeed SuiteSparseGraphBLAS.jl is the right tool for the job!
The only trouble is that operator overloading is a bit of a pain in the ass to implement at the array level versus scalar level. So that’s why we started out with scalars, in the same way that ForwardDiff.jl defines scalar Dual numbers but has limited support for array overloads (I think).

The devil is in the parenthesis: “at appropriate evaluation points”.
Consider the typical use case given by the convolutional layer of a neural net. For this to work, you want to be able to get the sparsity pattern of conv at a specific point x, and you have to derive it by hand. Ideally you don’t even want this sparse boolean matrix to ever get instantiated: it’s just a linear operator turning sparse matrices into other sparse matrices. But you have to do this for every single array function, and it’s a nightmare.

To complement the analogy with ForwardDiff.jl, this is more the realm of ChainRules.jl, where rules are defined at array level to speed things up.

gdalle · May 9, 2024, 1:40pm

I apologize to everyone for this weird thread where I exposed a very specific problem first instead of giving the bird’s eye view. I promise the high-level view is coming, but I’m already writing it down in my LaTeX file and I don’t have time to duplicate it just now ^^

rocco_sprmnt21 · May 9, 2024, 4:12pm

Maybe it’s not worth it if you’re in a production context, but wanting to learn, everything makes sense (Speaking of soups, “ogni cosa fa brodo”, in italian).
Digging around a bit with @edit I found this function call (for which I still haven’t found the final code) which improves performance by 10 times


julia> @btime f= findall($b0);
  11.100 μs (1 allocation: 7.94 KiB)

julia> @btime Base.to_indices($b0,($b0,))[1];
  904.878 ns (0 allocations: 0 bytes)


julia> Base.to_indices(b0,(b0,))[1] == findall(b0)
true

Now (my) question is: how is to_indexes so efficient?

foobar_lv2 · May 9, 2024, 10:05pm

Ah, ok, I misunderstood your whole approach. I was thinking about more “analysis-style” nonlinearities where the sparsity pattern of jacobians is mostly independent of evaluation points (i.e. if \partial_i f_j is not identically zero then it tends to vanish only on a set of codimension one). In that case performance of sparsity pattern discovery is not that important, because you do this only once, and can amortize over a lot of \partial f evaluations.

While it now seems like you care more about about RELU / PL-style nonlinearities, where sparsity patterns very much depend on x_0 and you therefore can amortize sparsity pattern extraction only over very few \partial_f evaluations, since the sparsity pattern is only valid for a very small neighborhood of x_0?

Base.to_indices(b0, (b0,))[1] doesn’t do significant work, it only counts the number of ones and creates a wrapper. The meat is in

julia> x=Base.to_indices(b0, (b0,))[1];

julia> r0, s = iterate(x);

julia> @less iterate(x, s)

Both logical indexing and findall have performance of roughly 0.4ns / collected item on my machine, which is ~1.5 cycles. There is a difference between logical indexing and findall of 0.04ns / ~ 0.12 cycles per item, or 3.5 cycles per consumed QWORD. The core loop has planned 20 cycles / QWORD, i.e. 0.6 cycles per item for branch misses. Unfortunately my machine doesn’t have crisp timings (I loved my old broadwell laptop for its cycle-perfect timings on everything, no turbo, just plain exact 2 ghz).

rocco_sprmnt21 · May 10, 2024, 6:52am

Finally (*), I had seen that, in the case of BitaArrays, to_indexes leads to the following function (wrapper?) in the julia\base\multidimensional.jl file.

Base.LogicalIndex(b0)

Which is based on using the specific iterate methods to do the collect(?).

iterate on BitVector

# # When wrapping a BitArray, lean heavily upon its internals.
  # @inline function iterate(L::LogicalIndex{Int,<:BitArray})
  #   L.sum == 0 && return nothing
  #   Bc = L.mask.chunks
  #   return iterate(L, (1, 1, (), @inbounds Bc[1]))
  # end
  # @inline function iterate(L::LogicalIndex{<:CartesianIndex,<:BitArray})
  #   L.sum == 0 && return nothing
  #   Bc = L.mask.chunks
  #   irest = ntuple(one, ndims(L.mask)-1)
  #   return iterate(L, (1, 1, irest, @inbounds Bc[1]))
  # end
  # @inline function iterate(L::LogicalIndex{<:Any,<:BitArray}, (i1, Bi, irest, c))
  #   Bc = L.mask.chunks
  #   while c == 0
  #       Bi >= length(Bc) && return nothing
  #       i1 += 64
  #       @inbounds c = Bc[Bi+=1]
  #   end
  #   tz = trailing_zeros(c)
  #   c = _blsr(c)
  #   i1, irest = _overflowind(i1 + tz, irest, size(L.mask))
  #   return eltype(L)(i1, irest...), (i1 - tz, Bi, irest, c)
# end

I understood correctly, right?
If I’m not ignoring something essential, I notice that the logic of iterate and findall is not very different (they both use trailing_zeros and _bslr to scan for true bits).

(IF) So, why does benchmark give these results?


b0=falses(2^10)
a=rand(1:2^10, 10)
b0[a].=true

julia> @btime Base.LogicalIndex($b0);
  3.000 ns (0 allocations: 0 bytes)

  
julia> @btime f= findall($b0);
  35.181 ns (1 allocation: 144 bytes)

Does this mean that it measures different things in the two situations?

(*)
Is there a more efficient way than using @edit and a lot of patience to trace the path of a function call only at the “high” level, i.e. only see calls to other methods? I understand that the description is very vague, but I trust you understand the meaning.

abraemer · May 10, 2024, 7:37am

Yes it does measure two different things. Base.LogicalIndex(b0) just creates the logical index. If you take a look at what that means, then you see that you just store a reference to b0 and also compute its sum. Crucially the concrete indices are NOT constructed. If you just typed julia> Base.LogicalIndex(b0) into your REPL though you would indeed get the concrete indices printed but the computation of these happens inside the printing function stack.

You can use Cthulhu.jl @descend to step through the call stack.

Topic		Replies	Views
Best data structure for fast unions of large sets of integers Performance performance , set , datastructures	56	1021	May 6, 2024
Sparse set-like indexible data structure General Usage question	9	894	October 19, 2017
Hand written loop slower than bitVector broadcast Performance question	15	725	January 18, 2024
Julia's in.() seems slow compared to R's %in% Performance	40	3019	May 9, 2019
Fastest way to count unique elements in `Vector{Union{Bool, Missing})` Performance	35	6869	January 16, 2018

Can I speed up this recursive set implementation?

Related topics