Computing the preimage of count_ones

count_ones(x) counts how many 1s are in the binary representation of the integer x

I am looking for a function that computes the preimage of that.
Or informally the inverse (but itโ€™s not actually invertable).
That is to say for for any n I want a function that returns an iterator of values y such that for all those values count_ones(y) == n (and for no other integers).
Up to some maximum z.
The native implementation is:

inv_count_ones(n, z) = filter(y->count_ones(y)==n, 1:z)

And it does indeed have

inv_count_ones(0, 8) == []
inv_count_ones(1, 8) == [1,2, 4, 8]
inv_count_ones(2, 8) == [3, 5, 6]
inv_count_ones(3, 8) == [7,]
inv_count_ones(4, 8) == []

But is there a better way?

1 Like

Hereโ€™s one way:

using Combinatorics
inv_count_ones(k,n) = 
  filter(<=(n), [reduce(|,(1 .<< s)) 
    for s in powerset(0:(8*sizeof(n)-leading_zeros(n))-1, k,k) 
    if length(s)>0])

The problem is pretty standard in combinatorics and bit-manipulation. The key feature is to leverage the extra parameters of the powerset function.

julia> [inv_count_ones(i, 8) for i in 0:4] == 
       [[], [1,2,4,8], [3,5,6], [7], []]

BTW this can be transformed into an Iterator for these sets, which can be more efficient in memory (and perhaps time).

Thereโ€™s definitely a pattern in the positive Ints, if you can split the implementation by types and ranges.

julia> for i in 1:32
        ci = count_ones(i)
        if ci == 1 println("\n",ci) else print(" ", ci) end


 2 2 3
 2 2 3 2 3 3 4
 2 2 3 2 3 3 4 2 3 3 4 3 4 4 5

You can tell n=1 occurs at powers of 2, and for a given n you can start the search at the minimum integer with n 1s, (2^n-1):z.
There is a pattern between the powers of 2, but not sure how to exploit that for n>1

This is a fun exercise in combinatorics. I vaguely remember that this is pretty well-known and standard, but I donโ€™t recall the name, so Iโ€™ll instead re-derive the construction here.

edit: @Syx_Pek provided the name down-thread, itโ€™s called combinatorial number system.

Originally I described a re-derivation with code, but read wikipedia instead, itโ€™s clearer than what I wrote.

click here to expand my pre-edit off-the-cuff derivation

First of all, suppose you had your array inv_co(k) = filter(y->count_ones(y)==k, 1: (-1 >>> 1)), and youโ€™re given a specific N, such that count_ones(N)=k. What is the index of N in this list inv_co(k)?

Well, it is one plus the number of smaller entries. This is easy to compute:

struct BitsIter
Base.iterate(b::BitsIter) = iterate(b, b.u)
Base.iterate(b::BitsIter, u) = (u == 0 ? nothing : (trailing_zeros(u), Base._blsr(u)))

function index(N)
  res = 1
  for (nbits, bitpos) in enumerate(BitsIter(N))
    #there are nbits many bits we need to distribute over bitpos many places
    res += binomial(bitpos, nbits)

We can reality-check that:

julia> inv_count_ones(n, z) = filter(y->count_ones(y)==n, 1:z)
inv_count_ones (generic function with 1 method)

julia> length(inv_count_ones(5, 1000))

julia> map(index, inv_count_ones(5, 1000)) == 1:length(inv_count_ones(5, 1000))

OK, now we need to invert the index function (you want random access to your iterator / lazy collection of preimages, right?). That is: We are given idx and K, and want z such that count_ones(z) == K and index(z) == idx.

I think the construction is best explained by giving a (slowish recursive) implementation:

function inv_index0(k, idx)
       k == 1 && return 1<<idx
       pos = findfirst(p -> binomial(p, k) > idx, 1:64)
       rem = idx - binomial(pos-1, k)
       @show k, pos, idx
       return (1<<(pos-1)) | inv_index0(k-1, rem)
inv_index(k, idx) = inv_index0(k, idx-1)

We can reality check that:

julia> for i = 1:1000
       z = rand(1:(1<<35))
       k = count_ones(z)
       if z != inv_index(k, index(z))
       @show z, k, i
       @assert false

Remaining parts of the exercise: Make a loop out of the tail-recursion, deal with integer overflows, make it fast (lookup table), optionally make a fast iterator, optionally make a fast container.

This function takes a BitInteger and gives the next value of the same type that has the same number of bits:

function next_val(k::T) where {T<:Base.BitInteger}
    ๐Ÿ™ = one(T)
    z = trailing_zeros(k)
    k += ๐Ÿ™ << z
    d = trailing_zeros(k) - z - 1
    k += ๐Ÿ™ << d - ๐Ÿ™

Basically, the logic is that you shift the last bit left by adding a bit at that position; if thereโ€™s a cascade of carrying because thereโ€™s more than one bit on in a row, then youโ€™ll end up with fewer bits in the resulting value and you need to replace those with a number of least significant bits. Hereโ€™s a function collects them into a vector and returns it:

function list_vals(::Type{T}, b::Integer) where {T<:Base.BitInteger}
    vals = T[]
    ๐Ÿ™ = one(T)
    k = ๐Ÿ™ << b - ๐Ÿ™
    while count_ones(k) == b
        push!(vals, k)
        z = trailing_zeros(k)
        k += ๐Ÿ™ << z
        d = trailing_zeros(k) - z - 1
        k += ๐Ÿ™ << d - ๐Ÿ™
    return vals
list_vals(b::T) where {T<:Base.BitInteger} = list_vals(T, b)

With this definition we get:

julia> map(bitstring, list_vals(UInt8(5)))
56-element Vector{String}:
1 Like

Ah, I just figured out a simpler way to do next_val that lets you cycle around back to the beginning once you overflow and also lets this work for arbitrary integer types:

function next_val(k::T) where {T<:Integer}
    ๐Ÿ™ = one(T)
    b = count_ones(k)
    z = trailing_zeros(k)
    k += ๐Ÿ™ << z
    x = b - count_ones(k)
    k += ๐Ÿ™ << x - ๐Ÿ™

I think this also makes it more obvious that what youโ€™re doing is repopulating the number of bits that you lots. Thereโ€™s an extra count_ones(k) call at the top, but thatโ€™s going to be a loop constant in the loop version; however, it means that the loop condition canโ€™t be based on counting ones anymore, so itโ€™s a bit of a wash. However, you can also precompute the correct array size using the binomal function, which is probably a win as well, giving this:

function list_vals(
    n::Integer = binomial(8*sizeof(T), b),
) where {T<:Integer}
    ๐Ÿ™ = one(T)
    k = ๐Ÿ™ << b - ๐Ÿ™
    v = Array{T}(undef, n)
    v[1] = k
    for i = 2:n
        z = trailing_zeros(k)
        k += ๐Ÿ™ << z
        x = b - count_ones(k)
        k += ๐Ÿ™ << x - ๐Ÿ™
        v[i] = k
    return v

This is the Iterator version of my previous answer:

using Combinatorics
inv_count_ones(k,n) = 
  Iterators.filter(<=(n),>length(s)>0 ? reduce(|,(1 .<<s)) : Int[],
      powerset(0:(8*sizeof(n)-leading_zeros(n))-1, k,k)))

Like the last version, it correctly follows the results in the OP.
(perhaps the last answer was ignored because it feels inefficient, but the powerset function means it steps through just the correct values and less than half the values can get filtered by <=(n)).

An algorithm for this is known as Gosperโ€™s Hack (see Hakmem, Hackerโ€™s Delight or Knuth 4A)

julia> function GosperHack(N)
           S = N & -N
           T = N + S
           T + (((T โŠป N) รท S) >> 2)
GosperHack (generic function with 1 method)

julia> function Generate(N, z::T) where T
           Arr = T[]
           i = (one(T) << N) - 1
           while i โ‰ค z
               push!(Arr, i)
               i = GosperHack(i)
Generate (generic function with 1 method)

There are two things one can do to it,

  1. Making it more convenient, it already almost an iterator.
  2. Applying the bitmask speedup (Generating Combinations). The integer division is slower than a popcount on certain architectures.

I timed these, expecting the div operation in GosperHack to really tank the performance (integer division is very slow), but it was the same as my next_val:

julia> @benchmark next_val(0x07)
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
 Range (min โ€ฆ max):  0.791 ns โ€ฆ 8.208 ns  โ”Š GC (min โ€ฆ max): 0.00% โ€ฆ 0.00%
 Time  (median):     0.875 ns             โ”Š GC (median):    0.00%
 Time  (mean ยฑ ฯƒ):   0.872 ns ยฑ 0.075 ns  โ”Š GC (mean ยฑ ฯƒ):  0.00% ยฑ 0.00%

  โ–‚โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–„โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–ˆโ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–‚ โ–‚
  0.791 ns       Histogram: frequency by time      0.916 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark GosperHack(0x07)
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
 Range (min โ€ฆ max):  0.791 ns โ€ฆ 25.250 ns  โ”Š GC (min โ€ฆ max): 0.00% โ€ฆ 0.00%
 Time  (median):     0.875 ns              โ”Š GC (median):    0.00%
 Time  (mean ยฑ ฯƒ):   0.877 ns ยฑ  0.247 ns  โ”Š GC (mean ยฑ ฯƒ):  0.00% ยฑ 0.00%

             โ–…           โ–ˆ          โ–                        โ–
  โ–…โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–ˆโ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–ˆโ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–ˆโ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–ˆโ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–‡ โ–ˆ
  0.791 ns     Histogram: log(frequency) by time        1 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

Like exactly the same timingsโ€”Gosperโ€™s hack had slightly higher mean and variance but the min and median are identical! If you try it on more different values a bigger difference emerges:

julia> @benchmark map(next_val, 0x1:0xff)
BenchmarkTools.Trial: 10000 samples with 839 evaluations.
 Range (min โ€ฆ max):  146.106 ns โ€ฆ  19.987 ฮผs  โ”Š GC (min โ€ฆ max): 0.00% โ€ฆ 99.13%
 Time  (median):     159.863 ns               โ”Š GC (median):    0.00%
 Time  (mean ยฑ ฯƒ):   172.080 ns ยฑ 326.125 ns  โ”Š GC (mean ยฑ ฯƒ):  6.95% ยฑ  3.97%

   โ–โ–„        โ–ˆโ–โ–ˆโ–
  โ–‡โ–ˆโ–ˆโ–‡โ–‚โ–โ–‚โ–โ–โ–โ–†โ–ˆโ–ˆโ–ˆโ–ˆโ–†โ–…โ–…โ–†โ–„โ–‚โ–‚โ–‚โ–โ–โ–‚โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ– โ–‚
  146 ns           Histogram: frequency by time          214 ns <

 Memory estimate: 304 bytes, allocs estimate: 1.

julia> @benchmark map(GosperHack, 0x1:0xff)
BenchmarkTools.Trial: 10000 samples with 460 evaluations.
 Range (min โ€ฆ max):  228.804 ns โ€ฆ  27.354 ฮผs  โ”Š GC (min โ€ฆ max): 0.00% โ€ฆ 99.03%
 Time  (median):     241.487 ns               โ”Š GC (median):    0.00%
 Time  (mean ยฑ ฯƒ):   251.756 ns ยฑ 374.715 ns  โ”Š GC (mean ยฑ ฯƒ):  4.13% ยฑ  3.30%

  โ–‡โ–ˆโ–‡โ–  โ–โ–ƒโ–†โ–ˆโ–‡โ–‡โ–„โ–ƒโ–„โ–ƒโ–‚ โ–โ–                                          โ–‚
  โ–ˆโ–ˆโ–ˆโ–ˆโ–‡โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‡โ–‡โ–‡โ–‡โ–†โ–‡โ–‡โ–†โ–†โ–†โ–‡โ–†โ–†โ–‡โ–‡โ–†โ–†โ–†โ–†โ–…โ–†โ–†โ–…โ–†โ–†โ–„โ–…โ–…โ–„โ–„โ–…โ–…โ–…โ–„โ–„โ–…โ–‚โ–„โ–… โ–ˆ
  229 ns        Histogram: log(frequency) by time        313 ns <

 Memory estimate: 304 bytes, allocs estimate: 1.

Iโ€™m guessing LLVM has seen 50 different ways of writing this code and has figured out the best one and replaces them with the one it thinks is best. LLVM is often somewhat scarily good at integer bit math in a loop (perhaps unsurprisingly).

No, I considered that since the timings were so suspiciously similar, but the machine code is very different. The code for next_val is more simpler instructions while the code for GosperHack is very small but includes that udiv and also has and exception branch in the case where the divisor is zero, which can happen when the input is zero.

I get different timings if I fix the benchmark to use the Ref interpolation, to prevent the compiler from constant-folding the specific argument you are passing:

julia> @btime next_val(0x07);
  0.875 ns (0 allocations: 0 bytes)

julia> @btime GosperHack(0x07);
  0.875 ns (0 allocations: 0 bytes)

julia> @btime next_val($(Ref(0x07))[]);
  1.792 ns (0 allocations: 0 bytes)

julia> @btime GosperHack($(Ref(0x07))[]);
  1.458 ns (0 allocations: 0 bytes)
1 Like

Oh right. Is it time to make a breaking release to BenchmarkTools that does this automatically yet? I basically always want it to work like @code_x and time how long the function call takes, not how long it takes to evaluate a specific expression with constant propagation and whatnot. Used to not matter so much, but now weโ€™re really quite good at constant prop, so it usually pretty misleading.

1 Like

Wouldnโ€™t help in your example, of course, because you werenโ€™t using $ at all.

Doing that could also give us a way to solve the longstanding problem that interpolating types into benchmark expressions de-optimizes them.

julia> f(::Type{T}, x) where {T} = T(x) + T(x); 

julia> @btime f(Int, 1)
  1.082 ns (0 allocations: 0 bytes)

julia> @btime f($Int, 1)
  54.993 ns (0 allocations: 0 bytes)

The reason for this is that thereโ€™s no way to create an instance of Tuple{Type{Int}}, you instead can only construct instances of Tuple{DataType} :face_with_symbols_over_mouth: :face_with_symbols_over_mouth: :face_with_symbols_over_mouth:

Replacing $x with $(Ref{Core.Typeof(x)}(x))[] is properly able to communicate the required info though:

julia> @btime f($(Ref(Int))[], 1)
  70.626 ns (0 allocations: 0 bytes)

julia> @btime f($(Ref{Type{Int}}(Int))[], 1)
  1.943 ns (0 allocations: 0 bytes)

I mean to go a step further and change @benchmark f(x) to mean what @benchmark f($(Ref(x))[x]) means. What it does now is like if @code_native f(x) showed you the code for () -> f(x) instead of the code of f for the type of x.

1 Like

The OEIS is perfect to find more about a pattern like this :slight_smile: Searching for the first numbers gives the right match as first result:

I experimented with a strength reduction opportunity:

        z = trailing_zeros(k)
        k += ๐Ÿ™ << z

can also be implemented as

        k += k & -k

LLVM generates fewer instructions for it which looks promising. Unfortunately, @benchmark doesnโ€™t seem to care to for it. I also played with a slightly different approach. Also slower according to @benchmark.

1 Like