Anything like C++ std::includes?

phendric · December 23, 2020, 5:38pm

Question: does Julia have an analog to C++'s std::includes? issubset doesn’t do what I’m looking for (I want following to return false, as the first array isn’t a sub-array of the 2nd):

julia> issubset([4,4,1],[1,4,2])
true

There’s a Stack Overflow post asking for something similar, but it’s several years old and the code samples there (issubvec, subset2) don’t appear to be compatible with a more-or-less current version of Julia (1.5.1).

I wrote a function to do what I want:

julia> function subset4(a::Array{Int},b::Array{Int})
           a_unq=unique(a)
           b_unq=unique(b)
           subset::Bool=true
           if issubset(a_unq, b_unq)
               for val in a_unq
                   if count(x->x==val,a)>count(y->y==val,b)
                       subset=false
                       break
                   end
               end
           else
               subset=false
           end
           return subset
       end
subset4 (generic function with 1 method)

It behaves correctly, but is much slower than either issubset or a C++ version of the thing.

issubset (after compile):

julia> @time for ii in 1:1000000 issubset(rand(1:10,3),rand(1:10,10)) end
  0.168592 seconds (2.00 M allocations: 259.399 MiB, 9.08% gc time)

subset4 (after compile):

julia> @time for ii in 1:1000000 subset4(rand(1:10,3),rand(1:10,10)) end
  0.795771 seconds (15.00 M allocations: 1.461 GiB, 11.04% gc time)

This is almost 5x slower.

C++ std::includes:

#include <iostream>
#include <array>
#include <random>
#include <algorithm>

int main() {
  std::array<int, 3> a;
  std::array<int, 10> b; 
  bool subset;
  
  std::mt19937_64 randomGenerator;
  randomGenerator.seed(std::random_device{}());
  std::uniform_int_distribution<int> digitDistro{1, 10};
  
  for(int j=0; j<1000000; j++) {
    for(int i=0; i<3; i++) a[i]=digitDistro(randomGenerator);
    for(int i=0; i<10; i++) b[i]=digitDistro(randomGenerator);
    
    std::sort(a.begin(), a.end());
    std::sort(b.begin(), b.end());
    
    subset = std::includes(b.begin(), b.end(), a.begin(), a.end());
  }
  return 0;
}

$ g++ -o includesTest includesTest.cpp -O3
$ time ./includesTest

real    0m0.166s
user    0m0.166s
sys     0m0.000s

issubset is about as fast as std::includes.

I need something ideally as fast as the “fast” functions in this post, as I’m going to be iterating through a large number of combinations of numbers and checking if they’re a subset of a master list.

Oscar_Smith · December 23, 2020, 5:48pm

using Accumulators from DataStructures.jl gives a more efficient approach.

using DataStructures

function subset5(a::AbstractVector,b::AbstractVector)
    count_a = counter(a)
    count_b = counter(b)
    return all(count_a[key]<=count_b[key] for key in keys(count_a))
end

I time it at 3x slower than issubset for small inputs, and 17% slower for large inputs.

Edit: adding an early bailout for !issubset brings it to about 2x in both cases. There’s probably a size based heuristic that could give you the best of both worlds.

phendric · December 23, 2020, 6:18pm

I’m seeing a factor of 6x slower for small inputs. Not sure what you mean by “early bailout for !issubset.”

julia> @time for ii in 1:1000000 subset5(rand(1:10,3),rand(1:10,10)) end
  1.022921 seconds (15.00 M allocations: 1.490 GiB, 8.67% gc time)

julia> 1.022921/0.168592
6.067434990984151

Oscar_Smith · December 23, 2020, 6:27pm

I meant

function subset6(a::AbstractVector,b::AbstractVector)
    !issubset(a,b) && return false
    count_a = counter(a)
    count_b = counter(b)
    return all(count_a[key]<=count_b[key] for key in keys(count_a))
end

phendric · December 23, 2020, 6:31pm

Thanks - I’m still very much a Julia newbie.

I’m also seeing 2x, which is the best I’ve seen in Julia to this point:

julia> @time for ii in 1:1000000 subset6(rand(1:10,3),rand(1:10,10)) end
  0.336311 seconds (4.36 M allocations: 601.466 MiB, 9.98% gc time)

julia> 0.336311/0.168592
1.9948218183543707

Oscar_Smith · December 23, 2020, 6:47pm

Edit: dont’ use this. It’s wrong

function subset7(a::AbstractVector,b::AbstractVector)
	length(a)>length(b) && return false
	sort!(a)
	sort!(b)
	cur = a[1]
	cur != b[1] && return false 
	for i in 2:length(a)
		if cur == a[i]
			cur != b[i] && return false
		else
			new_cur = a[i]
			i=findfirst(x-> x!=cur, @view b[i:end])
			i === nothing || i>length(a) && return
			new_cur != b[i] && return false
			cur = new_cur
		end
	end
	return true
end

Here is a version that is even faster (it uses a slightly better version of the C++ algorithm.

carstenbauer · December 23, 2020, 7:08pm

Direct translation of what is listed as “possible implementation” in the C++ reference you linked.

function includes(a::AbstractVector,b::AbstractVector)
    first2 = firstindex(a)
    first1 = firstindex(b)
    last2 = lastindex(a)
    last1 = lastindex(b)

    while first2 != last2
        if (first1 == last1) || (b[first2] < a[first1])
            return false
        else !(a[first1] < b[first2])
            first2 += 1
        end
        first1 += 1
    end
    return true
end

julia> @btime subset4(x,y) setup=(x=rand(1:10,3);y=rand(1:10,10))
  410.442 ns (12 allocations: 1.19 KiB)

julia> @btime subset5(x,y) setup=(x=rand(1:10,3);y=rand(1:10,10))
  354.789 ns (8 allocations: 1.19 KiB)

julia> @btime subset7(x,y) setup=(x=rand(1:10,3);y=rand(1:10,10))
  63.083 ns (1 allocation: 16 bytes)

julia> @btime issubset(x,y) setup=(x=rand(1:10,3);y=rand(1:10,10))
  7.455 ns (0 allocations: 0 bytes)

julia> @btime includes(x,y) setup=(x=rand(1:10,3);y=rand(1:10,10))
  3.464 ns (0 allocations: 0 bytes)

This is about 100x faster than your original function (assuming that my Julia translation is correct).

Update: I noted that you sort the arrays as well. Adding corresponding sort! statements to the includes function I find

julia> @btime includes(x,y) setup=(x=rand(1:10,3);y=rand(1:10,10))
  45.820 ns (0 allocations: 0 bytes)
false

phendric · December 23, 2020, 7:33pm

Yes, that is fast. I saw the sample implementation(s) on that page, and thought “Maybe someday I’ll be good enough at Julia (or maybe just programming) to be able to translate that C++ code.” For what it’s worth, your translation looks remarkably like the C++ code it came from.

FYI, I sorted the arrays (in the C++ code) because std::includes requires it, not because I have any particular need to:

Returns true if the sorted range [first2, last2) is a subsequence of the sorted range [first1, last1) .

While I was wondering why @Oscar_Smith’s last function wasn’t doing what I expect…

julia> subset7([1,4],[2,4,1])
false

…you jumped in with your translation, which does.

julia> includes([1,4],[2,4,1])
true

Thanks everyone!

cchderrick · December 23, 2020, 8:03pm

Not sure if you had a and b swapped, and also the possible implementation did not seems to handle the case for the last element of b (first2!=last2) ?.

function includes(a::AbstractVector,b::AbstractVector)
    first2 = firstindex(b)
    last2 = lastindex(b)
    first1 = firstindex(a)
    last1 = lastindex(a)
    
    while first2 != last2
        if (first1 == last1) || (b[first2] < a[first1])
            return false
        else !(a[first1] < b[first2])
            first2 += 1
        end
        first1 += 1
    end
    first2 == last2 && return (b[first2] in a) 
    return true
end

v1 = ['a', 'b', 'c', 'f', 'h', 'x']
v2 = ['a', 'b', 'c']
v3 = ['a', 'c']
v4 = ['a', 'a', 'b']
v5 = ['g']
v6 = ['a', 'c', 'g']
v7 = ['A', 'B', 'C']

"""
$(v1)
includes:
a b c   : $(includes(v1, v2))
a c     : $(includes(v1, v3))
a a b   : $(includes(v1, v4))
g       : $(includes(v1, v5))
a c g   : $(includes(v1, v6))
\$v1     : $(includes(v1, v1))
""" |> print

Produce this, which is closer to the expected behavior.

['a', 'b', 'c', 'f', 'h', 'x']
includes:
a b c   : true
a c     : true
a a b   : false
g       : false
a c g   : false
$v1     : true

Edit:
Should be this instead.
first2 == last2 && return (b[first2] in a[first1:end])

['a', 'b', 'c', 'f', 'h', 'x']
includes:
a b c                : true
a c                  : true
a a b                : false
g                    : false
a c a                : false
a b c g              : false
a b c f h x          : true

carstenbauer · December 23, 2020, 8:04pm

Yes, I swapped a and b because of what the OP wrote in his initial post (first arrays is/isn’t subarray of the second).

phendric · December 23, 2020, 9:23pm

Not sure I follow you. Using @carstenbauer’s code, the following is correct and…I think…tests the thing you bring up (“last element of ‘b’”):

julia> includes(['a', 'b', 'c', 'f', 'h', 'x'], ['a', 'b', 'c', 'f', 'h', 'x'])
true

Yes, it’s swapped from the C++ function, but I don’t need an identical function signature. I used std::includes as an example of a) a thing that computes what I want it to, and b) is fast.

cchderrick · December 23, 2020, 9:44pm

includes(['a', 'b', 'c', 'g'], ['a', 'b', 'c', 'f', 'h', 'x'])
This will return true.
‘g’ is not even in the sequence.
The problem is this: (first2!=last2)
When the b sequence at its last position (first2==last2), which is ‘g’, it breaks out the while-loop and return true, without checking for any more ‘g’ in a after ‘c’.

Which makes me think the “possible implementation” from the reference page is just a partial solution, not a C++ expert so I am not sure.

phendric · December 23, 2020, 10:21pm

Oof. Yeah:

julia> includes(['a', 'b', 'c', 'd', 'U', 'f'], ['a', 'b', 'c', 'd', 'e', 'f'])
true

cchderrick · December 23, 2020, 11:06pm

For that particular implementation, I don’t think you can swap a and b, and it needed that last element checks.

first2 == last2 && return (b[first2] in a[first1:end])

And first2 == last2 is redundant.

function includes(a::AbstractVector,b::AbstractVector)
    @assert issorted(a)
    @assert issorted(b)
    first2 = firstindex(b)
    last2 = lastindex(b)
    first1 = firstindex(a)
    last1 = lastindex(a)
    
    while first2 != last2
        if (first1 == last1) || (b[first2] < a[first1])
            return false
        else !(a[first1] < b[first2])
            first2 += 1
        end
        first1 += 1
    end
    return (b[first2] in a[first1:end])
end

Here is how I tested it.

v1 = ['a', 'b', 'c', 'f', 'h', 'x']
v2 = ['a', 'b', 'c']
v3 = ['c','h']
v4 = ['x']
v5 = ['h','x']
v6 = ['g']
v7 = ['a', 'c', 'z']
v8 = ['a', 'b', 'c', 'g', 'x']
v9 = ['0', 'b', 'c', 'd', 'h', 'x']

"""
$(join(v1,' '))
includes:
$(rpad(join(v2,' '),20)) : $(includes(v1, v2))
$(rpad(join(v3,' '),20)) : $(includes(v1, v3))
$(rpad(join(v4,' '),20)) : $(includes(v1, v4))
$(rpad(join(v5,' '),20)) : $(includes(v1, v5))
$(rpad(join(v6,' '),20)) : $(includes(v1, v6))
$(rpad(join(v7,' '),20)) : $(includes(v1, v7))
$(rpad(join(v8,' '),20)) : $(includes(v1, v8))
$(rpad(join(v9,' '),20)) : $(includes(v1, v9))
""" |> print

And a b c g x still make no sense:

a b c f h x
includes:
a b c                : true
c h                  : true
x                    : true
h x                  : true
g                    : false
a c z                : false
a b c g x            : true
0 b c d h x          : false

Jean_Michel · December 24, 2020, 3:05pm

Could you please call your new implementation issublist, so as not to confuse with issubset, which compares Sets, and have a name which reflects better what it does?

carstenbauer · December 24, 2020, 5:03pm

AFAIU, the semi-official term is subsequence. So I suggest calling it issubsequence.

Topic		Replies	Views
Issubset() but order matters General Usage question , arrays	5	533	December 26, 2021
Finding patterns in vectors General Usage array , pattern-matching , vector	9	996	June 20, 2022
Why is Julia faster than C++ for quicksort? Performance performance , quicksort	15	2000	August 15, 2023
Slicing array on julia 4000ms vs c++ 400ms Performance vector	24	689	November 10, 2024
What does `include` do? New to Julia question	27	897	October 25, 2024

Anything like C++ std::includes?

Related topics