Any idea what the cutoff is? It might be worth making a PR to make intersect use a hybrid algorithm.
My guess is that this probably depends on your specific architecture and how much intersection the two sets actually have, but for me, the cutoff is around ~100 elements in each array:
julia> using BenchmarkTools, Random
julia> a = rand(100);
julia> b = shuffle!([rand(a, 50); rand(50)]);
julia> @btime [i for i in a if i in b];
7.031 μs (109 allocations: 2.81 KiB)
julia> @btime intersect(a, b);
6.372 μs (22 allocations: 9.11 KiB)
Perhaps it’s enough to just document this behavior and let users make an informed decision themselves. We should probably use the naive method for tuples though, since tuples usually aren’t much larger than 100 elements.
Indeed, it is dependent on the size of the arrays and of the number of repeated elements. Here is a very crude benchmark (using eval=1 in the benchmark, because otherwise it takes too long):
Any chance you can add the 3rd dimension for when the lists don’t have equal sizes? My guess is that the cutoff mainly depends on the size of the bigger list, but would be nice to confirm.
If I have done things right (this really a draft…), this is the plot:
Green: intersect
is faster. Blue: naive
is faster. The number of common elements
decreases from the first to the last plot.
This is the code:
using Plots
naive(v1,v2) = unique!([ x for x in v1 if x in v2 ])
lv1 = Int[]
lv2 = Int[]
nr=Int[]
avnr=Float64[]
t = Float64[]
for i in 10:25:500
println(i)
for j in 10:25:500
for k in 11:100:901
tnaive = 0.
tintersect = 0.
nint = 0
for l in 1:100
v1 = rand(1:k,i)
v2 = rand(1:k,j)
tnaive += @elapsed naive(v1,v2)
tintersect += @elapsed intersect(v1,v2)
nint += length(intersect(v1,v2))
end
push!(lv1,i)
push!(lv2,j)
push!(nr,k)
push!(t,tnaive-tintersect)
end
end
end
plot(layout=(3,3))
sp=0
for k in 11:100:901
global sp += 1
println(sp)
inds = findall(i -> i == k, nr)
l1 = lv1[inds]
l2 = lv2[inds]
times = t[inds]
c = [ (x > 0 ? :green : :blue) for x in times ]
scatter!(l1,l2,color=c,
xlabel="vector 1 size",
ylabel="vector 2 size",
markershape=:square,
markersize=10,
xlim=[-15,515],
ylim=[-15,515],
label="",
title="count(v1 ∩ v2) ∝ 1/$k",
subplot=sp)
end
plot!(size=(1000,1000))
savefig("./intersect.pdf")
This one with a different range of common elements (more than above):
Edit: before anybody asks, I have used @elapsed
instead of something more sophisticated because any of the tools of the BenchmarkTools
was taking too long to run.
To be honest I’m not sure I fully understand either. Let’s take this:
julia> isa.(Ref([1,2,3]), [Array, Dict, Int])
3-element BitVector:
1
0
0
julia> isa.([1,2,3], [Array, Dict, Int])
3-element BitVector:
0
0
1
What do we see in the first snippet and what do we see in the second?
The first evaluates to
julia> [
isa([1, 2, 3], Array),
isa([1, 2, 3], Dict),
isa([1, 2, 3], Int),
]
3-element Vector{Bool}:
1
0
0
and the second to
julia> [
isa(1, Array),
isa(2, Dict),
isa(3, Int),
]
3-element Vector{Bool}:
0
0
1
The Ref
makes the argument a “scalar” or “atom” in broadcasting.
I was suspecting the same but wasn’t sure, thank you for explaining it.
In other words, wrapping in a Ref
causes the Vector to be treated as a single unit/thing, as the Vector itself - vs unpacking it into its elements, as broadcasting would do without the Ref
.
What is the relation between the Ref
meaning in the first part of the docstring (that I admit I haven’t fully understood) and its usage as “protection” from the broadcasting?
I have always see Ref
in this second context, is it commonly used for other things?
Ref
is akin to a 0-dimensional array, that can therefore contain only one element.
Now think about what happens when broadcasting a vector of length 1:
julia> x = [10];
julia> y = [1,2,3];
julia> x .+ y
3-element Vector{Int64}:
11
12
13
everything happens as though the only element of x
was repeated enough times to match the length of y
(I think this is sometimes referred to as recycling the only element of x
). Well, the same thing happens with Ref
because it, too, contains only one element:
julia> Ref(10) .+ y
3-element Vector{Int64}:
11
12
13
I am not sure why everyone adopted Ref
for this, anything that acts as an scalar and encapsulates an arbitrary object will do. I, for an example, use single-element tuples instead of Ref
, but this was just a question of personal taste. Now we have the situation in which Ref
is used so much as a scalar container that people do not even know its reason to exist.
Ref
is basically what would be a “pointer” in other languages. It is an object that just stores the memory position of another object, it may not be pointing anywhere (i.e., be empty) in which case it has a null value stored (which cannot be a valid memory address) and you can check if this is the case. They are much more relevant in languages with memory models different from Julia like C, for example, where assigning a struct
to a local variable will copy every field in the struct at the right hand-side to the already preallocated struct
on the left-hand side (and, as so, to manipulate some things without excessive copies you will need ); in Julia, the variable (which does not require any preallocation, i.e., it is just a label not a box) will now point to the exact same object as the right-side, and if it already contained something before, it is forgotten.
I have a kinda abstract post about the Julia memory model here: Is it appropriate to say a variable is a label or box? - #19 by Henrique_Becker
However, if you never programmed in a programming language with a different memory model the distinction may be lost on you.
I’m sort of surprised that doesn’t error, as all the broadcasting and operation syntax is so carefully thought to be consistent with standard math. (Never did that because in any case x[1] .+ ...
or similar with begin
would be more natural).
Exactly my thoughts. I was thinking that a nobroadcast
would be less confusing to many, but well Ref
is much shorter.
I now wonder whether there is a performance benefit of explicitly passing Ref around. Or maybe inside recursion where Julia shouldn’t modify copies
I don’t think C struct assignments work as you suggest. The struct variable is just a pointer to the head of the struct. Maybe it works differently in C++ . . .
I do believe I am correct. In fact, the common C idiom of not declaring a struct
in local scope, but instead a pointer to it, and then using malloc
to obtain a memory location for the struct
(and an address to the pointer), exists exactly because of the C memory model I mentioned.
I was wrong, sorry. Haven’t programmed in C in ages. I’m surprised that it allows block copy of struct contents with =
. Was it always this way? IIRC, you can’t (or couldn’t) do this with strings or other arrays. In those cases, the variable name truly is the pointer to the first array element, right?
In C there is no “string” type there’s just “pointer to char”
Ah, now I understand the source of the confusion, more than ten years ago I have studied deeply the C99 standard and no other, however there are even older standards, of course. I just checked and my C99 Reference book says previous compilers did not allow struct
and union
attribution, so yes, it was not always possible.
You cannot do this with arrays (in C99, I’m not up to par with newer standards), but this has nothing to do with what is in the variable, local variable declared to be arrays are allocated in the stack and “popped up” after the function finishes, so technically, they contain the whole object inside themselves (not a pointer), it is just that attribution is not defined for them (will give an error). You can check they actually contain the whole object in themselves by the amount of C threads in forums that just boil down to “I returned a pointer from an array variable I had declared inside the function and outside the function it does not work (has other values, gives segfault, etc…).”
Strings in C are “always” char *
, so they are always a pointer, and act accordingly (I mean, you can declare a string as an array of char, but then everything that deals with strings will only want the address of the first character and in the end it will be the same as having defined it as a pointer from the start).
Fascinating trying to recollect how things used to be. I learned C on the job starting in 1984. At that time, apparently, C compilers were generally allowing struct assignment with =
, but it was not allowed according to the ‘bible’ (Kernighan & Ritchie’s The C Programming Language), which I studied. A detailed discussion of the history is at
and at
https://www.thecodingforums.com/threads/struct-assignment-historical-question.443010/
You also couldn’t pass structures by value when calling a function, according to K&R. An interesting quote from Larry Jones:
struct parameters, return values, and assignment were added to the Unix C compilers just about the time that K&R was actually published (the “Recent Changes to C” paper noting them is dated Nov. 15, 1978), leading to a great deal of frustration many years later when MS-DOS compilers appeared that implemented K&R exactly and thus didn’t support them.