What is Ref?

Any idea what the cutoff is? It might be worth making a PR to make intersect use a hybrid algorithm.

1 Like

My guess is that this probably depends on your specific architecture and how much intersection the two sets actually have, but for me, the cutoff is around ~100 elements in each array:

julia> using BenchmarkTools, Random

julia> a = rand(100);

julia> b = shuffle!([rand(a, 50); rand(50)]);

julia> @btime [i for i in a if i in b];
  7.031 μs (109 allocations: 2.81 KiB)

julia> @btime intersect(a, b);
  6.372 μs (22 allocations: 9.11 KiB)

Perhaps it’s enough to just document this behavior and let users make an informed decision themselves. We should probably use the naive method for tuples though, since tuples usually aren’t much larger than 100 elements.

4 Likes

Indeed, it is dependent on the size of the arrays and of the number of repeated elements. Here is a very crude benchmark (using eval=1 in the benchmark, because otherwise it takes too long):

intersect

10 Likes

Any chance you can add the 3rd dimension for when the lists don’t have equal sizes? My guess is that the cutoff mainly depends on the size of the bigger list, but would be nice to confirm.

If I have done things right (this really a draft…), this is the plot:

Green: intersect is faster. Blue: naive is faster. The number of common elements
decreases from the first to the last plot.

This is the code:

using Plots

naive(v1,v2) = unique!([ x for x in v1 if x in v2 ])

lv1 = Int[]
lv2 = Int[]
nr=Int[]
avnr=Float64[]
t = Float64[]
for i in 10:25:500
  println(i)
  for j in 10:25:500
    for k in 11:100:901
      tnaive = 0.
      tintersect = 0.
      nint = 0
      for l in 1:100
        v1 = rand(1:k,i)
        v2 = rand(1:k,j)
        tnaive += @elapsed naive(v1,v2)
        tintersect += @elapsed intersect(v1,v2)
        nint += length(intersect(v1,v2))
      end
      push!(lv1,i)
      push!(lv2,j)
      push!(nr,k)
      push!(t,tnaive-tintersect)
    end
  end
end

plot(layout=(3,3))

sp=0
for k in 11:100:901
  global sp += 1
  println(sp)
  inds = findall(i -> i == k, nr)
  l1 = lv1[inds]
  l2 = lv2[inds]
  times = t[inds]
  c = [ (x > 0 ? :green : :blue) for x in times ]
  scatter!(l1,l2,color=c,
           xlabel="vector 1 size",
           ylabel="vector 2 size",
           markershape=:square,
           markersize=10,
           xlim=[-15,515],
           ylim=[-15,515],
           label="",
           title="count(v1 ∩ v2) ∝ 1/$k",
           subplot=sp)

end

plot!(size=(1000,1000))
savefig("./intersect.pdf")

This one with a different range of common elements (more than above):

Edit: before anybody asks, I have used @elapsed instead of something more sophisticated because any of the tools of the BenchmarkTools was taking too long to run.

16 Likes

To be honest I’m not sure I fully understand either. Let’s take this:

julia> isa.(Ref([1,2,3]), [Array, Dict, Int])
3-element BitVector:
 1
 0
 0

julia> isa.([1,2,3], [Array, Dict, Int])
3-element BitVector:
 0
 0
 1

What do we see in the first snippet and what do we see in the second?

Think of Ref(x) like [x].

The first evaluates to

julia> [
           isa([1, 2, 3], Array),
           isa([1, 2, 3], Dict),
           isa([1, 2, 3], Int),
       ]
3-element Vector{Bool}:
 1
 0
 0

and the second to

julia> [
           isa(1, Array),
           isa(2, Dict),
           isa(3, Int),
       ]
3-element Vector{Bool}:
 0
 0
 1

The Ref makes the argument a “scalar” or “atom” in broadcasting.

7 Likes

I was suspecting the same but wasn’t sure, thank you for explaining it.

In other words, wrapping in a Ref causes the Vector to be treated as a single unit/thing, as the Vector itself - vs unpacking it into its elements, as broadcasting would do without the Ref.

3 Likes

What is the relation between the Ref meaning in the first part of the docstring (that I admit I haven’t fully understood) and its usage as “protection” from the broadcasting?
I have always see Ref in this second context, is it commonly used for other things?

Ref is akin to a 0-dimensional array, that can therefore contain only one element.

Now think about what happens when broadcasting a vector of length 1:

julia> x = [10];
julia> y = [1,2,3];
julia> x .+ y
3-element Vector{Int64}:
 11
 12
 13

everything happens as though the only element of x was repeated enough times to match the length of y (I think this is sometimes referred to as recycling the only element of x). Well, the same thing happens with Ref because it, too, contains only one element:

julia> Ref(10) .+ y
3-element Vector{Int64}:
 11
 12
 13
2 Likes

I am not sure why everyone adopted Ref for this, anything that acts as an scalar and encapsulates an arbitrary object will do. I, for an example, use single-element tuples instead of Ref, but this was just a question of personal taste. Now we have the situation in which Ref is used so much as a scalar container that people do not even know its reason to exist.

Ref is basically what would be a “pointer” in other languages. It is an object that just stores the memory position of another object, it may not be pointing anywhere (i.e., be empty) in which case it has a null value stored (which cannot be a valid memory address) and you can check if this is the case. They are much more relevant in languages with memory models different from Julia like C, for example, where assigning a struct to a local variable will copy every field in the struct at the right hand-side to the already preallocated struct on the left-hand side (and, as so, to manipulate some things without excessive copies you will need ); in Julia, the variable (which does not require any preallocation, i.e., it is just a label not a box) will now point to the exact same object as the right-side, and if it already contained something before, it is forgotten.

I have a kinda abstract post about the Julia memory model here: Is it appropriate to say a variable is a label or box? - #19 by Henrique_Becker
However, if you never programmed in a programming language with a different memory model the distinction may be lost on you.

6 Likes

I’m sort of surprised that doesn’t error, as all the broadcasting and operation syntax is so carefully thought to be consistent with standard math. (Never did that because in any case x[1] .+ ... or similar with begin would be more natural).

1 Like

Exactly my thoughts. I was thinking that a nobroadcast would be less confusing to many, but well Ref is much shorter.

I now wonder whether there is a performance benefit of explicitly passing Ref around. Or maybe inside recursion where Julia shouldn’t modify copies

1 Like

I don’t think C struct assignments work as you suggest. The struct variable is just a pointer to the head of the struct. Maybe it works differently in C++ . . .

1 Like

I do believe I am correct. In fact, the common C idiom of not declaring a struct in local scope, but instead a pointer to it, and then using malloc to obtain a memory location for the struct (and an address to the pointer), exists exactly because of the C memory model I mentioned.

I was wrong, sorry. Haven’t programmed in C in ages. I’m surprised that it allows block copy of struct contents with =. Was it always this way? IIRC, you can’t (or couldn’t) do this with strings or other arrays. In those cases, the variable name truly is the pointer to the first array element, right?

In C there is no “string” type there’s just “pointer to char”

1 Like

Ah, now I understand the source of the confusion, more than ten years ago I have studied deeply the C99 standard and no other, however there are even older standards, of course. I just checked and my C99 Reference book says previous compilers did not allow struct and union attribution, so yes, it was not always possible.

You cannot do this with arrays (in C99, I’m not up to par with newer standards), but this has nothing to do with what is in the variable, local variable declared to be arrays are allocated in the stack and “popped up” after the function finishes, so technically, they contain the whole object inside themselves (not a pointer), it is just that attribution is not defined for them (will give an error). You can check they actually contain the whole object in themselves by the amount of C threads in forums that just boil down to “I returned a pointer from an array variable I had declared inside the function and outside the function it does not work (has other values, gives segfault, etc…).”

Strings in C are “always” char *, so they are always a pointer, and act accordingly (I mean, you can declare a string as an array of char, but then everything that deals with strings will only want the address of the first character and in the end it will be the same as having defined it as a pointer from the start).

Fascinating trying to recollect how things used to be. I learned C on the job starting in 1984. At that time, apparently, C compilers were generally allowing struct assignment with =, but it was not allowed according to the ‘bible’ (Kernighan & Ritchie’s The C Programming Language), which I studied. A detailed discussion of the history is at

https://stackoverflow.com/questions/17513736/what-does-it-mean-that-automatic-structures-and-arrays-may-now-also-be-initializ

and at

https://www.thecodingforums.com/threads/struct-assignment-historical-question.443010/

You also couldn’t pass structures by value when calling a function, according to K&R. An interesting quote from Larry Jones:

struct parameters, return values, and assignment were added to the Unix C compilers just about the time that K&R was actually published (the “Recent Changes to C” paper noting them is dated Nov. 15, 1978), leading to a great deal of frustration many years later when MS-DOS compilers appeared that implemented K&R exactly and thus didn’t support them.