What is Ref?

I was trying to do something like this:

x = in.([1,2], [1,3,4])

(BTW, what really want is an array of all the items that are in array 1 AND array 2. Is a broadcasted in the best way to do that ? )

Wanting a list of indicators of whether the items in the first list are in the second list, so for example

 x = [1, 0]

However, what I’ve written does NOT work. However, the documentation is your friend !

https://docs.julialang.org/en/v1/base/collections/#Iterable-Collections

The important note being

wrap collection in a tuple or a Ref

julia> in.([1,3], Ref([1,2,4]))
2-element BitArray{1}:
 1
 0

But, what is this Ref thing ?? I went back to the documentation, but no luck finding it because several pages of matches show up. Interestingly there is a Core.Ref, but I’m pretty sure that’s not the same thing.

What is this Ref thing as used in my example ?

Thank you.

3 Likes

I think you want intersect.

Try something like

intersect([1,2], [1,3,4])

Documentation for intersect: Collections and Data Structures · The Julia Language


Ref, in this particular case, tells Julia to treat something as a scalar for the purposes of broadcasting. I don’t think you need it for what you want to accomplish.

2 Likes

The easiest way usually to figure this out is just to enter ?Ref in the REPL:

help?> Ref
search: Ref WeakRef prevfloat UndefRefError GlobalRef uppercasefirst lowercasefirst ProcessFailedException

  Ref{T}

  An object that safely references data of type T. This type is guaranteed to point to valid, Julia-allocated
  memory of the correct type. The underlying data is protected from freeing by the garbage collector as long as
  the Ref itself is referenced.

  In Julia, Ref objects are dereferenced (loaded or stored) with [].

  Creation of a Ref to a value x of type T is usually written Ref(x). Additionally, for creating interior pointers
  to containers (such as Array or Ptr), it can be written Ref(a, i) for creating a reference to the i-th element
  of a.

  When passed as a ccall argument (either as a Ptr or Ref type), a Ref object will be converted to a native
  pointer to the data it references.

  There is no invalid (NULL) Ref in Julia, but a C_NULL instance of Ptr can be passed to a ccall Ref argument.

  Use in broadcasting
  ≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡

  Ref is sometimes used in broadcasting in order to treat the referenced values as a scalar:

  julia> isa.(Ref([1,2,3]), [Array, Dict, Int])
  3-element BitArray{1}:
   1
   0
   0

It even explains, why Ref would be used in broadcasting.

7 Likes

It is actually just an alias for Core.Ref:

julia> Ref === Core.Ref
true
2 Likes

You can also write [ x for x in [1,2,3] if x in [3,4,5]], and I am mentioning this because it is, at least in this simple test, faster than intersect:

julia> using BenchmarkTools

julia> @btime [ i for i in [1,2,3] if i in [3,4,5] ]
  173.147 ns (7 allocations: 656 bytes)
1-element Array{Int64,1}:
 3

julia> @btime intersect([1,2,3],[3,4,5])
  494.927 ns (15 allocations: 1.33 KiB)
1-element Array{Int64,1}:
 3

EDIT: the results are not identical if there are repeated values in the vectors. In that case, you need to check what you want:

julia> @btime intersect([1, 4, 4, 5, 6], [4, 6, 6, 7, 8])
  541.625 ns (14 allocations: 1.36 KiB)
2-element Array{Int64,1}:
 4
 6

julia> @btime [ x for x in [1, 4, 4, 5, 6] if x in [4, 6, 6, 7, 8]]
  233.356 ns (9 allocations: 976 bytes)
3-element Array{Int64,1}:
 4
 4
 6

EDIT2: Dropping repeated elements with unique! is still faster than intersect:

julia> v1 = rand(1:10,10); 

julia> v2 = rand(1:10,20);

julia> @btime intersect($v1,$v2)
  773.919 ns (15 allocations: 1.46 KiB)
6-element Array{Int64,1}:
 9
 7
 2
 3
 4
 1

julia> @btime unique!([ x for x in $v1 if x in $v2 ])
  459.558 ns (9 allocations: 912 bytes)
6-element Array{Int64,1}:
 9
 7
 2
 3
 4
 1

Note that this is only for small n:

julia> using BenchmarkTools, Random

julia> a = rand(1000);

julia> b = shuffle!([rand(a, 500); rand(500)]);

julia> @btime [i for i in a if i in b];
  380.556 μs (1012 allocations: 24.06 KiB)

julia> @btime intersect(a, b);
  103.642 μs (27 allocations: 99.89 KiB)

This is because intersect converts the right-hand side to a Set first, which is not inexpensive for small n, but only scales linear for larger n. This has the benefit that lookup is now \mathcal O(1) instead of \mathcal O(n), so in total, if the vector on the left has length m and the one on the right length n, the complexity of intersect will be \mathcal O(m+n) instead of \mathcal O(m \cdot n) for your naive version.

16 Likes

Thanks for that ! I was really confused as to how interesect could be slower than the naive version and was just about to write a test case for large n.

I think either the naive version or intersect are easier for someone else reading through the code to understand rather than broadcasted in with a Ref()

1 Like

Any idea what the cutoff is? It might be worth making a PR to make intersect use a hybrid algorithm.

1 Like

My guess is that this probably depends on your specific architecture and how much intersection the two sets actually have, but for me, the cutoff is around ~100 elements in each array:

julia> using BenchmarkTools, Random

julia> a = rand(100);

julia> b = shuffle!([rand(a, 50); rand(50)]);

julia> @btime [i for i in a if i in b];
  7.031 μs (109 allocations: 2.81 KiB)

julia> @btime intersect(a, b);
  6.372 μs (22 allocations: 9.11 KiB)

Perhaps it’s enough to just document this behavior and let users make an informed decision themselves. We should probably use the naive method for tuples though, since tuples usually aren’t much larger than 100 elements.

4 Likes

Indeed, it is dependent on the size of the arrays and of the number of repeated elements. Here is a very crude benchmark (using eval=1 in the benchmark, because otherwise it takes too long):

intersect

10 Likes

Any chance you can add the 3rd dimension for when the lists don’t have equal sizes? My guess is that the cutoff mainly depends on the size of the bigger list, but would be nice to confirm.

If I have done things right (this really a draft…), this is the plot:

Green: intersect is faster. Blue: naive is faster. The number of common elements
decreases from the first to the last plot.

This is the code:

using Plots

naive(v1,v2) = unique!([ x for x in v1 if x in v2 ])

lv1 = Int[]
lv2 = Int[]
nr=Int[]
avnr=Float64[]
t = Float64[]
for i in 10:25:500
  println(i)
  for j in 10:25:500
    for k in 11:100:901
      tnaive = 0.
      tintersect = 0.
      nint = 0
      for l in 1:100
        v1 = rand(1:k,i)
        v2 = rand(1:k,j)
        tnaive += @elapsed naive(v1,v2)
        tintersect += @elapsed intersect(v1,v2)
        nint += length(intersect(v1,v2))
      end
      push!(lv1,i)
      push!(lv2,j)
      push!(nr,k)
      push!(t,tnaive-tintersect)
    end
  end
end

plot(layout=(3,3))

sp=0
for k in 11:100:901
  global sp += 1
  println(sp)
  inds = findall(i -> i == k, nr)
  l1 = lv1[inds]
  l2 = lv2[inds]
  times = t[inds]
  c = [ (x > 0 ? :green : :blue) for x in times ]
  scatter!(l1,l2,color=c,
           xlabel="vector 1 size",
           ylabel="vector 2 size",
           markershape=:square,
           markersize=10,
           xlim=[-15,515],
           ylim=[-15,515],
           label="",
           title="count(v1 ∩ v2) ∝ 1/$k",
           subplot=sp)

end

plot!(size=(1000,1000))
savefig("./intersect.pdf")

This one with a different range of common elements (more than above):

Edit: before anybody asks, I have used @elapsed instead of something more sophisticated because any of the tools of the BenchmarkTools was taking too long to run.

16 Likes

To be honest I’m not sure I fully understand either. Let’s take this:

julia> isa.(Ref([1,2,3]), [Array, Dict, Int])
3-element BitVector:
 1
 0
 0

julia> isa.([1,2,3], [Array, Dict, Int])
3-element BitVector:
 0
 0
 1

What do we see in the first snippet and what do we see in the second?

Think of Ref(x) like [x].

The first evaluates to

julia> [
           isa([1, 2, 3], Array),
           isa([1, 2, 3], Dict),
           isa([1, 2, 3], Int),
       ]
3-element Vector{Bool}:
 1
 0
 0

and the second to

julia> [
           isa(1, Array),
           isa(2, Dict),
           isa(3, Int),
       ]
3-element Vector{Bool}:
 0
 0
 1

The Ref makes the argument a “scalar” or “atom” in broadcasting.

7 Likes

I was suspecting the same but wasn’t sure, thank you for explaining it.

In other words, wrapping in a Ref causes the Vector to be treated as a single unit/thing, as the Vector itself - vs unpacking it into its elements, as broadcasting would do without the Ref.

3 Likes

What is the relation between the Ref meaning in the first part of the docstring (that I admit I haven’t fully understood) and its usage as “protection” from the broadcasting?
I have always see Ref in this second context, is it commonly used for other things?

Ref is akin to a 0-dimensional array, that can therefore contain only one element.

Now think about what happens when broadcasting a vector of length 1:

julia> x = [10];
julia> y = [1,2,3];
julia> x .+ y
3-element Vector{Int64}:
 11
 12
 13

everything happens as though the only element of x was repeated enough times to match the length of y (I think this is sometimes referred to as recycling the only element of x). Well, the same thing happens with Ref because it, too, contains only one element:

julia> Ref(10) .+ y
3-element Vector{Int64}:
 11
 12
 13
2 Likes

I am not sure why everyone adopted Ref for this, anything that acts as an scalar and encapsulates an arbitrary object will do. I, for an example, use single-element tuples instead of Ref, but this was just a question of personal taste. Now we have the situation in which Ref is used so much as a scalar container that people do not even know its reason to exist.

Ref is basically what would be a “pointer” in other languages. It is an object that just stores the memory position of another object, it may not be pointing anywhere (i.e., be empty) in which case it has a null value stored (which cannot be a valid memory address) and you can check if this is the case. They are much more relevant in languages with memory models different from Julia like C, for example, where assigning a struct to a local variable will copy every field in the struct at the right hand-side to the already preallocated struct on the left-hand side (and, as so, to manipulate some things without excessive copies you will need ); in Julia, the variable (which does not require any preallocation, i.e., it is just a label not a box) will now point to the exact same object as the right-side, and if it already contained something before, it is forgotten.

I have a kinda abstract post about the Julia memory model here: Is it appropriate to say a variable is a label or box? - #19 by Henrique_Becker
However, if you never programmed in a programming language with a different memory model the distinction may be lost on you.

7 Likes

I’m sort of surprised that doesn’t error, as all the broadcasting and operation syntax is so carefully thought to be consistent with standard math. (Never did that because in any case x[1] .+ ... or similar with begin would be more natural).

1 Like