Some_thing in some_list

I am from matlab. Codes like

if  some_thing in some_list 

is new to me. It of course makes coding simple, but what is the price? I guess julia just checks the elements of some_list one by one to see whether some_thing is in it. That is very time consuming, right?

1 Like

if you have no information about the structure of some_list, then there is no other option to find an element

if you do have information about the structure of some_list, that can be put in the type and in can be specialized to be more performant on that type

2 Likes

It really depends on what some_thing and some_list are.

For example, if some_thing is 5 and some_list is 1:4 which a UnitRange in Julia, then the code used is as follows.

in(x::Integer, r::AbstractUnitRange{<:Integer}) = (first(r) <= x) & (x <= last(r))

In Julia, 1:4 is not [1, 2, 3, 4] exactly but rather a UnitRange, a struct with two fields. Only the code that is needed execute is a comparison with those two fields.

6 Likes

Matlab has the ismember Array elements that are members of set array - MATLAB ismember function which basically does the same thing. It’s not going to be especially fast, and you should likely consider using a Set if you have an algorithm that is bottle-necked by queries.

5 Likes

Is the RHS of this substantively different than first(r) <= x <= last(t) ? Git blame pegs this code from the v0.6 era, so maybe that syntax wasn’t supported yet?

1 Like

Just to spell this out explicitly: if some_list is a Set (despite the name), that’s another example where some_thing in some_list has a more efficient implementation than checking elements one by one; specifically, it does a hash table lookup.

The difference is the lack of short-circuiting in in since it uses & and not &&. Not sure why that’s preferable. Maybe something to do with SIMDability when broadcasting? Anyway, once & is replaced with && the two variants generate the same lowered code:

function shortcircuitingin(x::Integer, r::AbstractUnitRange{<:Integer})
    return (first(r) <= x) && (x <= last(r))
end

function chainingin(x::Integer, r::AbstractUnitRange{<:Integer})
    return first(r) <= x <= last(r)
end
julia> @code_lowered in(1, 2:3)
CodeInfo(
1 ─ %1 = Base.first(r)
│   %2 = %1 <= x
│   %3 = Base.last(r)
│   %4 = x <= %3
│   %5 = %2 & %4
└──      return %5
)

julia> @code_lowered shortcircuitingin(1, 2:3)
CodeInfo(
1 ─ %1 = Main.first(r)
│   %2 = %1 <= x
└──      goto #3 if not %2
2 ─ %4 = Main.last(r)
│   %5 = x <= %4
└──      return %5
3 ─      return false
)

julia> @code_lowered chainingin(1, 2:3)
CodeInfo(
1 ─ %1 = Main.first(r)
│   %2 = %1 <= x
└──      goto #3 if not %2
2 ─ %4 = Main.last(r)
│   %5 = x <= %4
└──      return %5
3 ─      return false
)

in general, branches are slow so Julia’s version uses 1 branch instead of 2 (although realistically LLVM probably could do this for us)

You mean like this?

julia> for k in 1:10
           if isodd(k)
                println(k)
           end
       end
1
3
5
7
9

Others have pointed out that the particular code executed depends on the type of some_list, I wanted to add an observation, which is that a Set is not necessarily more performant than a Vector for small collections of values.

A Set will use a hash lookup of some_thing in the Set, which scales well: it takes about as long to check some_thing against a Set with thousands of elements as it does to check it against a Set with only a few.

But if some_list is a short Vector (where “short” is strictly empirical), testing membership of some_thing might be faster than a Set. The Set membership has to hash some_thing and probe it, this is branchy code, whereas a small Vector of primitive values is probably laid out in contiguous memory and may be scanned with good cache locality. But the average time to do this always increases as the Vector gets longer, and at some point a Set is clearly the better choice. But if you need the data to be in a Vector, you can’t save time converting it into a Set just to check membership.

The meta-lesson here is that performance is empirical, when it’s important you must profile and run benchmarks to have confidence that the code is optimal.

6 Likes

That applies to the default Set (and Dict it’s built on), but Julia has others (smallintdict in Julia, only in C it seems). FYI: OrderedSet might already be faster, for small sets, from:

It also implements LittleDict which is a ordered dictionary, that is much faster than any other AbstractDict (ordered or not) for small collections.

[You can use that for sets already; must prevent possibilities of duplicates manually.]

I’ve thought of implementing a HybridDict of that one and a regular one, would be helpful for small dicts, keeping scalability of large ones, also helpful for sets. Python dicts are already ordered, I don’t know of its set implementation, might have such an optimization already, and you can use:

PySet: PythonCall.jl/src/Wrap/PySet.jl at 2165f91602e7f1de862384f442c2c9814766c1e6 · JuliaPy/PythonCall.jl · GitHub

LittleDict uses a pair of Vectors under the hood (or Tuples if frozen/imutable), so it’s doing the same thing if you squint :slight_smile:

The ordered data structures are very nice if you need to preserve insertion order! My point is that a question like “is checking membership of a scalar in a vector fast or slow” can’t be answered without more information. LittleDict says the heuristic should be about 30 elements, so let’s go with that: if your Vector has length(v) < 30 or so, switching to a Set is unlikely to result in a speedup. Still might be the correct data structure, if you need to compute unions or intersections for example. It all depends on what you need.

2 Likes