FlexiJoins vs SortMerge (particularly in astronomy workflows)

I wouldn’t call FlexiJoins.jl “polished” tbh :slight_smile: It does support lots of different join conditions with generally efficient implementations, but stuff like error messages and internal implementation design can definitely be improved.
A major feature of FlexiJoins.jl is uniform convenient interface for a wide range of join conditions and types. In particular, it supports SkyCoords.jl directly – joining two catalogs is just

innerjoin((A, B), by_distance(x -> x.coords, separation, ≤(1u"arcsecond")))

assuming elements of both datasets contain the coords property with SkyCoords objects.

I haven’t checked, but if SortMerge.jl is more performant, it could be added as an alternative backend to FlexiJoins.jl. Currently, it uses NearestNeighbors.jl trees for by-distance joins.

2 Likes

I made some test and the outcome is that FlexiJoins typically performs better than SortMerge, but the latter is faster when the catalogs are pre-sorted.

MWE (see documentation here):

using SkyCoords, AstroLib, SortMerge
using FlexiJoins, Unitful, UnitfulAstro

# Generate some random coordinates
nn = 10_000_000
c1 = ICRSCoords.(rand(nn) .* 2pi, rand(nn) .* pi .- pi/2);
c2 = ICRSCoords.(rand(nn) .* 2pi, rand(nn) .* pi .- pi/2);

# Cross match using SortMerge:
lt(v, i, j) = (v[i].dec < v[j].dec)
function sd(v1, v2, i1, i2, threshold_arcsec)
    threshold_rad = threshold_arcsec / 3600. * pi / 180.

    d = (v1[i1].dec - v2[i2].dec) / threshold_rad
    (abs(d) >= 1)  &&  (return sign(d))

    maxd = max(abs(v1[i1].dec), abs(v2[i2].dec))
    if pi/2. - maxd > pi / 180. # avoid this optimization in regions below 1 deg from the poles
        d = abs(v1[i1].ra - v2[i2].ra)
        (d > pi)  &&  (d = 2pi - d)
        d *= cos(maxd) / threshold_rad
        (d >= 1)  &&  (return 999)
    end
    
    dd = gcirc(0, v1[i1].ra, v1[i1].dec, v2[i2].ra, v2[i2].dec)    
    (dd < threshold_rad)  &&  (return 0)
    return 999
end
@time jj = sortmerge(c1, c2, lt1=lt, lt2=lt, sd=sd, 1.);

# Cross match using FlexiJoins:
@time result = innerjoin((c1, c2), by_distance(identity, separation, <(1u"arcsecond")));

To cross match two catalog, each with 10 million entries, SortMerge takes ~33 sec on average, while FlexiJoin provides some ~30% better performance with an average of ~24 s.

On the other hand, when the input catalogs are sorted by declination:

cs1 = c1[sortperm([r.dec for r in c1])];
cs2 = c2[sortperm([r.dec for r in c2])];

# Cross match using SortMerge:
@time jj = sortmerge(cs1, cs2, sd=sd, 1., sorted=true);

# Cross match using FlexiJoins:
@time result = innerjoin((cs1, cs2), by_distance(identity, separation, <(1u"arcsecond")));

SortMerge takes ~9.6 sec, while FlexiJoins is requires ~16 sec.

As discussed during the meeting, I will submit a PR to list SortMerge in the JuliaAstro ecosystem.

Thanks (and sorry for the slightly off-topic post).

2 Likes

That’s awesome, but the documentation from FlexiJoin is not exactly clear. Would it be possible to return the indices of the matches rather than the matches themselves?

Oh that’s super cool to know

First of all, I cannot help but wonder why you need this :slight_smile: Would be curious to hear about the usecase, why indices and not elements themselves are needed.

Nevertheless indices are easy to retrieve – FlexiJoins.jl returns views of the original datasets:

julia> A = rand(100);
julia> B = rand(10);

julia> j = innerjoin((;A, B), by_distance(identity, Euclidean(), ≤(0.01)))
15-element StructArray(view(::Vector{Float64}, [21, 24, 24, 31, 47, 48, 50, 51, 58, 72, 75, 77, 90, 91, 91]), view(::Vector{Float64}, [6, 2, 4, 10, 7, 5, 9, 5, 6, 8, 6, 7, 7, 2, 4])) with eltype @NamedTuple{A::Float64, B::Float64}:
 (A = 0.2736830841831832, B = 0.2703319077489147)
 (A = 0.6382293710508645, B = 0.6413080028298531)
<...>

# individual element of the join result:
julia> j[1]
(A = 0.2736830841831832, B = 0.2703319077489147)

# indices of A in the join result:
julia> parentindices(j.A)
([21, 24, 24, 31, 47, 48, 50, 51, 58, 72, 75, 77, 90, 91, 91],)
1 Like

parentindices, like parent, goes all the way up the ancestry tree, so the parentindices of a view of a view doesn’t necessarily give the indices into the joined arrays.

julia> parentindices(view(view([10, 20, 30, 40], 2:4), 1:2))
(2:3,)

So it sometimes will do what OP is asking for, but sometimes not quite.

Good point regarding nested views. There’s joinindices(...) with the same arguments as innerjoin(...) that returns indices instead of elements – in fact, all actual calculations happen within joinindices.
If “getting indices of matches” is something generally useful, it can be streamlined and documented. I don’t think I ever needed/used it myself, so curious about the usecases @Matt_jl.

2 Likes

A use case is as follows: assume your tables are so large they don’t fit in RAM, you’ll need to

  • load just the coordinates column;
  • cross-match tables;
  • load remaining columns only for matching entries.