FlexiJoins vs SortMerge (particularly in astronomy workflows)

aplavin · April 28, 2025, 11:18pm

I wouldn’t call FlexiJoins.jl “polished” tbh It does support lots of different join conditions with generally efficient implementations, but stuff like error messages and internal implementation design can definitely be improved.
A major feature of FlexiJoins.jl is uniform convenient interface for a wide range of join conditions and types. In particular, it supports SkyCoords.jl directly – joining two catalogs is just

innerjoin((A, B), by_distance(x -> x.coords, separation, ≤(1u"arcsecond")))

assuming elements of both datasets contain the coords property with SkyCoords objects.

I haven’t checked, but if SortMerge.jl is more performant, it could be added as an alternative backend to FlexiJoins.jl. Currently, it uses NearestNeighbors.jl trees for by-distance joins.

gcalderone · April 30, 2025, 11:01pm

I made some test and the outcome is that FlexiJoins typically performs better than SortMerge, but the latter is faster when the catalogs are pre-sorted.

MWE (see documentation here):

using SkyCoords, AstroLib, SortMerge
using FlexiJoins, Unitful, UnitfulAstro

# Generate some random coordinates
nn = 10_000_000
c1 = ICRSCoords.(rand(nn) .* 2pi, rand(nn) .* pi .- pi/2);
c2 = ICRSCoords.(rand(nn) .* 2pi, rand(nn) .* pi .- pi/2);

# Cross match using SortMerge:
lt(v, i, j) = (v[i].dec < v[j].dec)
function sd(v1, v2, i1, i2, threshold_arcsec)
    threshold_rad = threshold_arcsec / 3600. * pi / 180.

    d = (v1[i1].dec - v2[i2].dec) / threshold_rad
    (abs(d) >= 1)  &&  (return sign(d))

    maxd = max(abs(v1[i1].dec), abs(v2[i2].dec))
    if pi/2. - maxd > pi / 180. # avoid this optimization in regions below 1 deg from the poles
        d = abs(v1[i1].ra - v2[i2].ra)
        (d > pi)  &&  (d = 2pi - d)
        d *= cos(maxd) / threshold_rad
        (d >= 1)  &&  (return 999)
    end
    
    dd = gcirc(0, v1[i1].ra, v1[i1].dec, v2[i2].ra, v2[i2].dec)    
    (dd < threshold_rad)  &&  (return 0)
    return 999
end
@time jj = sortmerge(c1, c2, lt1=lt, lt2=lt, sd=sd, 1.);

# Cross match using FlexiJoins:
@time result = innerjoin((c1, c2), by_distance(identity, separation, <(1u"arcsecond")));

To cross match two catalog, each with 10 million entries, SortMerge takes ~33 sec on average, while FlexiJoin provides some ~30% better performance with an average of ~24 s.

On the other hand, when the input catalogs are sorted by declination:

cs1 = c1[sortperm([r.dec for r in c1])];
cs2 = c2[sortperm([r.dec for r in c2])];

# Cross match using SortMerge:
@time jj = sortmerge(cs1, cs2, sd=sd, 1., sorted=true);

# Cross match using FlexiJoins:
@time result = innerjoin((cs1, cs2), by_distance(identity, separation, <(1u"arcsecond")));

SortMerge takes ~9.6 sec, while FlexiJoins is requires ~16 sec.

As discussed during the meeting, I will submit a PR to list SortMerge in the JuliaAstro ecosystem.

Thanks (and sorry for the slightly off-topic post).

Matt_jl · May 1, 2025, 12:40pm

That’s awesome, but the documentation from FlexiJoin is not exactly clear. Would it be possible to return the indices of the matches rather than the matches themselves?

The_Mastermage · May 1, 2025, 2:59pm

Oh that’s super cool to know

aplavin · May 1, 2025, 9:34pm

First of all, I cannot help but wonder why you need this Would be curious to hear about the usecase, why indices and not elements themselves are needed.

Nevertheless indices are easy to retrieve – FlexiJoins.jl returns views of the original datasets:

julia> A = rand(100);
julia> B = rand(10);

julia> j = innerjoin((;A, B), by_distance(identity, Euclidean(), ≤(0.01)))
15-element StructArray(view(::Vector{Float64}, [21, 24, 24, 31, 47, 48, 50, 51, 58, 72, 75, 77, 90, 91, 91]), view(::Vector{Float64}, [6, 2, 4, 10, 7, 5, 9, 5, 6, 8, 6, 7, 7, 2, 4])) with eltype @NamedTuple{A::Float64, B::Float64}:
 (A = 0.2736830841831832, B = 0.2703319077489147)
 (A = 0.6382293710508645, B = 0.6413080028298531)
<...>

# individual element of the join result:
julia> j[1]
(A = 0.2736830841831832, B = 0.2703319077489147)

# indices of A in the join result:
julia> parentindices(j.A)
([21, 24, 24, 31, 47, 48, 50, 51, 58, 72, 75, 77, 90, 91, 91],)

jar1 · May 1, 2025, 10:10pm

parentindices, like parent, goes all the way up the ancestry tree, so the parentindices of a view of a view doesn’t necessarily give the indices into the joined arrays.

julia> parentindices(view(view([10, 20, 30, 40], 2:4), 1:2))
(2:3,)

So it sometimes will do what OP is asking for, but sometimes not quite.

aplavin · May 2, 2025, 6:00am

Good point regarding nested views. There’s joinindices(...) with the same arguments as innerjoin(...) that returns indices instead of elements – in fact, all actual calculations happen within joinindices.
If “getting indices of matches” is something generally useful, it can be streamlined and documented. I don’t think I ever needed/used it myself, so curious about the usecases @Matt_jl.

gcalderone · May 5, 2025, 8:42am

A use case is as follows: assume your tables are so large they don’t fit in RAM, you’ll need to

load just the coordinates column;
cross-match tables;
load remaining columns only for matching entries.

Topic		Replies	Views
[ANN] FlexiJoins.jl: fresh take on joining datasets Package Announcements	32	3188	April 3, 2025
Spatial join with dataframes General Usage dataframes , geo	27	2084	September 6, 2024
Conditional left join 2 dataframes when none of the columns are common General Usage dataframes , flexijoins	26	1993	May 26, 2022
Mapping Julia’s Astronomy Package Ecosystem: Bridging Gaps and Highlighting Strengths Astro/Space	48	1422	May 2, 2025
Left join algorithm for columnar tables General Usage tables , flexijoins	4	137	November 18, 2024

FlexiJoins vs SortMerge (particularly in astronomy workflows)

Related topics