I'm new to Julia and am hoping to use Apple's MLX Framework with Julia

I’m new to Julia and I am loving the language. Some test code I have runs in about 2 minutes, which I found astounding given the amount of computation involved and that the code is written with loops. However, when I port the code to Python and use Apple’s MLX framework on my M4 Mac Studio, it runs in about 1 second. Supposedly, the new M5 chip would be 4X faster still. I don’t mind waiting a couple of minutes for this kind of problem, but for some problems I will want the extra performance my hardware has to offer. Is there a chance that the MLX framework will accessible to Julia users in the near future?

Welcome! That large of a discrepancy sounds like something bigger is going on. Have you seen the performance tips chapter and followed them?

Yes, I’ve have, but there is an awful lot of information at that web page to take into account. I doubt that my function is fully optimized yet. It is threaded to use my 12 performance CPU cores on my Mac Studio. I’ll keep trying, but I doubt that I will be able to obtain a 120X improvement.

have you tried using Reactant.jl to optimize it?

No I haven’t, but I just tried out a Metal version, and the speed up is considerable. I’m beginning to regret having posted my question.

julia> @time radii, C_r = grassberger_procaccia(U); #default
120.226431 seconds (2.13 G allocations: 1.692 TiB, 35.43% gc time)

julia> @time radii, C_r = grassberger_procaccia2(U); #Metal
44.735732 seconds (25.25 M allocations: 1.639 GiB, 0.19% gc time, 16.20% compilation time: 2% of which was recompilation)

Depending on your usecase AppleAccelerate.jl may also help.

Do you have a minimum working example of the code you’re trying to speed up? It’s the best way to nerdsnipe people here

Thanks, I did try that, and it didn’t seem to help. But the code has changed a lot since then, so it is worth trying again.

I’m still working on the Metal version. The results aren’t quite right. The code below should run, but maybe not on all hardware. I’ve been making heavy use of AI, so anything wrong with this code is Gemini 3 Pro’s fault. Note that this is the original version that I was trying to optimize. The metal version or something better will replace it.

using Base.Threads
using LinearAlgebra

function grassberger_procaccia(data::AbstractMatrix; n_bins::Int=64, r_min=nothing, 
    r_max=nothing)
        
    n_obs, n_features = size(data)

    # Transpose for column major
    data_t = permutedims(data)

    # Estimate Range (if not provided)
    if isnothing(r_max) || isnothing(r_min)
        sample_dists = Float64[]
        attempts = 0
        # safe loop
        while length(sample_dists) < 1000 && attempts < 10000
            i, j = rand(1:n_obs, 2)
            # Simple check to deal with cases causing problems
            if i != j
                d = norm(view(data_t, :, i) .- view(data_t, :, j))
                if d > 0 push!(sample_dists, d) end
            end
            attempts += 1
        end
        if isnothing(r_max) r_max = maximum(sample_dists) * 1.5 end
        if isnothing(r_min) r_min = minimum(sample_dists) / 2.0 end
    end

    ln_min = log(r_min)
    ln_max = log(r_max)
    bin_width = (ln_max - ln_min) / n_bins
    radii = exp.(range(ln_min + bin_width, ln_max, length=n_bins))

    max_id = Threads.maxthreadid()
    histograms = [zeros(Int, n_bins) for _ in 1:max_id]

    @threads for i in 1:(n_obs - 1)
        tid = Threads.threadid()
        
        if tid > max_id
            continue 
        end
        
        col_i = view(data_t, :, i)
        
        for j in (i + 1):n_obs
            dist = norm(col_i .- view(data_t, :, j))
            
            if dist > 0
                val = (log(dist) - ln_min) / bin_width
                
                if isfinite(val)
                    bin_idx = floor(Int, val) + 1
                    if 1 <= bin_idx <= n_bins
                        @inbounds histograms[tid][bin_idx] += 1
                    end
                end
            end
        end
    end

    total_hist = sum(histograms)

    # Cumulative Sum
    counts_cumulative = cumsum(total_hist)

    # Normalize
    total_pairs = n_obs * (n_obs - 1) / 2
    C_r = counts_cumulative ./ total_pairs

    return radii, C_r
end

X = randn(Float64, (37651,300)) #same size as my real data
foreach(normalize!, eachrow(X))
@time radii, C_r = grassberger_procaccia(X)
1 Like

That seems like a lot of allocations. I think many of them are coming from your norm(x .- y)s — that allocates a temporary for x .- y on every call. That’s where I’m sure you could do significantly better, even on a plain old CPU. Changing those to Distances.euclidean from Distances.jl (and nothing else!) gets me down to 18s on my crappy old M1 laptop with way fewer allocations:

julia> @time radii, C_r = grassberger_procaccia(X)
250.868078 seconds (2.13 G allocations: 1.692 TiB, 37.04% gc time, 0.28% compilation time)

julia> @time radii, C_r = grassberger_procaccia_euclidean(X)
 18.588317 seconds (862.84 k allocations: 128.668 MiB, 0.02% gc time, 1.61% compilation time)

You’ll also want to avoid using tid to divide up your threads work; see PSA: Thread-local state is no longer recommended; Common misconceptions about threadid() and nthreads() for more details there.

It works. Amazing. Thank you for your help. Not only is the language wonderful, but so is the community.

julia> @time radii, C_r = grassberger_procaccia(X)
  7.789150 seconds (962.49 k allocations: 136.976 MiB, 0.03% gc time, 4.86% compilation time)
1 Like