I'm new to Julia and am hoping to use Apple's MLX Framework with Julia

Bill_Oliver · March 6, 2026, 5:35pm

I’m new to Julia and I am loving the language. Some test code I have runs in about 2 minutes, which I found astounding given the amount of computation involved and that the code is written with loops. However, when I port the code to Python and use Apple’s MLX framework on my M4 Mac Studio, it runs in about 1 second. Supposedly, the new M5 chip would be 4X faster still. I don’t mind waiting a couple of minutes for this kind of problem, but for some problems I will want the extra performance my hardware has to offer. Is there a chance that the MLX framework will accessible to Julia users in the near future?

mbauman · March 6, 2026, 5:44pm

Welcome! That large of a discrepancy sounds like something bigger is going on. Have you seen the performance tips chapter and followed them?

Bill_Oliver · March 6, 2026, 5:55pm

Yes, I’ve have, but there is an awful lot of information at that web page to take into account. I doubt that my function is fully optimized yet. It is threaded to use my 12 performance CPU cores on my Mac Studio. I’ll keep trying, but I doubt that I will be able to obtain a 120X improvement.

wsmoses · March 6, 2026, 7:02pm

have you tried using Reactant.jl to optimize it?

Bill_Oliver · March 6, 2026, 7:07pm

No I haven’t, but I just tried out a Metal version, and the speed up is considerable. I’m beginning to regret having posted my question.

julia> @time radii, C_r = grassberger_procaccia(U); #default
120.226431 seconds (2.13 G allocations: 1.692 TiB, 35.43% gc time)

julia> @time radii, C_r = grassberger_procaccia2(U); #Metal
44.735732 seconds (25.25 M allocations: 1.639 GiB, 0.19% gc time, 16.20% compilation time: 2% of which was recompilation)

sablonl · March 6, 2026, 7:09pm

Depending on your usecase AppleAccelerate.jl may also help.

gdalle · March 6, 2026, 7:59pm

Do you have a minimum working example of the code you’re trying to speed up? It’s the best way to nerdsnipe people here

Bill_Oliver · March 6, 2026, 8:25pm

Thanks, I did try that, and it didn’t seem to help. But the code has changed a lot since then, so it is worth trying again.

Bill_Oliver · March 6, 2026, 8:53pm

I’m still working on the Metal version. The results aren’t quite right. The code below should run, but maybe not on all hardware. I’ve been making heavy use of AI, so anything wrong with this code is Gemini 3 Pro’s fault. Note that this is the original version that I was trying to optimize. The metal version or something better will replace it.

using Base.Threads
using LinearAlgebra

function grassberger_procaccia(data::AbstractMatrix; n_bins::Int=64, r_min=nothing, 
    r_max=nothing)
        
    n_obs, n_features = size(data)

    # Transpose for column major
    data_t = permutedims(data)

    # Estimate Range (if not provided)
    if isnothing(r_max) || isnothing(r_min)
        sample_dists = Float64[]
        attempts = 0
        # safe loop
        while length(sample_dists) < 1000 && attempts < 10000
            i, j = rand(1:n_obs, 2)
            # Simple check to deal with cases causing problems
            if i != j
                d = norm(view(data_t, :, i) .- view(data_t, :, j))
                if d > 0 push!(sample_dists, d) end
            end
            attempts += 1
        end
        if isnothing(r_max) r_max = maximum(sample_dists) * 1.5 end
        if isnothing(r_min) r_min = minimum(sample_dists) / 2.0 end
    end

    ln_min = log(r_min)
    ln_max = log(r_max)
    bin_width = (ln_max - ln_min) / n_bins
    radii = exp.(range(ln_min + bin_width, ln_max, length=n_bins))

    max_id = Threads.maxthreadid()
    histograms = [zeros(Int, n_bins) for _ in 1:max_id]

    @threads for i in 1:(n_obs - 1)
        tid = Threads.threadid()
        
        if tid > max_id
            continue 
        end
        
        col_i = view(data_t, :, i)
        
        for j in (i + 1):n_obs
            dist = norm(col_i .- view(data_t, :, j))
            
            if dist > 0
                val = (log(dist) - ln_min) / bin_width
                
                if isfinite(val)
                    bin_idx = floor(Int, val) + 1
                    if 1 <= bin_idx <= n_bins
                        @inbounds histograms[tid][bin_idx] += 1
                    end
                end
            end
        end
    end

    total_hist = sum(histograms)

    # Cumulative Sum
    counts_cumulative = cumsum(total_hist)

    # Normalize
    total_pairs = n_obs * (n_obs - 1) / 2
    C_r = counts_cumulative ./ total_pairs

    return radii, C_r
end

X = randn(Float64, (37651,300)) #same size as my real data
foreach(normalize!, eachrow(X))
@time radii, C_r = grassberger_procaccia(X)

mbauman · March 6, 2026, 10:49pm

That seems like a lot of allocations. I think many of them are coming from your norm(x .- y)s — that allocates a temporary for x .- y on every call. That’s where I’m sure you could do significantly better, even on a plain old CPU. Changing those to Distances.euclidean from Distances.jl (and nothing else!) gets me down to 18s on my crappy old M1 laptop with way fewer allocations:

julia> @time radii, C_r = grassberger_procaccia(X)
250.868078 seconds (2.13 G allocations: 1.692 TiB, 37.04% gc time, 0.28% compilation time)

julia> @time radii, C_r = grassberger_procaccia_euclidean(X)
 18.588317 seconds (862.84 k allocations: 128.668 MiB, 0.02% gc time, 1.61% compilation time)

You’ll also want to avoid using tid to divide up your threads work; see PSA: Thread-local state is no longer recommended; Common misconceptions about threadid() and nthreads() for more details there.

Bill_Oliver · March 6, 2026, 11:20pm

It works. Amazing. Thank you for your help. Not only is the language wonderful, but so is the community.

julia> @time radii, C_r = grassberger_procaccia(X)
  7.789150 seconds (962.49 k allocations: 136.976 MiB, 0.03% gc time, 4.86% compilation time)

abraemer · March 8, 2026, 8:45am

Note that this pattern is not safe anymore. While the failure probability is not high, you should change that. See this article which explains the issue and outlines a fix:

Also checkout OhMyThreads.jl which makes these things easier.

Bill_Oliver · March 8, 2026, 2:20pm

I was getting a boost in speed for my function with the deprecated pattern. However, I will use OhMyThreads from now on. Thank you @abraemer for pointing me to that package. Its documentation is very good. I especially appreciate the many code examples.

Topic		Replies	Views
Apple M1 GPU from Julia? GPU question	20	6164	March 31, 2023
Does Mac M1 in multithreads is slower that in single thread? Performance mac-m1	10	3734	May 16, 2021
LinearAlgebra.mul! for complex vectors very slow on Apple Silicon Performance performance , linearalgebra , mac-m1	5	527	November 8, 2024
Taking advantage of Apple M1? Performance mac-m1 , hardware	27	5906	November 10, 2023
Apple M1, M1 pro M1 Max and Julia developpers Offtopic	17	5607	November 1, 2021

I'm new to Julia and am hoping to use Apple's MLX Framework with Julia

Related topics