Is it possible for you to post a working example of that function alone?
Edit:
function get_gains!(X, current_values, idxs, gains)
@inbounds Threads.@threads for i in eachindex(idxs)
s = 0.0
for j in eachindex(current_values)
s += @fastmath sqrt(current_values[j] + X[j, idxs[i]])
end
gains[i] = s
end
end;
If that is the bottleneck, have you tried to use @tturbo in the inner (or outer) loops instead @threads? Or @turbo in the inner loop at least? Seems to me that what is being computed is too simple for standard threading to be useful. If the number of elements of current_values is large this is also probably an easy use case for a GPU accelerated map reduce.
Edit 2:
In [25]:
opt2 = LazyGreedy(X_digits);
res2 = @btime fit(opt2, k);
bench2 = @benchmark fit(opt2, k);
t2 = bench2.times[1]/1e9;
13.836 s (22202 allocations: 8.95 GiB)
Those allocations, unless you know exactly where are they coming from, are an indication of a type instability. I find very hard to debug a notebook, but that may be an indication that the program can be much faster.