Unique() to a certain tolerance?

Ferran_Mazzanti · April 10, 2019, 4:34pm

Hi all,
I’m building up a code that evaluates the energy of a lot of spin states, and many of those states have the same energy. Typically about 2^20 different states, but only a few thousand of them have different energies. I want to know how many of these energies are different, so I evaluate all the energies, store it in an array E[I] whose length is 2^20, and then evaluate

unique(E)

to get the number of energies that are strictly different. SO far, so good, but the problem is that I find different states with energies very close, for instance

-2.9999999999
-2.9999999998

and I’m 100% sure these have the same value, and the difference is just numerical error. But unique(E) says these states are different.
So the question is: is there a simple way to tell unique() to be accurate up to, let’s say, 5 digital places only? Or shall I better first round off the energies, and then use unique() on the result? And if I have to round first the energies before feeding them to unique(), what is the right way to do that?

Thanks a lot,

Ferran.

phelipe · April 10, 2019, 4:48pm

You can use “trunc”:

data = rand(10)
unique(trunc.(data, digits = 4))

improbable22 · April 10, 2019, 5:07pm

You can pass a function to unique:

unique(x -> round(x, digits=0), randn(10))

Perhaps this will be unique(round(_, digits=0), randn(10)) soon…

djsegal · April 10, 2019, 5:13pm

Maybe related to:

StefanKarpinski · April 10, 2019, 5:14pm

Someone periodically asks this, but note that:

So no matter what you do, you’ll be potentially putting values that differ by one unit in the last place into different buckets.

foobar_lv2 · April 10, 2019, 5:14pm

Truncation before unique unfortunately induces a discontinuity if some energy levels are close to the truncation boundary. An easy variant is the following (if your energy is one-dimensional and you know the discretization parameter apriori):

function discretize(energy_levels, epsilon)
res = Vector{Vector{eltype(energy_levels)}}()
current = Vector{eltype(energy_levels)}()
bound = energy_levels[1][1]+epsilon
mingap = Inf
for (energy, idx) in energy_levels
    if energy <= bound
        push!(current, (energy, idx))
    else
        mingap = min(mingap, energy - current[end][1])
        push!(res, current)
        current = [(energy, idx)]
        bound = energy + epsilon
    end
end
push!(res, current)
return mingap, res
end

Then:

julia> N = 10_000;
julia> energy_levels = sort!([(randn() + 5*iseven(i), i) for i=1:N]);
julia> discretize(energy_levels, 1.0)
(4.000029130021192, Array{Tuple{Float64,Int64},1}[[(1.00344e-5, 1649), (0.00115351, 1023), (0.00124319, 4715), (0.001514, 2551), (0.00183936, 499), (0.00186862, 7835), (0.00190698, 4597), (0.00227954, 4959), (0.00234361, 6277), (0.00271333, 4933)  …  (0.997599, 9123), (0.997636, 8371), (0.997659, 7433), (0.997705, 1337), (0.998313, 769), (0.998855, 7799), (0.99897, 7941), (0.999027, 2609), (0.999211, 5695), (0.999993, 4315)]])

julia> energy_levels = sort!([(randn(), i) for i=1:N]);
julia> discretize(energy_levels, 1.0)
(0.00019930580942706388, Array{Tuple{Float64,Int64},1}[[(-3.65232, 7208), (-3.49434, 2538), (-3.27886, 9182), (-3.24522, 2502), (-3...

You see that the mingap is large in the first case (because your distribution really is just two values with some small error) and small in the second case (assumptions for discretize to make sense are violated). Use mingap > 2*epsilon as a reality-check.

If you don’t know the error scale a priori, then you need to do some kind of hierarchical clustering in order to discover the error scale.

If your numerical errors are on the same scale as the spectral gaps then you are SOL and need better computations. If you have continuous spectrum, then you are SOL, period. If eigenvalues accumulate somewhere (typical situation for compact operators like inverse laplace) and you know the accumulation points (0 for compact operators) then you should do some special-casing of the accumulation points.

edit: bugfix

stevengj · April 10, 2019, 6:49pm

I would normally use sigdigits=n instead of digits. Using digits is like an absolute tolerance, and is sensitive to the overall scaling of the data, whereas sigdigits is a scale-invariant relative tolerance.

Ferran_Mazzanti · April 11, 2019, 7:58am

Oh, thanks for the replies and the different proposals… I will try them at once
Best,
Ferran.

Juan · April 11, 2019, 10:22am

The problem with trunc and round is they work in a given direction.
for example if you trunc the last digit of 2.00 and 2.03 you will get 2.0 and 2.0 and they will be equal. But 2.49 and 2.51 will be 2.4 and 2.5 and will be different.

One solution could be to sort all numbers, calculate their differences and then apply then decide how the numbers to pick …

Tamas_Papp · April 11, 2019, 10:27am

I would do something roughly like this:

sort and then diff all the values,
drop the zeros from the differences, establish a cutoff \Delta from the rest (eg using a quantile, I would plot first),
consecutive sorted values within a distance of \Delta go in the same bin, with a sanity check that the bins don’t get too wide (eg 1+\epsilon, 1+2\epsilon, …), but the way you describe the problem should preclude this.

foobar_lv2 · April 11, 2019, 11:07am

That’s exactly what my proposal does.

The general approach (higher dimensional “energy” function, unknown error scale) is to do some kind of hierarchical clustering / dendrogram (preferably using something like knn trees), spot the gap and use the clustering at that level. This is complicated and requires human input at some point (alternatively it requires assumptions on errors and distribution that you lack).

Topic		Replies	Views
How to make a Set of real values based on rtol? General Usage question	2	896	August 6, 2018
Possible bug in unique/Set Internals & Design faq	49	4551	June 15, 2018
The test isequal(+0.0,-0.0) returns FALSE Numerics isequal	16	990	June 2, 2020
Is it safe to compare rounded float values for equality? General Usage faq	11	9245	February 28, 2019
Questions related to -0.0 and 0.0 General Usage question	9	544	October 21, 2021

Unique() to a certain tolerance?

Related topics