Unique() to a certain tolerance?

faq
#1

Hi all,
I’m building up a code that evaluates the energy of a lot of spin states, and many of those states have the same energy. Typically about 2^20 different states, but only a few thousand of them have different energies. I want to know how many of these energies are different, so I evaluate all the energies, store it in an array E[I] whose length is 2^20, and then evaluate

unique(E)

to get the number of energies that are strictly different. SO far, so good, but the problem is that I find different states with energies very close, for instance

-2.9999999999
-2.9999999998

and I’m 100% sure these have the same value, and the difference is just numerical error. But unique(E) says these states are different.
So the question is: is there a simple way to tell unique() to be accurate up to, let’s say, 5 digital places only? Or shall I better first round off the energies, and then use unique() on the result? And if I have to round first the energies before feeding them to unique(), what is the right way to do that?

Thanks a lot,

Ferran.

#2

You can use “trunc”:

data = rand(10)
unique(trunc.(data, digits = 4))
#3

You can pass a function to unique:

unique(x -> round(x, digits=0), randn(10))

Perhaps this will be unique(round(_, digits=0), randn(10)) soon…

#4

Maybe related to:

1 Like
#5

Someone periodically asks this, but note that:

So no matter what you do, you’ll be potentially putting values that differ by one unit in the last place into different buckets.

2 Likes
#6

Truncation before unique unfortunately induces a discontinuity if some energy levels are close to the truncation boundary. An easy variant is the following (if your energy is one-dimensional and you know the discretization parameter apriori):

function discretize(energy_levels, epsilon)
res = Vector{Vector{eltype(energy_levels)}}()
current = Vector{eltype(energy_levels)}()
bound = energy_levels[1][1]+epsilon
mingap = Inf
for (energy, idx) in energy_levels
    if energy <= bound
        push!(current, (energy, idx))
    else
        mingap = min(mingap, energy - current[end][1])
        push!(res, current)
        current = [(energy, idx)]
        bound = energy + epsilon
    end
end
push!(res, current)
return mingap, res
end

Then:

julia> N = 10_000;
julia> energy_levels = sort!([(randn() + 5*iseven(i), i) for i=1:N]);
julia> discretize(energy_levels, 1.0)
(4.000029130021192, Array{Tuple{Float64,Int64},1}[[(1.00344e-5, 1649), (0.00115351, 1023), (0.00124319, 4715), (0.001514, 2551), (0.00183936, 499), (0.00186862, 7835), (0.00190698, 4597), (0.00227954, 4959), (0.00234361, 6277), (0.00271333, 4933)  …  (0.997599, 9123), (0.997636, 8371), (0.997659, 7433), (0.997705, 1337), (0.998313, 769), (0.998855, 7799), (0.99897, 7941), (0.999027, 2609), (0.999211, 5695), (0.999993, 4315)]])

julia> energy_levels = sort!([(randn(), i) for i=1:N]);
julia> discretize(energy_levels, 1.0)
(0.00019930580942706388, Array{Tuple{Float64,Int64},1}[[(-3.65232, 7208), (-3.49434, 2538), (-3.27886, 9182), (-3.24522, 2502), (-3...

You see that the mingap is large in the first case (because your distribution really is just two values with some small error) and small in the second case (assumptions for discretize to make sense are violated). Use mingap > 2*epsilon as a reality-check.

If you don’t know the error scale a priori, then you need to do some kind of hierarchical clustering in order to discover the error scale.

If your numerical errors are on the same scale as the spectral gaps then you are SOL and need better computations. If you have continuous spectrum, then you are SOL, period. If eigenvalues accumulate somewhere (typical situation for compact operators like inverse laplace) and you know the accumulation points (0 for compact operators) then you should do some special-casing of the accumulation points.

edit: bugfix

#7

I would normally use sigdigits=n instead of digits. Using digits is like an absolute tolerance, and is sensitive to the overall scaling of the data, whereas sigdigits is a scale-invariant relative tolerance.

2 Likes
#8

Oh, thanks for the replies and the different proposals… I will try them at once :slight_smile:
Best,
Ferran.

#9

The problem with trunc and round is they work in a given direction.
for example if you trunc the last digit of 2.00 and 2.03 you will get 2.0 and 2.0 and they will be equal. But 2.49 and 2.51 will be 2.4 and 2.5 and will be different.

One solution could be to sort all numbers, calculate their differences and then apply then decide how the numbers to pick …

#10

I would do something roughly like this:

  1. sort and then diff all the values,
  2. drop the zeros from the differences, establish a cutoff \Delta from the rest (eg using a quantile, I would plot first),
  3. consecutive sorted values within a distance of \Delta go in the same bin, with a sanity check that the bins don’t get too wide (eg 1+\epsilon, 1+2\epsilon, …), but the way you describe the problem should preclude this.
#11

That’s exactly what my proposal does.

The general approach (higher dimensional “energy” function, unknown error scale) is to do some kind of hierarchical clustering / dendrogram (preferably using something like knn trees), spot the gap and use the clustering at that level. This is complicated and requires human input at some point (alternatively it requires assumptions on errors and distribution that you lack).