Breakers.jl is a pure Julia partial implementation of the R
classInt library, that handles fisher, kmeans, quantile and equal breaks of a vector (potentially containing missing
values). This aids in binning continuous data to allow use with discrete color values in thematic mapping. It has been tested using a real-world dataset of the populations of the 3,235 county populations of the US and overseas possessions and produces identical results to those of classInt.
It would be fun to have some benchmarks between the R implementation and your implementation! Just some size vs compute time plots?
For my use case, the relatively poor performance compared to R isnβt a concernβif I were dealing with more than a few thousand items, however, Iβd have to do something like they do in Fortran
C SUBROUTINE FISH(M, X, VLAB, RLAB, TITLE, K, DMWORK, WORK, DMIWRK,
C * IWORK, OUNIT)
SUBROUTINE FISH(M, X, K, DMWORK, WORK, DMIWRK, IWORK, LLOUT)
C
C<><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><>
C
C PURPOSE
C -------
C
C CLUSTERS A SEQUENCE OF CASES INTO SUBSEQUENCES BY FISHER'S
C METHOD OF EXACT OPTIMIZATION
C
C DESCRIPTION
C -----------
C
C 1. THE "EXACT OPTIMIZATION" METHOD OF W. D. FISHER MAXIMIZES THE
C BETWEEN-CLUSTER SUM OF SQUARES. NOTE THAT THE PARTITION IS
C GUARANTEED OPTIMAL BUT NOT UNIQUE.
C
C 2. IF A PARTITION INTO K CLUSTERS IS REQUESTED, OPTIMAL PARTITIONS
C INTO K-1, K-2, ..., 2, 1 CLUSTERS ARE ALSO FOUND AND INCLUDED
C IN THE OUTPUT.
C
C 3. THE OUTPUT IS WRITTEN ON FORTRAN UNIT OUNIT AND CONSISTS OF THE
C VECTOR OF CASE LABELS AND THE VECTOR OF THE OBSERVATIONS. THEN
C THE OPTIMAL PARTITIONS INTO K, K-1, ..., 2, 1 SUBSETS WITH
C SUMMARY STATISTICS ARE PRINTED. THEY INCLUDE THE MEAN AND
C STANDARD DEVIATION OF THE OBSER- VATIONS FOR EACH CLUSTER FOR
C EACH PARTIION. THE MEMBERS OF THE FIRST CLUSTER FOR ANY
C PARTITION BEGIN AT THE TOP OF THE VECTOR OF LABELS AND CONTINUE
C FOR THE NUMBER IN THE CLUSTER.
C
C INPUT PARAMETERS
C ----------------
C R1MACH(2) = B**EMAX*(1 - B**(-T)), the largest magnitude.
C
C M INTEGER SCALAR (UNCHANGED ON OUTPUT).
C THE NUMBER OF CASES.
C
C X REAL VECTOR DIMENSIONED AT LEAST M (UNCHANGED ON OUTPUT)
C OBSERVED VALUES.
C
C K INTEGER SCALAR (UNCHANGED ON OUTPUT).
C THE NUMBER OF CLUSTER SUBSEQUENCES REQUESTED.
C
C VLAB 4-CHARACTER VARIABLE (UNCHANGED ON OUTPUT).
C THE LABEL OF THE VARIABLE.
C
C RLAB VECTOR OF 4-CHARACTER VARIABLES DIMENSIONED AT LEAST M.
C (UNCHANGED ON OUTPUT).
C THE LABELS OF THE CASES.
C
C TITLE 10-CHARACTER VARIABLE (UNCHANGED ON OUTPUT).
C TITLE OF THE DATA SET.
C
C DMWORK INTEGER SCALAR (UNCHANGED ON OUTPUT).
C THE LEADING DIMENSION OF THE MATRIX WORK. MUST BE AT LEAST M.
C
C WORK REAL MATRIX WHOSE FIRST DIMENSION MUST BE DMWORK AND SECOND
C DIMENSION MUST BE AT LEAST K.
C WORK MATRIX.
C
C DMIWRK INTEGER SCALAR (UNCHANGED ON OUTPUT).
C THE LEADING DIMENSION OF THE MATRIX IWORK. MUST BE AT LEAST M.
C
C IWORK INTEGER MATRIX WHOSE FIRST DIMENSION MUST BE DMIWRK AND SECOND
C DIMENSION MUST BE AT LEAST K.
C WORK MATRIX.
C
C OUNIT INTEGER SCALAR (UNCHANGED ON OUTPUT).
C UNIT NUMBER FOR OUTPUT.
C
C REFERENCES
C ----------
C
C FISHER, W. D. (1958). "ON GROUPING FOR MAXIMAL HOMOGENEITY,"
C JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION 53, 789-798.
C
C HARTIGAN, J. A. (1975). CLUSTERING ALGORITHMS, JOHN WILEY &
C SONS, INC., NEW YORK. PAGES 130-142.
C
C<><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><>
C
C INTEGER DMWORK, DMIWRK, OUNIT
IMPLICIT LOGICAL(A-Z)
INTEGER DMWORK, DMIWRK, IWORK
INTEGER I, J, K, M, II, III, IK, JJ, L, LL, IL, IU
DOUBLE PRECISION X, WORK, LLOUT
DIMENSION X(*), WORK(DMWORK,*), IWORK(DMIWRK,*), LLOUT(K,*)
C CHARACTER*4 VLAB, RLAB(*)
C CHARACTER*10 TITLE
DOUBLE PRECISION R1MACH2, SS, S, SN, VAR, AMINL, AMAXL
C
C INITIALIZE AND OUTPUT DATA
C
R1MACH2=10.E30
DO 10 J=1,K
IWORK(1,J)=1
WORK(1,J)=0.
DO 11 I=1,M
WORK(I,J)=R1MACH2
C 10 WORK(I,J)=R1MACH(2)
11 CONTINUE
10 CONTINUE
C IF (OUNIT .GT. 0) THEN
C WRITE(OUNIT,1)
C 1 FORMAT('1')
C CALL OUT(1,M,1,X,VLAB,RLAB,TITLE,OUNIT)
C ENDIF
C
C COMPUTE WORK AND IWORK ITERATIVELY
C
DO 40 I=1,M
SS=0.
S=0.
DO 30 II=1,I
III=I-II+1
SS=SS+X(III)**2
S=S+X(III)
SN=II
VAR=SS-S**2/SN
IK=III-1
IF (IK.NE.0) THEN
DO 20 J=2,K
IF (WORK(I,J).GE.VAR+WORK(IK,J-1))THEN
IWORK(I,J)=III
WORK(I,J)=VAR+WORK(IK,J-1)
ENDIF
20 CONTINUE
ENDIF
30 CONTINUE
WORK(I,1)=VAR
IWORK(I,1)=1
40 CONTINUE
C
C PRINT RESULTS
C
C IF (OUNIT .GT. 0) CALL PFISH(M, X, K, DMWORK, WORK, DMIWRK,
C * IWORK, OUNIT)
C DO 130 J=1,K
J=1
JJ=K-J+1
IL=M+1
DO 120 L=1,JJ
LL=JJ-L+1
AMINL=R1MACH2
AMAXL=-R1MACH2
S=0.
SS=0.
IU=IL-1
IL=IWORK(IU,LL)
DO 110 II=IL,IU
IF(X(II).GE.AMAXL) AMAXL=X(II)
IF(X(II).LE.AMINL) AMINL=X(II)
S=S+X(II)
SS=SS+X(II)**2
110 CONTINUE
SN=IU-IL+1
S=S/SN
SS=SS/SN-S**2
SS=SQRT(ABS(SS))
LLOUT(L,1)=AMINL
LLOUT(L,2)=AMAXL
LLOUT(L,3)=S
LLOUT(L,4)=SS
C WRITE(OUNIT,4) LL,SN,S,SS
C 4 FORMAT(I5,5X,3F10.4)
120 CONTINUE
C 130 CONTINUE
RETURN
END
Iβm not sure if I understand, you have a roughly 10x speed-up, right? (3.5 ms min in julia, 40.9 ms min in R?)
But itβs only an order of magnitude.
Iβd take an order of magnitude improvement any day, but perhaps there are some optimizations to be found? I took a quick glance, you can perhaps try to disable bounds-checking, do all the sorting once (if possible) and perhaps use some multi-threading somewhere? Also, perhaps it is more efficient to store things in struct
s rather than Dict
s?
My attempt at a threaded version was an abysmal failureβmuch slower than single threaded.
Yeah that happens to me very often as well, it is important to check at what level you are threading, so any gains are not dominated by overhead.
I just took a glance at cut_data.jl
and can cut of another order of magnitude by first computing the necessary strings:
using Random, BenchmarkTools
Random.seed!(42)
function cut_data(x::Vector{T}, breaks::AbstractVector{<:Real}) where T <: Union{Missing, Real}
# Convert breaks to a simple Float64 vector to ensure consistent processing
# Always collect to handle SubArrays
breaks_float = collect(Float64.(collect(breaks)))
n = length(breaks_float)
result = similar(x, String)
if n < 2
error("At least 2 break points are required")
end
for i in eachindex(x)
value = x[i]
if ismissing(value)
result[i] = "Missing"
continue
end
found = false
# Special case for the minimum value
if value <= breaks_float[1]
result[i] = "β€ $(breaks_float[1])"
found = true
else
for j in 1:(n-1)
# Match values within an interval using strict inequality for upper bound
# to match R's classInt behavior (values at break points go to higher interval)
if value > breaks_float[j] && value < breaks_float[j+1]
result[i] = "$(breaks_float[j]) - $(breaks_float[j+1])"
found = true
break
end
# Special case for values exactly on break points (except minimum)
# Assign to the higher interval
if value == breaks_float[j+1] && j < n-1
result[i] = "$(breaks_float[j+1]) - $(breaks_float[j+2])"
found = true
break
end
end
# Special case for maximum value and values beyond
if !found && value >= breaks_float[n-1]
result[i] = "> $(breaks_float[n-1])"
found = true
end
end
if !found
result[i] = "Other"
end
end
return result
end
function cut_data2(x::Vector{T}, breaks::AbstractVector{<:Real}) where T <: Union{Missing, Real}
# Convert breaks to a simple Float64 vector to ensure consistent processing
# Always collect to handle SubArrays
breaks_float = collect(Float64.(collect(breaks)))
n = length(breaks_float)
result = similar(x, String)
if n < 2
error("At least 2 break points are required")
end
string_names = ["$(breaks_float[j]) - $(breaks_float[j+1])" for j in 1:length(breaks_float)-1]
string_names_begin = "β€ $(breaks_float[1])"
string_names_end = "> $(breaks_float[n-1])"
@inbounds for i in eachindex(x) # None of the bounds checking is necessary here, as you loop over `x`, `result` is similar to x. Indexing breaks_float is correct by definition of `n`.
value = x[i]
if ismissing(value)
result[i] = "Missing"
continue
end
found = false
# Special case for the minimum value
if value <= breaks_float[1]
result[i] = string_names_begin
found = true
else
for j in 1:(n-1)
# Match values within an interval using strict inequality for upper bound
# to match R's classInt behavior (values at break points go to higher interval)
if value > breaks_float[j] && value < breaks_float[j+1]
result[i] = string_names[j] #"$(breaks_float[j]) - $(breaks_float[j+1])"
found = true
break
end
# Special case for values exactly on break points (except minimum)
# Assign to the higher interval
if value == breaks_float[j+1] && j < n-1
result[i] = string_names[j+1]
found = true
break
end
end
# Special case for maximum value and values beyond
if !found && value >= breaks_float[n-1]
result[i] = string_names_end
found = true
end
end
if !found
result[i] = "Other"
end
end
return result
end
function test(N, k)
x = rand(N)
k = sort(rand(k))
x_long = vcat(x,k)
display(cut_data(x_long, k) == cut_data2(x_long, k)) # Ensure that your breaks are in the data set, so all cases of your function should be reached.
display(@benchmark cut_data($x_long, $k))
display(@benchmark cut_data2($x_long, $k))
end
test(100, 3)
gives
true
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
Range (min β¦ max): 18.917 ΞΌs β¦ 3.993 ms β GC (min β¦ max): 0.00% β¦ 98.88%
Time (median): 19.667 ΞΌs β GC (median): 0.00%
Time (mean Β± Ο): 21.666 ΞΌs Β± 50.125 ΞΌs β GC (mean Β± Ο): 6.65% Β± 3.59%
ββββββββββββββββββ β
ββββββββββββββββββββββββ
βββββ
β
β
βββββββββββ
ββ
β
β
ββ
βββ
ββ
ββββββ β
18.9 ΞΌs Histogram: log(frequency) by time 29.2 ΞΌs <
Memory estimate: 72.67 KiB, allocs estimate: 682.
BenchmarkTools.Trial: 10000 samples with 10 evaluations per sample.
Range (min β¦ max): 1.071 ΞΌs β¦ 3.071 ms β GC (min β¦ max): 0.00% β¦ 99.86%
Time (median): 1.250 ΞΌs β GC (median): 0.00%
Time (mean Β± Ο): 1.816 ΞΌs Β± 31.008 ΞΌs β GC (mean Β± Ο): 18.04% Β± 1.94%
ββ
β
βββββββββ
βββββββββββββββββββββββββββββββββββββββββββββββββ β
1.07 ΞΌs Histogram: frequency by time 3.75 ΞΌs <
Memory estimate: 4.51 KiB, allocs estimate: 35.
I think a further speed-up may be achieved by looping over breaks on the outside and over x
on the inside, as you may then make use of vectorized operations.
Also, the first part of value > breaks_float[j] && value < breaks_float[j+1]
is redundant as I believe you analyse the vector in ascending order, but I did not notice a sizeable improvement from removing it here.
If youβre interested in some more speed-ups (improving over R is of course an important pastime):
using Random, BenchmarkTools
Random.seed!(42)
function equal_breaks(x::AbstractVector{<:Real}, n::Integer)
# Check for empty vector
if isempty(x)
error("Input vector cannot be empty")
end
# Check for valid number of classes
if n <= 0
error("Number of classes must be positive")
end
# Sort the vector (in case it's not already sorted)
x_sorted = sort(x)
# Get min and max values
min_val = x_sorted[1]
max_val = x_sorted[end]
# Handle case when all values are the same
if min_val == max_val
return Float64[min_val, max_val]
end
# Calculate interval width
range_val = max_val - min_val
interval_width = range_val / n
# Generate breaks at equal intervals
breaks = Float64[min_val + interval_width * i for i in 0:n]
return breaks
end
function equal_breaks2(x::AbstractVector{<:Real}, n::Integer)
# Check for empty vector
isempty(x) && error("Input vector cannot be empty")
# Check for valid number of classes
n <= 0 && error("Number of classes must be positive")
# Get min and max values
min_val, max_val = extrema(x)
# Handle case when all values are the same
if min_val == max_val
return Float64[min_val, max_val]
end
# Calculate interval width
range_val = max_val - min_val
interval_width = range_val / n
# Generate breaks at equal intervals
breaks = Float64[min_val + interval_width * i for i in 0:n]
return breaks
end
function test(N, k)
x = rand(N)
display(equal_breaks(x, k) == equal_breaks2(x, k))
display(@benchmark equal_breaks($x, $k))
display(@benchmark equal_breaks2($x, $k))
end
test(1000, 3)
gives
true
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
Range (min β¦ max): 10.041 ΞΌs β¦ 46.625 ΞΌs β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 11.250 ΞΌs β GC (median): 0.00%
Time (mean Β± Ο): 11.524 ΞΌs Β± 1.569 ΞΌs β GC (mean Β± Ο): 0.00% Β± 0.00%
ββββββββββββββββββ
ββββ β
βββββββββββββββββββββββββββββ
β
βββββββ
ββ
ββ
ββββ
ββ
β
ββββββ
β
ββββ β
10 ΞΌs Histogram: log(frequency) by time 16.8 ΞΌs <
Memory estimate: 18.34 KiB, allocs estimate: 4.
BenchmarkTools.Trial: 10000 samples with 7 evaluations per sample.
Range (min β¦ max): 4.554 ΞΌs β¦ 16.804 ΞΌs β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 4.595 ΞΌs β GC (median): 0.00%
Time (mean Β± Ο): 4.744 ΞΌs Β± 504.184 ns β GC (mean Β± Ο): 0.00% Β± 0.00%
ββ β ββ ββ β
βββββββββββββββββββββββββββββββββββββββββ
βββββββββββ
β
ββββββ β
4.55 ΞΌs Histogram: log(frequency) by time 6.79 ΞΌs <
Memory estimate: 96 bytes, allocs estimate: 1.
Thank you! Itβs such a treasure to be in a community that helps bring newbies along.