[ANN] Breakers.jl

Breakers.jl is a pure Julia partial implementation of the R classInt library, that handles fisher, kmeans, quantile and equal breaks of a vector (potentially containing missing values). This aids in binning continuous data to allow use with discrete color values in thematic mapping. It has been tested using a real-world dataset of the populations of the 3,235 county populations of the US and overseas possessions and produces identical results to those of classInt.

8 Likes

It would be fun to have some benchmarks between the R implementation and your implementation! Just some size vs compute time plots?


For my use case, the relatively poor performance compared to R isn’t a concernβ€”if I were dealing with more than a few thousand items, however, I’d have to do something like they do in Fortran

C      SUBROUTINE FISH(M, X, VLAB, RLAB, TITLE, K, DMWORK, WORK, DMIWRK,
C     *                IWORK, OUNIT)
      SUBROUTINE FISH(M, X, K, DMWORK, WORK, DMIWRK, IWORK, LLOUT)
C
C<><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><>
C
C   PURPOSE
C   -------
C
C      CLUSTERS A SEQUENCE OF CASES INTO SUBSEQUENCES BY FISHER'S
C      METHOD OF EXACT OPTIMIZATION
C
C   DESCRIPTION
C   -----------
C
C   1.  THE "EXACT OPTIMIZATION" METHOD OF W.  D.  FISHER MAXIMIZES THE
C       BETWEEN-CLUSTER SUM OF SQUARES.  NOTE THAT THE PARTITION IS
C       GUARANTEED OPTIMAL BUT NOT UNIQUE.
C
C   2.  IF A PARTITION INTO K CLUSTERS IS REQUESTED, OPTIMAL PARTITIONS
C       INTO K-1, K-2, ..., 2, 1 CLUSTERS ARE ALSO FOUND AND INCLUDED
C       IN THE OUTPUT.
C
C   3.  THE OUTPUT IS WRITTEN ON FORTRAN UNIT OUNIT AND CONSISTS OF THE
C       VECTOR OF CASE LABELS AND THE VECTOR OF THE OBSERVATIONS.  THEN
C       THE OPTIMAL PARTITIONS INTO K, K-1, ..., 2, 1 SUBSETS WITH
C       SUMMARY STATISTICS ARE PRINTED.  THEY INCLUDE THE MEAN AND
C       STANDARD DEVIATION OF THE OBSER- VATIONS FOR EACH CLUSTER FOR
C       EACH PARTIION.  THE MEMBERS OF THE FIRST CLUSTER FOR ANY
C       PARTITION BEGIN AT THE TOP OF THE VECTOR OF LABELS AND CONTINUE
C       FOR THE NUMBER IN THE CLUSTER.
C
C   INPUT PARAMETERS
C   ----------------
C  R1MACH(2) = B**EMAX*(1 - B**(-T)), the largest magnitude.
C
C   M     INTEGER SCALAR (UNCHANGED ON OUTPUT).
C         THE NUMBER OF CASES.
C
C   X     REAL VECTOR DIMENSIONED AT LEAST M (UNCHANGED ON OUTPUT)
C         OBSERVED VALUES.
C
C   K     INTEGER SCALAR (UNCHANGED ON OUTPUT).
C         THE NUMBER OF CLUSTER SUBSEQUENCES REQUESTED.
C
C   VLAB  4-CHARACTER VARIABLE (UNCHANGED ON OUTPUT).
C         THE LABEL OF THE VARIABLE.
C
C   RLAB  VECTOR OF 4-CHARACTER VARIABLES DIMENSIONED AT LEAST M.
C            (UNCHANGED ON OUTPUT).
C         THE LABELS OF THE CASES.
C
C   TITLE 10-CHARACTER VARIABLE (UNCHANGED ON OUTPUT).
C         TITLE OF THE DATA SET.
C
C   DMWORK INTEGER SCALAR (UNCHANGED ON OUTPUT).
C         THE LEADING DIMENSION OF THE MATRIX WORK.  MUST BE AT LEAST M.
C
C   WORK  REAL MATRIX WHOSE FIRST DIMENSION MUST BE DMWORK AND SECOND
C            DIMENSION MUST BE AT LEAST K.
C         WORK MATRIX.
C
C   DMIWRK INTEGER SCALAR (UNCHANGED ON OUTPUT).
C         THE LEADING DIMENSION OF THE MATRIX IWORK. MUST BE AT LEAST M.
C
C   IWORK INTEGER MATRIX WHOSE FIRST DIMENSION MUST BE DMIWRK AND SECOND
C            DIMENSION MUST BE AT LEAST K.
C         WORK MATRIX.
C
C   OUNIT INTEGER SCALAR (UNCHANGED ON OUTPUT).
C         UNIT NUMBER FOR OUTPUT.
C
C   REFERENCES
C   ----------
C
C     FISHER, W. D. (1958).  "ON GROUPING FOR MAXIMAL HOMOGENEITY,"
C     JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION 53, 789-798.
C
C     HARTIGAN, J. A. (1975).  CLUSTERING ALGORITHMS, JOHN WILEY &
C        SONS, INC., NEW YORK.  PAGES 130-142.
C
C<><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><>
C
C      INTEGER DMWORK, DMIWRK, OUNIT
      IMPLICIT LOGICAL(A-Z)
      INTEGER DMWORK, DMIWRK, IWORK
      INTEGER I, J, K, M, II, III, IK, JJ, L, LL, IL, IU
      DOUBLE PRECISION X, WORK, LLOUT
      DIMENSION X(*), WORK(DMWORK,*), IWORK(DMIWRK,*), LLOUT(K,*)
C      CHARACTER*4 VLAB, RLAB(*)
C      CHARACTER*10 TITLE
      DOUBLE PRECISION R1MACH2, SS, S, SN, VAR, AMINL, AMAXL
C
C     INITIALIZE AND OUTPUT DATA
C
      R1MACH2=10.E30
      DO 10 J=1,K
         IWORK(1,J)=1
         WORK(1,J)=0.
         DO 11 I=1,M
            WORK(I,J)=R1MACH2
C   10       WORK(I,J)=R1MACH(2)
   11    CONTINUE
   10 CONTINUE
C      IF (OUNIT .GT. 0) THEN
C         WRITE(OUNIT,1)
C    1    FORMAT('1')
C         CALL OUT(1,M,1,X,VLAB,RLAB,TITLE,OUNIT)
C      ENDIF
C
C     COMPUTE WORK AND IWORK ITERATIVELY
C
      DO 40 I=1,M
         SS=0.
         S=0.
         DO 30 II=1,I
            III=I-II+1
            SS=SS+X(III)**2
            S=S+X(III)
            SN=II
            VAR=SS-S**2/SN
            IK=III-1
            IF (IK.NE.0) THEN
               DO 20 J=2,K
                  IF (WORK(I,J).GE.VAR+WORK(IK,J-1))THEN
                     IWORK(I,J)=III
                     WORK(I,J)=VAR+WORK(IK,J-1)
                  ENDIF
   20          CONTINUE
            ENDIF
   30    CONTINUE
         WORK(I,1)=VAR
         IWORK(I,1)=1
   40 CONTINUE
C
C     PRINT RESULTS
C
C      IF (OUNIT .GT. 0) CALL PFISH(M, X, K, DMWORK, WORK, DMIWRK,
C     *                             IWORK, OUNIT)

C      DO 130 J=1,K
         J=1
         JJ=K-J+1
         IL=M+1
         DO 120 L=1,JJ
            LL=JJ-L+1
            AMINL=R1MACH2
            AMAXL=-R1MACH2
            S=0.
            SS=0.
            IU=IL-1
            IL=IWORK(IU,LL)
            DO 110 II=IL,IU
               IF(X(II).GE.AMAXL) AMAXL=X(II)
               IF(X(II).LE.AMINL) AMINL=X(II)
               S=S+X(II)
               SS=SS+X(II)**2
  110       CONTINUE
            SN=IU-IL+1
            S=S/SN
            SS=SS/SN-S**2
            SS=SQRT(ABS(SS))
            LLOUT(L,1)=AMINL
            LLOUT(L,2)=AMAXL
            LLOUT(L,3)=S
            LLOUT(L,4)=SS
C            WRITE(OUNIT,4) LL,SN,S,SS
C    4       FORMAT(I5,5X,3F10.4)
  120    CONTINUE
C  130 CONTINUE



      RETURN
      END

1 Like

I’m not sure if I understand, you have a roughly 10x speed-up, right? (3.5 ms min in julia, 40.9 ms min in R?)

2 Likes

But it’s only an order of magnitude. :sweat_smile:

2 Likes

I’d take an order of magnitude improvement any day, but perhaps there are some optimizations to be found? I took a quick glance, you can perhaps try to disable bounds-checking, do all the sorting once (if possible) and perhaps use some multi-threading somewhere? Also, perhaps it is more efficient to store things in structs rather than Dicts?

4 Likes

My attempt at a threaded version was an abysmal failureβ€”much slower than single threaded.

Yeah that happens to me very often as well, it is important to check at what level you are threading, so any gains are not dominated by overhead.

I just took a glance at cut_data.jl and can cut of another order of magnitude by first computing the necessary strings:

using Random, BenchmarkTools
Random.seed!(42)

function cut_data(x::Vector{T}, breaks::AbstractVector{<:Real}) where T <: Union{Missing, Real}
    # Convert breaks to a simple Float64 vector to ensure consistent processing
    # Always collect to handle SubArrays
    breaks_float = collect(Float64.(collect(breaks)))
    n = length(breaks_float)
    result = similar(x, String)
    
    if n < 2
        error("At least 2 break points are required")
    end
    
    for i in eachindex(x)
        value = x[i]
        if ismissing(value)
            result[i] = "Missing"
            continue
        end
        
        found = false
        
        # Special case for the minimum value
        if value <= breaks_float[1]
            result[i] = "≀ $(breaks_float[1])"
            found = true
        else
            for j in 1:(n-1)
                # Match values within an interval using strict inequality for upper bound
                # to match R's classInt behavior (values at break points go to higher interval)
                if value > breaks_float[j] && value < breaks_float[j+1]
                    result[i] = "$(breaks_float[j]) - $(breaks_float[j+1])"
                    found = true
                    break
                end
                
                # Special case for values exactly on break points (except minimum)
                # Assign to the higher interval
                if value == breaks_float[j+1] && j < n-1
                    result[i] = "$(breaks_float[j+1]) - $(breaks_float[j+2])"
                    found = true
                    break
                end
            end
            
            # Special case for maximum value and values beyond
            if !found && value >= breaks_float[n-1]
                result[i] = "> $(breaks_float[n-1])"
                found = true
            end
        end
        
        if !found
            result[i] = "Other"
        end
    end
    
    return result
end

function cut_data2(x::Vector{T}, breaks::AbstractVector{<:Real}) where T <: Union{Missing, Real}
    # Convert breaks to a simple Float64 vector to ensure consistent processing
    # Always collect to handle SubArrays
    breaks_float = collect(Float64.(collect(breaks)))
    n = length(breaks_float)
    result = similar(x, String)
    
    if n < 2
        error("At least 2 break points are required")
    end
    string_names = ["$(breaks_float[j]) - $(breaks_float[j+1])" for j in 1:length(breaks_float)-1]
    string_names_begin = "≀ $(breaks_float[1])"
    string_names_end = "> $(breaks_float[n-1])"

    @inbounds for i in eachindex(x) # None of the bounds checking is necessary here, as you loop over `x`, `result` is similar to x. Indexing breaks_float is correct by definition of `n`. 
        value = x[i]
        if ismissing(value)
            result[i] = "Missing"
            continue
        end
        
        found = false 
        
        # Special case for the minimum value
        if value <= breaks_float[1]
            result[i] = string_names_begin
            found = true
        else
            for j in 1:(n-1)
                # Match values within an interval using strict inequality for upper bound
                # to match R's classInt behavior (values at break points go to higher interval)
                if value > breaks_float[j] && value < breaks_float[j+1]
                    result[i] = string_names[j] #"$(breaks_float[j]) - $(breaks_float[j+1])"
                    found = true
                    break
                end
                
                # Special case for values exactly on break points (except minimum)
                # Assign to the higher interval
                if value == breaks_float[j+1] && j < n-1
                    result[i] = string_names[j+1]
                    found = true
                    break
                end
            end
            
            # Special case for maximum value and values beyond
            if !found && value >= breaks_float[n-1]
                result[i] = string_names_end
                found = true
            end
        end
        
        if !found
            result[i] = "Other"
        end
    end
    
    return result
end

function test(N, k)
    x = rand(N)
    k = sort(rand(k))
    x_long = vcat(x,k)
    display(cut_data(x_long, k) == cut_data2(x_long, k)) # Ensure that your breaks are in the data set, so all cases of your function should be reached. 
    display(@benchmark cut_data($x_long, $k))
    display(@benchmark cut_data2($x_long, $k))
end

test(100, 3)

gives

true
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (min … max):  18.917 ΞΌs …  3.993 ms  β”Š GC (min … max): 0.00% … 98.88%
 Time  (median):     19.667 ΞΌs              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   21.666 ΞΌs Β± 50.125 ΞΌs  β”Š GC (mean Β± Οƒ):  6.65% Β±  3.59%

   β–„β–‡β–ˆβ–ˆβ–‡β–„β–β–‚β–‚β–‚β–β–β–‚β–‚β–‚β–β–β–                                         β–‚
  β–†β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–…β–‡β–†β–†β–†β–…β–…β–…β–‡β–‡β–†β–‡β–‡β–†β–†β–†β–†β–†β–…β–†β–…β–…β–…β–„β–…β–†β–†β–…β–†β–…β–†β–„β–„β–„β–‚β–ƒ β–ˆ
  18.9 ΞΌs      Histogram: log(frequency) by time      29.2 ΞΌs <

 Memory estimate: 72.67 KiB, allocs estimate: 682.
BenchmarkTools.Trial: 10000 samples with 10 evaluations per sample.
 Range (min … max):  1.071 ΞΌs …  3.071 ms  β”Š GC (min … max):  0.00% … 99.86%
 Time  (median):     1.250 ΞΌs              β”Š GC (median):     0.00%
 Time  (mean Β± Οƒ):   1.816 ΞΌs Β± 31.008 ΞΌs  β”Š GC (mean Β± Οƒ):  18.04% Β±  1.94%

  β–ˆβ–…  ▁                                                       
  β–ˆβ–ˆβ–†β–ˆβ–ˆβ–ˆβ–‡β–†β–…β–„β–ƒβ–ƒβ–ƒβ–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–β–‚β–‚β–‚β–‚β–‚β–‚ β–ƒ
  1.07 ΞΌs        Histogram: frequency by time        3.75 ΞΌs <

 Memory estimate: 4.51 KiB, allocs estimate: 35.

I think a further speed-up may be achieved by looping over breaks on the outside and over x on the inside, as you may then make use of vectorized operations.

Also, the first part of value > breaks_float[j] && value < breaks_float[j+1] is redundant as I believe you analyse the vector in ascending order, but I did not notice a sizeable improvement from removing it here.

If you’re interested in some more speed-ups (improving over R is of course an important pastime):

using Random, BenchmarkTools
Random.seed!(42)

function equal_breaks(x::AbstractVector{<:Real}, n::Integer)
    # Check for empty vector
    if isempty(x)
        error("Input vector cannot be empty")
    end
    
    # Check for valid number of classes
    if n <= 0
        error("Number of classes must be positive")
    end
    
    # Sort the vector (in case it's not already sorted)
    x_sorted = sort(x)
    
    # Get min and max values
    min_val = x_sorted[1]
    max_val = x_sorted[end]
    
    # Handle case when all values are the same
    if min_val == max_val
        return Float64[min_val, max_val]
    end
    
    # Calculate interval width
    range_val = max_val - min_val
    interval_width = range_val / n
    
    # Generate breaks at equal intervals
    breaks = Float64[min_val + interval_width * i for i in 0:n]
    
    return breaks
end

function equal_breaks2(x::AbstractVector{<:Real}, n::Integer)
    # Check for empty vector
    isempty(x) && error("Input vector cannot be empty")
    
    # Check for valid number of classes
    n <= 0 && error("Number of classes must be positive")
        
    # Get min and max values
    min_val, max_val = extrema(x)
    
    # Handle case when all values are the same
    if min_val == max_val
        return Float64[min_val, max_val]
    end
    
    # Calculate interval width
    range_val = max_val - min_val
    interval_width = range_val / n
    
    # Generate breaks at equal intervals
    breaks = Float64[min_val + interval_width * i for i in 0:n]
    
    return breaks
end

function test(N, k)
    x = rand(N)
    display(equal_breaks(x, k) == equal_breaks2(x, k))
    display(@benchmark equal_breaks($x, $k))
    display(@benchmark equal_breaks2($x, $k))
end

test(1000, 3)

gives

true
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (min … max):  10.041 ΞΌs … 46.625 ΞΌs  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     11.250 ΞΌs              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   11.524 ΞΌs Β±  1.569 ΞΌs  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

  β–β–„β–ƒβ–β–‚β–‚β–‚β–‚β–†β–ˆβ–ˆβ–‡β–†β–„β–‚β–‚β–„β–…β–„β–„β–‚β–                                      β–‚
  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–‡β–‡β–‡β–…β–…β–†β–†β–†β–„β–†β–‡β–…β–‡β–…β–†β–…β–†β–‡β–†β–…β–„β–…β–…β–‡β–†β–„β–„β–„β–…β–…β–„β–„β–ƒβ–„ β–ˆ
  10 ΞΌs        Histogram: log(frequency) by time      16.8 ΞΌs <

 Memory estimate: 18.34 KiB, allocs estimate: 4.
BenchmarkTools.Trial: 10000 samples with 7 evaluations per sample.
 Range (min … max):  4.554 ΞΌs …  16.804 ΞΌs  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     4.595 ΞΌs               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   4.744 ΞΌs Β± 504.184 ns  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

  β–ˆβ–ˆ  β–‚  ▃▁  ▁▂                                               ▁
  β–ˆβ–ˆβ–ˆβ–‡β–ˆβ–ˆβ–‡β–ˆβ–ˆβ–‡β–†β–ˆβ–ˆβ–ˆβ–ˆβ–†β–‡β–‡β–‡β–‡β–†β–†β–‡β–†β–‡β–‡β–†β–†β–‡β–‡β–‡β–‡β–†β–‡β–†β–†β–‡β–†β–†β–†β–…β–†β–†β–†β–†β–†β–†β–†β–†β–†β–†β–…β–…β–†β–†β–„β–ƒβ–ƒβ–‚ β–ˆ
  4.55 ΞΌs      Histogram: log(frequency) by time      6.79 ΞΌs <

 Memory estimate: 96 bytes, allocs estimate: 1.
1 Like

Thank you! It’s such a treasure to be in a community that helps bring newbies along.

4 Likes