[ANN] Breakers.jl

technocrat · April 9, 2025, 4:38am

Breakers.jl is a pure Julia partial implementation of the R classInt library, that handles fisher, kmeans, quantile and equal breaks of a vector (potentially containing missing values). This aids in binning continuous data to allow use with discrete color values in thematic mapping. It has been tested using a real-world dataset of the populations of the 3,235 county populations of the US and overseas possessions and produces identical results to those of classInt.

JADekker · April 9, 2025, 6:28am

It would be fun to have some benchmarks between the R implementation and your implementation! Just some size vs compute time plots?

technocrat · April 9, 2025, 8:33am

For my use case, the relatively poor performance compared to R isn’t a concern—if I were dealing with more than a few thousand items, however, I’d have to do something like they do in Fortran

C      SUBROUTINE FISH(M, X, VLAB, RLAB, TITLE, K, DMWORK, WORK, DMIWRK,
C     *                IWORK, OUNIT)
      SUBROUTINE FISH(M, X, K, DMWORK, WORK, DMIWRK, IWORK, LLOUT)
C
C<><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><>
C
C   PURPOSE
C   -------
C
C      CLUSTERS A SEQUENCE OF CASES INTO SUBSEQUENCES BY FISHER'S
C      METHOD OF EXACT OPTIMIZATION
C
C   DESCRIPTION
C   -----------
C
C   1.  THE "EXACT OPTIMIZATION" METHOD OF W.  D.  FISHER MAXIMIZES THE
C       BETWEEN-CLUSTER SUM OF SQUARES.  NOTE THAT THE PARTITION IS
C       GUARANTEED OPTIMAL BUT NOT UNIQUE.
C
C   2.  IF A PARTITION INTO K CLUSTERS IS REQUESTED, OPTIMAL PARTITIONS
C       INTO K-1, K-2, ..., 2, 1 CLUSTERS ARE ALSO FOUND AND INCLUDED
C       IN THE OUTPUT.
C
C   3.  THE OUTPUT IS WRITTEN ON FORTRAN UNIT OUNIT AND CONSISTS OF THE
C       VECTOR OF CASE LABELS AND THE VECTOR OF THE OBSERVATIONS.  THEN
C       THE OPTIMAL PARTITIONS INTO K, K-1, ..., 2, 1 SUBSETS WITH
C       SUMMARY STATISTICS ARE PRINTED.  THEY INCLUDE THE MEAN AND
C       STANDARD DEVIATION OF THE OBSER- VATIONS FOR EACH CLUSTER FOR
C       EACH PARTIION.  THE MEMBERS OF THE FIRST CLUSTER FOR ANY
C       PARTITION BEGIN AT THE TOP OF THE VECTOR OF LABELS AND CONTINUE
C       FOR THE NUMBER IN THE CLUSTER.
C
C   INPUT PARAMETERS
C   ----------------
C  R1MACH(2) = B**EMAX*(1 - B**(-T)), the largest magnitude.
C
C   M     INTEGER SCALAR (UNCHANGED ON OUTPUT).
C         THE NUMBER OF CASES.
C
C   X     REAL VECTOR DIMENSIONED AT LEAST M (UNCHANGED ON OUTPUT)
C         OBSERVED VALUES.
C
C   K     INTEGER SCALAR (UNCHANGED ON OUTPUT).
C         THE NUMBER OF CLUSTER SUBSEQUENCES REQUESTED.
C
C   VLAB  4-CHARACTER VARIABLE (UNCHANGED ON OUTPUT).
C         THE LABEL OF THE VARIABLE.
C
C   RLAB  VECTOR OF 4-CHARACTER VARIABLES DIMENSIONED AT LEAST M.
C            (UNCHANGED ON OUTPUT).
C         THE LABELS OF THE CASES.
C
C   TITLE 10-CHARACTER VARIABLE (UNCHANGED ON OUTPUT).
C         TITLE OF THE DATA SET.
C
C   DMWORK INTEGER SCALAR (UNCHANGED ON OUTPUT).
C         THE LEADING DIMENSION OF THE MATRIX WORK.  MUST BE AT LEAST M.
C
C   WORK  REAL MATRIX WHOSE FIRST DIMENSION MUST BE DMWORK AND SECOND
C            DIMENSION MUST BE AT LEAST K.
C         WORK MATRIX.
C
C   DMIWRK INTEGER SCALAR (UNCHANGED ON OUTPUT).
C         THE LEADING DIMENSION OF THE MATRIX IWORK. MUST BE AT LEAST M.
C
C   IWORK INTEGER MATRIX WHOSE FIRST DIMENSION MUST BE DMIWRK AND SECOND
C            DIMENSION MUST BE AT LEAST K.
C         WORK MATRIX.
C
C   OUNIT INTEGER SCALAR (UNCHANGED ON OUTPUT).
C         UNIT NUMBER FOR OUTPUT.
C
C   REFERENCES
C   ----------
C
C     FISHER, W. D. (1958).  "ON GROUPING FOR MAXIMAL HOMOGENEITY,"
C     JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION 53, 789-798.
C
C     HARTIGAN, J. A. (1975).  CLUSTERING ALGORITHMS, JOHN WILEY &
C        SONS, INC., NEW YORK.  PAGES 130-142.
C
C<><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><>
C
C      INTEGER DMWORK, DMIWRK, OUNIT
      IMPLICIT LOGICAL(A-Z)
      INTEGER DMWORK, DMIWRK, IWORK
      INTEGER I, J, K, M, II, III, IK, JJ, L, LL, IL, IU
      DOUBLE PRECISION X, WORK, LLOUT
      DIMENSION X(*), WORK(DMWORK,*), IWORK(DMIWRK,*), LLOUT(K,*)
C      CHARACTER*4 VLAB, RLAB(*)
C      CHARACTER*10 TITLE
      DOUBLE PRECISION R1MACH2, SS, S, SN, VAR, AMINL, AMAXL
C
C     INITIALIZE AND OUTPUT DATA
C
      R1MACH2=10.E30
      DO 10 J=1,K
         IWORK(1,J)=1
         WORK(1,J)=0.
         DO 11 I=1,M
            WORK(I,J)=R1MACH2
C   10       WORK(I,J)=R1MACH(2)
   11    CONTINUE
   10 CONTINUE
C      IF (OUNIT .GT. 0) THEN
C         WRITE(OUNIT,1)
C    1    FORMAT('1')
C         CALL OUT(1,M,1,X,VLAB,RLAB,TITLE,OUNIT)
C      ENDIF
C
C     COMPUTE WORK AND IWORK ITERATIVELY
C
      DO 40 I=1,M
         SS=0.
         S=0.
         DO 30 II=1,I
            III=I-II+1
            SS=SS+X(III)**2
            S=S+X(III)
            SN=II
            VAR=SS-S**2/SN
            IK=III-1
            IF (IK.NE.0) THEN
               DO 20 J=2,K
                  IF (WORK(I,J).GE.VAR+WORK(IK,J-1))THEN
                     IWORK(I,J)=III
                     WORK(I,J)=VAR+WORK(IK,J-1)
                  ENDIF
   20          CONTINUE
            ENDIF
   30    CONTINUE
         WORK(I,1)=VAR
         IWORK(I,1)=1
   40 CONTINUE
C
C     PRINT RESULTS
C
C      IF (OUNIT .GT. 0) CALL PFISH(M, X, K, DMWORK, WORK, DMIWRK,
C     *                             IWORK, OUNIT)

C      DO 130 J=1,K
         J=1
         JJ=K-J+1
         IL=M+1
         DO 120 L=1,JJ
            LL=JJ-L+1
            AMINL=R1MACH2
            AMAXL=-R1MACH2
            S=0.
            SS=0.
            IU=IL-1
            IL=IWORK(IU,LL)
            DO 110 II=IL,IU
               IF(X(II).GE.AMAXL) AMAXL=X(II)
               IF(X(II).LE.AMINL) AMINL=X(II)
               S=S+X(II)
               SS=SS+X(II)**2
  110       CONTINUE
            SN=IU-IL+1
            S=S/SN
            SS=SS/SN-S**2
            SS=SQRT(ABS(SS))
            LLOUT(L,1)=AMINL
            LLOUT(L,2)=AMAXL
            LLOUT(L,3)=S
            LLOUT(L,4)=SS
C            WRITE(OUNIT,4) LL,SN,S,SS
C    4       FORMAT(I5,5X,3F10.4)
  120    CONTINUE
C  130 CONTINUE



      RETURN
      END

JADekker · April 9, 2025, 8:42am

I’m not sure if I understand, you have a roughly 10x speed-up, right? (3.5 ms min in julia, 40.9 ms min in R?)

technocrat · April 9, 2025, 10:01am

But it’s only an order of magnitude.

JADekker · April 9, 2025, 10:15am

I’d take an order of magnitude improvement any day, but perhaps there are some optimizations to be found? I took a quick glance, you can perhaps try to disable bounds-checking, do all the sorting once (if possible) and perhaps use some multi-threading somewhere? Also, perhaps it is more efficient to store things in structs rather than Dicts?

technocrat · April 9, 2025, 9:15pm

My attempt at a threaded version was an abysmal failure—much slower than single threaded.

JADekker · April 10, 2025, 8:35am

Yeah that happens to me very often as well, it is important to check at what level you are threading, so any gains are not dominated by overhead.

I just took a glance at cut_data.jl and can cut of another order of magnitude by first computing the necessary strings:

using Random, BenchmarkTools
Random.seed!(42)

function cut_data(x::Vector{T}, breaks::AbstractVector{<:Real}) where T <: Union{Missing, Real}
    # Convert breaks to a simple Float64 vector to ensure consistent processing
    # Always collect to handle SubArrays
    breaks_float = collect(Float64.(collect(breaks)))
    n = length(breaks_float)
    result = similar(x, String)
    
    if n < 2
        error("At least 2 break points are required")
    end
    
    for i in eachindex(x)
        value = x[i]
        if ismissing(value)
            result[i] = "Missing"
            continue
        end
        
        found = false
        
        # Special case for the minimum value
        if value <= breaks_float[1]
            result[i] = "≤ $(breaks_float[1])"
            found = true
        else
            for j in 1:(n-1)
                # Match values within an interval using strict inequality for upper bound
                # to match R's classInt behavior (values at break points go to higher interval)
                if value > breaks_float[j] && value < breaks_float[j+1]
                    result[i] = "$(breaks_float[j]) - $(breaks_float[j+1])"
                    found = true
                    break
                end
                
                # Special case for values exactly on break points (except minimum)
                # Assign to the higher interval
                if value == breaks_float[j+1] && j < n-1
                    result[i] = "$(breaks_float[j+1]) - $(breaks_float[j+2])"
                    found = true
                    break
                end
            end
            
            # Special case for maximum value and values beyond
            if !found && value >= breaks_float[n-1]
                result[i] = "> $(breaks_float[n-1])"
                found = true
            end
        end
        
        if !found
            result[i] = "Other"
        end
    end
    
    return result
end

function cut_data2(x::Vector{T}, breaks::AbstractVector{<:Real}) where T <: Union{Missing, Real}
    # Convert breaks to a simple Float64 vector to ensure consistent processing
    # Always collect to handle SubArrays
    breaks_float = collect(Float64.(collect(breaks)))
    n = length(breaks_float)
    result = similar(x, String)
    
    if n < 2
        error("At least 2 break points are required")
    end
    string_names = ["$(breaks_float[j]) - $(breaks_float[j+1])" for j in 1:length(breaks_float)-1]
    string_names_begin = "≤ $(breaks_float[1])"
    string_names_end = "> $(breaks_float[n-1])"

    @inbounds for i in eachindex(x) # None of the bounds checking is necessary here, as you loop over `x`, `result` is similar to x. Indexing breaks_float is correct by definition of `n`. 
        value = x[i]
        if ismissing(value)
            result[i] = "Missing"
            continue
        end
        
        found = false 
        
        # Special case for the minimum value
        if value <= breaks_float[1]
            result[i] = string_names_begin
            found = true
        else
            for j in 1:(n-1)
                # Match values within an interval using strict inequality for upper bound
                # to match R's classInt behavior (values at break points go to higher interval)
                if value > breaks_float[j] && value < breaks_float[j+1]
                    result[i] = string_names[j] #"$(breaks_float[j]) - $(breaks_float[j+1])"
                    found = true
                    break
                end
                
                # Special case for values exactly on break points (except minimum)
                # Assign to the higher interval
                if value == breaks_float[j+1] && j < n-1
                    result[i] = string_names[j+1]
                    found = true
                    break
                end
            end
            
            # Special case for maximum value and values beyond
            if !found && value >= breaks_float[n-1]
                result[i] = string_names_end
                found = true
            end
        end
        
        if !found
            result[i] = "Other"
        end
    end
    
    return result
end

function test(N, k)
    x = rand(N)
    k = sort(rand(k))
    x_long = vcat(x,k)
    display(cut_data(x_long, k) == cut_data2(x_long, k)) # Ensure that your breaks are in the data set, so all cases of your function should be reached. 
    display(@benchmark cut_data($x_long, $k))
    display(@benchmark cut_data2($x_long, $k))
end

test(100, 3)

gives

true
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (min … max):  18.917 μs …  3.993 ms  ┊ GC (min … max): 0.00% … 98.88%
 Time  (median):     19.667 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   21.666 μs ± 50.125 μs  ┊ GC (mean ± σ):  6.65% ±  3.59%

   ▄▇██▇▄▁▂▂▂▁▁▂▂▂▁▁▁                                         ▂
  ▆█████████████████████▇▅▇▆▆▆▅▅▅▇▇▆▇▇▆▆▆▆▆▅▆▅▅▅▄▅▆▆▅▆▅▆▄▄▄▂▃ █
  18.9 μs      Histogram: log(frequency) by time      29.2 μs <

 Memory estimate: 72.67 KiB, allocs estimate: 682.
BenchmarkTools.Trial: 10000 samples with 10 evaluations per sample.
 Range (min … max):  1.071 μs …  3.071 ms  ┊ GC (min … max):  0.00% … 99.86%
 Time  (median):     1.250 μs              ┊ GC (median):     0.00%
 Time  (mean ± σ):   1.816 μs ± 31.008 μs  ┊ GC (mean ± σ):  18.04% ±  1.94%

  █▅  ▁                                                       
  ██▆███▇▆▅▄▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▂▂▂▂ ▃
  1.07 μs        Histogram: frequency by time        3.75 μs <

 Memory estimate: 4.51 KiB, allocs estimate: 35.

JADekker · April 10, 2025, 8:57am

I think a further speed-up may be achieved by looping over breaks on the outside and over x on the inside, as you may then make use of vectorized operations.

Also, the first part of value > breaks_float[j] && value < breaks_float[j+1] is redundant as I believe you analyse the vector in ascending order, but I did not notice a sizeable improvement from removing it here.

JADekker · April 10, 2025, 10:12am

If you’re interested in some more speed-ups (improving over R is of course an important pastime):

using Random, BenchmarkTools
Random.seed!(42)

function equal_breaks(x::AbstractVector{<:Real}, n::Integer)
    # Check for empty vector
    if isempty(x)
        error("Input vector cannot be empty")
    end
    
    # Check for valid number of classes
    if n <= 0
        error("Number of classes must be positive")
    end
    
    # Sort the vector (in case it's not already sorted)
    x_sorted = sort(x)
    
    # Get min and max values
    min_val = x_sorted[1]
    max_val = x_sorted[end]
    
    # Handle case when all values are the same
    if min_val == max_val
        return Float64[min_val, max_val]
    end
    
    # Calculate interval width
    range_val = max_val - min_val
    interval_width = range_val / n
    
    # Generate breaks at equal intervals
    breaks = Float64[min_val + interval_width * i for i in 0:n]
    
    return breaks
end

function equal_breaks2(x::AbstractVector{<:Real}, n::Integer)
    # Check for empty vector
    isempty(x) && error("Input vector cannot be empty")
    
    # Check for valid number of classes
    n <= 0 && error("Number of classes must be positive")
        
    # Get min and max values
    min_val, max_val = extrema(x)
    
    # Handle case when all values are the same
    if min_val == max_val
        return Float64[min_val, max_val]
    end
    
    # Calculate interval width
    range_val = max_val - min_val
    interval_width = range_val / n
    
    # Generate breaks at equal intervals
    breaks = Float64[min_val + interval_width * i for i in 0:n]
    
    return breaks
end

function test(N, k)
    x = rand(N)
    display(equal_breaks(x, k) == equal_breaks2(x, k))
    display(@benchmark equal_breaks($x, $k))
    display(@benchmark equal_breaks2($x, $k))
end

test(1000, 3)

gives

true
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (min … max):  10.041 μs … 46.625 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     11.250 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   11.524 μs ±  1.569 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▁▄▃▁▂▂▂▂▆██▇▆▄▂▂▄▅▄▄▂▁                                      ▂
  ████████████████████████▇▇▇▇▅▅▆▆▆▄▆▇▅▇▅▆▅▆▇▆▅▄▅▅▇▆▄▄▄▅▅▄▄▃▄ █
  10 μs        Histogram: log(frequency) by time      16.8 μs <

 Memory estimate: 18.34 KiB, allocs estimate: 4.
BenchmarkTools.Trial: 10000 samples with 7 evaluations per sample.
 Range (min … max):  4.554 μs …  16.804 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     4.595 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   4.744 μs ± 504.184 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ██  ▂  ▃▁  ▁▂                                               ▁
  ███▇██▇██▇▆████▆▇▇▇▇▆▆▇▆▇▇▆▆▇▇▇▇▆▇▆▆▇▆▆▆▅▆▆▆▆▆▆▆▆▆▆▅▅▆▆▄▃▃▂ █
  4.55 μs      Histogram: log(frequency) by time      6.79 μs <

 Memory estimate: 96 bytes, allocs estimate: 1.

technocrat · April 10, 2025, 4:44pm

Thank you! It’s such a treasure to be in a community that helps bring newbies along.

Topic		Replies	Views
Performance optimization of a custom binning function Performance	43	1032	December 15, 2022
With Missings, Julia is slower than R General Usage	30	4244	February 26, 2021
Ironic observation about `sort` and `sortperm` speed for "small integers" vs R Performance sort , sortperm , r	32	4669	February 4, 2018
lhe/Biofast benchmark \| FASTQ parsing [Julia,Nim,Crystal,Python,...] Biology, Health, and Medicine performance , benchmark , community , biology	54	8605	May 25, 2020
C++ code much faster than Julia how can I optimize it? Performance	40	4579	October 10, 2022

[ANN] Breakers.jl

Related topics