[ANN] RuleMiner.jl - Efficient data mining in Julia

JaredSchwartz · September 29, 2024, 2:33am

RuleMiner.jl - Data Mining in Julia

RuleMiner is a Julia package for data mining inspired by the arules R package and SPMF Java library.

Data mining is a domain of computer science dedicated to efficiently searching for patterns in datasets. Common data mining techniques include association rule mining, itemset mining, and sequence mining among others. These techniques have applications across diverse disciplines including market basket analysis, computational biology, web traffic analysis, epidemiology, and more.

Design Goals

Provide an ergonomic API that can be used to easily import data, export results, and perform data-mining techniques.

One major headache with mining tools in other languages is their unfriendly output formats that either use custom data structures or print directly into files. Thus, RuleMiner was designed to output directly into a DataFrames.jl DataFrame for better slicing, sorting and general portability.

Optimize the functions for fast execution by utilizing Julia’s parallelization capabilities.

Many of the common C and C++ libraries for data mining are only single-threaded. While they are fast, taking advantage of Julia’s mature parallelization options like Base.Threads can improve performance dramatically.

Future Plans

Future work will include additional algorithms and optimizations. Check the project’s GitHub page for the current roadmap.

JaredSchwartz · September 29, 2024, 2:45am

RuleMiner.jl 0.1.0 Release Notes

This inaugural release includes:

Loading data into Transactions objects with load_transactions
The A Priori algorithm for association rule mining
The ECLAT algorithm for freqeunt itemset mining

RuleMiner.jl 0.2.0 Release Notes

Added FP-Growth algorithm for frequent itemset mining
Implemented performance improvements to eclat and apriori
Added function to convert DataFrames to Transactions

RuleMiner.jl 0.3.0 Release Notes

Added support for four closed itemset mining algorithms:
- FPClose
- CHARM
- LCM
- CARPENTER
Added RecoverClosed algorithm to recover frequent itemsets from closed itemsets
Moved from the threads macro to sync/spawn blocks in ECLAT, Apriori, and FP tree creation to better balance the often asymmetric workload across processing cores
Added an additional dependency on Combinatorics.jl

RuleMiner.jl 0.4.0 Release Notes

Added support for two maximal itemset mining algorithms:
- FPMax
- GenMax
Reworked internal FP Tree construction to use atomics instead of a lock-based multithreading approach, leading to significant speed gains and lower memory usage for all FP algorithms.
Reworked load_transactions to no longer rely on LuxurySparse.jl for COO-CSC sparse conversion and removed that dependency
Moved the function to convert a DataFrame into a Transactions object directly into the Transactions struct as a constructor.
Completely overhauled documentation and fully documented all public functions and structs.

JaredSchwartz · September 29, 2024, 3:10am

RuleMiner.jl 0.5.0 Release Notes

Minimum supported version of Julia is now 1.10
New direct dependency on PrettyTables.jl (though it was already an indirect dependency)
Reworked Transactions objects
- Renamed to Txns. The Transactions label is now an abstract type which will include other, future, transactions-type objects.
- Moved load_transactions to just being a constructor for the struct
- Reading files is now ~45% faster and uses ~70% less memory
- Added validation checks to the constructors with meaningful error messages
- Added pretty printing through a Base.Show method
- Added other utility methods, including for:
  - Base.getindex
  - Base.first
  - Base.last
  - Base.length
Exposed and documented FPTree objects for the public API
- FPTree objects can now be created from Txns and assigned to variables outside of FP functions, allowing for re-use.
- All FP functions (fpgrowth, fpclose, and fpmax) now accept FPTree objects in the data argument. This significantly speeds up execution time if a pre-made tree is provided
- Added pretty printing through a Base.Show method
Added recover_maximal function to recover frequent itemsets from maximal itemsets

Links: Github || Docs

zgornel · September 29, 2024, 1:56pm

Great work, thanks!

davidbp · September 30, 2024, 12:42pm

Thanks a lot for your contribution! It would be interesting to compare the speed with Christian Borgelt's Web Pages, did you benchmark it?

JaredSchwartz · October 1, 2024, 2:49am

I’m definitely familiar with Dr. Borgelt’s work, though I hadn’t done any testing with the FIM python interface before today.

Here are some rough benchmarks for file IO, A Priori, and FP-Growth on the non-trivial retail dataset with the caveat that the exact numbers will vary from system-to-system and dataset-to-dataset:

File IO

Python: avg 52.951 ms across 100 iterations

with open('retail.txt') as file:
    dataset = file.read()
    dataset = dataset.split('\n')
    dataset = [i.split(' ') for i in dataset]

RuleMiner.jl: avg 76.380 ms across 100 iterations

dataset = Txns("retail.txt",' ')

A Priori

pyfim: avg 827.783 ms across 100 iterations

arules(dataset, supp=5, conf=0, report='asC')

RuleMiner.jl: avg 9.284 ms across 100 iterations

apriori(dataset, 0.05, 100)

FP-Growth

pyfim: avg 54.353 ms across 100 iterations

fpgrowth(dataset, supp=5, conf=0, report='asC')

RuleMiner.jl: avg 2.682 ms across 100 iterations

fpgrowth(dataset, 0.05)

TheCedarPrince · October 1, 2024, 4:59am

Out of curiosity, how did you implement FileIO in RuleMiner.jl? Also, this would be super nice to have in your docs! Always nice to show performance metrics! Congrats on mostly between the state of the art implementations – did you use BenchmarkTools.jl to benchmark?

JaredSchwartz · October 1, 2024, 5:59am

Thanks! I’ve thought about doing more benchmarking with other software, but I’ve been so focused on polishing my own code, I haven’t gotten around to it.

IO for RuleMiner is a lot more complicated than just reading the lines as vectors, which is why it’s slower than the naive python implementation above. The reading process converts transaction items to a 1-hot encoded sparse matrix which makes calculating item support in the mining algorithms much, much faster. You end up trading a little time on load for a massive speed advantage in the algorithms.

The performance bottleneck in file IO at this point is the need to assign distinct, dense, numeric item IDs for every item (forming the column indexes in the matrix). If anyone has any ideas on how to do that more efficiently than stuffing elements into a Dict, I’d love to hear it!

And yes, I did! I used BenchmarkTools.jl for the Julia benchmarks above and the timeit library for the Python ones.

davidbp · October 1, 2024, 8:13am

The timings you report are very impressing, I will double check to ensure the benchmarks are correct. Would you want a PR with a couple of files to have benchmarks? I think it would attract more people into the project.

Currently I use Borgeld work to extract rules in a FANNG company (we have a ton of data and speeed really matters).The speed and ease of use is paramount. I love Borgeld code because of the API that uses strings to specify what metrics you want to compute. It is quite old-school but very versatile.

If your project is faster (and uses a similar amount of memory, I need to test that as well) it definetly catches my attention and I would love to help give a hand.

There are a couple of things I miss in rule mining software:

(I) easy to use out-of-core capability leveraging some sort of memory-mapped like data.
(II) Easy preprocess/compression to build rules with interegers. This would be specially usefull when the raw data contains strings (to use less memory).

JaredSchwartz · October 1, 2024, 3:06pm

That’s really cool! My design so far has been geared towards a more exploratory and experimental desktop/workstation use case, essentially a drop-in replacement for the R arules package. If you are looking for faster arule mining, the SPMF Java library is even faster than my implementations.

As far as benchmarking, I think it’s a little early in the lifespan of this project to join a benchmark race against other software. I’d be interested in any results you come up with, but I’m not interested in maintaining a bunch of benchmarks while the package is still under heavy development.

I can also tell you without even testing it that RuleMiner’s RAM usage is almost certainly higher than Dr. Borgelt’s implementations because of my underlying matrix structure and multithreading implementations. There’s still lots of optimization work to do.

Hopefully this didn’t throw too much cold water on your enthusiasm for RuleMiner! This is still such a new project, and I want to manage expectations. Under-promise and over-deliver, as they say!

JaredSchwartz · October 20, 2024, 12:06am

RuleMiner.jl 0.6.0 Release Notes

Txns
- Fixed unbalanced center truncation of the Base.show output
- Fixed Base.first and Base.last methods to return a single vector of items when no n argument is supplied
- Added Base.lastindex method, so slicing with [end] is now supported
FPTree
- Added truncation for large trees in Base.show output
apriori
- Added support for confidence filtering within the function
- New min_confidence argument in the function signature
- Major optimizations:
  - ~30% faster execution time
  - ~80% reduction in memory usage

Links: GitHub || Docs

JaredSchwartz · October 30, 2024, 1:04am

Performance Patches (0.6.1 and 0.6.2)

I recently spent time optimizing the mining functions to use less RAM and pushed out two patch releases in the last couple days. I introduced a new matrix pruning function that optimizes the matrix for mining and converts the frequent submatrix to a BitMatrix for extremely fast computation. These changes affected all of the algorithms that use the matrix format for mining (i.e. not FP-Tree mining).

All of the functions now use considerably less RAM, and all but one of them saw performance boosts. Here are some stats relative to the 0.6.0 release:

apriori
- ~34% less RAM
- ~3X faster execution
charm
- ~99% less RAM
- ~16X faster execution
LCM
- ~99% less RAM
- ~74X faster execution
carpenter
- ~99% less RAM
- ~4,625X faster execution (not a typo-- It was extremely slow before)
genmax
- ~92% less RAM
- ~1.7X faster execution

Now, the regression:

eclat
- ~92% less RAM
- ~2.5X slower execution

Eclat was already exceedingly fast, but the new matrix pruning function takes longer than the entire eclat implementation takes to run. Thus, introducing the new pruning step increased the time significantly. That said, the RAM usage is still dramatically less than before, and eclat is still the fastest option for mining frequent itemsets in the package.

I’m planning to keep this architecture and focus on improving the matrix pruning subroutine to run faster, hopefully closing the regression gap on eclat and increasing performance of the other algorithms too. Right now, the largest bottleneck for the subroutine is the conversion from the SparseMatrixCSC storage format to the BitMatrix format. If anyone has any ideas on how to speed that up, I’d love to hear them!

davidbp · October 30, 2024, 11:01am

Amazing Jared.

In prune_matrix you can leverage the fact that the SparseMatrixCSC you are using only contains ones.

Instead of this

 sum(X_csc; dims=1)

You can look at the pointers of the array and substract the consecutive values. This will tell you the amount of nonzero values in a column (which in your case equals to the sum of the elements in the column, because they are all 1s).

For example consider the following X_csc

X_csc = sparse([0 1 0; 1 0 1; 0 1 1; 1 1 0; 1 1 1; 1 1 0])
6×3 SparseMatrixCSC{Int64, Int64} with 12 stored entries:
 ⋅  1  ⋅
 1  ⋅  1
 ⋅  1  1
 1  1  ⋅
 1  1  1
 1  1  ⋅

This has

X_csc.colptr
4-element Vector{Int64}:
  1
  5
 10
 13

Since your matrix has only ones, 5-1 = 4, 10-5 =5, 13-10=3 will tell you the sum over the columns, without looking at the values.

This is the implementation

function get_supports_from_transaction_csc(X_csc::SparseMatrixCSC)
    
    n_result = size(X_csc, 2)
    vec_result = Vector{Int64}(undef, n_result)
    
    previous_elem = X_csc.colptr[1]
    for (i,x) in enumerate(X_csc.colptr[2:end])
        vec_result[i] = x-previous_elem
        previous_elem = x
    end
    return vec_result
end

Benchmark


# Create a 10000 x 10000 sparse matrix with a density of 0.01 (1% of elements are non-zero)
m = 100000
n = 100
density = 0.01

A_csc = sprand(m, n, density)
A_csc.nzval .= 1;

@btime get_supports_from_transaction_csc(A_csc)
  132.106 ns (2 allocations: 1.75 KiB)

@btime sum(A_csc; dims=1)
  18.208 μs (4 allocations: 1.02 KiB)

Option 1) Finding sorted items without storing all supports

You can use the idea above to change this

supports = sum(matrix, dims=1)
sorted_items = [i for i in axes(matrix,2) if supports[1,i] >= min_support]

By this

function get_sorted_items(X_csc::SparseMatrixCSC, min_support)
    
    sorted_items = Vector{Int64}()
    supports = Vector{Int64}()

    previous_elem = X_csc.colptr[1]
    for (i,x) in enumerate(X_csc.colptr[2:end])
        current_support = x-previous_elem 
        previous_elem = x
        if current_support >= min_support
            push!(sorted_items, i)
            push!(supports, current_support)
        end
        
    end
    return sorted_items, supports
end

Option 2) Finding sorted items storing all supports

function get_sorted_items2(X_csc::SparseMatrixCSC, min_support)
    
    supports  = get_supports_from_transaction_csc(X_csc)
    sorted_items = [i for i in axes(X_csc,2) if supports[i] >= min_support]

    return sorted_items, supports
end

It seems a bit faster to compute but at the expense of storing all supports.

Benchmark

Current version

function find_sorted_items_current(matrix::SparseMatrixCSC, min_support)
    supports = sum(matrix, dims=1)
    sorted_items = [i for i in axes(matrix,2) if supports[1,i] >= min_support]
    return sorted_items, supports
end

@btime find_sorted_items_current(A_csc, 4)
  18.833 μs (10 allocations: 3.03 KiB)

Options proposed

@btime get_sorted_items(A_csc, 4)
  950.818 ns (10 allocations: 4.75 KiB)

@btime get_sorted_items2(A_csc, 4)
  645.078 ns (8 allocations: 3.77 KiB)

If you want I can do a PR during the week, but feel free to update yourself if this makes sense to you.

JaredSchwartz · October 30, 2024, 5:06pm

This is great stuff!

I’m definitely interested in getting the items without having to store all of the supports. Looking at your get_sorted_items function, it seems like we could optimize this even further by using a single Vector{Tuple{Int64,Int64}} instead of two Vector{Int64}. This halves the number of allocations because we only have to push! one time on each iteration.

m = 100000
n = 100
density = 0.01

mat = sprand(Bool, m, n, density)

function get_sorted_items_current(matrix::SparseMatrixCSC, min_support)
    supports = sum(matrix, dims=1)
    sorted_items = [i for i in axes(matrix,2) if supports[1,i] >= min_support]
    return sorted_items, supports
end

function get_sorted_items(X_csc::SparseMatrixCSC, min_support)
    
    sorted_items = Vector{Int64}()
    supports = Vector{Int64}()

    previous_elem = X_csc.colptr[1]
    for (i,x) in enumerate(X_csc.colptr[2:end])
        current_support = x-previous_elem 
        previous_elem = x
        if current_support >= min_support
            push!(sorted_items, i)
            push!(supports, current_support)
        end
    end
    return sorted_items, supports
end

function get_sorted_items_improved(X_csc::SparseMatrixCSC, min_support)
    result = Vector{Tuple{Int64,Int64}}()
    previous_elem = X_csc.colptr[1]
    
    for (i,x) in enumerate(X_csc.colptr[2:end])
        current_support = x - previous_elem
        previous_elem = x
        if current_support >= min_support
            push!(result, (i, current_support))
        end
    end
    
    return result
end

Results:

@btime get_sorted_items_current(mat,500);
  8.625 μs (11 allocations: 3.08 KiB)

@btime get_sorted_items(mat,500);
  914.361 ns (10 allocations: 4.75 KiB)

@btime get_sorted_items_improved(mat,500);
  550.134 ns (5 allocations: 4.61 KiB)

Definitely going to integrate this into RuleMiner!

Be warned that difference may not be super obvious in the final algorithms because the internal fast_convert function is still orders of magnitude slower than these support-counting operations.

davidbp · October 30, 2024, 5:13pm

Nice, thanks for sharing, I will check the fast_convert method. I think sharing MWE of micro benchmarks of functions like fast_convert can drastically lower the barrier to get help. But to do this I guess one needs also to do execution profiling to understand the % of time spend in each of the function calls of a rule mining algorithm to see what needs to be improved.

JaredSchwartz · October 30, 2024, 7:10pm

For sure, I’m going to keep working on optimizing that fast_convert function and do additional profiling and optimization.

I hadn’t posted a reprex here mostly because I want to avoid turning this thread into a help thread.

I just opened another thread where we can talk about the fast_convert function and ways to improve it.

JaredSchwartz · January 25, 2025, 2:07am

Performance Patch 0.6.3

Development on this package has been a bit sparse the last few months. I just haven’t had much time for this project, but I’m back with a couple performance boosts:

Txns file constructor
- Operates directly on the file Mmap array, rather than reading to strings
- Eliminated almost all allocations except the string allocations for the item labels: colkeys and rowkeys
- ~2x faster execution time
- ~50% less memory usage
apriori
- Implemented early pruning of redundant candidates, rather than post-processeing
- Significant speed and memory usage improvements in dense transaction sets (sets with many items per transaction)

Links: Github || Docs

svilupp · February 12, 2025, 6:00am

Very cool! We’ve built something like this internally a few years ago – I need to benchmark it with your package.

We did it for apriori, but added directional associations (not bidirectional), built on top of sparse data (with lookup materialization), so it can run on datasets with millions of transactions and thousand items and still finish quickly.
Maybe we can share some learnings

Topic		Replies	Views
[ANN] RuleMiner.jl - Efficient frequent itemset and association rule mining Package Announcements package , announcement , statistics , machine-learning	3	542	August 11, 2024
[ANN] A new lightning fast package for data manipulation in pure Julia Package Announcements data , dataframes , inmemorydatasets	95	10613	July 4, 2022
The One Billion Row Challenge Performance	29	2469	October 30, 2024
Text Mining: Detect Strings: Word Lookup in a Large Corpus of Phrases Using a Large Dictionary Performance question	27	2199	December 15, 2021
[ANN] DataFrameDBs.jl Data package , announcement	60	4050	May 2, 2020

[ANN] RuleMiner.jl - Efficient data mining in Julia

RuleMiner.jl - Data Mining in Julia

Design Goals

Future Plans

RuleMiner.jl 0.1.0 Release Notes

RuleMiner.jl 0.2.0 Release Notes

RuleMiner.jl 0.3.0 Release Notes

RuleMiner.jl 0.4.0 Release Notes

RuleMiner.jl 0.5.0 Release Notes

File IO

A Priori

FP-Growth

RuleMiner.jl 0.6.0 Release Notes

Performance Patches (0.6.1 and 0.6.2)

Benchmark

Option 1) Finding sorted items without storing all supports

Option 2) Finding sorted items storing all supports

Benchmark

Performance Patch 0.6.3

Related topics