[ANN] RuleMiner.jl - Efficient data mining in Julia

RuleMiner.jl - Data Mining in Julia

Links: GitHub || Docs

RuleMiner is a Julia package for data mining inspired by the arules R package and SPMF Java library.

Data mining is a domain of computer science dedicated to efficiently searching for patterns in datasets. Common data mining techniques include association rule mining, itemset mining, and sequence mining among others. These techniques have applications across diverse disciplines including market basket analysis, computational biology, web traffic analysis, epidemiology, and more.

Design Goals

  1. Provide an ergonomic API that can be used to easily import data, export results, and perform data-mining techniques.

One major headache with mining tools in other languages is their unfriendly output formats that either use custom data structures or print directly into files. Thus, RuleMiner was designed to output directly into a DataFrames.jl DataFrame for better slicing, sorting and general portability.

  1. Optimize the functions for fast execution by utilizing Julia’s parallelization capabilities.

Many of the common C and C++ libraries for data mining are only single-threaded. While they are fast, taking advantage of Julia’s mature parallelization options like Base.Threads can improve performance dramatically.

Future Plans

Future work will include additional algorithms and optimizations. Check the project’s GitHub page for the current roadmap.

15 Likes

RuleMiner.jl 0.1.0 Release Notes

This inaugural release includes:

  • Loading data into Transactions objects with load_transactions
  • The A Priori algorithm for association rule mining
  • The ECLAT algorithm for freqeunt itemset mining

RuleMiner.jl 0.2.0 Release Notes

  • Added FP-Growth algorithm for frequent itemset mining
  • Implemented performance improvements to eclat and apriori
  • Added function to convert DataFrames to Transactions

RuleMiner.jl 0.3.0 Release Notes

  • Added support for four closed itemset mining algorithms:
    • FPClose
    • CHARM
    • LCM
    • CARPENTER
  • Added RecoverClosed algorithm to recover frequent itemsets from closed itemsets
  • Moved from the threads macro to sync/spawn blocks in ECLAT, Apriori, and FP tree creation to better balance the often asymmetric workload across processing cores
  • Added an additional dependency on Combinatorics.jl

RuleMiner.jl 0.4.0 Release Notes

  • Added support for two maximal itemset mining algorithms:
    • FPMax
    • GenMax
  • Reworked internal FP Tree construction to use atomics instead of a lock-based multithreading approach, leading to significant speed gains and lower memory usage for all FP algorithms.
  • Reworked load_transactions to no longer rely on LuxurySparse.jl for COO-CSC sparse conversion and removed that dependency
  • Moved the function to convert a DataFrame into a Transactions object directly into the Transactions struct as a constructor.
  • Completely overhauled documentation and fully documented all public functions and structs.
2 Likes

RuleMiner.jl 0.5.0 Release Notes

  • Minimum supported version of Julia is now 1.10
  • New direct dependency on PrettyTables.jl (though it was already an indirect dependency)
  • Reworked Transactions objects
    • Renamed to Txns. The Transactions label is now an abstract type which will include other, future, transactions-type objects.
    • Moved load_transactions to just being a constructor for the struct
    • Reading files is now ~45% faster and uses ~70% less memory
    • Added validation checks to the constructors with meaningful error messages
    • Added pretty printing through a Base.Show method
    • Added other utility methods, including for:
      • Base.getindex
      • Base.first
      • Base.last
      • Base.length
  • Exposed and documented FPTree objects for the public API
    • FPTree objects can now be created from Txns and assigned to variables outside of FP functions, allowing for re-use.
    • All FP functions (fpgrowth, fpclose, and fpmax) now accept FPTree objects in the data argument. This significantly speeds up execution time if a pre-made tree is provided
    • Added pretty printing through a Base.Show method
  • Added recover_maximal function to recover frequent itemsets from maximal itemsets

Links: Github || Docs

10 Likes

Great work, thanks!

1 Like

Thanks a lot for your contribution! It would be interesting to compare the speed with Christian Borgelt's Web Pages, did you benchmark it?

1 Like

I’m definitely familiar with Dr. Borgelt’s work, though I hadn’t done any testing with the FIM python interface before today.

Here are some rough benchmarks for file IO, A Priori, and FP-Growth on the non-trivial retail dataset with the caveat that the exact numbers will vary from system-to-system and dataset-to-dataset:

File IO

Python: avg 52.951 ms across 100 iterations

with open('retail.txt') as file:
    dataset = file.read()
    dataset = dataset.split('\n')
    dataset = [i.split(' ') for i in dataset]

RuleMiner.jl: avg 76.380 ms across 100 iterations

dataset = Txns("retail.txt",' ')

A Priori

pyfim: avg 827.783 ms across 100 iterations

arules(dataset, supp=5, conf=0, report='asC')

RuleMiner.jl: avg 9.284 ms across 100 iterations

apriori(dataset, 0.05, 100)

FP-Growth

pyfim: avg 54.353 ms across 100 iterations

fpgrowth(dataset, supp=5, conf=0, report='asC')

RuleMiner.jl: avg 2.682 ms across 100 iterations

fpgrowth(dataset, 0.05)
5 Likes

Out of curiosity, how did you implement FileIO in RuleMiner.jl? Also, this would be super nice to have in your docs! Always nice to show performance metrics! Congrats on mostly between the state of the art implementations – did you use BenchmarkTools.jl to benchmark?

1 Like

Thanks! I’ve thought about doing more benchmarking with other software, but I’ve been so focused on polishing my own code, I haven’t gotten around to it.

IO for RuleMiner is a lot more complicated than just reading the lines as vectors, which is why it’s slower than the naive python implementation above. The reading process converts transaction items to a 1-hot encoded sparse matrix which makes calculating item support in the mining algorithms much, much faster. You end up trading a little time on load for a massive speed advantage in the algorithms.

The performance bottleneck in file IO at this point is the need to assign distinct, dense, numeric item IDs for every item (forming the column indexes in the matrix). If anyone has any ideas on how to do that more efficiently than stuffing elements into a Dict, I’d love to hear it!

And yes, I did! I used BenchmarkTools.jl for the Julia benchmarks above and the timeit library for the Python ones.

1 Like

The timings you report are very impressing, I will double check to ensure the benchmarks are correct. Would you want a PR with a couple of files to have benchmarks? I think it would attract more people into the project.

Currently I use Borgeld work to extract rules in a FANNG company (we have a ton of data and speeed really matters).The speed and ease of use is paramount. I love Borgeld code because of the API that uses strings to specify what metrics you want to compute. It is quite old-school but very versatile.

If your project is faster (and uses a similar amount of memory, I need to test that as well) it definetly catches my attention and I would love to help give a hand.

There are a couple of things I miss in rule mining software:

  • (I) easy to use out-of-core capability leveraging some sort of memory-mapped like data.
  • (II) Easy preprocess/compression to build rules with interegers. This would be specially usefull when the raw data contains strings (to use less memory).
1 Like

That’s really cool! My design so far has been geared towards a more exploratory and experimental desktop/workstation use case, essentially a drop-in replacement for the R arules package. If you are looking for faster arule mining, the SPMF Java library is even faster than my implementations.

As far as benchmarking, I think it’s a little early in the lifespan of this project to join a benchmark race against other software. I’d be interested in any results you come up with, but I’m not interested in maintaining a bunch of benchmarks while the package is still under heavy development.

I can also tell you without even testing it that RuleMiner’s RAM usage is almost certainly higher than Dr. Borgelt’s implementations because of my underlying matrix structure and multithreading implementations. There’s still lots of optimization work to do.

Hopefully this didn’t throw too much cold water on your enthusiasm for RuleMiner! This is still such a new project, and I want to manage expectations. Under-promise and over-deliver, as they say!

8 Likes

RuleMiner.jl 0.6.0 Release Notes

  • Txns
    • Fixed unbalanced center truncation of the Base.show output
    • Fixed Base.first and Base.last methods to return a single vector of items when no n argument is supplied
    • Added Base.lastindex method, so slicing with [end] is now supported
  • FPTree
    • Added truncation for large trees in Base.show output
  • apriori
    • Added support for confidence filtering within the function
    • New min_confidence argument in the function signature
    • Major optimizations:
      • ~30% faster execution time
      • ~80% reduction in memory usage

Links: GitHub || Docs

5 Likes

Performance Patches (0.6.1 and 0.6.2)

I recently spent time optimizing the mining functions to use less RAM and pushed out two patch releases in the last couple days. I introduced a new matrix pruning function that optimizes the matrix for mining and converts the frequent submatrix to a BitMatrix for extremely fast computation. These changes affected all of the algorithms that use the matrix format for mining (i.e. not FP-Tree mining).

All of the functions now use considerably less RAM, and all but one of them saw performance boosts. Here are some stats relative to the 0.6.0 release:

  • apriori
    • ~34% less RAM
    • ~3X faster execution
  • charm
    • ~99% less RAM
    • ~16X faster execution
  • LCM
    • ~99% less RAM
    • ~74X faster execution
  • carpenter
    • ~99% less RAM
    • ~4,625X faster execution (not a typo-- It was extremely slow before)
  • genmax
    • ~92% less RAM
    • ~1.7X faster execution

Now, the regression:

  • eclat
    • ~92% less RAM
    • ~2.5X slower execution

Eclat was already exceedingly fast, but the new matrix pruning function takes longer than the entire eclat implementation takes to run. Thus, introducing the new pruning step increased the time significantly. That said, the RAM usage is still dramatically less than before, and eclat is still the fastest option for mining frequent itemsets in the package.

I’m planning to keep this architecture and focus on improving the matrix pruning subroutine to run faster, hopefully closing the regression gap on eclat and increasing performance of the other algorithms too. Right now, the largest bottleneck for the subroutine is the conversion from the SparseMatrixCSC storage format to the BitMatrix format. If anyone has any ideas on how to speed that up, I’d love to hear them!

4 Likes

Amazing Jared.

In prune_matrix you can leverage the fact that the SparseMatrixCSC you are using only contains ones.

Instead of this

 sum(X_csc; dims=1)

You can look at the pointers of the array and substract the consecutive values. This will tell you the amount of nonzero values in a column (which in your case equals to the sum of the elements in the column, because they are all 1s).

For example consider the following X_csc

X_csc = sparse([0 1 0; 1 0 1; 0 1 1; 1 1 0; 1 1 1; 1 1 0])
6×3 SparseMatrixCSC{Int64, Int64} with 12 stored entries:
 ⋅  1  ⋅
 1  ⋅  1
 ⋅  1  1
 1  1  ⋅
 1  1  1
 1  1  ⋅

This has

X_csc.colptr
4-element Vector{Int64}:
  1
  5
 10
 13

Since your matrix has only ones, 5-1 = 4, 10-5 =5, 13-10=3 will tell you the sum over the columns, without looking at the values.

This is the implementation

function get_supports_from_transaction_csc(X_csc::SparseMatrixCSC)
    
    n_result = size(X_csc, 2)
    vec_result = Vector{Int64}(undef, n_result)
    
    previous_elem = X_csc.colptr[1]
    for (i,x) in enumerate(X_csc.colptr[2:end])
        vec_result[i] = x-previous_elem
        previous_elem = x
    end
    return vec_result
end

Benchmark


# Create a 10000 x 10000 sparse matrix with a density of 0.01 (1% of elements are non-zero)
m = 100000
n = 100
density = 0.01

A_csc = sprand(m, n, density)
A_csc.nzval .= 1;
@btime get_supports_from_transaction_csc(A_csc)
  132.106 ns (2 allocations: 1.75 KiB)
@btime sum(A_csc; dims=1)
  18.208 μs (4 allocations: 1.02 KiB)

Option 1) Finding sorted items without storing all supports

You can use the idea above to change this

supports = sum(matrix, dims=1)
sorted_items = [i for i in axes(matrix,2) if supports[1,i] >= min_support]

By this

function get_sorted_items(X_csc::SparseMatrixCSC, min_support)
    
    sorted_items = Vector{Int64}()
    supports = Vector{Int64}()

    previous_elem = X_csc.colptr[1]
    for (i,x) in enumerate(X_csc.colptr[2:end])
        current_support = x-previous_elem 
        previous_elem = x
        if current_support >= min_support
            push!(sorted_items, i)
            push!(supports, current_support)
        end
        
    end
    return sorted_items, supports
end

Option 2) Finding sorted items storing all supports

function get_sorted_items2(X_csc::SparseMatrixCSC, min_support)
    
    supports  = get_supports_from_transaction_csc(X_csc)
    sorted_items = [i for i in axes(X_csc,2) if supports[i] >= min_support]

    return sorted_items, supports
end

It seems a bit faster to compute but at the expense of storing all supports.

Benchmark

Current version

function find_sorted_items_current(matrix::SparseMatrixCSC, min_support)
    supports = sum(matrix, dims=1)
    sorted_items = [i for i in axes(matrix,2) if supports[1,i] >= min_support]
    return sorted_items, supports
end
@btime find_sorted_items_current(A_csc, 4)
  18.833 μs (10 allocations: 3.03 KiB)

Options proposed

@btime get_sorted_items(A_csc, 4)
  950.818 ns (10 allocations: 4.75 KiB)
@btime get_sorted_items2(A_csc, 4)
  645.078 ns (8 allocations: 3.77 KiB)

If you want I can do a PR during the week, but feel free to update yourself if this makes sense to you.

6 Likes

This is great stuff!

I’m definitely interested in getting the items without having to store all of the supports. Looking at your get_sorted_items function, it seems like we could optimize this even further by using a single Vector{Tuple{Int64,Int64}} instead of two Vector{Int64}. This halves the number of allocations because we only have to push! one time on each iteration.

m = 100000
n = 100
density = 0.01

mat = sprand(Bool, m, n, density)

function get_sorted_items_current(matrix::SparseMatrixCSC, min_support)
    supports = sum(matrix, dims=1)
    sorted_items = [i for i in axes(matrix,2) if supports[1,i] >= min_support]
    return sorted_items, supports
end

function get_sorted_items(X_csc::SparseMatrixCSC, min_support)
    
    sorted_items = Vector{Int64}()
    supports = Vector{Int64}()

    previous_elem = X_csc.colptr[1]
    for (i,x) in enumerate(X_csc.colptr[2:end])
        current_support = x-previous_elem 
        previous_elem = x
        if current_support >= min_support
            push!(sorted_items, i)
            push!(supports, current_support)
        end
    end
    return sorted_items, supports
end

function get_sorted_items_improved(X_csc::SparseMatrixCSC, min_support)
    result = Vector{Tuple{Int64,Int64}}()
    previous_elem = X_csc.colptr[1]
    
    for (i,x) in enumerate(X_csc.colptr[2:end])
        current_support = x - previous_elem
        previous_elem = x
        if current_support >= min_support
            push!(result, (i, current_support))
        end
    end
    
    return result
end

Results:

@btime get_sorted_items_current(mat,500);
  8.625 μs (11 allocations: 3.08 KiB)
@btime get_sorted_items(mat,500);
  914.361 ns (10 allocations: 4.75 KiB)
@btime get_sorted_items_improved(mat,500);
  550.134 ns (5 allocations: 4.61 KiB)

Definitely going to integrate this into RuleMiner!

Be warned that difference may not be super obvious in the final algorithms because the internal fast_convert function is still orders of magnitude slower than these support-counting operations.

2 Likes

Nice, thanks for sharing, I will check the fast_convert method. I think sharing MWE of micro benchmarks of functions like fast_convert can drastically lower the barrier to get help. But to do this I guess one needs also to do execution profiling to understand the % of time spend in each of the function calls of a rule mining algorithm to see what needs to be improved.

2 Likes

For sure, I’m going to keep working on optimizing that fast_convert function and do additional profiling and optimization.

I hadn’t posted a reprex here mostly because I want to avoid turning this thread into a help thread.

I just opened another thread where we can talk about the fast_convert function and ways to improve it.

1 Like