[ANN] RuleMiner.jl - Efficient data mining in Julia

RuleMiner.jl - Data Mining in Julia

Links: GitHub || Docs

RuleMiner is a Julia package for data mining inspired by the arules R package and SPMF Java library.

Data mining is a domain of computer science dedicated to efficiently searching for patterns in datasets. Common data mining techniques include association rule mining, itemset mining, and sequence mining among others. These techniques have applications across diverse disciplines including market basket analysis, computational biology, web traffic analysis, epidemiology, and more.

Design Goals

  1. Provide an ergonomic API that can be used to easily import data, export results, and perform data-mining techniques.

One major headache with mining tools in other languages is their unfriendly output formats that either use custom data structures or print directly into files. Thus, RuleMiner was designed to output directly into a DataFrames.jl DataFrame for better slicing, sorting and general portability.

  1. Optimize the functions for fast execution by utilizing Julia’s parallelization capabilities.

Many of the common C and C++ libraries for data mining are only single-threaded. While they are fast, taking advantage of Julia’s mature parallelization options like Base.Threads can improve performance dramatically.

Future Plans

Future work will include additional algorithms and optimizations. Check the project’s GitHub page for the current roadmap.

12 Likes

RuleMiner.jl 0.1.0 Release Notes

This inaugural release includes:

  • Loading data into Transactions objects with load_transactions
  • The A Priori algorithm for association rule mining
  • The ECLAT algorithm for freqeunt itemset mining

RuleMiner.jl 0.2.0 Release Notes

  • Added FP-Growth algorithm for frequent itemset mining
  • Implemented performance improvements to eclat and apriori
  • Added function to convert DataFrames to Transactions

RuleMiner.jl 0.3.0 Release Notes

  • Added support for four closed itemset mining algorithms:
    • FPClose
    • CHARM
    • LCM
    • CARPENTER
  • Added RecoverClosed algorithm to recover frequent itemsets from closed itemsets
  • Moved from the threads macro to sync/spawn blocks in ECLAT, Apriori, and FP tree creation to better balance the often asymmetric workload across processing cores
  • Added an additional dependency on Combinatorics.jl

RuleMiner.jl 0.4.0 Release Notes

  • Added support for two maximal itemset mining algorithms:
    • FPMax
    • GenMax
  • Reworked internal FP Tree construction to use atomics instead of a lock-based multithreading approach, leading to significant speed gains and lower memory usage for all FP algorithms.
  • Reworked load_transactions to no longer rely on LuxurySparse.jl for COO-CSC sparse conversion and removed that dependency
  • Moved the function to convert a DataFrame into a Transactions object directly into the Transactions struct as a constructor.
  • Completely overhauled documentation and fully documented all public functions and structs.
2 Likes

RuleMiner.jl 0.5.0 Release Notes

  • Minimum supported version of Julia is now 1.10
  • New direct dependency on PrettyTables.jl (though it was already an indirect dependency)
  • Reworked Transactions objects
    • Renamed to Txns. The Transactions label is now an abstract type which will include other, future, transactions-type objects.
    • Moved load_transactions to just being a constructor for the struct
    • Reading files is now ~45% faster and uses ~70% less memory
    • Added validation checks to the constructors with meaningful error messages
    • Added pretty printing through a Base.Show method
    • Added other utility methods, including for:
      • Base.getindex
      • Base.first
      • Base.last
      • Base.length
  • Exposed and documented FPTree objects for the public API
    • FPTree objects can now be created from Txns and assigned to variables outside of FP functions, allowing for re-use.
    • All FP functions (fpgrowth, fpclose, and fpmax) now accept FPTree objects in the data argument. This significantly speeds up execution time if a pre-made tree is provided
    • Added pretty printing through a Base.Show method
  • Added recover_maximal function to recover frequent itemsets from maximal itemsets

Links: Github || Docs

10 Likes

Great work, thanks!

1 Like

Thanks a lot for your contribution! It would be interesting to compare the speed with Christian Borgelt's Web Pages, did you benchmark it?

1 Like

I’m definitely familiar with Dr. Borgelt’s work, though I hadn’t done any testing with the FIM python interface before today.

Here are some rough benchmarks for file IO, A Priori, and FP-Growth on the non-trivial retail dataset with the caveat that the exact numbers will vary from system-to-system and dataset-to-dataset:

File IO

Python: avg 52.951 ms across 100 iterations

with open('retail.txt') as file:
    dataset = file.read()
    dataset = dataset.split('\n')
    dataset = [i.split(' ') for i in dataset]

RuleMiner.jl: avg 76.380 ms across 100 iterations

dataset = Txns("retail.txt",' ')

A Priori

pyfim: avg 827.783 ms across 100 iterations

arules(dataset, supp=5, conf=0, report='asC')

RuleMiner.jl: avg 9.284 ms across 100 iterations

apriori(dataset, 0.05, 100)

FP-Growth

pyfim: avg 54.353 ms across 100 iterations

fpgrowth(dataset, supp=5, conf=0, report='asC')

RuleMiner.jl: avg 2.682 ms across 100 iterations

fpgrowth(dataset, 0.05)
4 Likes

Out of curiosity, how did you implement FileIO in RuleMiner.jl? Also, this would be super nice to have in your docs! Always nice to show performance metrics! Congrats on mostly between the state of the art implementations – did you use BenchmarkTools.jl to benchmark?

1 Like

Thanks! I’ve thought about doing more benchmarking with other software, but I’ve been so focused on polishing my own code, I haven’t gotten around to it.

IO for RuleMiner is a lot more complicated than just reading the lines as vectors, which is why it’s slower than the naive python implementation above. The reading process converts transaction items to a 1-hot encoded sparse matrix which makes calculating item support in the mining algorithms much, much faster. You end up trading a little time on load for a massive speed advantage in the algorithms.

The performance bottleneck in file IO at this point is the need to assign distinct, dense, numeric item IDs for every item (forming the column indexes in the matrix). If anyone has any ideas on how to do that more efficiently than stuffing elements into a Dict, I’d love to hear it!

And yes, I did! I used BenchmarkTools.jl for the Julia benchmarks above and the timeit library for the Python ones.

1 Like

The timings you report are very impressing, I will double check to ensure the benchmarks are correct. Would you want a PR with a couple of files to have benchmarks? I think it would attract more people into the project.

Currently I use Borgeld work to extract rules in a FANNG company (we have a ton of data and speeed really matters).The speed and ease of use is paramount. I love Borgeld code because of the API that uses strings to specify what metrics you want to compute. It is quite old-school but very versatile.

If your project is faster (and uses a similar amount of memory, I need to test that as well) it definetly catches my attention and I would love to help give a hand.

There are a couple of things I miss in rule mining software:

  • (I) easy to use out-of-core capability leveraging some sort of memory-mapped like data.
  • (II) Easy preprocess/compression to build rules with interegers. This would be specially usefull when the raw data contains strings (to use less memory).
1 Like

That’s really cool! My design so far has been geared towards a more exploratory and experimental desktop/workstation use case, essentially a drop-in replacement for the R arules package. If you are looking for faster arule mining, the SPMF Java library is even faster than my implementations.

As far as benchmarking, I think it’s a little early in the lifespan of this project to join a benchmark race against other software. I’d be interested in any results you come up with, but I’m not interested in maintaining a bunch of benchmarks while the package is still under heavy development.

I can also tell you without even testing it that RuleMiner’s RAM usage is almost certainly higher than Dr. Borgelt’s implementations because of my underlying matrix structure and multithreading implementations. There’s still lots of optimization work to do.

Hopefully this didn’t throw too much cold water on your enthusiasm for RuleMiner! This is still such a new project, and I want to manage expectations. Under-promise and over-deliver, as they say!

7 Likes