Autocategorization of bank expenses

Continuing the discussion from Personal Finance package?:

I want to automatically categorize bank expenses into about 25 different categories. I already have about 3000 expense-category pairs. Each expense is comprised of:

  1. the date of the expense (Date)
  2. a comment describing the business/account involved in the transaction (String)
  3. and the amount of the transaction (Float64)

While the date is probably useless for the categorization, the individual words and word-combinations in the comment as well as the sign and magnitude of the amount should do the trick.

I’ve never really used ML before and never on a mixed dataset like this (words and numbers). Is there some straightforward way for me to do this in Julia?

Thanks in advance!


to tokenize your comment String, into an vector of words.

Then use
to convert each of those words into a word embedding – which is a vector that (kindof) represents the meaning.

Sum those vectors up across all the words in the comment.
Thus getting a vector something that kind of represents the meaning (or at least the word content)
of your transaction comment.
(This works unreasonably well as an input representation for classification tasks (this is basically my thesis topic))

Append to that vector the transaction amount as a float – maybe normalized or scaled.
Now you have a vector representing that transaction.

Feed it to your favorite ML library.

I discuss several of them in the context of binary classification here:
but all the ones discussed also work for multiclass.
You may have to read the docs for them a bit.

I’ld start with LIBSVM.jl

  1. $Profit?

There is a bunch of actual details hidden under those steps.
But that is the big picture.


This sounds amazingly like the work I am doing!! Perhaps we can connect

Absolutely! PM