I don’t know what you mean by “personalized”, but if you want to know what a particular person has found interesting, you can look at the list of their starred projects on Github.
Are these ratings binary, range-based or real? How many items and how many users do you have? All the details are important.
For example, range-based (e.g. 1-5 stars for a movie on Netflix) and real (e.g. normalized number of times a user listened to a singer on Last.fm) ratings work well with cosine distance, but binary data (e.g. number of likes on Facebook or products bought on Amazon) often gives better results with Jaccard distance.
If you have only a thousand users in 100Mb of data, simple iterative CF will work, but for millions of users and gigabytes of data you will have to use databases / data structures for quick neighbor retrieval.
Or change the algorithm. For instance, if you have enough memory, you can factorize user-item matrix and find latent components. If your data is binary and not very sparse, you can also use RBM.
Finally, if you want to recommend GitHub projects to users based on their “stars”, most likely none of the above will work - a better approach may be to use item-based recommendation systems (based on similarity between projects, not users) or even some custom regression-based sorting.
All of the above is easy to implement, but performance differs significantly depending on use case.
The RecSys repo is probably a decent example. Basically, what you want to do is find a lowish dimensional manifold where user preferences actually live and then use what you know about a user to estimate where they lie on the manifold. The SVD approach does this with a hyperplane. The high-level recipe for the SVD approach is roughly:
Normalize users so that each user has unit-L2 norm in item space. This is slightly unintuitive since it considers users with different amounts of information to be equally important, but this is what I found works best in any case. Don’t de-duplicate users with the same set of items or anything – the number of them is important. If you have a preference scale (1-10 ratings or something like that), consider making it binary – the scale tends to be actively unhelpful and the only thing that matters is “like” or “don’t like”. If you really must incorporate the scale, consider doing it with separate dimensions for each item-rating. So if a user rated something as r there would be a 1 in the (user, 10*item+(r-1)) entry. The prediction ends up being a pseudo-distribution over item ratings for each user then.
Optional: split users into subgroups so that dominant groups don’t completely overwhelm the analysis. Subgroups do not actually need to be exclusive, you can let users appear in multiple subgroups. One slightly janky but effective way to do this is to do a preliminary e.g. 10-dimensional truncated SVD and split users based on their most prominent singular vectors (i.e. the singular vectors with largest absolute weight). You could, for example, include each user in a subgroup associated with its two largest singular vector weights. You can think of these as primary and secondary interests for each user; you will analyze each interest group separately so that as much modeling nuance goes into the 10th most common interest in the system as into the 1st most common interest—which is often orders of magnitude more popular, which is why it can overwhelm the analysis if you don’t do this step.
Take the truncated SVD up to some dimension (e.g. 10) of each subgroup and collect all of the singular vectors from all the groups. If you do a 10-dimensional SVD of 10 groups, you’ll have 100 “taste vectors”. The span of these 100 vectors is the hyperplane of “actual user tates” that you’ve computed. You’ll have to play around with the number of subgroups and dimensions to get good results.
To make predictions, project users onto the subspace spanned by the collected vectors from step 3. This takes what you know about them and what you know about users as a whole to make an inference about individuals. The predicted good items for a user are those with the largest coefficient in their projected taste vector. Note that a user’s predictions are not affected by the subgroups they were in initially – that was just for finding a taste subspace.
All of this can be done very straightforwardly and efficiently in Julia with sparse matrices, svds and the partialsortperm function.
@dfdx I have a bit of wiggle room as it’s for a Julia book. So performance or response time is not that much of an issue as it’s a toy example. However, illustrating a happy path, simple, exciting, Julian way of setting up a rec-sys that could be used in production with small optimizations, would be awesome.
What I have in mind for the “big finale” is a rec-sys for a dating website using data from: occamslab.com is for sale | HugeDomains . It contains user-profile ratings between 1 and 10 plus gender information for each profile. But until then I also discuss other types of rec-sys, like content-based and look at simple examples (like movie recommendations based on genre).
Not sure how the GitHub stars idea got into this discussion, but no, it’s not related to this at all.
@StefanKarpinski RecSys.jl looks good in terms of API. Since the examples are targeting beginners, a workflow like the one presented, in the lines of train(...), save(...), load(...), recommend(...) would be great. Implementing all that logic by hand would be too much, I’m afraid. But I’ll check the source, maybe it’s not that intimidating as it sounds Ideally there would be a plug-and-play workflow that would make things simple for beginners.
If I have to roll it by hand, I’d like to keep it simple so I’m thinking about loading the data into a users-profiles matrix (DataFrame) and using Euclidian distance or Pearson’s correlation coefficient to make recommendations. Any better ideas?
It’s usually worth looking at what scikit-learn has to say. I’m not quite sure what the state-of-the-art is for doing this sort of thing. Considering the way many things are done lately, a naive approach could be to train a machine learning classifier of some sort on the user’s known data points. Of course this would require some schemes to improve convergence and might also pose some performance obstacles. Off the top of my head, I would attempt to train a classifier which has been initialized using users that are “similar” by some metric.
@ExpandingMan Thanks, yes, I was looking amongst other things at the Julia implementation of scikit-learn (Quick start guide - ScikitLearn.jl) and it’s quite nice. It may be that my data sample is too basic, seems quite difficult to frame this as a classification problem.
Found this nice demo of RecSys, I guess it’s worth giving it a try, hopefully it still works with recent Julia and dependences