Recommender Systems In Practice

Recommender System
Qianqian Shan
Contents
1 Introduction to Recommender Systems 3
1.1 Types of Recommenders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Simple Recommendation Systems 3
2.1 Basic Intuition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 News Feeds Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2.1 Hacker News Ranking Algorithm . . . . . . . . . . . . . . . . . . . . . . . 4
2.2.2 Reddit News Ranking Algorithm . . . . . . . . . . . . . . . . . . . . . . . 5
2.3 Recommendations Based on Average Ratings . . . . . . . . . . . . . . . . . . . . . 5
2.3.1 Problems with Average Ratings . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3.2 Explore-Exploit Dilemma . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.4 Bayesian Paradigm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.4.1 Bayesian Paradigm with An Example . . . . . . . . . . . . . . . . . . . . . 7
2.4.2 How Bayesian Paradigm Helps with Ranking/Recommendation? . . . . . . 7
2.5 Supervised Learning for Recommender Systems . . . . . . . . . . . . . . . . . . . 8
2.6 Markov Models and Google Page Rank . . . . . . . . . . . . . . . . . . . . . . . . 10
2.6.1 Markov Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.6.2 State Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.6.3 Connection to Recommender Systems . . . . . . . . . . . . . . . . . . . . . 11
3 Evaluate A Recommender System 12
4 Collaborative Filtering (CF) 12
4.1 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
4.2 User-User CF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4.3 Item-Item CF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.4 Recommender System with MapReduce . . . . . . . . . . . . . . . . . . . . . . . . 15
4.4.1 Item CF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.4.2 Recommender System with MapReduce . . . . . . . . . . . . . . . . . . . . 16
5 Matrix Factorization 17
5.1 Basic Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
5.2 Justification and Relationship between SVD and MF . . . . . . . . . . . . . . . . 17
5.3 Loss Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
5.4 Matrix Factorization with Bias and Regularization . . . . . . . . . . . . . . . . . . 18
5.5 Probabilistic MF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1
CONTENTS
5.6 Bayesian MF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
5.7 Improving Basic Models By Deep Learning Techniques . . . . . . . . . . . . . . . 19
6 Restricted Boltzmann Machine (RBM) for Collaborative Filtering 20
6.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
6.2 Bernoulli RBM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
6.3 Relaxed Bernoulli RBM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
6.4 RBM Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
6.5 Categorical RBM for Recommender Systems . . . . . . . . . . . . . . . . . . . . . 22
2
1. INTRODUCTION TO RECOMMENDER SYSTEMS
1 Introduction to Recommender Systems
Key points: matrix factorization, collaborative filtering.
1.1 Types of Recommenders
Recommending things. Example: Products recommended on online shopping websites.
Recommending content. Example: News feeds, videos.
Recommending Music. Example: Spotify or Pandora music recommendation.
Recommending people.
Recommending search results. Example: Google personalized search results.
A recommender system starts with data about users, the data can come from
1. explicitly asking users. Drawback: requires extra work from users, data tend to be sparse
as not everyone can be bothered to leave a rating. Everyone has different standards, some
people may be more critical than others, there may also be cultural differences and so on.
2. implicit behavior of users, look at things “you” do anyhow and interpret them as indications
of interest or disinterest. For example, clicks on a website / ad, there is no problem of sparsity
of clicks data, but not always reliable as indication of interest. Things “you” purchase
are a better indication of interest. It’s also resistant to fraud. Another example is that
consumption on time is another good indication of positive ratings.
Ultimately the goal is to put the best content that users will love in front of users in the form of
a top-N list, not the ability to predict the items users will hate.
2 Simple Recommendation Systems
Non-personalized recommenders: Individual likes and dislikes aren’t used for custom suggestions
to users. For example, amazon.com should show you products you might buy even if you never
use amazon.com before, youtube.com should show you interesting videos even you are a new user
to youtube.com.
2.1 Basic Intuition
1. Population behavior can be used to guide recommendations. The business reason to have
recommendation that users will have a high probability to take is that it will bring more
profits. For example, highest grossing movies are usually movies that lots of people are
willing to pay for, then recommending those movies to those who haven’t watched them
before might be a good recommendation: Other people really like this, therefore you’ll
probably like it as well.
Issues of popularity:
3
2. SIMPLE RECOMMENDATION SYSTEMS
(a) Popularity may not be the sole factor to be considered when making recommendations.
For example, when recommending restaurants to a traveler in a city, the most popular
local restaurant may not a good fit. The “top-50” popular music may not please
everyone either.
(b) Age of popularity. For example, a piece of popular news three years ago may not be a
good recommendation for users today.
2. Context can also guide recommendations. For example, finding product associations using
context. Associations among products are frequently bought together or logically go to-
gether. For example, when buying an iPhone, users may also buy an iPhone case at the
same time. Lift can be used to quantify product associations,
Lift =
p(A, B)
p(A) · p(B)
=
p(A|B)
p(A)
=
p(B|A)
p(B)
. (1)
That is, when buying B increases the probability of buying A, then Lift > 1 (e.g., p(B|A) =
0.2 and p(A) = 0.1).
The intuitions are considering non-machine learning techniques. These techniques are simple but
still powerful. A/B testing different approaches can help to decide which approach is the most
suitable one.
2.2 News Feeds Algorithm
2.2.1 Hacker News Ranking Algorithm
The Hacker News Algorithm handles the news age issue in recommendations by a ratio of pop-
ularity to age so that the higher the popularity, the higher the ranking, the higher the age, the
lower the ranking,
score
f(popularity)
g(age)
. (2)
The detailed score formula is
score =
(ups downs 1)
0.8
(age + 2)
gravity
× penalty, (3)
where
ups and downs are number of upvotes and downvotes repectively of a piece of news article.
age in hours starts from 2 to avoid infinite score when age is zero.
gravity controls how fast the score falls with respect to age.
penalty is a multiplier for implementation of customized “business rules” such as penalizing
self-posts, “controversial” posts and so on.
4
2. SIMPLE RECOMMENDATION SYSTEMS
Gravity (1.8 in our case) is usually bigger than the exponent of numerator in Equation 3 (0.8
in our case) so that the age effects will grow faster than the popularity (the number of votes an
article gets), thus newer news can appear closer to top with a higher ranking.
Article votes are usually distributed according to power law (a few articles get very high votes,
while the majority get very few votes). The numerator has exponent less than 1, thus your first
100 votes weighs more than another 100 votes after 1000 votes, the score will thus be less spread
out. More details of the algorithm are in Shirriff (2013).
2.2.2 Reddit News Ranking Algorithm
The ranking score for Reddit news is
score = sign(ups downs) × log [max(1, |ups downs|)] +
age
45000
, (4)
where
The 100 votes counts more then another 100 votes after 1000 votes with the logarithm term.
Score can be either positive or negative.
Age in seconds from inception of Reddit. The newer the article, higher the age and higher
score.
Only the relative scores matter.
Age grows (linearly) faster than popularity.
For rankings of comments, Reddit uses Wilson intervals.
2.3 Recommendations Based on Average Ratings
Unlike news feeds, which usually have just 1 or 2 options (e.g., upvotes/downvotes), ratings can
be 5 or more star ratings.
2.3.1 Problems with Average Ratings
A simple approach to make recommendation is to just sort products by average rating. One
problem with this simple approach is that higher average score of a product may only be from
one rating, while another product with lower score have much more ratings, we need to consider
how confident we are in those ratings.
1. A typical strategy for ranking ratings is to be pessimistic and use lower bounds of the average
ratings. Suppose a random variable follows a normal distribution, X N(µ, σ
2
), then the
mean of X has
¯
X N(µ,
σ
2
N
), (5)
5
2. SIMPLE RECOMMENDATION SYSTEMS
where N is the samples we have. The more samples we have, the spikier / skinnier the
distribution of
¯
X. The corresponding normal approximate (1 α)% confidence intervals
(CI) are
¯
X z
lef t
s
N
,
¯
X z
right
s
N
, (6)
where z
lef t
= Φ(α/2), z
right
= Φ(1 α/2) with Φ(·) is cdf of normal distribution, and
s =
q
1
N
P
N
i=1
(X
i
¯
X)
2
.
The above conclusion about
¯
X holds based on central limit theorem even if X is not normally
distributed.
Similar procedure can be used to obtain confidence intervals for approximate Bernoulli CI.
Wilson’s confidence interval can be used for Bernoulli distribution to obtain a more accurate
lower bound.
Comments on using lower bounds of average ratings for rankings:
Sometimes it can be too pessimistic.
Lower bound is good from the perspective that it accounts for number of people who
rated the item. That is, with more number of raters, it lead to smallers CI, then higher
lower bound. The popularity increases the score.
2. Smoothing (dampening) for average problems when number of raters are too small (e.g., one
or even zero): add a small number to both numerator and denominator when averaging the
ratings,
r =
P
N
i=1
X
i
+ λµ
0
N + λ
, (7)
the above formula provides a smooth transition from a predefined rating value to the true
sample mean we collect as the more samples we have the less effect of the predefined terms.
2.3.2 Explore-Exploit Dilemma
Example: I am playing slot machines at a casino and I’d like to know which machine has the best
reward, I need to play them (collect data) to determine which is best.
Problem: How many times should I play each machine?
Too few times: estimate is not good, CI is wide.
Too many times: only one machine is optimal, so 90% trials are suboptimal.
I want to explore by collecting more data, but I also want to exploit what I believe to be the best
reward. Similarly in recommenders, I watch a lot of videos on one specific topic, for example,
making juices. Such videos are probably suboptimal after I mastered skills of making juices and
I don’t want to watch these kind of videos any more, a more explore component is needed. The
following Bayesian approach can provide a solution to this explore-exploit dilemma.
6
2. SIMPLE RECOMMENDATION SYSTEMS
2.4 Bayesian Paradigm
2.4.1 Bayesian Paradigm with An Example
Assume a random variable x with two different values {0, 1} follows a Bernoulli distribution with
parameter π in Equation 8,
p(x) = π
x
(1 π)
1x
(8)
In Bayesian paradigm, π is also a random variable. Suppose π follows a Beta distribution
p(π) =
1
B(α, β)
π
α1
(1 π)
β1
, (9)
where B(α, β) =
Γ(α)Γ(β)
Γ(α+β)
and π [0, 1].
Instead of obtaining a maximum likelihood estimation of π as in frequentist approach, we compute
p(π|X) p(X|π)p(π). (10)
Plug in p(X|π) and p(π), we obtain
p(π|X) π
α+
P
B
i=1
X
i
1
(1 π)
β+N
P
B
i=1
X
i
1
, (11)
which is also a beta distribution.
Connection to smoothed average: The posterior mean of π is
E(π) =
α +
P
B
i=1
X
i
α + β + N
, (12)
which has almost the same form as Equation 7.
2.4.2 How Bayesian Paradigm Helps with Ranking/Recommendation?
As we collect more data, the narrower the credible interval of average ratings, which is consistent
with what we saw on the confidence interval when applying central limit theorem on average
ratings. The Bayesian approach, however, uses non-deterministic ranking by sorting samples
drawn from posteriors. For example, Figure 1 shows posterior distributions of two bandits, we
collect a lot samples for one bandit while not for the other. When sampling from posterior
distributions like these two, the corresponding ranking for products with dashed line could be
either higher or lower than that of the solid line based on the sampling values. Thus sampling
automatically balances between exploration and exploitation until the dashed line gets skinny.
We sample the posterior “intelligently” random rather than totally random given that it also
accounts for the data we collected.
Return to the slot machine example, we could use such as Bayesian approach to make recommen-
dations. Instead of playing each slot machine one at a time to collect more data, run each bandit
for each slot machine according to samples drawn from their posteriors in each round:
1. Model each slot machine as a separate beta distribution, which is going to model each slot
machine’s win rate.
2. In each loop, draw a sample from each of beta distributions and sort them in order (ranking),
and this is the order to rank items in a recommendation system.
7
2. SIMPLE RECOMMENDATION SYSTEMS
Figure 1: Bayesian Paradigm Scenario
3. Pick the top item and play that machine only.
See code at Bayesian bandit.ipynb for more details about its implementation. Figures 2 show the
posterior distributions of three bandits (slot machines) with true success probabilities 0.2, 0.5 and
0.75, respectively. After 50 trials, all three lines have fat credible intervals, the blue line with real
success probability 0.2 has started to shift to a lower mode. The credible intervals of the green
line (true success probability is 0.75) after 500 trials is much sharper with a mode around its true
success probability, whereas the other two lines remains to have fat credible intervals because, at
this point, we are confident that the green line is better than them and we end up with sampling
these two cases less often.
The above Bayesian paradigm can be easily extended to Gaussian case, in which case we can have
x as any real number instead of just 0 and 1.
2.5 Supervised Learning for Recommender Systems
For a supervised learning in the context of recommender systems, we may have input features X
such as some demographical data
Age
Gender
Religion
8
2. SIMPLE RECOMMENDATION SYSTEMS
Figure 2: Posterior distributions of three slot mahcines.
Location
Race
Occupation
Education
Marital status
. . .
or any other data collected by your sites such as
When the user signed up
Which pages they viewed
Purchase history
. . .
or data purchased from other companies.
Except for information on users, X should also include information about the products to be
recommended.
The corresponding targets Y could be
Did the user buy the product?
Did the user click on ad?
Did the user make an account?
Did the user rate this item?
Did the user . . .
9
2. SIMPLE RECOMMENDATION SYSTEMS
Two types of models:
1. Build a separate model for each product. It can be problematic as a lot of data will be
needed for each product.
2. Build a single model that include both user information and products attributes as model
inputs.
Making predictions with supervised models more accurately can help improve the recommendation
suggestions.
Instead of having explicit features such as age and gender, we could learn these features implicitly
with latent variable models, which can learn the best possible features automatically. It may be
hard to interpret features from these models, but we are confident that they are mathematically
optimal and we don’t need to worry about collecting these features manually. See more details in
Section 5 for more details about matrix factorization.
2.6 Markov Models and Google Page Rank
The PageRank of a page is the probability that a user would end up on that page if the user surfed
the Internet randomly for an infinite amount of time.
The higher the probability, the higher ranking (recommendation) the page shows up in search
results.
2.6.1 Markov Models
Markov models assumes that state x
t
depends only on the state before it, x
t1
.
p(x
t
|x
t1
, x
t2
, ··· , x
1
) = p(x
t
|x
t1
) (13)
The transistion probability matrix A has (i, j) th element
A[i, j] = p(x
t
= j|x
t1
= i) (14)
that represents the probability of transits to state j from state i and satisfies
P
M
j=1
A[i, j] =
P
M
j=1
(x
t
= j|x
t1
= i) = 1. A is also called “stochastic matrix” or “Markov matrix”.
2.6.2 State Distribution
Denote π
t
as the state distribution at time t., for example, suppose we have two states of weather,
sunny and rainy, then π
t
is a vector of size 2
π
t
= [p(x
t
= sunny), p(x
t
= rainy)]. (15)
To calculate a future state distribution at time t + 1, we use Bayes rule
10
2. SIMPLE RECOMMENDATION SYSTEMS
p(x
t+1
= j) =
M
X
i=1
p(x
t+1
= j, x
t
= i)
=
M
X
i=1
p(x
t+1
= j|x
t
= i)p(x
t
= i)
=
M
X
i=1
A[i, j]π
t
(i)
= π
t+1
(j). (16)
Rewrite the above element-wise relation in compact form of matrix
π
t+1
= π
t
A. (17)
Extend the above state distribution of one step ahead to that of k steps ahead
π
t+k
= π
t
A
k
. (18)
2.6.3 Connection to Recommender Systems
Every page on the Internet is a state in Markov model. If page i links to page j, the probability
of jumping to page j when user is at page i is
p(x
t
= j|x
t1
= i) =
1
n
i
, (19)
where n
i
is the number of links on page i.
Most of transition probabilities are 0 with billions of pages on the Internet, we add smoothing to
transition matrix to obtain a weighted probability distribution, Google matrix G,
G = 0.85A + 0.15U, (20)
where U[i, j] = 1/M is from uniform distribution.
Extend Equation 18 to time , we have
π
= π
G, (21)
linear algebra shows that the state distribution is the eigenvector of G with corresponding eigen-
value as 1, which is the score of each page.
Justification of pagerank algorithm with P-F theorem: If G is Markov chain and all its ele-
ments are positive then the stationary distribution (state distribution that doesn’t change after
transitioning by G) and its limiting distribution (state distribution arrived after transitioning G
infinite times) are the same.
See ways to evaluate rankings at https://en.wikipedia.org/wiki/Learning to rank.
To summarize, this section introduces non-personalized recommendations. Recommendations
systems can be framed in terms of machine learning problems, and we saw how recommendation
systems can make use of supervised learning, confidence intervals, Bayesian bandits, Markov
models.
11
3. EVALUATE A RECOMMENDER SYSTEM
3 Evaluate A Recommender System
Use train/test sets with cross validation (k-fold or leave one out of bag) to train the model. User
accuracy metrics such as RMSE, MAE, or metric such as hit rate and so on to evaluate the
performance of models. Other issues include:
Coverage
Diversity. Are we recommending the same types of things all the time?
Novelty. How popular or obscure your recommendations are?
Churn. How often do recommendations change?
Responsiveness. How quickly does new user behavior affect recommendations?
A real impact of models, however, should be evaluated on real users in an online setting. The
results of online A/B tests are the only evaluation that matters for recommendation systems.
Another thing can be done to evaluate recommender systems is to ask users to rate the recommen-
dations (measure the perceived quality), but it’s less likely to get some interpretable and useful
data as users may be confused with the ratings themselves and it requires extra work from users.
4 Collaborative Filtering (CF)
4.1 Notations
With collaborative filtering, we could obtain a personalized score s(i, j) for each user i and item
j. That is, with i = 1, 2, ··· , N and j = 1, 2, ··· , M
s(i, j) =
P
i
0
j
r
i
0
j
|
j
|
, (22)
where
j
is a set users who rated item j, r
i
0
j
is the rating of item j from user i
0
.
Denote the user-item ratings matrix of size N × M as R
N×M
and r
ij
is rating of user i for
item j.
Sparsity of R
N×M
:
The user-item ratings matrix is sparse (missing ratings / undefined / not exist).
Our goal is to make recommendations, so sparsity is good from the perspective of business
view so we have items to recommend to users (e.g., movies that users haven’t watched).
Our goal of CF is to predict the scores s(i, j) that users will rate to items they haven’t watched/pur-
chased/... and make recommendations.
12
4. COLLABORATIVE FILTERING (CF)
4.2 User-User CF
User-user CF is a type of CF based on the similarity between users calculated using people’s ratings
of items. (Find similar users, then recommend movies based on behaviors of similar users).
Instead of assigning each user an equal weight, we adopt weighted ratings to take degree of
similarities of users into account.
s(i, j) =
P
i
0
j
w
ii
0
r
i
0
j
P
i
0
j
|w
ii
0
|
. (23)
Also, interpretations of ratings from different users vary as users can be optimistic or pessimistic,
for example, one user may still rate an item with 4 out of 5 even the user doesn’t like it, while
another user may like the same item and rate it as 4, user deviation to de-bias
dev(i, j) = r(i, j) ¯r
i
. (24)
Combine the weightings and deviation above, we have
s(i, j) = ¯r
i
+
P
i
0
j
w
ii
0
(r
i
0
j
¯r
i
)
P
i
0
j
|w
ii
0
|
, (25)
where w
ii
0
can be negative and can be calculated with various ways:
Pearson correlation coefficient
w
ii
0
=
P
jΨ
ii
0
(r
ij
¯r
i
)(r
i
0
j
¯r
i
0
)
q
P
jΨ
i
(r
ij
¯r
i
)
2
q
P
jΨ
i
0
(r
i
0
j
¯r
i
0
)
2
, (26)
where Ψ
i
is a set of items that user i has rated, Ψ
ii
0
= Ψ
i
Ψ
i
0
is a set of items that both
user i and i
0
have rated.
Cosine similarity
cosθ =
P
N
i=1
x
i
y
i
q
P
N
i=1
x
2
i
q
P
N
i=1
y
2
i
, (27)
which is very similar to the Pearson correlation coefficient except that the latter is centered.
Euclidean distance.
Improvements of the CF:
Ψ
ii
0
may be zero when there are no common movies between user i and i
0
, and no correlations
could be computed. Those users won’t be used for each other’s prediction.
We only consider most similar/correlated users when making predictions, which can speed
up the computation. This, however, may also move some useful information. For example,
two users may have very strong negative correlation when using Pearson’s correlation and
dropping a user from each other’s prediction may not be desired.
13
4. COLLABORATIVE FILTERING (CF)
In practice, recommendations can be faster with precomputed user-user weights.
See an implementation of user-user CF with MovieLens 20 million data (link) at user-user-
collaborative-filtering.ipynb. The steps are
Split data into train and test sets.
Compute weights using train set.
Make predictions.
Use an evaluation metric to test the predictions for both sets.
4.3 Item-Item CF
Item CF is a type of CF based on the similarity between items calculated using people’s ratings
of those items. (movie A and C are more similar because they have similar ratings, if user A likes
movie A, then it’s more likely that he/she will also like movie C).
Similarly, the score is
s(i, j) = ¯r
j
+
P
j
0
i
w
jj
0
(r
ij
0
¯r
j
0
)
P
j
0
i
|w
jj
0
|
, (28)
the weights for different items can be computed in a similar way but summing over users (i)
Pearson correlation coefficient
w
jj
0
=
P
iΨ
jj
0
(r
ij
¯r
j
)(r
ij
0
¯r
j
0
)
q
P
iΨ
j
(r
ij
¯r
j
)
2
q
P
iΨ
j
0
(r
ij
0
¯r
j
0
)
2
. (29)
To summarize,
1. User-user CF chose items for a user because those items have been liked by similar users.
2. Item-item CF choose items for a user because this user has liked similar items in the past.
3. In practice, there are a lot more data when comparing items than comparing users.
4. Item-item CF is faster as it computes less weights when number of users are far more than
number of items, N >> M.
Situations when item CF is preferred:
There are more users than the number of products (e.g., Netflix has millions of users but
maybe not millions of movies, especially popular movies.)
Items don’t change frequently so less computational expensive.
It’s more convincing to use user’s historical item data.
Situations when user CF is preferred:
Products change very frequently (e.g., news recommendation).
In practice, we can use a combination of different methods and do an ensemble.
14
4. COLLABORATIVE FILTERING (CF)
User M1 M2 M3 M4 M5
A 1 1 1 0 0
B 1 1 0 1 0
C 0 1 1 0 1
D 0 1 0 1 0
Table 1: Indicator matrix of users and their corresponding rated movies, 1 indicates that a user
rated that movie while 0 indicates that a user didn’t rate that movie.
M1 M2 M3 M4 M5
M1 2 2 1 1 0
M2 2 4 2 2 1
M3 1 2 2 0 1
M4 1 2 0 2 0
M5 0 1 1 0 1
Table 2: Unnormalized co-occurence matrix.
4.4 Recommender System with MapReduce
4.4.1 Item CF
What data to use?
We chose to use the rating history of users to build the relationship between movies, data are
from Netflix Prize data. The underlying intuition is,
If one user rated two movies, the two movies are more related than the other unrated movies
given that users will usually rate a movie on their own initiative (interested in both movies).
When there are a lot of users, the rating history will reflect more about actual relations
among movies instead of individual “noise“.
How to describe the relationship between different items?
Co-occurence matrix: a matrix that is defined over an image to be the distribution of co-occurring
pixel values (grayscale values, or colors) at a given offset.
Take the total ratings of each movie into account, normalize the co-occurrence matrix.
M1 M2 M3 M4 M5
M1 2/6 2/6 1/6 1/6 0/6
M2 2/11 4/11 2/11 2/11 1/11
M3 1/6 2/6 2/6 0/6 1/6
M4 1/5 2/5 0/5 2/5 0/5
M5 0/3 1/3 1/3 0/3 1/3
Table 3: Normalized co-occurence matrix.
15
4. COLLABORATIVE FILTERING (CF)
Movie UserA
Rating
M1 3
M2 7
M3 8
M4 0
M5 0
Table 4: User ratings matrix.
Rating matrix to indicate the difference among movies for each user. For missing ratings (users
don’t rate the movies), use a neutral rating the users will usually give from historical value. If no
ratings from the users at all, cold start (5 out of 10) or find the neutral score of these users from
other sources such as other movie rating websites.
Multiply co-occurence matrix by rating matrix to obtain expected rating of each movie for a user,
make recommendation based on the expected score.
Raw data: userID, movieID, rating
Preprocesser: divide data by userID, merge the same userIDs together.
4.4.2 Recommender System with MapReduce
1. Enter docker and start hadoop.
2. Download code from https://github.com/QianqianShan/Recommender System
3. Unzip the file with tar -xvf RecommenderSystem.tar
4. cd RecommenderSystem”
5. Make directory to save input data, hdfs dfs -mkdir /input
6. Copy input file to the input folder, hdfs dfs -put input/* /input
7. Remove the created folders to each MapReduce job if already created (for previous running
of the code).
hdfs dfs -rm -r /dataDividedByUser
hdfs dfs -rm -r /coOccurrenceMatrix
hdfs dfs -rm -r /Normalize
hdfs dfs -rm -r /Multiplication
hdfs dfs -rm -r /Sum
8. Enter the source code folder, cd src/
9. Compile, hadoop com.sun.tools.javac.Main *.java
10. Make jar, jar cf recommender.jar *.class
16
5. MATRIX FACTORIZATION
11. Run the jobs, hadoop jar recommender.jar Driver /input /dataDividedByUser /coOccur-
renceMatrix /Normalize /Multiplication /Sum
Note:
args0: original dataset
args1: output directory for DividerByUser job
args2: output directory for coOccurrenceMatrixBuilder job
args3: output directory for Normalize job
args4: output directory for Multiplication job
args5: output directory for Sum job
12. Check results from the last MapReduce job, hdfs dfs -cat /Sum/*
13. List all folders to check outputs of each MapReduce job, hdfs dfs -ls /
5 Matrix Factorization
5.1 Basic Models
The idea of matrix factorization is to split the approximate rating matrix
b
R
N×M
into two smaller
matrices
b
R = W U
T
, (30)
where W
N×K
and U
M×K
should be much smaller in size than the sparse representation matrix
b
R,
that is, K should be relatively small (e.g., 10 - 50), so it takes much less space.
Multiplication of W and U
T
can still be large, one can compute rating for user i and item j
separately with br
ij
=
b
R[i, j] = w
T
i
u
j
. Further more,
w
T
i
u
j
= ||w
i
|| · ||u
j
||cosθ, (31)
which can be interpreted as how well user i’s preferences correlate with movie j’s attributes. These
features (preferences or attributes), however, are latent that are hard to interpret with K is the
latent dimensionality.
5.2 Justification and Relationship between SVD and MF
In mathematics, a matrix X can be decomposed to three matrices with singular value decompo-
sition (SVD)
X = USV
T
, (32)
17
5. MATRIX FACTORIZATION
so a matrix X can also be decomposed to two matrices simply by absorbing one matrix to one
of the rest two matrices on the right hand side of Equation 32 together. In other words, one can
always turn the solution of SVD into solutions in MF.
In the U and V of a SVD decomposition, however, vectors in U and V are orthonormal, which is
not necessarily true in MF, also SVD doesn’t work for matrices with missing values.
5.3 Loss Function
In order to ensure the approximation of R with
b
R is good, that is
R
b
R = W U
T
, (33)
we minimize a squared error loss
J =
X
i,j
(r
ij
br
ij
)
2
, (34)
where is a set of pairs (i, j) for user i and item j.
5.4 Matrix Factorization with Bias and Regularization
Similar with using deviations in collaborative filtering, we introduce bias terms for matrix factor-
ization
br
ij
= w
T
i
u
j
+ b
i
+ c
j
+ µ, (35)
where b
i
is user bias, c
j
is item bias, µ is global average. We use both user bias and item bias as
both dimensions can be biased.
To prevent overfitting, add regularization to the loss function
J =
X
i,j
(r
ij
br
ij
)
2
+ λ(||W ||
2
F
+ ||U||
2
F
+ ||b||
2
F
+ ||c||
2
F
), (36)
where || · ||
2
F
denotes Frobenius norm. Take partial derivatives with respect to the parameters
w
i
=
X
jΨ
i
u
j
u
T
j
+ λI
!
1
X
jΨ
i
(r
ij
b
i
c
j
µ)u
j
u
j
=
X
i
j
w
i
w
T
i
+ λI
1
X
i
j
(r
ij
b
i
c
j
µ)u
j
b
i
=
1
|Ψ
i
| + λ
X
jΨ
i
(r
ij
w
T
i
u
j
c
j
µ)
c
j
=
1
|
j
| + λ
X
i
j
(r
ij
w
T
i
u
j
b
i
µ) (37)
(38)
Update the parameters with alternating least squares to obtain optimized values. Gradient descent
algorithm can be an alternative.
18
5. MATRIX FACTORIZATION
5.5 Probabilistic MF
Instead of
b
R = W U
T
,
b
R follows a probabilistic distribution
R N(W T
T
, Σ = σ
2
I), (39)
where R is the random matrix. Then we can apply methods such as maximum likelihood estima-
tion (MLE) to find parameters. Log-likelihood can be written as
log(L) = C
X
i,j
1
2σ
2
(r
ij
w
T
i
u
j
)
2
, (40)
which gives the same results of parameters as above.
5.6 Bayesian MF
Maximum a posteriori (MAP) estimation of W and U
p(W, U|R) =
p(R|W, U)p(W )p(U)
p(R)
, (41)
where R is the collected data on ratings.
MCMC could be used to take samples of posterior distributions of W and U and compute the
posterior means
E(r
ij
|R) = E(w
T
i
u
j
|R)
1
T
w
(t)T
i
u
(t)
j
(42)
5.7 Improving Basic Models By Deep Learning Techniques
Matrix factorization models only on linear patterns, using deep NN can find non-linear patterns.
NN can work with multiple branches doing different things to find different features of inputs
1. Concatenate the W and U matrices together as the input a neural network (NN).
2. Apply NN on the residuals from matrix factorization modeling and make final predictions
of ratings as the sum of matrix factorization prediction and NN prediction on the residuals
br = MF (x) + NN(residuals) (43)
3. Autoencoder Sedhain et al. (2015) reproduces its own input. Autoencoder reconstructs its
input even if the input is noisy (e.g., missing ratings in input). For example, we simply treat
the rating matrix as data matrix (with missing values) with each row of users is a sample
and each column is feature, use the data matrix as both input and output to train a NN
model.
Add your own noise to inputs helps the model to predict missing ratings instead of just
reconstruct the input (predict ratings that are not missing).
19
6. RESTRICTED BOLTZMANN MACHINE (RBM) FOR COLLABORATIVE FILTERING
Figure 3: RBM layers from Wikipedia
.
Cost function is based on known ratings
J =
1
||
N
X
i=1
M
X
j=1
m
ij
(r
ij
br
ij
)
2
, (44)
where m
ij
= 1 if (i, j) else 0.
As autoencoder uses the same ratings as both input and output to train a model, one
cannot use test set as input to check MSE as the model will produce the same output
as input, one should use predicted ratings reconstructed from the training model to test
the performance on test set. That is, See code at Autoencoder.ipynb for more details.
6 Restricted Boltzmann Machine (RBM) for Collabora-
tive Filtering
An RBM has an input (visible) layer v and a hidden layer h, and it’s two-way NN (can go from v
to h and from h to v), see Figure 3 from Wikipedia contributors (2019) as an example of RBM.
20
6. RESTRICTED BOLTZMANN MACHINE (RBM) FOR COLLABORATIVE FILTERING
6.1 Motivation
We’d like to create a NN without imposing any structure on it, that is, “everything is connected
to everything” (Boltzmann machine, or BM). The idea is originated from statistical mechanics
that systems tend to go towards equilibrium, from high energy to low energy.
Instead of applying BM which is difficult to train, we apply RBM that can scale the training up to
non-trivial problems. That is, we restrict the BM that visible units must connect to all the hidden
units and vice versa, however, visible units cannot connect to other visible units and hidden units
cannot connect to other hidden units.
6.2 Bernoulli RBM
Both visible and hidden units in a Bernoulli RBM can only be 0 or 1, but this constraint can
usually be relaxed. The probabilities of h
p(h = 1|v) = σ(W
T
v + c), (45)
or equivalently in scaler form
p(h
j
= 1|v) = σ
D
X
i=1
W
ij
v
i
+ c
j
!
, (46)
where i = 1, ··· , D and j = 1, ··· , M.
With W is a shared weight, symmetrically, the probabilities of v
p(v = 1|h) = σ(W h + b) (47)
6.3 Relaxed Bernoulli RBM
One can relax the constraints on the values of Bernoulli RBM:
1. First scale all values into the range of [0, 1] and round all values x
i
such that x
i
= 1 if
x
i
>= 0.5, or x
i
= 0 otherwise.
2. Just pass in values as they are. For example, when calculating p(h = 1|v), there is no
constraint on the values of v.
3. When going from h to v, simply use
˜
h = p(h = 1|v) as input.
˜
h = p(h = 1|v) = σ(W
T
v + c)
p(v
0
= 1|h) = σ(W
˜
h + b). (48)
If we go v h v
0
, the cross entropy (distance between v and v
0
) will go down as we train
even we don’t optimize it.
21
6. RESTRICTED BOLTZMANN MACHINE (RBM) FOR COLLABORATIVE FILTERING
6.4 RBM Training
Maximum likelihood estimation is used to find values of W, b, c by maximizing the probabilities of
observations v we have.
W, b, c = argmax
W, b, c
log p(v; W, b, c), (49)
where
p(v) =
X
h
1
Z
exp
v
T
W h + b
T
v + c
T
h

=
1
Z
X
h
exp(E(v, h)) (50)
with
Z =
X
v
0
]
X
h
exp(E(v
0
, h)) (51)
Due to the exponentially increasing computation time to compute the probabilities, we need to
simplify the computation so it becomes tractable.
Define free energy as
F (v) = log
X
h
exp(E(v, h)). (52)
Then for any parameter θ ( = W or b or c)
log p(v)
θ
=
θ
[F (v) + log Z]
=
F (v)
θ
θ
1
Z
exp(F (v
0
))
F (v
0
)
θ
=
F (v)
θ
X
v
0
1
Z
exp(F (v
0
))
F (v
0
)
θ
=
F (v)
θ
X
v
0
p(v
0
)
F (v
0
)
θ
=
F (v)
θ
E
F (v
0
)
θ
F (v)
θ
1
N
F (v
0
)
θ
(sample v
0
from Gibbs sampling)
F (v)
θ
F (v
0
)
θ
(N = 1), (53)
that is, with N = 1, given v, sample h from p(h|v), then sample v
0
from p(v
0
|h). The above is
called contrastive divergence and F (v) will take O(M) time to compute.
6.5 Categorical RBM for Recommender Systems
Ratings in recommender systems are usually not between [0, 1], we extend the Bernoulli RBM to
have the visible units represent a Kclass categorical distribution, while the hidden units are still
allowed to take Bernoulli distribution. The NN equations can be updated accordingly
22
6. RESTRICTED BOLTZMANN MACHINE (RBM) FOR COLLABORATIVE FILTERING
p(h
j
= 1|v) = σ(
K
X
k=1
D
X
i=1
W
k
ij
v
k
i
+ c
j
)
p(v
k
i
= 1|h) = softmax
M
X
j=1
W
k
ij
h
j
+ b
k
i
!
, (54)
h is a vector of size M
V is 2-D array of size D × K
W is 3-D array of size D × K × M
b is 2-D array of size D × K
c is vector of size M
Predictions can be made with
weighted average of predicted ratings with corresponding probabilities.
take the argmax of probabilities on the predicted ratings.
23
REFERENCES
References
Sedhain, S., A. K. Menon, S. Sanner, and L. Xie (2015). Autorec: Autoencoders meet collaborative
filtering. pp. 111–112.
Shirriff, K. (2013). How Hacker News ranking really works: scoring, controversy, and penalties.
http://www.righto.com/2013/11/how-hacker-news-ranking-really-works.html/. [On-
line; accessed 10-May-2019].
Wikipedia contributors (2019). Restricted boltzmann machine Wikipedia, the free encyclopedia.
[Online; accessed 9-June-2019].
24