Recommender Systems In Practice

Recommender System

Qianqian Shan

Contents

1 Introduction to Recommender Systems 3

1.1 Types of Recommenders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Simple Recommendation Systems 3

2.1 Basic Intuition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.2 News Feeds Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.2.1 Hacker News Ranking Algorithm . . . . . . . . . . . . . . . . . . . . . . . 4

2.2.2 Reddit News Ranking Algorithm . . . . . . . . . . . . . . . . . . . . . . . 5

2.3 Recommendations Based on Average Ratings . . . . . . . . . . . . . . . . . . . . . 5

2.3.1 Problems with Average Ratings . . . . . . . . . . . . . . . . . . . . . . . . 5

2.3.2 Explore-Exploit Dilemma . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.4 Bayesian Paradigm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.4.1 Bayesian Paradigm with An Example . . . . . . . . . . . . . . . . . . . . . 7

2.4.2 How Bayesian Paradigm Helps with Ranking/Recommendation? . . . . . . 7

2.5 Supervised Learning for Recommender Systems . . . . . . . . . . . . . . . . . . . 8

2.6 Markov Models and Google Page Rank . . . . . . . . . . . . . . . . . . . . . . . . 10

2.6.1 Markov Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.6.2 State Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.6.3 Connection to Recommender Systems . . . . . . . . . . . . . . . . . . . . . 11

3 Evaluate A Recommender System 12

4 Collaborative Filtering (CF) 12

4.1 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

4.2 User-User CF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

4.3 Item-Item CF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

4.4 Recommender System with MapReduce . . . . . . . . . . . . . . . . . . . . . . . . 15

4.4.1 Item CF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

4.4.2 Recommender System with MapReduce . . . . . . . . . . . . . . . . . . . . 16

5 Matrix Factorization 17

5.1 Basic Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

5.2 Justiﬁcation and Relationship between SVD and MF . . . . . . . . . . . . . . . . 17

5.3 Loss Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

5.4 Matrix Factorization with Bias and Regularization . . . . . . . . . . . . . . . . . . 18

5.5 Probabilistic MF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

CONTENTS

5.6 Bayesian MF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

5.7 Improving Basic Models By Deep Learning Techniques . . . . . . . . . . . . . . . 19

6 Restricted Boltzmann Machine (RBM) for Collaborative Filtering 20

6.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

6.2 Bernoulli RBM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

6.3 Relaxed Bernoulli RBM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

6.4 RBM Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

6.5 Categorical RBM for Recommender Systems . . . . . . . . . . . . . . . . . . . . . 22

1. INTRODUCTION TO RECOMMENDER SYSTEMS

1 Introduction to Recommender Systems

Key points: matrix factorization, collaborative ﬁltering.

1.1 Types of Recommenders

• Recommending things. Example: Products recommended on online shopping websites.

• Recommending content. Example: News feeds, videos.

• Recommending Music. Example: Spotify or Pandora music recommendation.

• Recommending people.

• Recommending search results. Example: Google personalized search results.

A recommender system starts with data about users, the data can come from

1. explicitly asking users. Drawback: requires extra work from users, data tend to be sparse

as not everyone can be bothered to leave a rating. Everyone has diﬀerent standards, some

people may be more critical than others, there may also be cultural diﬀerences and so on.

2. implicit behavior of users, look at things “you” do anyhow and interpret them as indications

of interest or disinterest. For example, clicks on a website / ad, there is no problem of sparsity

of clicks data, but not always reliable as indication of interest. Things “you” purchase

are a better indication of interest. It’s also resistant to fraud. Another example is that

consumption on time is another good indication of positive ratings.

Ultimately the goal is to put the best content that users will love in front of users in the form of

a top-N list, not the ability to predict the items users will hate.

2 Simple Recommendation Systems

Non-personalized recommenders: Individual likes and dislikes aren’t used for custom suggestions

to users. For example, amazon.com should show you products you might buy even if you never

use amazon.com before, youtube.com should show you interesting videos even you are a new user

to youtube.com.

2.1 Basic Intuition

1. Population behavior can be used to guide recommendations. The business reason to have

recommendation that users will have a high probability to take is that it will bring more

proﬁts. For example, highest grossing movies are usually movies that lots of people are

willing to pay for, then recommending those movies to those who haven’t watched them

before might be a good recommendation: Other people really like this, therefore you’ll

probably like it as well.

Issues of popularity:

2. SIMPLE RECOMMENDATION SYSTEMS

(a) Popularity may not be the sole factor to be considered when making recommendations.

For example, when recommending restaurants to a traveler in a city, the most popular

local restaurant may not a good ﬁt. The “top-50” popular music may not please

everyone either.

(b) Age of popularity. For example, a piece of popular news three years ago may not be a

good recommendation for users today.

2. Context can also guide recommendations. For example, ﬁnding product associations using

context. Associations among products are frequently bought together or logically go to-

gether. For example, when buying an iPhone, users may also buy an iPhone case at the

same time. Lift can be used to quantify product associations,

Lift =

p(A, B)

p(A) · p(B)

p(A|B)

p(A)

p(B|A)

p(B)

. (1)

That is, when buying B increases the probability of buying A, then Lift > 1 (e.g., p(B|A) =

0.2 and p(A) = 0.1).

The intuitions are considering non-machine learning techniques. These techniques are simple but

still powerful. A/B testing diﬀerent approaches can help to decide which approach is the most

suitable one.

2.2 News Feeds Algorithm

2.2.1 Hacker News Ranking Algorithm

The Hacker News Algorithm handles the news age issue in recommendations by a ratio of pop-

ularity to age so that the higher the popularity, the higher the ranking, the higher the age, the

lower the ranking,

score ∼

f(popularity)

g(age)

. (2)

The detailed score formula is

score =

(ups − downs − 1)

0.8

(age + 2)

gravity

× penalty, (3)

where

• ups and downs are number of upvotes and downvotes repectively of a piece of news article.

• age in hours starts from 2 to avoid inﬁnite score when age is zero.

• gravity controls how fast the score falls with respect to age.

• penalty is a multiplier for implementation of customized “business rules” such as penalizing

self-posts, “controversial” posts and so on.

2. SIMPLE RECOMMENDATION SYSTEMS

Gravity (1.8 in our case) is usually bigger than the exponent of numerator in Equation 3 (0.8

in our case) so that the age eﬀects will grow faster than the popularity (the number of votes an

article gets), thus newer news can appear closer to top with a higher ranking.

Article votes are usually distributed according to power law (a few articles get very high votes,

while the majority get very few votes). The numerator has exponent less than 1, thus your ﬁrst

100 votes weighs more than another 100 votes after 1000 votes, the score will thus be less spread

out. More details of the algorithm are in Shirriﬀ (2013).

2.2.2 Reddit News Ranking Algorithm

The ranking score for Reddit news is

score = sign(ups − downs) × log [max(1, |ups −downs|)] +

age

45000

, (4)

where

• The 100 votes counts more then another 100 votes after 1000 votes with the logarithm term.

• Score can be either positive or negative.

• Age in seconds from inception of Reddit. The newer the article, higher the age and higher

score.

• Only the relative scores matter.

• Age grows (linearly) faster than popularity.

For rankings of comments, Reddit uses Wilson intervals.

2.3 Recommendations Based on Average Ratings

Unlike news feeds, which usually have just 1 or 2 options (e.g., upvotes/downvotes), ratings can

be 5 or more star ratings.

2.3.1 Problems with Average Ratings

A simple approach to make recommendation is to just sort products by average rating. One

problem with this simple approach is that higher average score of a product may only be from

one rating, while another product with lower score have much more ratings, we need to consider

how conﬁdent we are in those ratings.

1. A typical strategy for ranking ratings is to be pessimistic and use lower bounds of the average

ratings. Suppose a random variable follows a normal distribution, X ∼ N(µ, σ

), then the

mean of X has

X ∼ N(µ,

), (5)

2. SIMPLE RECOMMENDATION SYSTEMS

where N is the samples we have. The more samples we have, the spikier / skinnier the

distribution of

X. The corresponding normal approximate (1 − α)% conﬁdence intervals

(CI) are



X − z

lef t

√

X − z

right

√



, (6)

where z

lef t

= Φ(α/2), z

right

= Φ(1 − α/2) with Φ(·) is cdf of normal distribution, and

s =

i=1

−

The above conclusion about

X holds based on central limit theorem even if X is not normally

distributed.

Similar procedure can be used to obtain conﬁdence intervals for approximate Bernoulli CI.

Wilson’s conﬁdence interval can be used for Bernoulli distribution to obtain a more accurate

lower bound.

Comments on using lower bounds of average ratings for rankings:

• Sometimes it can be too pessimistic.

• Lower bound is good from the perspective that it accounts for number of people who

rated the item. That is, with more number of raters, it lead to smallers CI, then higher

lower bound. The popularity increases the score.

2. Smoothing (dampening) for average problems when number of raters are too small (e.g., one

or even zero): add a small number to both numerator and denominator when averaging the

ratings,

r =

i=1

+ λµ

N + λ

, (7)

the above formula provides a smooth transition from a predeﬁned rating value to the true

sample mean we collect as the more samples we have the less eﬀect of the predeﬁned terms.

2.3.2 Explore-Exploit Dilemma

Example: I am playing slot machines at a casino and I’d like to know which machine has the best

reward, I need to play them (collect data) to determine which is best.

Problem: How many times should I play each machine?

• Too few times: estimate is not good, CI is wide.

• Too many times: only one machine is optimal, so 90% trials are suboptimal.

I want to explore by collecting more data, but I also want to exploit what I believe to be the best

reward. Similarly in recommenders, I watch a lot of videos on one speciﬁc topic, for example,

making juices. Such videos are probably suboptimal after I mastered skills of making juices and

I don’t want to watch these kind of videos any more, a more explore component is needed. The

following Bayesian approach can provide a solution to this explore-exploit dilemma.

2. SIMPLE RECOMMENDATION SYSTEMS

2.4 Bayesian Paradigm

2.4.1 Bayesian Paradigm with An Example

Assume a random variable x with two diﬀerent values {0, 1} follows a Bernoulli distribution with

parameter π in Equation 8,

p(x) = π

(1 − π)

1−x

(8)

In Bayesian paradigm, π is also a random variable. Suppose π follows a Beta distribution

p(π) =

B(α, β)

α−1

(1 − π)

β−1

, (9)

where B(α, β) =

Γ(α)Γ(β)

Γ(α+β)

and π ∈ [0, 1].

Instead of obtaining a maximum likelihood estimation of π as in frequentist approach, we compute

p(π|X) ∝ p(X|π)p(π). (10)

Plug in p(X|π) and p(π), we obtain

p(π|X) ∝ π

α+

i=1

−1

(1 − π)

β+N −

i=1

−1

, (11)

which is also a beta distribution.

Connection to smoothed average: The posterior mean of π is

E(π) =

α +

i=1

α + β + N

, (12)

which has almost the same form as Equation 7.

2.4.2 How Bayesian Paradigm Helps with Ranking/Recommendation?

As we collect more data, the narrower the credible interval of average ratings, which is consistent

with what we saw on the conﬁdence interval when applying central limit theorem on average

ratings. The Bayesian approach, however, uses non-deterministic ranking by sorting samples

drawn from posteriors. For example, Figure 1 shows posterior distributions of two bandits, we

collect a lot samples for one bandit while not for the other. When sampling from posterior

distributions like these two, the corresponding ranking for products with dashed line could be

either higher or lower than that of the solid line based on the sampling values. Thus sampling

automatically balances between exploration and exploitation until the dashed line gets skinny.

We sample the posterior “intelligently” random rather than totally random given that it also

accounts for the data we collected.

Return to the slot machine example, we could use such as Bayesian approach to make recommen-

dations. Instead of playing each slot machine one at a time to collect more data, run each bandit

for each slot machine according to samples drawn from their posteriors in each round:

1. Model each slot machine as a separate beta distribution, which is going to model each slot

machine’s win rate.

2. In each loop, draw a sample from each of beta distributions and sort them in order (ranking),

and this is the order to rank items in a recommendation system.

2. SIMPLE RECOMMENDATION SYSTEMS

Figure 1: Bayesian Paradigm Scenario

3. Pick the top item and play that machine only.

See code at Bayesian bandit.ipynb for more details about its implementation. Figures 2 show the

posterior distributions of three bandits (slot machines) with true success probabilities 0.2, 0.5 and

0.75, respectively. After 50 trials, all three lines have fat credible intervals, the blue line with real

success probability 0.2 has started to shift to a lower mode. The credible intervals of the green

line (true success probability is 0.75) after 500 trials is much sharper with a mode around its true

success probability, whereas the other two lines remains to have fat credible intervals because, at

this point, we are conﬁdent that the green line is better than them and we end up with sampling

these two cases less often.

The above Bayesian paradigm can be easily extended to Gaussian case, in which case we can have

x as any real number instead of just 0 and 1.

2.5 Supervised Learning for Recommender Systems

For a supervised learning in the context of recommender systems, we may have input features X

such as some demographical data

• Age

• Gender

• Religion

2. SIMPLE RECOMMENDATION SYSTEMS

Figure 2: Posterior distributions of three slot mahcines.

• Location

• Race

• Occupation

• Education

• Marital status

• . . .

or any other data collected by your sites such as

• When the user signed up

• Which pages they viewed

• Purchase history

• . . .

or data purchased from other companies.

Except for information on users, X should also include information about the products to be

recommended.

The corresponding targets Y could be

• Did the user buy the product?

• Did the user click on ad?

• Did the user make an account?

• Did the user rate this item?

• Did the user . . .

2. SIMPLE RECOMMENDATION SYSTEMS

Two types of models:

1. Build a separate model for each product. It can be problematic as a lot of data will be

needed for each product.

2. Build a single model that include both user information and products attributes as model

inputs.

Making predictions with supervised models more accurately can help improve the recommendation

suggestions.

Instead of having explicit features such as age and gender, we could learn these features implicitly

with latent variable models, which can learn the best possible features automatically. It may be

hard to interpret features from these models, but we are conﬁdent that they are mathematically

optimal and we don’t need to worry about collecting these features manually. See more details in

Section 5 for more details about matrix factorization.

2.6 Markov Models and Google Page Rank

The PageRank of a page is the probability that a user would end up on that page if the user surfed

the Internet randomly for an inﬁnite amount of time.

The higher the probability, the higher ranking (recommendation) the page shows up in search

results.

2.6.1 Markov Models

Markov models assumes that state x

depends only on the state before it, x

t−1

p(x

t−1

, x

t−2

, ··· , x

) = p(x

t−1

) (13)

The transistion probability matrix A has (i, j) − th element

A[i, j] = p(x

= j|x

t−1

= i) (14)

that represents the probability of transits to state j from state i and satisﬁes

j=1

A[i, j] =

j=1

= j|x

t−1

= i) = 1. A is also called “stochastic matrix” or “Markov matrix”.

2.6.2 State Distribution

Denote π

as the state distribution at time t., for example, suppose we have two states of weather,

sunny and rainy, then π

is a vector of size 2

= [p(x

= sunny), p(x

= rainy)]. (15)

To calculate a future state distribution at time t + 1, we use Bayes rule

2. SIMPLE RECOMMENDATION SYSTEMS

p(x

t+1

= j) =

i=1

p(x

t+1

= j, x

= i)

i=1

p(x

t+1

= j|x

= i)p(x

= i)

i=1

A[i, j]π

(i)

= π

t+1

(j). (16)

Rewrite the above element-wise relation in compact form of matrix

t+1

= π

A. (17)

Extend the above state distribution of one step ahead to that of k steps ahead

t+k

= π

. (18)

2.6.3 Connection to Recommender Systems

Every page on the Internet is a state in Markov model. If page i links to page j, the probability

of jumping to page j when user is at page i is

p(x

= j|x

t−1

= i) =

, (19)

where n

is the number of links on page i.

Most of transition probabilities are 0 with billions of pages on the Internet, we add smoothing to

transition matrix to obtain a weighted probability distribution, Google matrix G,

G = 0.85A + 0.15U, (20)

where U[i, j] = 1/M is from uniform distribution.

Extend Equation 18 to time ∞ , we have

∞

= π

∞

G, (21)

linear algebra shows that the state distribution is the eigenvector of G with corresponding eigen-

value as 1, which is the score of each page.

Justiﬁcation of pagerank algorithm with P-F theorem: If G is Markov chain and all its ele-

ments are positive then the stationary distribution (state distribution that doesn’t change after

transitioning by G) and its limiting distribution (state distribution arrived after transitioning G

inﬁnite times) are the same.

See ways to evaluate rankings at https://en.wikipedia.org/wiki/Learning to rank.

To summarize, this section introduces non-personalized recommendations. Recommendations

systems can be framed in terms of machine learning problems, and we saw how recommendation

systems can make use of supervised learning, conﬁdence intervals, Bayesian bandits, Markov

models.

3. EVALUATE A RECOMMENDER SYSTEM

3 Evaluate A Recommender System

Use train/test sets with cross validation (k-fold or leave one out of bag) to train the model. User

accuracy metrics such as RMSE, MAE, or metric such as hit rate and so on to evaluate the

performance of models. Other issues include:

• Coverage

• Diversity. Are we recommending the same types of things all the time?

• Novelty. How popular or obscure your recommendations are?

• Churn. How often do recommendations change?

• Responsiveness. How quickly does new user behavior aﬀect recommendations?

A real impact of models, however, should be evaluated on real users in an online setting. The

results of online A/B tests are the only evaluation that matters for recommendation systems.

Another thing can be done to evaluate recommender systems is to ask users to rate the recommen-

dations (measure the perceived quality), but it’s less likely to get some interpretable and useful

data as users may be confused with the ratings themselves and it requires extra work from users.

4 Collaborative Filtering (CF)

4.1 Notations

With collaborative ﬁltering, we could obtain a personalized score s(i, j) for each user i and item

j. That is, with i = 1, 2, ··· , N and j = 1, 2, ··· , M

s(i, j) =

∈Ω

|Ω

, (22)

where Ω

is a set users who rated item j, r

is the rating of item j from user i

Denote the user-item ratings matrix of size N × M as R

N×M

and r

is rating of user i for

item j.

Sparsity of R

N×M

• The user-item ratings matrix is sparse (missing ratings / undeﬁned / not exist).

• Our goal is to make recommendations, so sparsity is good from the perspective of business

view so we have items to recommend to users (e.g., movies that users haven’t watched).

Our goal of CF is to predict the scores s(i, j) that users will rate to items they haven’t watched/pur-

chased/... and make recommendations.

4. COLLABORATIVE FILTERING (CF)

4.2 User-User CF

User-user CF is a type of CF based on the similarity between users calculated using people’s ratings

of items. (Find similar users, then recommend movies based on behaviors of similar users).

Instead of assigning each user an equal weight, we adopt weighted ratings to take degree of

similarities of users into account.

s(i, j) =

∈Ω

. (23)

Also, interpretations of ratings from diﬀerent users vary as users can be optimistic or pessimistic,

for example, one user may still rate an item with 4 out of 5 even the user doesn’t like it, while

another user may like the same item and rate it as 4, user deviation to de-bias

dev(i, j) = r(i, j) − ¯r

. (24)

Combine the weightings and deviation above, we have

s(i, j) = ¯r

∈Ω

− ¯r

)

∈Ω

, (25)

where w

can be negative and can be calculated with various ways:

• Pearson correlation coeﬃcient

j∈Ψ

− ¯r

)(r

− ¯r

)

j∈Ψ

− ¯r

)

j∈Ψ

− ¯r

)

, (26)

where Ψ

is a set of items that user i has rated, Ψ

= Ψ

∩ Ψ

is a set of items that both

user i and i

have rated.

• Cosine similarity

cosθ =

i=1

, (27)

which is very similar to the Pearson correlation coeﬃcient except that the latter is centered.

• Euclidean distance.

Improvements of the CF:

• Ψ

may be zero when there are no common movies between user i and i

, and no correlations

could be computed. Those users won’t be used for each other’s prediction.

• We only consider most similar/correlated users when making predictions, which can speed

up the computation. This, however, may also move some useful information. For example,

two users may have very strong negative correlation when using Pearson’s correlation and

dropping a user from each other’s prediction may not be desired.

4. COLLABORATIVE FILTERING (CF)

• In practice, recommendations can be faster with precomputed user-user weights.

See an implementation of user-user CF with MovieLens 20 million data (link) at user-user-

collaborative-ﬁltering.ipynb. The steps are

• Split data into train and test sets.

• Compute weights using train set.

• Make predictions.

• Use an evaluation metric to test the predictions for both sets.

4.3 Item-Item CF

Item CF is a type of CF based on the similarity between items calculated using people’s ratings

of those items. (movie A and C are more similar because they have similar ratings, if user A likes

movie A, then it’s more likely that he/she will also like movie C).

Similarly, the score is

s(i, j) = ¯r

∈Ω

− ¯r

)

∈Ω

, (28)

the weights for diﬀerent items can be computed in a similar way but summing over users (i)

• Pearson correlation coeﬃcient

i∈Ψ

− ¯r

)(r

− ¯r

)

i∈Ψ

− ¯r

)

i∈Ψ

− ¯r

)

. (29)

To summarize,

1. User-user CF chose items for a user because those items have been liked by similar users.

2. Item-item CF choose items for a user because this user has liked similar items in the past.

3. In practice, there are a lot more data when comparing items than comparing users.

4. Item-item CF is faster as it computes less weights when number of users are far more than

number of items, N >> M.

Situations when item CF is preferred:

• There are more users than the number of products (e.g., Netﬂix has millions of users but

maybe not millions of movies, especially popular movies.)

• Items don’t change frequently so less computational expensive.

• It’s more convincing to use user’s historical item data.

Situations when user CF is preferred:

• Products change very frequently (e.g., news recommendation).

In practice, we can use a combination of diﬀerent methods and do an ensemble.

4. COLLABORATIVE FILTERING (CF)

User M1 M2 M3 M4 M5

A 1 1 1 0 0

B 1 1 0 1 0

C 0 1 1 0 1

D 0 1 0 1 0

Table 1: Indicator matrix of users and their corresponding rated movies, 1 indicates that a user

rated that movie while 0 indicates that a user didn’t rate that movie.

M1 M2 M3 M4 M5

M1 2 2 1 1 0

M2 2 4 2 2 1

M3 1 2 2 0 1

M4 1 2 0 2 0

M5 0 1 1 0 1

Table 2: Unnormalized co-occurence matrix.

4.4 Recommender System with MapReduce

4.4.1 Item CF

What data to use?

We chose to use the rating history of users to build the relationship between movies, data are

from Netﬂix Prize data. The underlying intuition is,

• If one user rated two movies, the two movies are more related than the other unrated movies

given that users will usually rate a movie on their own initiative (interested in both movies).

• When there are a lot of users, the rating history will reﬂect more about actual relations

among movies instead of individual “noise“.

How to describe the relationship between diﬀerent items?

Co-occurence matrix: a matrix that is deﬁned over an image to be the distribution of co-occurring

pixel values (grayscale values, or colors) at a given oﬀset.

Take the total ratings of each movie into account, normalize the co-occurrence matrix.

M1 M2 M3 M4 M5

M1 2/6 2/6 1/6 1/6 0/6

M2 2/11 4/11 2/11 2/11 1/11

M3 1/6 2/6 2/6 0/6 1/6

M4 1/5 2/5 0/5 2/5 0/5

M5 0/3 1/3 1/3 0/3 1/3

Table 3: Normalized co-occurence matrix.

4. COLLABORATIVE FILTERING (CF)

Movie UserA

Rating

M1 3

M2 7

M3 8

M4 0

M5 0

Table 4: User ratings matrix.

Rating matrix to indicate the diﬀerence among movies for each user. For missing ratings (users

don’t rate the movies), use a neutral rating the users will usually give from historical value. If no

ratings from the users at all, cold start (5 out of 10) or ﬁnd the neutral score of these users from

other sources such as other movie rating websites.

Multiply co-occurence matrix by rating matrix to obtain expected rating of each movie for a user,

make recommendation based on the expected score.

Raw data: userID, movieID, rating

Preprocesser: divide data by userID, merge the same userIDs together.

4.4.2 Recommender System with MapReduce

1. Enter docker and start hadoop.

2. Download code from https://github.com/QianqianShan/Recommender System

3. Unzip the ﬁle with tar -xvf RecommenderSystem.tar

4. cd RecommenderSystem”

5. Make directory to save input data, hdfs dfs -mkdir /input

6. Copy input ﬁle to the input folder, hdfs dfs -put input/* /input

7. Remove the created folders to each MapReduce job if already created (for previous running

of the code).

hdfs dfs -rm -r /dataDividedByUser

hdfs dfs -rm -r /coOccurrenceMatrix

hdfs dfs -rm -r /Normalize

hdfs dfs -rm -r /Multiplication

hdfs dfs -rm -r /Sum

8. Enter the source code folder, cd src/

9. Compile, hadoop com.sun.tools.javac.Main *.java

10. Make jar, jar cf recommender.jar *.class

5. MATRIX FACTORIZATION

11. Run the jobs, hadoop jar recommender.jar Driver /input /dataDividedByUser /coOccur-

renceMatrix /Normalize /Multiplication /Sum

Note:

args0: original dataset

args1: output directory for DividerByUser job

args2: output directory for coOccurrenceMatrixBuilder job

args3: output directory for Normalize job

args4: output directory for Multiplication job

args5: output directory for Sum job

12. Check results from the last MapReduce job, hdfs dfs -cat /Sum/*

13. List all folders to check outputs of each MapReduce job, hdfs dfs -ls /

5 Matrix Factorization

5.1 Basic Models

The idea of matrix factorization is to split the approximate rating matrix

N×M

into two smaller

matrices

R = W U

, (30)

where W

N×K

and U

M×K

should be much smaller in size than the sparse representation matrix

that is, K should be relatively small (e.g., 10 - 50), so it takes much less space.

Multiplication of W and U

can still be large, one can compute rating for user i and item j

separately with br

R[i, j] = w

. Further more,

= ||w

|| · ||u

||cosθ, (31)

which can be interpreted as how well user i’s preferences correlate with movie j’s attributes. These

features (preferences or attributes), however, are latent that are hard to interpret with K is the

latent dimensionality.

5.2 Justiﬁcation and Relationship between SVD and MF

In mathematics, a matrix X can be decomposed to three matrices with singular value decompo-

sition (SVD)

X = USV

, (32)

5. MATRIX FACTORIZATION

so a matrix X can also be decomposed to two matrices simply by absorbing one matrix to one

of the rest two matrices on the right hand side of Equation 32 together. In other words, one can

always turn the solution of SVD into solutions in MF.

In the U and V of a SVD decomposition, however, vectors in U and V are orthonormal, which is

not necessarily true in MF, also SVD doesn’t work for matrices with missing values.

5.3 Loss Function

In order to ensure the approximation of R with

R is good, that is

R ≈

R = W U

, (33)

we minimize a squared error loss

J =

i,j∈Ω

− br

)

, (34)

where Ω is a set of pairs (i, j) for user i and item j.

5.4 Matrix Factorization with Bias and Regularization

Similar with using deviations in collaborative ﬁltering, we introduce bias terms for matrix factor-

ization

= w

+ b

+ c

+ µ, (35)

where b

is user bias, c

is item bias, µ is global average. We use both user bias and item bias as

both dimensions can be biased.

To prevent overﬁtting, add regularization to the loss function

J =

i,j∈Ω

− br

)

+ λ(||W ||

+ ||U||

+ ||b||

+ ||c||

), (36)

where || · ||

denotes Frobenius norm. Take partial derivatives with respect to the parameters

j∈Ψ

+ λI

−1

j∈Ψ

− b

− c

− µ)u





i∈Ω

+ λI





−1

i∈Ω

− b

− c

− µ)u

|Ψ

| + λ

j∈Ψ

− w

− c

− µ)

|Ω

| + λ

i∈Ω

− w

− b

− µ) (37)

(38)

Update the parameters with alternating least squares to obtain optimized values. Gradient descent

algorithm can be an alternative.

5. MATRIX FACTORIZATION

5.5 Probabilistic MF

Instead of

R = W U

R follows a probabilistic distribution

R ∼ N(W T

, Σ = σ

I), (39)

where R is the random matrix. Then we can apply methods such as maximum likelihood estima-

tion (MLE) to ﬁnd parameters. Log-likelihood can be written as

log(L) = C −

i,j∈Ω

2σ

− w

)

, (40)

which gives the same results of parameters as above.

5.6 Bayesian MF

Maximum a posteriori (MAP) estimation of W and U

p(W, U|R) =

p(R|W, U)p(W )p(U)

p(R)

, (41)

where R is the collected data on ratings.

MCMC could be used to take samples of posterior distributions of W and U and compute the

posterior means

E(r

|R) = E(w

|R) ≈

(t)T

(t)

(42)

5.7 Improving Basic Models By Deep Learning Techniques

Matrix factorization models only on linear patterns, using deep NN can ﬁnd non-linear patterns.

NN can work with multiple branches doing diﬀerent things to ﬁnd diﬀerent features of inputs

1. Concatenate the W and U matrices together as the input a neural network (NN).

2. Apply NN on the residuals from matrix factorization modeling and make ﬁnal predictions

of ratings as the sum of matrix factorization prediction and NN prediction on the residuals

br = MF (x) + NN(residuals) (43)

3. Autoencoder Sedhain et al. (2015) reproduces its own input. Autoencoder reconstructs its

input even if the input is noisy (e.g., missing ratings in input). For example, we simply treat

the rating matrix as data matrix (with missing values) with each row of users is a sample

and each column is feature, use the data matrix as both input and output to train a NN

model.

• Add your own noise to inputs helps the model to predict missing ratings instead of just

reconstruct the input (predict ratings that are not missing).

6. RESTRICTED BOLTZMANN MACHINE (RBM) FOR COLLABORATIVE FILTERING

Figure 3: RBM layers from Wikipedia

• Cost function is based on known ratings

J =

|Ω|

i=1

j=1

− br

)

, (44)

where m

= 1 if (i, j) ∈ Ω else 0.

• As autoencoder uses the same ratings as both input and output to train a model, one

cannot use test set as input to check MSE as the model will produce the same output

as input, one should use predicted ratings reconstructed from the training model to test

the performance on test set. That is, See code at Autoencoder.ipynb for more details.

6 Restricted Boltzmann Machine (RBM) for Collabora-

tive Filtering

An RBM has an input (visible) layer v and a hidden layer h, and it’s two-way NN (can go from v

to h and from h to v), see Figure 3 from Wikipedia contributors (2019) as an example of RBM.

6. RESTRICTED BOLTZMANN MACHINE (RBM) FOR COLLABORATIVE FILTERING

6.1 Motivation

We’d like to create a NN without imposing any structure on it, that is, “everything is connected

to everything” (Boltzmann machine, or BM). The idea is originated from statistical mechanics

that systems tend to go towards equilibrium, from high energy to low energy.

Instead of applying BM which is diﬃcult to train, we apply RBM that can scale the training up to

non-trivial problems. That is, we restrict the BM that visible units must connect to all the hidden

units and vice versa, however, visible units cannot connect to other visible units and hidden units

cannot connect to other hidden units.

6.2 Bernoulli RBM

Both visible and hidden units in a Bernoulli RBM can only be 0 or 1, but this constraint can

usually be relaxed. The probabilities of h

p(h = 1|v) = σ(W

v + c), (45)

or equivalently in scaler form

p(h

= 1|v) = σ

i=1

+ c

, (46)

where i = 1, ··· , D and j = 1, ··· , M.

With W is a shared weight, symmetrically, the probabilities of v

p(v = 1|h) = σ(W h + b) (47)

6.3 Relaxed Bernoulli RBM

One can relax the constraints on the values of Bernoulli RBM:

1. First scale all values into the range of [0, 1] and round all values x

such that x

= 1 if

>= 0.5, or x

= 0 otherwise.

2. Just pass in values as they are. For example, when calculating p(h = 1|v), there is no

constraint on the values of v.

3. When going from h to v, simply use

h = p(h = 1|v) as input.

h = p(h = 1|v) = σ(W

v + c)

p(v

= 1|h) = σ(W

h + b). (48)

If we go v → h → v

, the cross entropy (distance between v and v

) will go down as we train

even we don’t optimize it.

6. RESTRICTED BOLTZMANN MACHINE (RBM) FOR COLLABORATIVE FILTERING

6.4 RBM Training

Maximum likelihood estimation is used to ﬁnd values of W, b, c by maximizing the probabilities of

observations v we have.

W, b, c = argmax

W, b, c

log p(v; W, b, c), (49)

where

p(v) =

exp



−



W h + b

v + c



exp(−E(v, h)) (50)

with

Z =

]

exp(−E(v

, h)) (51)

Due to the exponentially increasing computation time to compute the probabilities, we need to

simplify the computation so it becomes tractable.

Deﬁne free energy as

F (v) = −log

exp(−E(v, h)). (52)

Then for any parameter θ ( = W or b or c)

−

∂ log p(v)

∂θ

∂

∂θ

[F (v) + log Z]

∂F (v)

∂θ

−

∂

∂θ



exp(−F (v

))

∂F (v

)

∂θ



∂F (v)

∂θ

−



exp(−F (v

))

∂F (v

)

∂θ



∂F (v)

∂θ

−



p(v

)

∂F (v

)

∂θ



∂F (v)

∂θ

− E



∂F (v

)

∂θ



≈

∂F (v)

∂θ

−

∂F (v

)

∂θ

(sample v

from Gibbs sampling)

≈

∂F (v)

∂θ

−

∂F (v

)

∂θ

(N = 1), (53)

that is, with N = 1, given v, sample h from p(h|v), then sample v

from p(v

|h). The above is

called contrastive divergence and F (v) will take O(M) time to compute.

6.5 Categorical RBM for Recommender Systems

Ratings in recommender systems are usually not between [0, 1], we extend the Bernoulli RBM to

have the visible units represent a K−class categorical distribution, while the hidden units are still

allowed to take Bernoulli distribution. The NN equations can be updated accordingly

6. RESTRICTED BOLTZMANN MACHINE (RBM) FOR COLLABORATIVE FILTERING

p(h

= 1|v) = σ(

k=1

i=1

+ c

)

p(v

= 1|h) = softmax

j=1

+ b

, (54)

• h is a vector of size M

• V is 2-D array of size D × K

• W is 3-D array of size D × K × M

• b is 2-D array of size D × K

• c is vector of size M

Predictions can be made with

• weighted average of predicted ratings with corresponding probabilities.

• take the argmax of probabilities on the predicted ratings.

REFERENCES

References

Sedhain, S., A. K. Menon, S. Sanner, and L. Xie (2015). Autorec: Autoencoders meet collaborative

ﬁltering. pp. 111–112.

Shirriﬀ, K. (2013). How Hacker News ranking really works: scoring, controversy, and penalties.

http://www.righto.com/2013/11/how-hacker-news-ranking-really-works.html/. [On-

line; accessed 10-May-2019].

Wikipedia contributors (2019). Restricted boltzmann machine — Wikipedia, the free encyclopedia.

[Online; accessed 9-June-2019].