Notes on Elements of Statistical Learning

Notes on The Elements of Statistical

Learning

Qianqian Shan

Contents

1 Introduction 3

2 Overview of Supervised Learning 3

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.2 Variable Types . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.3 Two Simple Approaches to Prediction:Least Squares (LS) and

Nearest Neighbor (NN) . . . . . . . . . . . . . . . . . . . . . . 3

2.4 Statistical Decision Theory . . . . . . . . . . . . . . . . . . . . 4

2.4.1 L

Loss Function . . . . . . . . . . . . . . . . . . . . . 4

2.4.2 L

Loss Function . . . . . . . . . . . . . . . . . . . . . 5

2.4.3 0-1 Loss Function for Categorical Output Variable G . 6

2.5 Local Methods in High Dimensions . . . . . . . . . . . . . . . 6

2.6 Statistical Models, Supervised Learning and Function Approx-

imation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.6.1 Statistical Model for the Joint Distribution P (X, Y ) . . 6

2.6.2 Supervised Learning . . . . . . . . . . . . . . . . . . . 6

2.6.3 Function Approximation . . . . . . . . . . . . . . . . . 6

2.7 Structured Regression Models . . . . . . . . . . . . . . . . . . 7

2.8 Classes of Restricted Estimators . . . . . . . . . . . . . . . . . 7

2.8.1 Roughness Penalty and Bayesian Methods . . . . . . . 7

2.8.2 Kernel Methods and Local Regression . . . . . . . . . . 7

2.8.3 Basis Functions and Dictionary Methods . . . . . . . . 8

2.9 Model Selection and the Bias-Variance Tradeoﬀ . . . . . . . . 9

3 Linear Methods for Regression 9

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

3.2 Linear Regression Models and Least Squares . . . . . . . . . . 9

3.2.1 Gauss-Markov Theorem . . . . . . . . . . . . . . . . . 10

3.2.2 Multiple Regression from Simple Univariate Regression 10

CONTENTS

3.3 Subset Selection . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3.4 Shrinkage Methods . . . . . . . . . . . . . . . . . . . . . . . . 11

3.4.1 Ridge Regression . . . . . . . . . . . . . . . . . . . . . 11

3.4.2 The Lasso (basis pursuit in signal processing) . . . . . 13

3.4.3 Comparison subsection selection, Ridge, Lasso and other

Penalty Forms . . . . . . . . . . . . . . . . . . . . . . . 14

3.4.4 Least Angle Regression (LAR) . . . . . . . . . . . . . . 15

3.5 Methods Using Derived Input Directions . . . . . . . . . . . . 15

3.5.1 Principal Components Regression (PCR) . . . . . . . . 15

3.5.2 Partial Least Squares (PLS) . . . . . . . . . . . . . . . 16

3.6 Comparison of the Selection and Shrinkage Methods . . . . . . 17

3.7 Multiple Outcome Shrinkage and Selection . . . . . . . . . . . 17

3.8 More on the Lasso and Related Path Algorithms . . . . . . . . 19

3.8.1 Incremental Forward Stagewise Regression (FS0) . . . 19

3.8.2 Piecewise-Linear Path Algorithms . . . . . . . . . . . . 19

3.8.3 The Dantzig Selector . . . . . . . . . . . . . . . . . . . 19

3.8.4 The Grouped Lasso . . . . . . . . . . . . . . . . . . . . 20

3.8.5 Further Properties of the Lasso . . . . . . . . . . . . . 20

3.8.6 Pathwise Coordinate Optimization (PCO) . . . . . . . 20

4 Linear Models for Classiﬁcation 20

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

4.2 Linear Regression of an Indicator Matrix . . . . . . . . . . . . 21

4.3 Linear Discriminant Analysis . . . . . . . . . . . . . . . . . . 21

4.4 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . 24

4.5 Separating Hyperplanes . . . . . . . . . . . . . . . . . . . . . 26

4.5.1 Rosenblatt’s Perceptron Learning Algorithm . . . . . . 26

4.5.2 Optimal Separating Hyperplanes (OSH) . . . . . . . . 27

5 Basis Expansion and Regularization 29

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

6 Kernel Smoothing Methods 29

7 Model Assessment and Selection 29

8 Model Inference and Averaging 29

9 Additive Models, Trees, and Related Methods 29

9.1 Generalized Additive Models . . . . . . . . . . . . . . . . . . . 29

10 Boosting and Additive Trees 30

2 OVERVIEW OF SUPERVISED LEARNING

1 Introduction

A problem is called,

• supervised learning when the outcome variable is available to guide the

learning process.

• unsupervised learning when only features are observed and have no

measurement of the outcome.

2 Overview of Supervised Learning

2.1 Introduction

Synonym

Inputs:

• Inputs are often call as predictors in statistical literature, or more clas-

sically independent variables.

• Features is preferred in pattern recognition literature.

Outputs are called responses, or classically dependent variables.

2.2 Variable Types

Output Type Quantitative Qualititaive Ordered Categorical

Prediction Task Regression Classiﬁcation See Chapter 4

Table 1: Variable Types.

2.3 Two Simple Approaches to Prediction:Least Squares

(LS) and Nearest Neighbor (NN)

2 OVERVIEW OF SUPERVISED LEARNING

LS NN

Model

Y =

j=1

Y =

∈N

(x)

Solution Find

β by minimizing RSS.

Parameter β Number of neighbors k

Eﬀective Num-

ber of Parame-

ters

Dimension of β N/k, and generally bigger than p

(the dimension)

Assumption A linear decision boundary is ap-

propriate (low variance and po-

tentially high biss).

No stringent assumption (subre-

gion can be wiggly and unstable,

high variance and low bias).

Comments A large subset of the most pop-

ular techniques in use today are

variants of these two simple pro-

cedures.

Table 2: LS and NN

Ways to enhance these two procedures:

1. Kernel methods with weights to replace the 0/1 weights by k-nearest

neighbor methods.

2. The distance kernels can be modiﬁed to emphasize some variable more

than others in high dimensional spaces.

3. Local regression ﬁts linear models by locally weighted least squares.

4. Linear models ﬁts to basis expansion of the original inputs allow arbi-

trary complex models.

5. Projection pursuit and neutral network models consist of sums of non

linearly transformed linear models.

2.4 Statistical Decision Theory

2.4.1 L

Loss Function

We are seeking for a function f(X) to predict Y , a real valued random output

variable, given values of input X, if a square error loss function L(Y, f(X)) =

2 OVERVIEW OF SUPERVISED LEARNING

(Y − f(X))

is used, the criterion for choosing f is ,

EP E(f) = E(Y − f(X))

[y −f(x)]

P (dx, dy)

= E

Y |X



(Y − f(X))



and it suﬃces to minimize EPE pointwise,

f(x) = argmin

Y |X



(Y − c)

|X = x



, (2.1)

which leads to

f(x) = E(Y |X = x), (2.2)

the conditional expectation.

LS NN

Solution Assume the regression function

f(x) ≈ x

β is approximately lin-

ear in its arguments and β =

[E(XX

)]

−1

E(XY ) is the least

square solution by replacing the

expectation by averages over the

training data.

f(x) = Ave(y

∈ N

(x)),

which means that the expec-

tation is approximated by

averaging over sample data,

and,conditioning on a point is

approximated by conditioning on

some region “close” to the target

point.

Interpretation f(x) is approximated by a glob-

ally linear function.

f(x) is approximated by a locally

constant function.

Table 3: Squared Error Loss Function

2.4.2 L

Loss Function

If the loss function is,

L(Y, f(X)) = |Y − f(X)|, (2.3)

the solution is the conditional median,

f(x) = median(Y |X = x), (2.4)

which has discontinuities in its derivatives, and this hinders its widespread

use.

2 OVERVIEW OF SUPERVISED LEARNING

2.4.3 0-1 Loss Function for Categorical Output Variable G

G(x) = G

if P (G

|X = x) = max

g∈G

P (g|X = x), (2.5)

which is know as Bayes classiﬁer. It will classify to the most probable class

based on the conditional distribution P (G|X).

2.5 Local Methods in High Dimensions

Curse of dimensionality,

• Neighborhoods are no longer “local”.

• All sample points are close to an edge of the sample.

• Sampling density is proportional to N

1/p

, with p the dimension of the

input space and N is the sample size.

2.6 Statistical Models, Supervised Learning and Func-

tion Approximation

2.6.1 Statistical Model for the Joint Distribution P (X, Y )

Y = f(X) + , (2.6)

where  is the random error with E() = 0 and independent of X. In this

model, f(x) = E(Y |X = x) and the conditional distribution of P (Y |X)

depneds on X only through the conditional mean of f (x).

2.6.2 Supervised Learning

Suppose the model in (2.6) is reasonable, supervised learning attempts to

learn f by example through a teacher.The algorithm can modify its in-

put/output relationship in response to the diﬀerences y −

f, which is known

as learning by example.

2.6.3 Function Approximation

• In machine learning (with analogies to human reasoning) and neural

networks (with biological analogies to the brain), the learning paradigm

above has been the motivation for research.

• In applied mathematics and statistics, the approach taken is from the

perspective of function approximation and estimation.

2 OVERVIEW OF SUPERVISED LEARNING

The data pairs {x

, y

} are viewed as points in (p + 1)−dimensional Eu-

clidean space with the relation y

= f(x

) + 

, the goal is to obtain a useful

approximation to f(x) for all x in some region.

Treating supervised learning as a problem in function approximation en-

courages the geometrical concepts of Euclidean spaces and mathematical

concepts of probabilistic inference to be applied to the problem. And it’s the

approach taken by the ESL book.

2.7 Structured Regression Models

For the RSS criterion for an arbitrary function f, any function

f that passing

through the training sets is a solution. In order to obtain useful results for

ﬁnite N sample, we need to restrict the eligible solutions, and this is a major

topic of the ESL book.

Any method that attempts to produce locally varying functions in small

isotropic neighborhodds will run into problem in high dimensions due to the

curse of dimensionality.

All methods that overcome the dimensionality problems have an associ-

ated — often implicit or adaptive — metric for measuring neighborhoods,

which basically doesn’t allow the neighborhood to be simultaneously small

in all directions.

2.8 Classes of Restricted Estimators

2.8.1 Roughness Penalty and Bayesian Methods

P RSS(f, λ) = RSS(f) + λJ(f) (2.7)

Penalty function, or regularization methods, express our prior belief that

the type of functions we seek exhibit a certain type of smooth behavior, and

can be cast in a Bayesian framework by considering J as the log prior and

PRSS as the log-posterior. Thus minimizing PRSS amounts to ﬁnding the

posterior mode.

2.8.2 Kernel Methods and Local Regression

In general, a local regression of estimate of f(x

) is deﬁned as f

) which

minimizes,

RSS(f

, x

) =

i=1

, x

)(y

− f

))

, (2.8)

2 OVERVIEW OF SUPERVISED LEARNING

• When f

(x) = θ

, a constant, it’s

) =

, x

)

. (2.9)

• K

, x) is the kernel function which assigns weights to points x in a

region around x

• When f

(x) = θ

+ θ

x, then it’s local linear regression.

• Near neighbor methods can be thought of as kernel methods with data

dependent metric,

, x) = I(||x − x

|| ≤ ||x

(k)

− x

||), (2.10)

is the metric for k-nearest neighbors with x

(k)

the k-th in distance from

2.8.3 Basis Functions and Dictionary Methods

1. Basis functions: the model for f is a linear expansion of basis functions,

(x) =

m=1

(x), (2.11)

where h

is a function of input x.

• Radial basis functions are symmetric p-dimensional kernels lo-

cated at particular centroids,

(x) =

m=1

(µ

, x) (2.12)

2. The adaptively chosen basis function methods are known as dictionary

methods, where there are a possible inﬁnite set or dictionary D of can-

didate basis functions to choose, and models are built up by employing

some search mechanism. For example, a single layer feed-foreward net-

work model with linear output weights can be thought of as an adaptive

basis function method,

(x) =

m=1

σ(α

x + b

), (2.13)

with σ(x) = 1/(1 + exp(−x)).

3 LINEAR METHODS FOR REGRESSION

2.9 Model Selection and the Bias-Variance Tradeoﬀ

We’d like to choose model complexity to trade bias oﬀ with variance in such

a way to minimize the test error.

3 Linear Methods for Regression

3.1 Introduction

Linear models can sometimes outperform fancier nonlinear models, especially

in situations with small number of training cases, low signal-to-noise ratio

or sparse data. Also, linear models can be applied to transformations of the

inputs and this considerably expands their scope.

3.2 Linear Regression Models and Least Squares

• Input vector, X

= (X

, X

, ··· , X

), output Y

• Linear regression model, f(X) = β

• Estimation method: least squares,

RSS(β) =

i=1

− f (x

)]

, (3.1)

or in matrix format,

RSS(β) = (Y −Xβ)

(Y − Xβ), (3.2)

the criterion is reasonable if: the training observations represent in-

dependent random draws from population; or, y

’s are conditionally

independent given the inputs.

• Solution is

β = (X

−1

Y by solving the normal equation, X

(Y −

Xβ) = 0

• The ﬁtted values

Y = X

β = X(X

−1

• The normal equation shows that, we minimize RSS by choosing

β so

that the residual vector is orthogonal to the X column space.

• Z score can be used to test the hypothesis that a particular β

= 0.

• F statistic can be used to test a group of coeﬃcients simultaneously.

3 LINEAR METHODS FOR REGRESSION

3.2.1 Gauss-Markov Theorem

Consider the estimation of any linear combination of the parameters, θ =

β, with

β the least square estimator,

θ = a

β is unbiased E(a

β) = a

β,

then the Gauss-Markov theorem states that for any other linear unbiased

estimator c

Y ,

Var(a

β) ≤ Var(c

Y ). (3.3)

The theorem implies that the least square estimator has the smallest mean

squared error of all linear estimators with NO bias.

From a pragmatic point of view, most models are distortions of the truth,

and hence are biased; picking the right model amounts to creating the right

balance between bias and variance.

3.2.2 Multiple Regression from Simple Univariate Regression

Suppose inputs x

, ··· , x

, we say regress b on a to indicate that a univariate

regression of b on a without intercept. Then the Gram-Schmidt procedure

for multiple regression is,

Algorithm:

1. Initialize z

= x

= 1.

2. For j = 1, ··· , p,

• For l = 0, ··· , j − 1, regress x

on z

to obtain ˆγ

l,j

=< z

, x

/ < z

, z

• Compute residual vector z

= x

−

j−1

k=0

ˆγ

l,j

3. Regress y on z

to obtain

=< z

, y > / < z

, z

are orthogonal and form the basis for the column space of inputs X.

Rewrite Step 2 in matrix form,

X = ZΓ = ZD

−1

· DΓ = Q · R, (3.4)

where D is a diagonal matrix with D

= ||z

||, and Γ is an upper

triangular matrix with diagonal elements as 1, and (i, j) − th oﬀ diagonal

element as ˆγ

i−1,j−1

. R is also an upper triangular matrix.

QR decomposition of X represents a convenient orthogonal basis for col-

umn space of X. So,

3 LINEAR METHODS FOR REGRESSION

Subset selection Shrinkage methods

Exhibits high variance due

to a discrete process to se-

lect variables

More continuous, don’t suf-

fer as much from high vari-

ability

Q = I (3.5)

β = R

−1

y (3.6)

y = QQ

y (3.7)

3.3 Subset Selection

Least squares estimates have low bias but large variance (prediction accu-

racy). And a smaller subset that exhibit the strongest eﬀects are preferred

when there are a large number of predictors (interpretation). So we want to

do subset selection.

1. Best subset selection: ﬁnd the subset of size k that gives the smallest

RSS.

2. Forward- and Backward-stepwise selection:

• Forward stepwise: start with the intercept, then sequentially add

into the model the predictor that improves the ﬁt most. Greedy

algorithm that produces a nested sequence of models.

• Backward stepwise: start with the full model, then sequentially

deletes the predictor that has the least impact on the ﬁt. Can

only be used when N > p.

3. Forward stagewise regression: more constrained than forward stepwise.

Start with intercept and centered predictors with zero coeﬃcients. The

algorithm identiﬁes the variable that most correlated with the current

residuals, update the residual with regression on selected variable. Stop

until none of the variables correlated with the current residuals.

3.4 Shrinkage Methods

3.4.1 Ridge Regression

1. Assumption: the response will vary most in the directions of high vari-

ance of the inputs.

3 LINEAR METHODS FOR REGRESSION

2. Purpose: When there are many correlated variables in a linear regres-

sion model, their coeﬃcients can become poorly determined and exhibit

high variance: large positive coeﬃcient on one variable can be canceled

by a similarly large negative coeﬃcient from its correlated cousin.

3. X needs to be standardized as the solutions are not equi-variant under

scaling of the inputs.

4. Lagrangian form

ridge

= argmin

i=1

− β

−

j=1

+ λ

j=1

, λ ≥ 0. (3.8)

5. Or,

ridge

= argmin

i=1

− β

−

j=1

subject to

j=1

≤ t, (3.9)

There is a one-to-one correspondence between λ and t.

6. Matrix form,

RSS(λ) = (y − Xβ)

(y − Xβ) + λβ

β (3.10)

7. Solution,

ridge

= (X

X + λI)

−1

y. (3.11)

The estimate β

= ¯y.

8. Bayesian interpretation: suppose y

follows independent normal distri-

bution with mean β

+ x

β and variance σ

, and β

has prior N(0, τ

)

for each j, then the negative log-likelihood is equivalent to (3.8). The

ridge estimate is the mode of the posterior distribution.

9. Relationship to least square estimates: when having orthonormal in-

puts s.t. X

X = I,

ridge

β/(1 + λ), which indicates that the ridge

estimates are a scaled version of the least squares estimates.

3 LINEAR METHODS FOR REGRESSION

10. Interpretation on ridge shrinkage under SVD s.t. X = U DV

ridge

= X(X

−1

y (3.12)

j=1

+ λ

y, (3.13)

the ridge regression computes the coordinates of y with respect to the

orthonormal basis U , and then shrinks these coordinates by a factor

of d

/(d

+ λ), that is, the smaller the d

, the greater the shrinkage

will be applied to the coordinates of basis associated with it. A smaller

corresponds to directions in the column space of X with smaller

variance.

11. Eﬀective degree of freedom

df(λ) = tr(X(X

X + λI)

−1

) =

j=1

/(d

+ λ).

12. Hyperparameter λ could be determined by cross validation.

3.4.2 The Lasso (basis pursuit in signal processing)

1. Use L

penalty.

2. X needs to be standardized as the solutions are not equi-variant under

scaling of the inputs.

3. Lagrangian form

lasso

= argmin

i=1

− β

−

j=1

+ λ

j=1

|β

|, λ ≥ 0.

(3.14)

4. Or,

lasso

= argmin

i=1

− β

−

j=1

subject to

j=1

|β

| ≤ t, (3.15)

5. The L

penalty makes the solutions nonlinear in y

and no closed form

expression.

3 LINEAR METHODS FOR REGRESSION

Figure 1: Estimation picture of ridge (right) and lasso (left) in two dimension

cases. Plots are from ESL book.

6. Bayesian interpretation: suppose y

follows independent normal dis-

tribution with mean β

+ x

β and variance σ

, and β

has double

exponential prior (1/2)τexp(−|β

|/τ) with τ = 1/λ for each j, then

the negative log-likelihood is equivalent to (3.14). The lasso estimate

is the mode of the posterior distribution.

7. Relationship to least square estimates: when having orthonormal in-

puts s.t. X

X = I,

lasso

= sign(

β) ·(|

β|−λ)

, this could be derived

by discussing cases of diﬀerent signs of

β and

lasso

3.4.3 Comparison subsection selection, Ridge, Lasso and other

Penalty Forms

1. Best subset: drops all variables with coeﬃcients smaller than the M-th

largest, “hard thresholding”.

2. Ridge: does a proportional shrinkage, “soft thresholding”.

3. Lasso: translates each coeﬃcient by a constant factor λ, truncating at

zero. “soft thresholding”.

Figure 1 shows why there are more opportunities for lass to have param-

eter estimates as zero (more corners, ﬂatter edges and faces).

3 LINEAR METHODS FOR REGRESSION

Generalized penalty regression

β = argmin

i=1

− β

−

j=1

+ λ

j=1

|β

, λ ≥ 0, q ≥ 0.

(3.16)

• q = 0, subset selection.

• q ≤ 1, prior is not uniform in direction, but concentrates more mass in

the coordinate directions. q = 1, lasso.

• q = 2, ridge.

• q ∈ (1, 2), a compromise between ridge and lasso.

• Elastic net is another form of compromise between ridge and lasso

with penalty λ

j=1



αβ

+ (1 − α)|β



. Elastic net selects variables

like the lasso, and shrinks together the coeﬃcients of correlated predic-

tors like ridge. It also has considerable computational advantages over

penalty. See more details in Section 18.4.

3.4.4 Least Angle Regression (LAR)

1. LAR is a kind of “democratic” version of forward stepwise. Similarly,

it ﬁrst identiﬁes the variable most correlated with the response, but

only enters “as much” of the predictor as it deserves.

2. X needs to be standardized.

3. The directions chosen by LAR to adjust the coeﬃcients keeps the cor-

relations tied and decreasing.

4. The name “least angles” comes from a geometrical interpretation of the

process: u

= X

makes the smallest and equal angle with each of

the predictors in A

. See details in Table 4.

3.5 Methods Using Derived Input Directions

3.5.1 Principal Components Regression (PCR)

1. X needs to be standardized as the solutions are not equi-variant under

scaling of the inputs.

3 LINEAR METHODS FOR REGRESSION

Algorithm:

1. Standardize predictors to have mean zero and unit norm. Let r = y−

and β

, β

, ··· , β

= 0.

At each step k:

2. Find x

that is most correlated with r, add it to active set A

, so

there are (k − 1) nonzero values and one just entered with zero.

3. Move β

towards its least squares coeﬃcients δ

)

−1

by β

(α) = β

+ αδ

. The process stops when

other competitor x

has as much correlation with the current residual as

, the current residual is r

= y − X

4. Continue until all p predictors have been entered. With min(N-1, p)

steps, we arrive at the full least squares solution.

Table 4: LAR.

2. Use derived input columns, z

= Xv

, where v

are elements from

V with X = U DV

, V contains orthonormal eigen vectors of X

and V

V = I.

3. Regression form:

pcr

(M)

= [1, X]



m=1



= ¯y1 +

m=1

, (3.17)

where

= hz

, yi/hz

, z

i, and M ≤ p.

3.5.2 Partial Least Squares (PLS)

1. X needs to be standardized as the solutions are not equi-variant under

scaling of the inputs.

2. PLS seeks directions that have high variance and high correlation with

the response, while PCR keys only on high variance.

3. The m-th PLS direction

solves,

max

Corr

(y, Xα)Var(Xα), subject to ||α|| = 1, α

= 0, , l = 1, c . . . , m−1.

(3.18)

4. The variance in (3.18) tends to dominate, and the PLS behaves much

like ridge regression and principal components regression.

3 LINEAR METHODS FOR REGRESSION

Comparison of LAR and LASSO

Algorithm Can be a modiﬁcation to

implement LASSO

If the non-zero coeﬃcient hits

zero in step 3, drop its variable

from active set and recompute the

current joint least squares direc-

tions.

Computation

Steps

LAR always takes p steps to

get the full least squares es-

timates.

LASSO can have more than p

steps.

Stationary

Condition

(y − Xβ) = γs

, with

(y − Xβ)| ≤ γ

(y − Xβ) = λ · sign(β

)

The above conditions are the same when sign of β

matches the sign of inner product.

General deﬁnition of eﬀective degrees of freedom of adaptively ﬁtted model: df (

y) =

1/σ

i=1

Cov( ˆy

, y

)

Eﬀective

k after the kth step of LAR

procedure.

approximately equals the number

of predictors at any stage. The

approximation works best at the

last model in the sequence that

contains k predictors.

Table 5: A is the active set of variables at some stage for LAR, B is the

active set for LASSO, s

= ±1 is the sign of the inner product within the

equation.

3.6 Comparison of the Selection and Shrinkage Meth-

ods

1. Like ridge, PLS also tends to shrink the low-variance directions, but

can actually inﬂate some of the higher variance directions, which makes

PLS a little unstable.

2. To summarize, PLS, PCR and ridge regression tend to behave similarly.

Ridge may be preferred because it shrinks smoothly, rather than in

discrete steps. Lasso falls somewhere between ridge and best subset

selection regression, and enjoys some of the properties of each.

3.7 Multiple Outcome Shrinkage and Selection

To apply shrinkage in multiple output case:

1. Apply a univariate technique individually to each outcome, which could

allow diﬀerent amount of regularization to each outcome,

3 LINEAR METHODS FOR REGRESSION

Comparison of PCR and Ridge

PCR shrinks the coeﬃcients

of principal components, and

shrinks more for bigger size of

the corresponding eigenvalue.

PCR discards the p −M smallest

eigenvalue components.

Algorithm 3.3:

1. Standardize each predictor x

to have mean zero and variance one.

Set by

(0)

= ¯y1 and x

(0)

= x

, j = 1, ··· , p.

2. For m = 1, ··· , p,

• Compute the partial least square direction z

, where

= hx

, yi, the inputs z

are weighted by the strength of their

univariate eﬀect on y.

• Compute

= hz

, yi/hz

, z

•

(m)

(m−1)

• Prepare for next m: Orthogonalize each x

(m−1)

with respect to z

to obtain x

(m)

= x

(m−1)

−

, x

(m−1)

i/hz

, z

. (kind of

like keep only residuals of x).

3. Output the sequence {by

(m)

}

. by

(m)

= by

(0)

+···

= X

pls

(m)

Table 6: PLR.

2. Apply a univariate technique simultaneous to all outcomes,

3. Exploit correlations in the diﬀerent responses, for example, canonical

correlation analysis (CCA).

CCA and Reduced-Rank Regressoin

1. CCA: ﬁnds a sequence of uncorrelated linear combinations Xv

, m =

1, ··· , M of x

, and a sequence of uncorrelated linear combinations

Y u

of the respones y

and,

Corr

(Y u

, Xv

), (3.19)

is maximized.

3 LINEAR METHODS FOR REGRESSION

3.8 More on the Lasso and Related Path Algorithms

3.8.1 Incremental Forward Stagewise Regression (FS0)

Algorithm 3.4: Incremental Forward Stagewise

1. Standardize each predictor x

to have mean zero and unit norm. Set

the residual r = y and β

, ··· , β

= 0

2. Find the predictor x

most correlated with r.

3. Update β

← β

+ δ

, r ← r − δ

, where δ

=  · sign[hx

, ri].

4. Repeat 2 and 3 until the residuals are uncorrelated with all predictors.

1. If δ

= hx

, ri, then the above algorithm is exactly forward stagewise

procedure.

2. If  → 0, the algorithm 3.4 is identical to the lass path.

3. The algorithm plays an important role in non-linear, adaptive methods

like boosting in Chapter 10 and Chapter 16.

Comparasion of LAR, LASSO and FS0

If LAR proﬁles are monotone, then all three methods given identical results.

If the proﬁles are not monotone but not cross zero, then LAR and lasso are

identical. FS0 can be viewed as a monotone version of the lasso.

3.8.2 Piecewise-Linear Path Algorithms

3.8.3 The Dantzig Selector

Candes and Tao (2007) proposed Dantzig selector as solution of

min

||β||

subject to ||X

(y − Xβ)||

∞

≤ s, (3.20)

or equivalently,

min

||X

(y − Xβ)||

∞

subject to ||β||

≤ t, (3.21)

where ||·||

∞

is the L

∞

norm, the maximum absolute value of the components

of the vector. The selector is trying to minimize the maximum of the inner

product of the current residuals with all the predictors.

4 LINEAR MODELS FOR CLASSIFICATION

3.8.4 The Grouped Lasso

The grouped lasso shrink and select the members of a group together. The

grouped-lasso minimizes the convex criterion,

min

β ∈ R

||y − β

1 −

l=1

+ λ

l=1

√

||β

, (3.22)

where p predictors are divided intp L groups with p

predictors in group p,

||·||

is Euclidean norm (the norm of a vector is zero only if all its components

are zero).

3.8.5 Further Properties of the Lasso

The lasso shrinkage causes the estimates of the non-zero coeﬃcients to be

biased towards zero, ways to reduce the bias,

1. Run the lasso to identify the set of non-zero coeﬃcients and then ﬁt

unrestricted linear model to the selected predictors.

2. Use lasso to select non-zero predictors, and then apply lasso again only

on the selected predictors, this is known as relaxed lasso. (The variables

in the second step have less “competition” from noise variables, cross

validation tends to pick a smaller value for λ, and hence less penalty).

3. Modify the penalty function so that larger coeﬃcients are shrunken less

severely: smoothly clipped absolute deviation. But it’s non-convex.

4. Adaptive lasso uses a weighted penalty

j=1

|beta

| where w

1/|

and

is OLSE, ν > 0. This approximates |β|

penalties.

3.8.6 Pathwise Coordinate Optimization (PCO)

Simple coordinate descent is an alternative to the LARS algorithm for com-

puting the lasso solution.

Idea: Fix the penalty parameter λ in 3.14 and optimize sucessively over each

parameter, holding the other parameters ﬁxed at their current values.

4 Linear Models for Classiﬁcation

4.1 Introduction

Suppose there are K classes labeled as 1, 2, ··· , K, and the ﬁtted linear

model for the k-th indicator response is

(x) =

x. Then the decision

4 LINEAR MODELS FOR CLASSIFICATION

boundary is:

(x) =

(x), (4.1)

or equivalently,

{x : (

−

) − (

−

)

x = 0}, (4.2)

which is an aﬃne set or hyperplane.

To summarize, the regression approach is a member of a class of models

that model δ

(x)(for example, the posterior probabilities Pr(G = k|X = x)),

the discriminant functions, for each class, and then classify x into the class

with largest value of δ

(x).

To generalize the linear decision boundaries, we could expand the vari-

able set to include the variable squares and cross-products. Then the linear

functions (linear decision boundaries) in the augmented space map down to

quadratic functions in the original space (quadratic decision boundaries).

4.2 Linear Regression of an Indicator Matrix

Approach:

1. Suppose G has K classes with indicators Y = Y

, ··· , Y

2. Fit linear regression based on N observations:

Y = X(X

−1

Y .

Denote

B = (X

−1

Y .

3. The prediction of class of a new observation is based on:

G(x) =

argmin

k ∈ G

(x) with

(x) is the k

element of (1, x

)

Issues:

(x) can be negative or greater than 1, which is not a good approxi-

mation of probability.

2. When the number of classes is large, classes can be masked by others.

4.3 Linear Discriminant Analysis

Bayes theorem gives

Pr(G = k|X = x) =

(x)π

l=1

(x)π

. (4.3)

Techniques based on models for the class densities:

4 LINEAR MODELS FOR CLASSIFICATION

• linear and quadratic discriminant analysis use Gaussian densities.

• mixture of Gaussians allow for nonlinear decision boundaries (Section

6.8).

• general nonparametric density estimates for each lcass dnesity allow for

the most ﬂexibility (Section 6.6.2).

• Naive Bayes assume that each of the class densities are products of

marginal densities (inputs are conditionally independent in each class).

Discriminant analysis:

Suppose class density as a multivariate Gaussian,

(x) =

(2π)

p/2

|Σ

1/2

−

(x−µ

)

−1

(x−µ

)

(4.4)

1. Linear discriminant analysis (LDA)

(a) Assume all classes have a common covariance matrix, Σ

= Σ, ∀k

in 4.4.

(b) The decision boundary between class k and class l is linear in x.

(x)π

)

(x) = x

−1

−

−1

+ log π

, (4.5)

(d) or equivalently, decision rule G(x) = argmax

(x).

(e) In practice, parameters of Gaussian distributions are not known,

they can be estimated from training data.

• bπ

= N

/N.

• bµ

•

Σ =

k=1

− bµ

)(x

− bµ

)

/(N − K).

(f) There is a correspondence between simple linear regression and

LDA with two classes. In LDA, x

−1

( bµ

− bµ

) ∝ x

β > ···

(formula derived from Equation 4.5).

(g) With more than two classes, LDA is not the same as linear regres-

sion, and it avoids the masking problems in linear regression.

2. Quadratic discriminant analysis (QDA).

4 LINEAR MODELS FOR CLASSIFICATION

(a) Σ

are not assumed to be equal in 4.4.

(b) Discriminant function

(x) = −

log |Σ

| −

(x − µ

)

−1

(x − µ

) + log π

, (4.6)

(x) = δ

(x)}.

3. To summarize, both LDA and QDA have a good track record. More

likely a reason is that the data can only support simple decision bound-

aries such as linear or quadratic, and the estimates provided via the

Gaussian models are stable. For LDA, there is also bias variance trade-

oﬀ with much lower variance than the exotic alternative.

4. Regularized Discriminant Analysis (RDA)

(a) A compromise between LDA and QDA is RDA,

(α) = α

+ (1 − α)

Σ, (4.7)

where

Σ is pooled covariance matrix as in LDA, α ∈ [0, 1]. The

value of α can be chosen based on performance of the model on

validation data, or by cross-validation.

(b) Similarly for

Σ to shrunk toward scaler covariance,

Σ(γ) = γ

Σ + (1 − γ)σ

I, (4.8)

where γ ∈ [0, 1]. See more on RDA in Chapter 12.

5. Computations for LDA can be simpliﬁed by diagonalizing

Σ or

and transform inputs, X

∗

= D

−1/2

X so the covariance

matrix is identity.

6. Reduced-rank LDA.

Idea: The K centroids in p-dimensional input space lie in an aﬃne

subspace of dimension ≤ K −1, and if p is larger than K, this will be a

considerable drop in dimension. Fisher deﬁne the optimal subspace (in

our case, for LDA) to mean that the projected centroids were spread

out as much as possible in terms of variance.

7. A note on misclassiﬁcation rate. The misclassiﬁcation rate is based on

the area of overlap between two densities. Gaussian classiﬁcation add

log π

correction factor in distance calculation, if we move the cut-point

toward the smaller class, the error rate could be smaller (improved).

4 LINEAR MODELS FOR CLASSIFICATION

4.4 Logistic Regression

1. Motivation: Model the posterior probabilities of the K classes via linear

function in x, while at the same time ensure that they sum to one and

remain in [0, 1].

2. Usage: logistic regression models are used mostly as a data analysis

and inference tool, where the goal is to understand the role of the

input variables in explaining the outcome.

3. Model form

log

Pr(G = 1|X = x)

Pr(G = K|X = x)

= β

+ β

log

Pr(G = 2|X = x)

Pr(G = K|X = x)

= β

+ β

log

Pr(G = K − 1|X = x)

Pr(G = K|X = x)

= β

(K−1)0

+ β

(

K − 1)

x (4.9)

(4.10)

with probability of class k = 1, 2, ··· , K

Pr(G = k|X = x) =

exp(β

+ β

1 +

l=1

exp(β

+ β

= p

(x; θ), (4.11)

and parameters θ = {β

, β

, ··· , β

(K−1)0

, β

(K−1)

4. Model ﬁtting. Maximizing over the log-likelihood, L(θ) =

i=1

log p

; θ)

can be used to ﬁt the model. Suitable algorithms to solve the score func-

tion of log-likelihood include Newton-Raphson/iteratively reweighted

least squares (IRLS) and coordinate-descent methods.

5. Quadratic approximations and inference.

The ML estimators

β have a self-consistent relationship: they are co-

eﬃcients of a weighted LS ﬁt, with responses

= x

β +

− bp

(1 − bp

)

, (4.12)

with weight w

= bp

(1 − bp

). It can also oﬀer

4 LINEAR MODELS FOR CLASSIFICATION

(a) Weighted SS is Pearson chi-square statistic as a quadratic approx-

imation to the deviance,

i=1

− bp

)

(1 − bp

)

(4.13)

(b)

β is asymptotically consistent (converge to true β) given that the

model is correct.

exclusion of a term eﬃciently without ﬁtting the logistic model

each time.

6. L

regularized logistic regression.

To to variable selection and shrinkage, maximize a penalized log-likelihood

without penalty on intercept term, for two classes,

max

, β

(

i=1

(β

+ β

) − log(1 + e

+β

)

− λ

j=1

|β

)

. (4.14)

7. Logistic regression vs. LDA.

From Equation 4.4 of LDA assumption, one can derive that

log

Pr(G = k|X = x)

Pr(G = K|X = x)

= α

+ α

x, (4.15)

where α

, α

are parameters related to π

, π

, µ

and

Σ. Equa-

tion 4.15 has linearity in x due to Gaussian and common covariance

assumption, and seems to be the same as Equation 4.9, however, the

way the linear coeﬃcients are estimated are diﬀerent.

(a) The logistic regression model is more general as it makes less as-

sumptions.

(b) The logistic regression ﬁt the parameters by maximizing condi-

tional likelihood, Pr(G = k|X) and leaves the marginal density

of X as arbitrary. LDA ﬁts parameters by maximizing full log-

likelihood based on joint density,

Pr(X, G = k) = φ(X; µ

, Σ)π

, (4.16)

4 LINEAR MODELS FOR CLASSIFICATION

and the marginal density here is a mixture density,

Pr(X) =

k=1

φ(X; µ

, Σ)π

. (4.17)

The extra restriction on the marginal density of X provide more

information about the parameters for a more eﬃcient estimation,

given that the f

(x) are actually Gaussian. Also, from the mixture

formulation, even observations without class labels have informa-

tion about parameters, by relying on strong model assumptions,

this kind of information can also be used.

more robust bet than the LDA model, relying on fewer assump-

tions.

4.5 Separating Hyperplanes

Perceptrons: classiﬁers that compute a linear combination of the input fea-

tures and return the sign with separating hyperplane

{x :

+ x

β = 0}. (4.18)

4.5.1 Rosenblatt’s Perceptron Learning Algorithm

1. Goal: Perceptron learning algorithm tries to ﬁnd a separating hyper-

plane by minimizing the distance of misclassiﬁed points to the decision

boundary.

2. Separating hyperplane classiﬁers are trying to separate data into diﬀer-

ent classes by constructing linear decision boundaries explicitly. They

provide basis for support vector classiﬁers in Chapter 12.

3. Suppose response y is coded with ±1, then if y

= 1 is misclassiﬁed,

the corresponding {x

β + β

< 0}. The goal is to minimize,

D(β, β

) = −

i∈M

β + β

), (4.19)

where M indexes the set of misclassiﬁed points. Stochastic gradient

descent is used for the minimization.

Problems of this algorithm:

4 LINEAR MODELS FOR CLASSIFICATION

• When data separable, there are many solutions.

Solution: add additional constrains to the separating hyperplane.

• When “ﬁnite“ number of steps can be very large.

Solution: seek a hyperplane in a much enlarged space obtained

by creating many basis function transformations of the original

space. Perfect separation, however, cannot always be achieved

(two diﬀerent classes may have same input, over-ﬁtting etc.)

• When data are not separable, algorithm will not converge.

4.5.2 Optimal Separating Hyperplanes (OSH)

1. OSH separates the two classes and maximize the distance to the closest

point from either class (Vapnik, 1996).

2. Intuition: A large margin on the training data will lead to good sepa-

ration on the test data.

3. Optimization,

•

max

, β, ||β|| = 1

M, (4.20)

subject to y

β + β

) ≥ M, i = 1, ··· , N. Arbitrarily set ||β|| =

1/M, Equation 4.20 can be rewritten as

•

min

β, β

||β||

, (4.21)

subject to y

β + β

) ≥ 1, i = 1, ··· , N.

• The Lagrange function to be minimized w.r.t β is

||β||

−

i=1

β + β

) − 1], (4.22)

take derivative w.r.t. β and β

and set as zero to obtain constrains

on β and α

. To solve the α values, use Wolfe dual.

– Set h(α) = min

β, β

= min

β, β

L(β, β

, α).

– To ﬁnd α, solve max

h(α), with λ ≥ 0.

– Equivalent to solve dual problem max

α, β, β

L(β, β

, α) with λ ≥ 0

∂L

∂β

= 0 and

∂L

∂β

= 0.

4 LINEAR MODELS FOR CLASSIFICATION

LDA Logistic OSH

Depends on all the

data, even points are

far away from decision

boundary.

Depends on all data. Focus more on the

points that count

and is more robust

to model misspecﬁca-

tion.

The identiﬁcation of

support points depend

on all the data

When a separating

hyperplane exists,

logistic regression

will always ﬁnd it

by draing the log-

likelihood function to

Coeﬃcient vector is

deﬁned by weighted

least square ﬁt of

zero-mean linearized

response on input

features, weights are

larger for points near

the decision boundary.

Table 7: Summarization of LDA, Logistic regression and OSH.

4. Interpretation: As α

β + β

) − 1] ∀i,

• If α

> 0, then y

β + β

) = 1, x

is on the boundary of the slab.

• If (x

β + β

) > 1, x

is NOT on the boundary of the slab with

= 0. Thus β =

i=1

is deﬁned as linear combinations of

support points x

that are on the boundary of slab via α

> 0.

Then the OSH produces

f(x) = (x

β + β

) for classiﬁcation of

new observations.

Summarization in Table 7.

9 ADDITIVE MODELS, TREES, AND RELATED METHODS

5 Basis Expansion and Regularization

5.1 Introduction

6 Kernel Smoothing Methods

7 Model Assessment and Selection

8 Model Inference and Averaging

9 Additive Models, Trees, and Related Meth-

ods

This chapter discusses some speciﬁc methods for superwised learning, which

assume a diﬀerent structured form for the unknown regression function.

9.1 Generalized Additive Models

Generalized additive models use automatic ﬂexible statistical methods to

identify and characterize non-linear regression eﬀects.

General form

g(µ(X)) = α + f

) + ··· + f

), (9.1)

• µ(X) is the conditional mean of response Y .

• α is a constant.

• f

’s are unspeciﬁed smooth functions (nonparametric).

• g(·) is the link function, can be identity, logit, probit or log linear.

Other exponential family distributions such as gamma and negative

binomial distributions can also be extended from generalized linear

models to generalized additive models.

Fitting additive models: the additive model has the form

Y = α +

j=1

) + , (9.2)

10 BOOSTING AND ADDITIVE TREES

and the criterion is,

PRSS(α, f

, ··· , f

) =

i=1

− α −

j=1

i,j

)

i=1

)]

(9.3)

The minimizer is an additive cubic spline model, i.e., f

is a cubic spline

in the component X

10 Boosting and Additive Trees