Notes on Elements of Statistical Learning

Notes on The Elements of Statistical
Learning
Qianqian Shan
Contents
1 Introduction 3
2 Overview of Supervised Learning 3
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Variable Types . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.3 Two Simple Approaches to Prediction:Least Squares (LS) and
Nearest Neighbor (NN) . . . . . . . . . . . . . . . . . . . . . . 3
2.4 Statistical Decision Theory . . . . . . . . . . . . . . . . . . . . 4
2.4.1 L
2
Loss Function . . . . . . . . . . . . . . . . . . . . . 4
2.4.2 L
1
Loss Function . . . . . . . . . . . . . . . . . . . . . 5
2.4.3 0-1 Loss Function for Categorical Output Variable G . 6
2.5 Local Methods in High Dimensions . . . . . . . . . . . . . . . 6
2.6 Statistical Models, Supervised Learning and Function Approx-
imation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.6.1 Statistical Model for the Joint Distribution P (X, Y ) . . 6
2.6.2 Supervised Learning . . . . . . . . . . . . . . . . . . . 6
2.6.3 Function Approximation . . . . . . . . . . . . . . . . . 6
2.7 Structured Regression Models . . . . . . . . . . . . . . . . . . 7
2.8 Classes of Restricted Estimators . . . . . . . . . . . . . . . . . 7
2.8.1 Roughness Penalty and Bayesian Methods . . . . . . . 7
2.8.2 Kernel Methods and Local Regression . . . . . . . . . . 7
2.8.3 Basis Functions and Dictionary Methods . . . . . . . . 8
2.9 Model Selection and the Bias-Variance Tradeoff . . . . . . . . 9
3 Linear Methods for Regression 9
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.2 Linear Regression Models and Least Squares . . . . . . . . . . 9
3.2.1 Gauss-Markov Theorem . . . . . . . . . . . . . . . . . 10
3.2.2 Multiple Regression from Simple Univariate Regression 10
1
CONTENTS
3.3 Subset Selection . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.4 Shrinkage Methods . . . . . . . . . . . . . . . . . . . . . . . . 11
3.4.1 Ridge Regression . . . . . . . . . . . . . . . . . . . . . 11
3.4.2 The Lasso (basis pursuit in signal processing) . . . . . 13
3.4.3 Comparison subsection selection, Ridge, Lasso and other
Penalty Forms . . . . . . . . . . . . . . . . . . . . . . . 14
3.4.4 Least Angle Regression (LAR) . . . . . . . . . . . . . . 15
3.5 Methods Using Derived Input Directions . . . . . . . . . . . . 15
3.5.1 Principal Components Regression (PCR) . . . . . . . . 15
3.5.2 Partial Least Squares (PLS) . . . . . . . . . . . . . . . 16
3.6 Comparison of the Selection and Shrinkage Methods . . . . . . 17
3.7 Multiple Outcome Shrinkage and Selection . . . . . . . . . . . 17
3.8 More on the Lasso and Related Path Algorithms . . . . . . . . 19
3.8.1 Incremental Forward Stagewise Regression (FS0) . . . 19
3.8.2 Piecewise-Linear Path Algorithms . . . . . . . . . . . . 19
3.8.3 The Dantzig Selector . . . . . . . . . . . . . . . . . . . 19
3.8.4 The Grouped Lasso . . . . . . . . . . . . . . . . . . . . 20
3.8.5 Further Properties of the Lasso . . . . . . . . . . . . . 20
3.8.6 Pathwise Coordinate Optimization (PCO) . . . . . . . 20
4 Linear Models for Classification 20
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.2 Linear Regression of an Indicator Matrix . . . . . . . . . . . . 21
4.3 Linear Discriminant Analysis . . . . . . . . . . . . . . . . . . 21
4.4 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . 24
4.5 Separating Hyperplanes . . . . . . . . . . . . . . . . . . . . . 26
4.5.1 Rosenblatt’s Perceptron Learning Algorithm . . . . . . 26
4.5.2 Optimal Separating Hyperplanes (OSH) . . . . . . . . 27
5 Basis Expansion and Regularization 29
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
6 Kernel Smoothing Methods 29
7 Model Assessment and Selection 29
8 Model Inference and Averaging 29
9 Additive Models, Trees, and Related Methods 29
9.1 Generalized Additive Models . . . . . . . . . . . . . . . . . . . 29
10 Boosting and Additive Trees 30
2
2 OVERVIEW OF SUPERVISED LEARNING
1 Introduction
A problem is called,
supervised learning when the outcome variable is available to guide the
learning process.
unsupervised learning when only features are observed and have no
measurement of the outcome.
2 Overview of Supervised Learning
2.1 Introduction
Synonym
Inputs:
Inputs are often call as predictors in statistical literature, or more clas-
sically independent variables.
Features is preferred in pattern recognition literature.
Outputs are called responses, or classically dependent variables.
2.2 Variable Types
Output Type Quantitative Qualititaive Ordered Categorical
Prediction Task Regression Classification See Chapter 4
Table 1: Variable Types.
2.3 Two Simple Approaches to Prediction:Least Squares
(LS) and Nearest Neighbor (NN)
3
2 OVERVIEW OF SUPERVISED LEARNING
LS NN
Model
ˆ
Y =
ˆ
β
0
+
P
p
j=1
X
j
ˆ
β
j
ˆ
Y =
1
k
P
x
i
N
k
(x)
y
i
Solution Find
ˆ
β by minimizing RSS.
Parameter β Number of neighbors k
Effective Num-
ber of Parame-
ters
Dimension of β N/k, and generally bigger than p
(the dimension)
Assumption A linear decision boundary is ap-
propriate (low variance and po-
tentially high biss).
No stringent assumption (subre-
gion can be wiggly and unstable,
high variance and low bias).
Comments A large subset of the most pop-
ular techniques in use today are
variants of these two simple pro-
cedures.
Table 2: LS and NN
Ways to enhance these two procedures:
1. Kernel methods with weights to replace the 0/1 weights by k-nearest
neighbor methods.
2. The distance kernels can be modified to emphasize some variable more
than others in high dimensional spaces.
3. Local regression fits linear models by locally weighted least squares.
4. Linear models fits to basis expansion of the original inputs allow arbi-
trary complex models.
5. Projection pursuit and neutral network models consist of sums of non
linearly transformed linear models.
2.4 Statistical Decision Theory
2.4.1 L
2
Loss Function
We are seeking for a function f(X) to predict Y , a real valued random output
variable, given values of input X, if a square error loss function L(Y, f(X)) =
4
2 OVERVIEW OF SUPERVISED LEARNING
(Y f(X))
2
is used, the criterion for choosing f is ,
EP E(f) = E(Y f(X))
2
=
Z
[y f(x)]
2
P (dx, dy)
= E
X
E
Y |X
(Y f(X))
2
|X
,
and it suffices to minimize EPE pointwise,
f(x) = argmin
c
E
Y |X
(Y c)
2
|X = x
, (2.1)
which leads to
f(x) = E(Y |X = x), (2.2)
the conditional expectation.
LS NN
Solution Assume the regression function
f(x) x
T
β is approximately lin-
ear in its arguments and β =
[E(XX
T
)]
1
E(XY ) is the least
square solution by replacing the
expectation by averages over the
training data.
ˆ
f(x) = Ave(y
i
|x
i
N
k
(x)),
which means that the expec-
tation is approximated by
averaging over sample data,
and,conditioning on a point is
approximated by conditioning on
some region “close” to the target
point.
Interpretation f(x) is approximated by a glob-
ally linear function.
f(x) is approximated by a locally
constant function.
Table 3: Squared Error Loss Function
2.4.2 L
1
Loss Function
If the loss function is,
L(Y, f(X)) = |Y f(X)|, (2.3)
the solution is the conditional median,
ˆ
f(x) = median(Y |X = x), (2.4)
which has discontinuities in its derivatives, and this hinders its widespread
use.
5
2 OVERVIEW OF SUPERVISED LEARNING
2.4.3 0-1 Loss Function for Categorical Output Variable G
ˆ
G(x) = G
k
if P (G
k
|X = x) = max
g∈G
P (g|X = x), (2.5)
which is know as Bayes classifier. It will classify to the most probable class
based on the conditional distribution P (G|X).
2.5 Local Methods in High Dimensions
Curse of dimensionality,
Neighborhoods are no longer “local”.
All sample points are close to an edge of the sample.
Sampling density is proportional to N
1/p
, with p the dimension of the
input space and N is the sample size.
2.6 Statistical Models, Supervised Learning and Func-
tion Approximation
2.6.1 Statistical Model for the Joint Distribution P (X, Y )
Y = f(X) + , (2.6)
where is the random error with E() = 0 and independent of X. In this
model, f(x) = E(Y |X = x) and the conditional distribution of P (Y |X)
depneds on X only through the conditional mean of f (x).
2.6.2 Supervised Learning
Suppose the model in (2.6) is reasonable, supervised learning attempts to
learn f by example through a teacher.The algorithm can modify its in-
put/output relationship in response to the differences y
ˆ
f, which is known
as learning by example.
2.6.3 Function Approximation
In machine learning (with analogies to human reasoning) and neural
networks (with biological analogies to the brain), the learning paradigm
above has been the motivation for research.
In applied mathematics and statistics, the approach taken is from the
perspective of function approximation and estimation.
6
2 OVERVIEW OF SUPERVISED LEARNING
The data pairs {x
i
, y
i
} are viewed as points in (p + 1)dimensional Eu-
clidean space with the relation y
i
= f(x
i
) +
i
, the goal is to obtain a useful
approximation to f(x) for all x in some region.
Treating supervised learning as a problem in function approximation en-
courages the geometrical concepts of Euclidean spaces and mathematical
concepts of probabilistic inference to be applied to the problem. And it’s the
approach taken by the ESL book.
2.7 Structured Regression Models
For the RSS criterion for an arbitrary function f, any function
ˆ
f that passing
through the training sets is a solution. In order to obtain useful results for
finite N sample, we need to restrict the eligible solutions, and this is a major
topic of the ESL book.
Any method that attempts to produce locally varying functions in small
isotropic neighborhodds will run into problem in high dimensions due to the
curse of dimensionality.
All methods that overcome the dimensionality problems have an associ-
ated often implicit or adaptive metric for measuring neighborhoods,
which basically doesn’t allow the neighborhood to be simultaneously small
in all directions.
2.8 Classes of Restricted Estimators
2.8.1 Roughness Penalty and Bayesian Methods
P RSS(f, λ) = RSS(f) + λJ(f) (2.7)
Penalty function, or regularization methods, express our prior belief that
the type of functions we seek exhibit a certain type of smooth behavior, and
can be cast in a Bayesian framework by considering J as the log prior and
PRSS as the log-posterior. Thus minimizing PRSS amounts to finding the
posterior mode.
2.8.2 Kernel Methods and Local Regression
In general, a local regression of estimate of f(x
0
) is defined as f
ˆ
θ
(x
0
) which
minimizes,
RSS(f
θ
, x
0
) =
N
X
i=1
K
λ
(x
0
, x
i
)(y
i
f
θ
(x
i
))
2
, (2.8)
7
2 OVERVIEW OF SUPERVISED LEARNING
When f
θ
(x) = θ
0
, a constant, it’s
f
ˆ
θ
(x
0
) =
P
K
λ
(x
0
, x
i
)y
i
P
K
λ
(x
0
, x
i
)
. (2.9)
K
λ
(x
0
, x) is the kernel function which assigns weights to points x in a
region around x
0
.
When f
θ
(x) = θ
0
+ θ
1
x, then it’s local linear regression.
Near neighbor methods can be thought of as kernel methods with data
dependent metric,
K
k
(x
0
, x) = I(||x x
0
|| ||x
(k)
x
0
||), (2.10)
is the metric for k-nearest neighbors with x
(k)
the k-th in distance from
x
0
.
2.8.3 Basis Functions and Dictionary Methods
1. Basis functions: the model for f is a linear expansion of basis functions,
f
θ
(x) =
M
X
m=1
θ
m
h
m
(x), (2.11)
where h
m
is a function of input x.
Radial basis functions are symmetric p-dimensional kernels lo-
cated at particular centroids,
f
θ
(x) =
M
X
m=1
θ
m
K
λ
m
(µ
m
, x) (2.12)
2. The adaptively chosen basis function methods are known as dictionary
methods, where there are a possible infinite set or dictionary D of can-
didate basis functions to choose, and models are built up by employing
some search mechanism. For example, a single layer feed-foreward net-
work model with linear output weights can be thought of as an adaptive
basis function method,
f
θ
(x) =
M
X
m=1
β
m
σ(α
T
m
x + b
m
), (2.13)
with σ(x) = 1/(1 + exp(x)).
8
3 LINEAR METHODS FOR REGRESSION
2.9 Model Selection and the Bias-Variance Tradeoff
We’d like to choose model complexity to trade bias off with variance in such
a way to minimize the test error.
3 Linear Methods for Regression
3.1 Introduction
Linear models can sometimes outperform fancier nonlinear models, especially
in situations with small number of training cases, low signal-to-noise ratio
or sparse data. Also, linear models can be applied to transformations of the
inputs and this considerably expands their scope.
3.2 Linear Regression Models and Least Squares
Input vector, X
T
= (X
1
, X
2
, ··· , X
p
), output Y
Linear regression model, f(X) = β
0
+
P
p
X
j
β
Estimation method: least squares,
RSS(β) =
N
X
i=1
[y
i
f (x
i
)]
2
, (3.1)
or in matrix format,
RSS(β) = (Y Xβ)
T
(Y Xβ), (3.2)
the criterion is reasonable if: the training observations represent in-
dependent random draws from population; or, y
i
’s are conditionally
independent given the inputs.
Solution is
b
β = (X
T
X)
1
X
T
Y by solving the normal equation, X
T
(Y
Xβ) = 0
The fitted values
b
Y = X
b
β = X(X
T
X)
1
X
T
Y
The normal equation shows that, we minimize RSS by choosing
b
β so
that the residual vector is orthogonal to the X column space.
Z score can be used to test the hypothesis that a particular β
j
= 0.
F statistic can be used to test a group of coefficients simultaneously.
9
3 LINEAR METHODS FOR REGRESSION
3.2.1 Gauss-Markov Theorem
Consider the estimation of any linear combination of the parameters, θ =
a
T
β, with
b
β the least square estimator,
b
θ = a
T
b
β is unbiased E(a
T
b
β) = a
T
β,
then the Gauss-Markov theorem states that for any other linear unbiased
estimator c
T
Y ,
Var(a
T
b
β) Var(c
T
Y ). (3.3)
The theorem implies that the least square estimator has the smallest mean
squared error of all linear estimators with NO bias.
From a pragmatic point of view, most models are distortions of the truth,
and hence are biased; picking the right model amounts to creating the right
balance between bias and variance.
3.2.2 Multiple Regression from Simple Univariate Regression
Suppose inputs x
1
, ··· , x
p
, we say regress b on a to indicate that a univariate
regression of b on a without intercept. Then the Gram-Schmidt procedure
for multiple regression is,
Algorithm:
1. Initialize z
0
= x
0
= 1.
2. For j = 1, ··· , p,
For l = 0, ··· , j 1, regress x
j
on z
l
to obtain ˆγ
l,j
=< z
l
, x
j
>
/ < z
l
, z
l
>.
Compute residual vector z
j
= x
j
P
j1
k=0
ˆγ
l,j
z
k
.
3. Regress y on z
p
to obtain
ˆ
β
p
=< z
p
, y > / < z
p
, z
p
>.
z
j
are orthogonal and form the basis for the column space of inputs X.
Rewrite Step 2 in matrix form,
X = ZΓ = ZD
1
· DΓ = Q · R, (3.4)
where D is a diagonal matrix with D
jj
= ||z
j
||, and Γ is an upper
triangular matrix with diagonal elements as 1, and (i, j) th off diagonal
element as ˆγ
i1,j1
. R is also an upper triangular matrix.
QR decomposition of X represents a convenient orthogonal basis for col-
umn space of X. So,
10
3 LINEAR METHODS FOR REGRESSION
Subset selection Shrinkage methods
Exhibits high variance due
to a discrete process to se-
lect variables
More continuous, don’t suf-
fer as much from high vari-
ability
Q
T
Q = I (3.5)
ˆ
β = R
1
Q
T
y (3.6)
ˆ
y = QQ
T
y (3.7)
3.3 Subset Selection
Least squares estimates have low bias but large variance (prediction accu-
racy). And a smaller subset that exhibit the strongest effects are preferred
when there are a large number of predictors (interpretation). So we want to
do subset selection.
1. Best subset selection: find the subset of size k that gives the smallest
RSS.
2. Forward- and Backward-stepwise selection:
Forward stepwise: start with the intercept, then sequentially add
into the model the predictor that improves the fit most. Greedy
algorithm that produces a nested sequence of models.
Backward stepwise: start with the full model, then sequentially
deletes the predictor that has the least impact on the fit. Can
only be used when N > p.
3. Forward stagewise regression: more constrained than forward stepwise.
Start with intercept and centered predictors with zero coefficients. The
algorithm identifies the variable that most correlated with the current
residuals, update the residual with regression on selected variable. Stop
until none of the variables correlated with the current residuals.
3.4 Shrinkage Methods
3.4.1 Ridge Regression
1. Assumption: the response will vary most in the directions of high vari-
ance of the inputs.
11
3 LINEAR METHODS FOR REGRESSION
2. Purpose: When there are many correlated variables in a linear regres-
sion model, their coefficients can become poorly determined and exhibit
high variance: large positive coefficient on one variable can be canceled
by a similarly large negative coefficient from its correlated cousin.
3. X needs to be standardized as the solutions are not equi-variant under
scaling of the inputs.
4. Lagrangian form
b
β
ridge
= argmin
β
N
X
i=1
y
i
β
0
p
X
j=1
x
ij
β
j
!
2
+ λ
p
X
j=1
β
2
j
, λ 0. (3.8)
5. Or,
b
β
ridge
= argmin
β
N
X
i=1
y
0
β
0
p
X
j=1
x
ij
β
j
!
2
,
subject to
p
X
j=1
β
2
j
t, (3.9)
There is a one-to-one correspondence between λ and t.
6. Matrix form,
RSS(λ) = (y Xβ)
T
(y Xβ) + λβ
T
β (3.10)
7. Solution,
b
β
ridge
= (X
T
X + λI)
1
X
T
y. (3.11)
The estimate β
0
= ¯y.
8. Bayesian interpretation: suppose y
i
follows independent normal distri-
bution with mean β
0
+ x
T
i
β and variance σ
2
, and β
j
has prior N(0, τ
2
)
for each j, then the negative log-likelihood is equivalent to (3.8). The
ridge estimate is the mode of the posterior distribution.
9. Relationship to least square estimates: when having orthonormal in-
puts s.t. X
T
X = I,
b
β
ridge
=
b
β/(1 + λ), which indicates that the ridge
estimates are a scaled version of the least squares estimates.
12
3 LINEAR METHODS FOR REGRESSION
10. Interpretation on ridge shrinkage under SVD s.t. X = U DV
T
,
X
b
β
ridge
= X(X
T
X)
1
X
T
y (3.12)
=
p
X
j=1
u
j
d
2
j
d
2
j
+ λ
u
j
y, (3.13)
the ridge regression computes the coordinates of y with respect to the
orthonormal basis U , and then shrinks these coordinates by a factor
of d
2
j
/(d
2
j
+ λ), that is, the smaller the d
2
j
, the greater the shrinkage
will be applied to the coordinates of basis associated with it. A smaller
d
j
corresponds to directions in the column space of X with smaller
variance.
11. Effective degree of freedom
df(λ) = tr(X(X
T
X + λI)
1
X
T
) =
P
p
j=1
d
2
j
/(d
2
j
+ λ).
12. Hyperparameter λ could be determined by cross validation.
3.4.2 The Lasso (basis pursuit in signal processing)
1. Use L
1
penalty.
2. X needs to be standardized as the solutions are not equi-variant under
scaling of the inputs.
3. Lagrangian form
b
β
lasso
= argmin
β
1
2
N
X
i=1
y
i
β
0
p
X
j=1
x
ij
β
j
!
2
+ λ
p
X
j=1
|β
j
|, λ 0.
(3.14)
4. Or,
b
β
lasso
= argmin
β
N
X
i=1
y
0
β
0
p
X
j=1
x
ij
β
j
!
2
,
subject to
p
X
j=1
|β
j
| t, (3.15)
5. The L
1
penalty makes the solutions nonlinear in y
i
and no closed form
expression.
13
3 LINEAR METHODS FOR REGRESSION
Figure 1: Estimation picture of ridge (right) and lasso (left) in two dimension
cases. Plots are from ESL book.
6. Bayesian interpretation: suppose y
i
follows independent normal dis-
tribution with mean β
0
+ x
T
i
β and variance σ
2
, and β
j
has double
exponential prior (1/2)τexp(−|β
j
|) with τ = 1 for each j, then
the negative log-likelihood is equivalent to (3.14). The lasso estimate
is the mode of the posterior distribution.
7. Relationship to least square estimates: when having orthonormal in-
puts s.t. X
T
X = I,
b
β
lasso
= sign(
b
β) ·(|
b
β|λ)
+
, this could be derived
by discussing cases of different signs of
b
β and
b
β
lasso
.
3.4.3 Comparison subsection selection, Ridge, Lasso and other
Penalty Forms
1. Best subset: drops all variables with coefficients smaller than the M-th
largest, “hard thresholding”.
2. Ridge: does a proportional shrinkage, “soft thresholding”.
3. Lasso: translates each coefficient by a constant factor λ, truncating at
zero. “soft thresholding”.
Figure 1 shows why there are more opportunities for lass to have param-
eter estimates as zero (more corners, flatter edges and faces).
14
3 LINEAR METHODS FOR REGRESSION
Generalized penalty regression
b
β = argmin
β
1
2
N
X
i=1
y
i
β
0
p
X
j=1
x
ij
β
j
!
2
+ λ
p
X
j=1
|β
j
|
q
, λ 0, q 0.
(3.16)
q = 0, subset selection.
q 1, prior is not uniform in direction, but concentrates more mass in
the coordinate directions. q = 1, lasso.
q = 2, ridge.
q (1, 2), a compromise between ridge and lasso.
Elastic net is another form of compromise between ridge and lasso
with penalty λ
P
p
j=1
αβ
2
j
+ (1 α)|β
j
|
. Elastic net selects variables
like the lasso, and shrinks together the coefficients of correlated predic-
tors like ridge. It also has considerable computational advantages over
L
q
penalty. See more details in Section 18.4.
3.4.4 Least Angle Regression (LAR)
1. LAR is a kind of “democratic” version of forward stepwise. Similarly,
it first identifies the variable most correlated with the response, but
only enters “as much” of the predictor as it deserves.
2. X needs to be standardized.
3. The directions chosen by LAR to adjust the coefficients keeps the cor-
relations tied and decreasing.
4. The name “least angles” comes from a geometrical interpretation of the
process: u
k
= X
A
k
δ
k
makes the smallest and equal angle with each of
the predictors in A
k
. See details in Table 4.
3.5 Methods Using Derived Input Directions
3.5.1 Principal Components Regression (PCR)
1. X needs to be standardized as the solutions are not equi-variant under
scaling of the inputs.
15
3 LINEAR METHODS FOR REGRESSION
Algorithm:
1. Standardize predictors to have mean zero and unit norm. Let r = y
¯
y
and β
1
, β
2
, ··· , β
p
= 0.
At each step k:
2. Find x
j
that is most correlated with r, add it to active set A
k
, so
there are (k 1) nonzero values and one just entered with zero.
3. Move β
A
k
towards its least squares coefficients δ
k
=
(X
T
A
k
X
A
k
)
1
X
T
A
k
r
k
by β
A
k
(α) = β
A
k
+ αδ
k
. The process stops when
other competitor x
l
has as much correlation with the current residual as
x
j
, the current residual is r
k
= y X
A
k
β
A
k
.
4. Continue until all p predictors have been entered. With min(N-1, p)
steps, we arrive at the full least squares solution.
Table 4: LAR.
2. Use derived input columns, z
m
= Xv
m
, where v
m
are elements from
V with X = U DV
T
, V contains orthonormal eigen vectors of X
T
X
and V
T
V = I.
3. Regression form:
by
pcr
(M)
= [1, X]
¯
y
P
M
m=1
b
θ
m
v
m
= ¯y1 +
M
X
m=1
b
θ
m
v
m
, (3.17)
where
b
θ
m
= hz
m
, yi/hz
m
, z
m
i, and M p.
3.5.2 Partial Least Squares (PLS)
1. X needs to be standardized as the solutions are not equi-variant under
scaling of the inputs.
2. PLS seeks directions that have high variance and high correlation with
the response, while PCR keys only on high variance.
3. The m-th PLS direction
b
φ
m
solves,
max
α
Corr
2
(y, Xα)Var(Xα), subject to ||α|| = 1, α
T
S
b
φ
l
= 0, , l = 1, c . . . , m1.
(3.18)
4. The variance in (3.18) tends to dominate, and the PLS behaves much
like ridge regression and principal components regression.
16
3 LINEAR METHODS FOR REGRESSION
Comparison of LAR and LASSO
Algorithm Can be a modification to
implement LASSO
If the non-zero coefficient hits
zero in step 3, drop its variable
from active set and recompute the
current joint least squares direc-
tions.
Computation
Steps
LAR always takes p steps to
get the full least squares es-
timates.
LASSO can have more than p
steps.
Stationary
Condition
x
T
j
(y Xβ) = γs
j
, with
|x
T
k
(y Xβ)| γ
x
j
(y Xβ) = λ · sign(β
j
)
The above conditions are the same when sign of β
j
matches the sign of inner product.
General definition of effective degrees of freedom of adaptively fitted model: df (
ˆ
y) =
1
2
P
N
i=1
Cov( ˆy
i
, y
i
)
.
Effective
DF
k after the kth step of LAR
procedure.
approximately equals the number
of predictors at any stage. The
approximation works best at the
last model in the sequence that
contains k predictors.
Table 5: A is the active set of variables at some stage for LAR, B is the
active set for LASSO, s
j
= ±1 is the sign of the inner product within the
equation.
3.6 Comparison of the Selection and Shrinkage Meth-
ods
1. Like ridge, PLS also tends to shrink the low-variance directions, but
can actually inflate some of the higher variance directions, which makes
PLS a little unstable.
2. To summarize, PLS, PCR and ridge regression tend to behave similarly.
Ridge may be preferred because it shrinks smoothly, rather than in
discrete steps. Lasso falls somewhere between ridge and best subset
selection regression, and enjoys some of the properties of each.
3.7 Multiple Outcome Shrinkage and Selection
To apply shrinkage in multiple output case:
1. Apply a univariate technique individually to each outcome, which could
allow different amount of regularization to each outcome,
17
3 LINEAR METHODS FOR REGRESSION
Comparison of PCR and Ridge
PCR shrinks the coefficients
of principal components, and
shrinks more for bigger size of
the corresponding eigenvalue.
PCR discards the p M smallest
eigenvalue components.
Algorithm 3.3:
1. Standardize each predictor x
j
to have mean zero and variance one.
Set by
(0)
= ¯y1 and x
(0)
j
= x
j
, j = 1, ··· , p.
2. For m = 1, ··· , p,
Compute the partial least square direction z
m
=
P
j
b
φ
ij
x
j
, where
b
φ
ij
= hx
j
, yi, the inputs z
m
are weighted by the strength of their
univariate effect on y.
Compute
b
θ
m
= hz
m
, yi/hz
m
, z
m
i.
b
y
(m)
=
b
y
(m1)
+
b
θ
m
z
m
.
Prepare for next m: Orthogonalize each x
(m1)
j
with respect to z
m
to obtain x
(m)
j
= x
(m1)
j
h
hz
m
, x
(m1)
j
i/hz
m
, z
m
i
i
z
m
. (kind of
like keep only residuals of x).
3. Output the sequence {by
(m)
}
p
1
. by
(m)
= by
(0)
+
b
θ
1
z
1
+···
b
θ
p
z
p
= X
b
β
pls
(m)
Table 6: PLR.
2. Apply a univariate technique simultaneous to all outcomes,
3. Exploit correlations in the different responses, for example, canonical
correlation analysis (CCA).
CCA and Reduced-Rank Regressoin
1. CCA: finds a sequence of uncorrelated linear combinations Xv
m
, m =
1, ··· , M of x
j
, and a sequence of uncorrelated linear combinations
Y u
m
of the respones y
k
and,
Corr
2
(Y u
m
, Xv
m
), (3.19)
is maximized.
18
3 LINEAR METHODS FOR REGRESSION
3.8 More on the Lasso and Related Path Algorithms
3.8.1 Incremental Forward Stagewise Regression (FS0)
Algorithm 3.4: Incremental Forward Stagewise
1. Standardize each predictor x
j
to have mean zero and unit norm. Set
the residual r = y and β
1
, ··· , β
p
= 0
2. Find the predictor x
j
most correlated with r.
3. Update β
j
β
j
+ δ
j
, r r δ
j
x
j
, where δ
j
= · sign[hx
j
, ri].
4. Repeat 2 and 3 until the residuals are uncorrelated with all predictors.
1. If δ
j
= hx
j
, ri, then the above algorithm is exactly forward stagewise
procedure.
2. If 0, the algorithm 3.4 is identical to the lass path.
3. The algorithm plays an important role in non-linear, adaptive methods
like boosting in Chapter 10 and Chapter 16.
Comparasion of LAR, LASSO and FS0
If LAR profiles are monotone, then all three methods given identical results.
If the profiles are not monotone but not cross zero, then LAR and lasso are
identical. FS0 can be viewed as a monotone version of the lasso.
3.8.2 Piecewise-Linear Path Algorithms
3.8.3 The Dantzig Selector
Candes and Tao (2007) proposed Dantzig selector as solution of
min
β
||β||
1
subject to ||X
T
(y Xβ)||
s, (3.20)
or equivalently,
min
β
||X
T
(y Xβ)||
subject to ||β||
1
t, (3.21)
where ||·||
is the L
norm, the maximum absolute value of the components
of the vector. The selector is trying to minimize the maximum of the inner
product of the current residuals with all the predictors.
19
4 LINEAR MODELS FOR CLASSIFICATION
3.8.4 The Grouped Lasso
The grouped lasso shrink and select the members of a group together. The
grouped-lasso minimizes the convex criterion,
min
β R
p
||y β
0
1
L
X
l=1
X
l
β
l
||
2
2
+ λ
L
X
l=1
p
l
||β
l
||
2
!
, (3.22)
where p predictors are divided intp L groups with p
l
predictors in group p,
||·||
2
is Euclidean norm (the norm of a vector is zero only if all its components
are zero).
3.8.5 Further Properties of the Lasso
The lasso shrinkage causes the estimates of the non-zero coefficients to be
biased towards zero, ways to reduce the bias,
1. Run the lasso to identify the set of non-zero coefficients and then fit
unrestricted linear model to the selected predictors.
2. Use lasso to select non-zero predictors, and then apply lasso again only
on the selected predictors, this is known as relaxed lasso. (The variables
in the second step have less “competition” from noise variables, cross
validation tends to pick a smaller value for λ, and hence less penalty).
3. Modify the penalty function so that larger coefficients are shrunken less
severely: smoothly clipped absolute deviation. But it’s non-convex.
4. Adaptive lasso uses a weighted penalty
P
p
j=1
w
j
|beta
j
| where w
j
=
1/|
b
β
j
|
ν
and
b
β
j
is OLSE, ν > 0. This approximates |β|
q
penalties.
3.8.6 Pathwise Coordinate Optimization (PCO)
Simple coordinate descent is an alternative to the LARS algorithm for com-
puting the lasso solution.
Idea: Fix the penalty parameter λ in 3.14 and optimize sucessively over each
parameter, holding the other parameters fixed at their current values.
4 Linear Models for Classification
4.1 Introduction
Suppose there are K classes labeled as 1, 2, ··· , K, and the fitted linear
model for the k-th indicator response is
b
f
k
(x) =
b
β
k0
+
b
β
T
k
x. Then the decision
20
4 LINEAR MODELS FOR CLASSIFICATION
boundary is:
b
f
k
(x) =
b
f
l
(x), (4.1)
or equivalently,
{x : (
b
β
k0
b
β
l0
) (
b
β
k
b
β
l
)
T
x = 0}, (4.2)
which is an affine set or hyperplane.
To summarize, the regression approach is a member of a class of models
that model δ
k
(x)(for example, the posterior probabilities Pr(G = k|X = x)),
the discriminant functions, for each class, and then classify x into the class
with largest value of δ
k
(x).
To generalize the linear decision boundaries, we could expand the vari-
able set to include the variable squares and cross-products. Then the linear
functions (linear decision boundaries) in the augmented space map down to
quadratic functions in the original space (quadratic decision boundaries).
4.2 Linear Regression of an Indicator Matrix
Approach:
1. Suppose G has K classes with indicators Y = Y
1
, ··· , Y
k
.
2. Fit linear regression based on N observations:
b
Y = X(X
T
X)
1
X
T
Y .
Denote
b
B = (X
T
X)
1
X
T
Y .
3. The prediction of class of a new observation is based on:
b
G(x) =
argmin
k G
b
f
k
(x) with
b
f
k
(x) is the k
th
element of (1, x
T
)
b
B.
Issues:
1.
b
f
k
(x) can be negative or greater than 1, which is not a good approxi-
mation of probability.
2. When the number of classes is large, classes can be masked by others.
4.3 Linear Discriminant Analysis
Bayes theorem gives
Pr(G = k|X = x) =
f
k
(x)π
k
P
K
l=1
f
l
(x)π
l
. (4.3)
Techniques based on models for the class densities:
21
4 LINEAR MODELS FOR CLASSIFICATION
linear and quadratic discriminant analysis use Gaussian densities.
mixture of Gaussians allow for nonlinear decision boundaries (Section
6.8).
general nonparametric density estimates for each lcass dnesity allow for
the most flexibility (Section 6.6.2).
Naive Bayes assume that each of the class densities are products of
marginal densities (inputs are conditionally independent in each class).
Discriminant analysis:
Suppose class density as a multivariate Gaussian,
f
k
(x) =
1
(2π)
p/2
|Σ
k
|
1/2
e
1
2
(xµ
k
)
T
Σ
1
k
(xµ
k
)
(4.4)
1. Linear discriminant analysis (LDA)
(a) Assume all classes have a common covariance matrix, Σ
k
= Σ, k
in 4.4.
(b) The decision boundary between class k and class l is linear in x.
(c) Linear discriminant functions from log(f
k
(x)π
k
)
δ
k
(x) = x
T
Σ
1
µ
k
1
2
µ
T
k
Σ
1
µ
k
+ log π
k
, (4.5)
(d) or equivalently, decision rule G(x) = argmax
k
δ
k
(x).
(e) In practice, parameters of Gaussian distributions are not known,
they can be estimated from training data.
bπ
k
= N
k
/N.
bµ
k
=
P
g
i
=k
x
i
/N
k
.
b
Σ =
P
K
k=1
P
g
i
=k
(x
i
bµ
k
)(x
i
bµ
k
)
T
/(N K).
(f) There is a correspondence between simple linear regression and
LDA with two classes. In LDA, x
T
b
Σ
1
( bµ
1
bµ
2
) x
T
b
β > ···
(formula derived from Equation 4.5).
(g) With more than two classes, LDA is not the same as linear regres-
sion, and it avoids the masking problems in linear regression.
2. Quadratic discriminant analysis (QDA).
22
4 LINEAR MODELS FOR CLASSIFICATION
(a) Σ
k
are not assumed to be equal in 4.4.
(b) Discriminant function
δ
k
(x) =
1
2
log |Σ
k
|
1
2
(x µ
k
)
T
Σ
1
k
(x µ
k
) + log π
k
, (4.6)
(c) Decision boundary for class k and l, {x : δ
k
(x) = δ
l
(x)}.
3. To summarize, both LDA and QDA have a good track record. More
likely a reason is that the data can only support simple decision bound-
aries such as linear or quadratic, and the estimates provided via the
Gaussian models are stable. For LDA, there is also bias variance trade-
off with much lower variance than the exotic alternative.
4. Regularized Discriminant Analysis (RDA)
(a) A compromise between LDA and QDA is RDA,
b
Σ
k
(α) = α
b
Σ
k
+ (1 α)
b
Σ, (4.7)
where
b
Σ is pooled covariance matrix as in LDA, α [0, 1]. The
value of α can be chosen based on performance of the model on
validation data, or by cross-validation.
(b) Similarly for
b
Σ to shrunk toward scaler covariance,
b
Σ(γ) = γ
b
Σ + (1 γ)σ
2
I, (4.8)
where γ [0, 1]. See more on RDA in Chapter 12.
5. Computations for LDA can be simplified by diagonalizing
b
Σ or
b
Σ
k
=
U
k
D
k
U
T
k
and transform inputs, X
= D
1/2
U
T
X so the covariance
matrix is identity.
6. Reduced-rank LDA.
Idea: The K centroids in p-dimensional input space lie in an affine
subspace of dimension K 1, and if p is larger than K, this will be a
considerable drop in dimension. Fisher define the optimal subspace (in
our case, for LDA) to mean that the projected centroids were spread
out as much as possible in terms of variance.
7. A note on misclassification rate. The misclassification rate is based on
the area of overlap between two densities. Gaussian classification add
log π
k
correction factor in distance calculation, if we move the cut-point
toward the smaller class, the error rate could be smaller (improved).
23
4 LINEAR MODELS FOR CLASSIFICATION
4.4 Logistic Regression
1. Motivation: Model the posterior probabilities of the K classes via linear
function in x, while at the same time ensure that they sum to one and
remain in [0, 1].
2. Usage: logistic regression models are used mostly as a data analysis
and inference tool, where the goal is to understand the role of the
input variables in explaining the outcome.
3. Model form
log
Pr(G = 1|X = x)
Pr(G = K|X = x)
= β
10
+ β
T
1
x
log
Pr(G = 2|X = x)
Pr(G = K|X = x)
= β
20
+ β
T
2
x
.
.
.
log
Pr(G = K 1|X = x)
Pr(G = K|X = x)
= β
(K1)0
+ β
(
K 1)
T
x (4.9)
(4.10)
with probability of class k = 1, 2, ··· , K
Pr(G = k|X = x) =
exp(β
k0
+ β
T
k
x)
1 +
P
K
l=1
exp(β
l0
+ β
T
l
x)
= p
k
(x; θ), (4.11)
and parameters θ = {β
10
, β
T
1
, ··· , β
(K1)0
, β
T
(K1)
}.
4. Model fitting. Maximizing over the log-likelihood, L(θ) =
P
N
i=1
log p
g
i
(x
i
; θ)
can be used to fit the model. Suitable algorithms to solve the score func-
tion of log-likelihood include Newton-Raphson/iteratively reweighted
least squares (IRLS) and coordinate-descent methods.
5. Quadratic approximations and inference.
The ML estimators
b
β have a self-consistent relationship: they are co-
efficients of a weighted LS fit, with responses
z
i
= x
T
i
β +
y
i
bp
i
bp
i
(1 bp
i
)
, (4.12)
with weight w
i
= bp
i
(1 bp
i
). It can also offer
24
4 LINEAR MODELS FOR CLASSIFICATION
(a) Weighted SS is Pearson chi-square statistic as a quadratic approx-
imation to the deviance,
N
X
i=1
(y
i
bp
i
)
2
bp
i
(1 bp
i
)
(4.13)
(b)
b
β is asymptotically consistent (converge to true β) given that the
model is correct.
(c) Rao score test and Wald test can be used to test an inclusion and
exclusion of a term efficiently without fitting the logistic model
each time.
6. L
1
regularized logistic regression.
To to variable selection and shrinkage, maximize a penalized log-likelihood
without penalty on intercept term, for two classes,
max
β
0
, β
(
N
X
i=1
h
y
i
(β
0
+ β
T
x
i
) log(1 + e
β
0
+β
T
x
i
)
i
λ
p
X
j=1
|β
j
|
)
. (4.14)
7. Logistic regression vs. LDA.
From Equation 4.4 of LDA assumption, one can derive that
log
Pr(G = k|X = x)
Pr(G = K|X = x)
= α
k0
+ α
T
k
x, (4.15)
where α
k0
, α
k
are parameters related to π
k
, π
K
, µ
k
, µ
K
and
b
Σ. Equa-
tion 4.15 has linearity in x due to Gaussian and common covariance
assumption, and seems to be the same as Equation 4.9, however, the
way the linear coefficients are estimated are different.
(a) The logistic regression model is more general as it makes less as-
sumptions.
(b) The logistic regression fit the parameters by maximizing condi-
tional likelihood, Pr(G = k|X) and leaves the marginal density
of X as arbitrary. LDA fits parameters by maximizing full log-
likelihood based on joint density,
Pr(X, G = k) = φ(X; µ
k
, Σ)π
k
, (4.16)
25
4 LINEAR MODELS FOR CLASSIFICATION
and the marginal density here is a mixture density,
Pr(X) =
K
X
k=1
φ(X; µ
k
, Σ)π
k
. (4.17)
The extra restriction on the marginal density of X provide more
information about the parameters for a more efficient estimation,
given that the f
k
(x) are actually Gaussian. Also, from the mixture
formulation, even observations without class labels have informa-
tion about parameters, by relying on strong model assumptions,
this kind of information can also be used.
(c) Comparison: It’s generally felt that logistic regression is a safer,
more robust bet than the LDA model, relying on fewer assump-
tions.
4.5 Separating Hyperplanes
Perceptrons: classifiers that compute a linear combination of the input fea-
tures and return the sign with separating hyperplane
{x :
ˆ
β
0
+ x
T
ˆ
β = 0}. (4.18)
4.5.1 Rosenblatt’s Perceptron Learning Algorithm
1. Goal: Perceptron learning algorithm tries to find a separating hyper-
plane by minimizing the distance of misclassified points to the decision
boundary.
2. Separating hyperplane classifiers are trying to separate data into differ-
ent classes by constructing linear decision boundaries explicitly. They
provide basis for support vector classifiers in Chapter 12.
3. Suppose response y is coded with ±1, then if y
i
= 1 is misclassified,
the corresponding {x
T
i
β + β
0
< 0}. The goal is to minimize,
D(β, β
0
) =
X
i∈M
y
i
(x
T
i
β + β
0
), (4.19)
where M indexes the set of misclassified points. Stochastic gradient
descent is used for the minimization.
Problems of this algorithm:
26
4 LINEAR MODELS FOR CLASSIFICATION
When data separable, there are many solutions.
Solution: add additional constrains to the separating hyperplane.
When “finite“ number of steps can be very large.
Solution: seek a hyperplane in a much enlarged space obtained
by creating many basis function transformations of the original
space. Perfect separation, however, cannot always be achieved
(two different classes may have same input, over-fitting etc.)
When data are not separable, algorithm will not converge.
4.5.2 Optimal Separating Hyperplanes (OSH)
1. OSH separates the two classes and maximize the distance to the closest
point from either class (Vapnik, 1996).
2. Intuition: A large margin on the training data will lead to good sepa-
ration on the test data.
3. Optimization,
max
β
0
, β, ||β|| = 1
M, (4.20)
subject to y
i
(x
T
i
β + β
0
) M, i = 1, ··· , N. Arbitrarily set ||β|| =
1/M, Equation 4.20 can be rewritten as
min
β, β
0
1
2
||β||
2
, (4.21)
subject to y
i
(x
T
i
β + β
0
) 1, i = 1, ··· , N.
The Lagrange function to be minimized w.r.t β is
L
P
=
1
2
||β||
2
N
X
i=1
α
i
[y
i
(x
T
i
β + β
0
) 1], (4.22)
take derivative w.r.t. β and β
0
and set as zero to obtain constrains
on β and α
i
. To solve the α values, use Wolfe dual.
Set h(α) = min
β, β
0
L
P
= min
β, β
0
L(β, β
0
, α).
To find α, solve max
α
h(α), with λ 0.
Equivalent to solve dual problem max
α, β, β
0
L(β, β
0
, α) with λ 0
,
L
β
= 0 and
L
β
0
= 0.
27
4 LINEAR MODELS FOR CLASSIFICATION
LDA Logistic OSH
Depends on all the
data, even points are
far away from decision
boundary.
Depends on all data. Focus more on the
points that count
and is more robust
to model misspecfica-
tion.
The identification of
support points depend
on all the data
When a separating
hyperplane exists,
logistic regression
will always find it
by draing the log-
likelihood function to
0.
Coefficient vector is
defined by weighted
least square fit of
zero-mean linearized
response on input
features, weights are
larger for points near
the decision boundary.
Table 7: Summarization of LDA, Logistic regression and OSH.
4. Interpretation: As α
i
[y
i
(x
T
i
β + β
0
) 1] i,
If α
i
> 0, then y
i
(x
T
i
β + β
0
) = 1, x
i
is on the boundary of the slab.
If (x
T
i
β + β
0
) > 1, x
i
is NOT on the boundary of the slab with
α
i
= 0. Thus β =
P
N
i=1
α
i
y
i
x
i
is defined as linear combinations of
support points x
i
that are on the boundary of slab via α
i
> 0.
Then the OSH produces
ˆ
f(x) = (x
T
i
β + β
0
) for classification of
new observations.
Summarization in Table 7.
28
9 ADDITIVE MODELS, TREES, AND RELATED METHODS
5 Basis Expansion and Regularization
5.1 Introduction
6 Kernel Smoothing Methods
7 Model Assessment and Selection
8 Model Inference and Averaging
9 Additive Models, Trees, and Related Meth-
ods
This chapter discusses some specific methods for superwised learning, which
assume a different structured form for the unknown regression function.
9.1 Generalized Additive Models
Generalized additive models use automatic flexible statistical methods to
identify and characterize non-linear regression effects.
General form
g(µ(X)) = α + f
1
(x
1
) + ··· + f
p
(X
p
), (9.1)
µ(X) is the conditional mean of response Y .
α is a constant.
f
j
’s are unspecified smooth functions (nonparametric).
g(·) is the link function, can be identity, logit, probit or log linear.
Other exponential family distributions such as gamma and negative
binomial distributions can also be extended from generalized linear
models to generalized additive models.
Fitting additive models: the additive model has the form
Y = α +
p
X
j=1
f
j
(X
j
) + , (9.2)
29
10 BOOSTING AND ADDITIVE TREES
and the criterion is,
PRSS(α, f
1
, ··· , f
p
) =
N
X
i=1
y
i
α
p
X
j=1
f
j
(x
i,j
)
!
2
+
p
X
i=1
λ
j
Z
[f
00
j
(t
j
)]
2
dt
j
.
(9.3)
The minimizer is an additive cubic spline model, i.e., f
j
is a cubic spline
in the component X
j
.
10 Boosting and Additive Trees
30