4 Linear Regression with multiple variables
4-1 Multiple features
Multiple features (variables)
Notation:
n
n
n = number of features
x
(
i
)
x^{(i)}
x(i)=input(features) of
i
t
h
i^{th}
ith training example.
x
j
(
i
)
x_j^{(i)}
xj(i)=value of feature in
i
t
h
i^{th}
ith training example.
Hypothesis
h θ ( x ) = θ 0 + θ 1 x 1 + θ 2 x 2 + ⋅ ⋅ ⋅ + θ n x n h_\theta(x)=\theta_0+\theta_1x_1+\theta_2x_2+···+\theta_nx_n hθ(x)=θ0+θ1x1+θ2x2+⋅⋅⋅+θnxn
define x 0 x_0 x0=1
x = [ x 0 x 1 x 2 ⋮ x n ] x=\begin{bmatrix}x_0\\x_1\\x_2\\\vdots\\x_n\end{bmatrix} x=⎣⎢⎢⎢⎢⎢⎡x0x1x2⋮xn⎦⎥⎥⎥⎥⎥⎤ θ = [ θ 0 θ 1 θ 2 ⋮ θ n ] \theta=\begin{bmatrix}\theta_0\\\theta_1\\\theta_2\\\vdots\\\theta_n\end{bmatrix} θ=⎣⎢⎢⎢⎢⎢⎡θ0θ1θ2⋮θn⎦⎥⎥⎥⎥⎥⎤
h θ ( x ) = θ T x h_\theta (x)=\theta^Tx hθ(x)=θTx
4-2 Gradient descent for multiple variable
Cost function
J
(
θ
)
=
J
(
θ
0
,
θ
1
,
⋯
,
θ
n
)
=
1
2
m
∑
i
=
1
m
(
h
θ
(
x
(
i
)
−
y
(
i
)
)
2
J(\theta)=J(\theta_0,\theta_1,\cdots,\theta_n)=\frac{1}{2m}\sum_{i=1}^{m}(h_\theta(x^{(i)}-y^{(i)})^2
J(θ)=J(θ0,θ1,⋯,θn)=2m1i=1∑m(hθ(x(i)−y(i))2
Gradient descent
θ j : = θ j − α ∂ ∂ θ j J ( θ ) \theta_j := \theta_j-\alpha\frac{\partial}{\partial\theta_j}J(\theta) θj:=θj−α∂θj∂J(θ)
Previously(n=1):
Repeat{
{
θ
0
:
=
θ
0
−
α
1
m
∑
i
=
1
m
(
h
θ
(
x
(
i
)
−
y
(
i
)
)
θ
1
:
=
θ
1
−
α
1
m
∑
i
=
1
m
(
h
θ
(
x
(
i
)
−
y
(
i
)
)
.
x
(
i
)
\begin{cases}\theta_0 := \theta_0-\alpha\frac{1}{m}\sum_{i=1}^{m}{(h_\theta(x^{(i)}-y^{(i)})} \\\theta_1 := \theta_1-\alpha\frac{1}{m}\sum_{i=1}^{m}{(h_\theta(x^{(i)}-y^{(i)})}.x^{(i)} \end{cases}
{θ0:=θ0−αm1∑i=1m(hθ(x(i)−y(i))θ1:=θ1−αm1∑i=1m(hθ(x(i)−y(i)).x(i)
}
New algorithm(n ≥ 1 \ge1 ≥1)
Repeat{
θ
j
:
=
θ
j
−
α
1
m
∑
i
=
1
m
(
h
θ
(
x
(
i
)
−
y
(
i
)
)
x
j
i
\theta_j := \theta_j-\alpha\frac{1}{m}\sum_{i=1}^{m}{(h_\theta(x^{(i)}-y^{(i)})x^{i}_j}
θj:=θj−αm1i=1∑m(hθ(x(i)−y(i))xji
}
4-3 Gradient descent in practice I: Feature Scaling
Feature Scaling 特征缩放
Ideal :make sure features are on a similar scale
Get every feature into approximately a 0 ≤ x i ≤ 1 0\le x_i \le 1 0≤xi≤1 ( x 0 = 1 x_0=1 x0=1) range
Mean normalization
Replace
x
i
x_i
xi with
x
i
−
μ
i
x_i-\mu_i
xi−μi to make features have approximately zero mean (Do not apply to
x
0
=
1
x_0=1
x0=1)
x
i
⟵
x
i
−
μ
i
s
i
x_i\longleftarrow \frac{x_i-\mu_i}{s_i} \\
xi⟵sixi−μi
x
i
x_i
xi: feature number
μ i \mu_i μi:average value
s i s_i si: range(max-min)
4-4 Gradient descent in practice I: Learning rate
Gradient descent
θ j : = θ j − α ∂ ∂ θ j J ( θ ) \theta_j := \theta_j-\alpha\frac{\partial}{\partial\theta_j}J(\theta) θj:=θj−α∂θj∂J(θ)
-
Debugging: make sure gradient descent is working correctly
-
How to choose learning rate α \alpha α
Convergence test:
Declare convergence if J ( θ ) J(\theta) J(θ) decreases by less than 1 0 − 3 10^{-3} 10−3 ( ϵ \epsilon ϵ)in one iteration
Summary
For sufficiently small
α
\alpha
α,
J
(
θ
)
J(\theta)
J(θ)should decrease on every iteration.
But if Q is too small, gradient descent can be slow to converge
To choose α \alpha α,try
steps of ten
···,0.001,0.003,0.01,0.03,···
4-5 Features and polynomial regression
housing price prediction
Polynomial regression
h θ ( x ) = θ 0 + θ 1 x 1 + θ 2 x 2 + θ 3 x 3 h_\theta(x)=\theta_0+\theta_1x_1+\theta_2x_2+\theta_3x_3 hθ(x)=θ0+θ1x1+θ2x2+θ3x3
⟶ \longrightarrow ⟶ h θ ( x ) = θ 0 + θ 1 ( s i z e ) + θ 2 ( s i z e ) 2 + θ 3 ( s i z e ) 3 h_\theta(x)=\theta_0+\theta_1(size)+\theta_2(size)^2+\theta_3(size)^3 hθ(x)=θ0+θ1(size)+θ2(size)2+θ3(size)3
Choice of features
h θ ( x ) = θ 0 + θ 1 x 1 + θ 2 s i z e h_\theta(x)=\theta_0+\theta_1x_1+\theta_2\sqrt{size} hθ(x)=θ0+θ1x1+θ2size
4-6 Normal equation
Intuition: (
θ
∈
R
\theta\in R
θ∈R)(
θ
∈
R
n
+
1
\theta\in R^{n+1}
θ∈Rn+1)
θ
=
(
X
T
X
)
−
1
X
T
y
\theta=(X^TX)^{-1}X^Ty
θ=(XTX)−1XTy
pinv(x'*x)*x'*y %octave
m examples ( ( x ( 1 ) , y ( 1 ) ) , ⋯ , ( x ( m ) , y ( m ) ) (x^{(1)},y^{(1)}),\cdots,(x^{(m)},y^{(m)}) (x(1),y(1)),⋯,(x(m),y(m))),n features
x ( i ) = [ x 0 ( i ) x 1 ( i ) x 2 ( i ) ⋮ x n ( i ) ] x^{(i)}=\begin{bmatrix} x_0^{(i)}\\ x_1^{(i)}\\x_2^{(i)}\\\vdots\\x_n^{(i)}\end{bmatrix} x(i)=⎣⎢⎢⎢⎢⎢⎢⎡x0(i)x1(i)x2(i)⋮xn(i)⎦⎥⎥⎥⎥⎥⎥⎤ X= [ ⋯ ( x ( 1 ) ) T ⋯ ⋯ ( x ( 2 ) ) T ⋯ ⋮ ⋯ ( x ( m ) ) T ⋯ ] \begin{bmatrix}\cdots & (x^{(1)})^T & \cdots \\ \cdots & (x^{(2)})^T & \cdots \\&\vdots&\\\cdots & (x^{(m)})^T & \cdots \end{bmatrix} ⎣⎢⎢⎢⎡⋯⋯⋯(x(1))T(x(2))T⋮(x(m))T⋯⋯⋯⎦⎥⎥⎥⎤
y = [ y ( 1 ) y ( 2 ) ⋮ y ( m ) ] y=\begin{bmatrix}y^{(1)}\\y^{(2)}\\\vdots\\y^{(m)}\end{bmatrix} y=⎣⎢⎢⎢⎡y(1)y(2)⋮y(m)⎦⎥⎥⎥⎤
Gradient Descent O(n) | Normal Equation O( n 3 n^3 n3) |
---|---|
Need to choose α \alpha α | No need to choose α \alpha α |
Needs many iterations | Don’t need to iterate |
Works well even when n is large | Need to compute ( X T X ) − 1 (X^TX)^{-1} (XTX)−1 |
Slow if n is very large |
4-7 Normal equation and non-invertibility(optional)
singular or degenerate matrices
What if X T X X^TX XTX is non-invertible ?
-
Redundant features (linear dependent)
e.g. x 1 x_1 x1=size in f e e t 2 feet^2 feet2
x 2 x_2 x2=size in m 2 m^2 m2
-
Too many features(e.g. m ≤ \le ≤n)
-Delete some features,or use regulariztion
**Octave Tutorial
Working on and submitting programming exercises