吴恩达·Machine Learning || chap4 Linear Regression with multiple variables 简记

4 Linear Regression with multiple variables

4-1 Multiple features

Multiple features (variables)

​ Notation:

​ n n n​ = number of features
​ x ( i ) x^{(i)} x(i)​=input(features) of i t h i^{th} ith​ training example.
​ x j ( i ) x_j^{(i)} xj(i)​​=value of feature in i t h i^{th} ith​​ training example.

Hypothesis

​ h θ ( x ) = θ 0 + θ 1 x 1 + θ 2 x 2 + ⋅ ⋅ ⋅ + θ n x n h_\theta(x)=\theta_0+\theta_1x_1+\theta_2x_2+···+\theta_nx_n hθ​(x)=θ0​+θ1​x1​+θ2​x2​+⋅⋅⋅+θn​xn​

​ define x 0 x_0 x0​=1

​ x = [ x 0 x 1 x 2 ⋮ x n ] x=\begin{bmatrix}x_0\\x_1\\x_2\\\vdots\\x_n\end{bmatrix} x=⎣⎢⎢⎢⎢⎢⎡​x0​x1​x2​⋮xn​​⎦⎥⎥⎥⎥⎥⎤​ θ = [ θ 0 θ 1 θ 2 ⋮ θ n ] \theta=\begin{bmatrix}\theta_0\\\theta_1\\\theta_2\\\vdots\\\theta_n\end{bmatrix} θ=⎣⎢⎢⎢⎢⎢⎡​θ0​θ1​θ2​⋮θn​​⎦⎥⎥⎥⎥⎥⎤​

​ h θ ( x ) = θ T x h_\theta (x)=\theta^Tx hθ​(x)=θTx

4-2 Gradient descent for multiple variable

Cost function
J ( θ ) = J ( θ 0 , θ 1 , ⋯   , θ n ) = 1 2 m ∑ i = 1 m ( h θ ( x ( i ) − y ( i ) ) 2 J(\theta)=J(\theta_0,\theta_1,\cdots,\theta_n)=\frac{1}{2m}\sum_{i=1}^{m}(h_\theta(x^{(i)}-y^{(i)})^2 J(θ)=J(θ0​,θ1​,⋯,θn​)=2m1​i=1∑m​(hθ​(x(i)−y(i))2
Gradient descent

​ θ j : = θ j − α ∂ ∂ θ j J ( θ ) \theta_j := \theta_j-\alpha\frac{\partial}{\partial\theta_j}J(\theta) θj​:=θj​−α∂θj​∂​J(θ)​​

Previously(n=1):

​ Repeat{
{ θ 0 : = θ 0 − α 1 m ∑ i = 1 m ( h θ ( x ( i ) − y ( i ) ) θ 1 : = θ 1 − α 1 m ∑ i = 1 m ( h θ ( x ( i ) − y ( i ) ) . x ( i ) \begin{cases}\theta_0 := \theta_0-\alpha\frac{1}{m}\sum_{i=1}^{m}{(h_\theta(x^{(i)}-y^{(i)})} \\\theta_1 := \theta_1-\alpha\frac{1}{m}\sum_{i=1}^{m}{(h_\theta(x^{(i)}-y^{(i)})}.x^{(i)} \end{cases} {θ0​:=θ0​−αm1​∑i=1m​(hθ​(x(i)−y(i))θ1​:=θ1​−αm1​∑i=1m​(hθ​(x(i)−y(i)).x(i)​
​ }

New algorithm(n ≥ 1 \ge1 ≥1​)

​ Repeat{
θ j : = θ j − α 1 m ∑ i = 1 m ( h θ ( x ( i ) − y ( i ) ) x j i \theta_j := \theta_j-\alpha\frac{1}{m}\sum_{i=1}^{m}{(h_\theta(x^{(i)}-y^{(i)})x^{i}_j} θj​:=θj​−αm1​i=1∑m​(hθ​(x(i)−y(i))xji​
​ }

4-3 Gradient descent in practice I: Feature Scaling

Feature Scaling 特征缩放

​ Ideal :make sure features are on a similar scale

​ Get every feature into approximately a 0 ≤ x i ≤ 1 0\le x_i \le 1 0≤xi​≤1​ ( x 0 = 1 x_0=1 x0​=1) range​​

Mean normalization

​ Replace x i x_i xi​ with x i − μ i x_i-\mu_i xi​−μi​ to make features have approximately zero mean (Do not apply to x 0 = 1 x_0=1 x0​=1)
x i ⟵ x i − μ i s i x_i\longleftarrow \frac{x_i-\mu_i}{s_i} \\ xi​⟵si​xi​−μi​​
​ x i x_i xi​: feature number

​ μ i \mu_i μi​:average value

​ s i s_i si​: range(max-min)

4-4 Gradient descent in practice I: Learning rate

Gradient descent

​ θ j : = θ j − α ∂ ∂ θ j J ( θ ) \theta_j := \theta_j-\alpha\frac{\partial}{\partial\theta_j}J(\theta) θj​:=θj​−α∂θj​∂​J(θ)

  • Debugging: make sure gradient descent is working correctly

  • How to choose learning rate α \alpha α

    Convergence test:

    ​ Declare convergence if J ( θ ) J(\theta) J(θ)​​ decreases by less than 1 0 − 3 10^{-3} 10−3​​​ ( ϵ \epsilon ϵ​​)in one iteration

Summary

For sufficiently small α \alpha α, J ( θ ) J(\theta) J(θ)​should decrease on every iteration.
But if Q is too small, gradient descent can be slow to converge

To choose α \alpha α,try

​ steps of ten

​ ···,0.001,0.003,0.01,0.03,···

4-5 Features and polynomial regression

housing price prediction

Polynomial regression

h θ ( x ) = θ 0 + θ 1 x 1 + θ 2 x 2 + θ 3 x 3 h_\theta(x)=\theta_0+\theta_1x_1+\theta_2x_2+\theta_3x_3 hθ​(x)=θ0​+θ1​x1​+θ2​x2​+θ3​x3​

⟶ \longrightarrow ⟶ h θ ( x ) = θ 0 + θ 1 ( s i z e ) + θ 2 ( s i z e ) 2 + θ 3 ( s i z e ) 3 h_\theta(x)=\theta_0+\theta_1(size)+\theta_2(size)^2+\theta_3(size)^3 hθ​(x)=θ0​+θ1​(size)+θ2​(size)2+θ3​(size)3​

Choice of features

h θ ( x ) = θ 0 + θ 1 x 1 + θ 2 s i z e h_\theta(x)=\theta_0+\theta_1x_1+\theta_2\sqrt{size} hθ​(x)=θ0​+θ1​x1​+θ2​size

4-6 Normal equation

Intuition: ( θ ∈ R \theta\in R θ∈R​)( θ ∈ R n + 1 \theta\in R^{n+1} θ∈Rn+1)
θ = ( X T X ) − 1 X T y \theta=(X^TX)^{-1}X^Ty θ=(XTX)−1XTy

pinv(x'*x)*x'*y		%octave

m examples ( ( x ( 1 ) , y ( 1 ) ) , ⋯   , ( x ( m ) , y ( m ) ) (x^{(1)},y^{(1)}),\cdots,(x^{(m)},y^{(m)}) (x(1),y(1)),⋯,(x(m),y(m))​​),n features

x ( i ) = [ x 0 ( i ) x 1 ( i ) x 2 ( i ) ⋮ x n ( i ) ] x^{(i)}=\begin{bmatrix} x_0^{(i)}\\ x_1^{(i)}\\x_2^{(i)}\\\vdots\\x_n^{(i)}\end{bmatrix} x(i)=⎣⎢⎢⎢⎢⎢⎢⎡​x0(i)​x1(i)​x2(i)​⋮xn(i)​​⎦⎥⎥⎥⎥⎥⎥⎤​​​ X= [ ⋯ ( x ( 1 ) ) T ⋯ ⋯ ( x ( 2 ) ) T ⋯ ⋮ ⋯ ( x ( m ) ) T ⋯ ] \begin{bmatrix}\cdots & (x^{(1)})^T & \cdots \\ \cdots & (x^{(2)})^T & \cdots \\&\vdots&\\\cdots & (x^{(m)})^T & \cdots \end{bmatrix} ⎣⎢⎢⎢⎡​⋯⋯⋯​(x(1))T(x(2))T⋮(x(m))T​⋯⋯⋯​⎦⎥⎥⎥⎤​​​

y = [ y ( 1 ) y ( 2 ) ⋮ y ( m ) ] y=\begin{bmatrix}y^{(1)}\\y^{(2)}\\\vdots\\y^{(m)}\end{bmatrix} y=⎣⎢⎢⎢⎡​y(1)y(2)⋮y(m)​⎦⎥⎥⎥⎤​

Gradient Descent O(n) Normal Equation O( n 3 n^3 n3)
Need to choose α \alpha α No need to choose α \alpha α
Needs many iterations Don’t need to iterate
Works well even when n is large Need to compute ( X T X ) − 1 (X^TX)^{-1} (XTX)−1
Slow if n is very large

4-7 Normal equation and non-invertibility(optional)

singular or degenerate matrices

What if X T X X^TX XTX is non-invertible ?

  • Redundant features (linear dependent)

    e.g. x 1 x_1 x1​=size in f e e t 2 feet^2 feet2​

    ​ x 2 x_2 x2​=size in m 2 m^2 m2​

  • Too many features(e.g. m ≤ \le ≤n)

    -Delete some features,or use regulariztion

**Octave Tutorial

Working on and submitting programming exercises

上一篇:百度地图坐标、Google坐标、腾讯坐标相互转化


下一篇:[20160314]关于sql写法问题使用=还是用in