AndrewNg
看着吴恩达老师的网课稍微用自己的理解来记一记东西~非常业余只供看乐子
Concepts
- Structured Data : Refers to the data that each of the features has a very well defined meaning.
- Unstructured Data : audio,raw audio or images where you may want to recognize what’s in the image or text. And the features might be the pixel values(像素值) in an image.What should be paid attention to is that humans are really good at interpret unstructing data,as a word or a text can be regarded as a form of unstructured data.
- piont: deal with a huge amount of data.Especially “labled data”
- algorithm innovation(data computation): transform sigmoid to RELU function(help computation),more convient for bigger nn trains and trying your new idea,improving your efficiency.
- Label : things you’re going to predict.
- Feature :
- Example : labened/unlabeled(features only) ;
- model :
- training :
- inference :
- overfitting ;
- convergence :
- parameter :
- hyperparameter :
- 模型训练的一次迭代(一次梯度更新)
Basics
Logistic Regression
- The dimension of feature vector:the number of elements.n=num*row*col(of the matrixes);m refers to the number of examples we have in total;
- create a matrix (nx,m);python:X.shape;Y.shape(1,m)
- We assume that the parameters(参数) of logistic regression will be w an nx-dimensional vector.
- you can use the following function to get a estimated value:
$$
\widehat{y}=\sigma(wx+b)
$$ - Loss Function: Logistic Rsgression lost function,we can find the local optima(best solution).You can assume y equals to a certain value like 1 or 0,then see what value we hope the $\widehat{y}$ be.
$$
L(\widehat{y},y)=-(ylog\widehat{y}+(1-y)log(1-\widehat{y}))
$$ - Cost Function: It is the additive sum of every loss of the predictive value,and it measures how well the parameters w and b are doing. We should find apprioate w and b to make the J(w,b) as small as possible.
$$
J(w,b) = \frac{1}{m}\sum L(\widehat{y}^{(i)},y^{(i)})
$$
Interpret:
First, we want to minimum the cost function J(w,b).
Second, we’re supposed to maximum the minus L, for we always want the max through the maximum likehood estimition.
Then we assume that our model were IID(identically independently distributed).
$$y = 1 : p(y|x) = \widehat{y}$$
$$y = 0 : p(y|x) = 1 - \widehat{y}$$
We mix the two function together:
$$p(y|x) = \widehat{y}^y + (1-\widehat{y})^{(1-y)}$$
log the both side we get the -L.
Use the gradient descent algorithm
- to train the w and b,get the best result;
- progress:as you should know that J(w,b) is a convex(凹函数),so we’re supposed to find the minus,at least the local minus. so we repeat:
$$
w=w-\alpha \frac{\delta J(w,b)}{\delta w}
$$
Implement gradient for logistic regression
- three core formulas:
$$z = w^Tx+b$$
$$\widehat{y} = a = \sigma(z)$$
$$L(a,y) = -(ylog(a)+(1-y)log(1-a))$$ - the number of fearures equals to the unmber of w(i),and only a sigle b.We compute the loss based on a sigle example.
- Caculate:
$$
dz = a - y , db = dz
$$
then you cancompute $w_1,w_2$….the same way as the former
$$dw_1 = x_1dz,w_1 = w_1 - \alpha dw
$$
for the m examples,make a for loop
- you add every value up,include $J,w_i,z_i,dz_i$,then you get the additive sum of $j,dw_i,b$,then /m.
$$w_i = w_i - \alpha dw_i
$$
$$b = b - \alpha db$$
Vectorization
- use special command so that you can accelarate your efficiency.for loop is too slow.
- feature: transform the for loop compute to the special matrix algorithm,so we can call out numpy to perform better.Second you can achieve work out all the results at a sigle time through transform the vector to the matrix.
broadcasting in Python
1 | cal = A.sum(axis = 0) # to sum vertically, 1 means sum horizontally |
In fact, Python can automatically transforms the matrix too suit the compute, by copying vertically or horizontally.
You can read NumPy documention.
Do not use rank 1 arries
1 | a = np.random.dandn(5) # to create a special arry ranking 1 |
instead do this
1 | a = np.random,randn(5,1) # column vector |
Neural Network
Overview&Representation
- input layer -> hinden layer -> output layer
- we call the input layer the zero layer, according to that role to define the number of layer.
西瓜书
性能度量
错误率精度略,这里主要说明查准率与查全率
查准率:
$$
P = \frac{TP}{TP+FP}
$$
查全率:
$$
R = \frac{TP}{TP+FN}
$$
直观反映:PR图
神经网络
前馈型(FNN)
- 最简单
- 各神经元之间没有反馈连接,信息只能向前流动
- 没有记忆效应,适用于多监督学习任务
典型:卷积神经网络(CNN)
后馈型(递归型)
- 具备反馈/循环连接
- 输出可以在后续的时间步骤被送回给网络的输入端
- 适于处理需要记忆和上下文信息的任务,处理时序数据
典型:循环神经网络(RNN)
LSTM
一种改进后的RNN
非线性分类器
升高维度