Deep Learning I
当然首先是关于神经网络的基础知识和训练算法。
深度学习很火,学学看看。也不是说什么火就要学什么。瞎学西西╮(╯▽╰)╭
可我还是想学数据库。执念。这学期的高级数据库不适合我。哭唧唧。
但是为了加深印象还是要写笔记。
首先介绍了背景知识,Connectionist Machines, A-type, B-type machines, brain models, Hebbian learning, Rosenblatt’s perceptron什么的。觉得这些了解一下就好,perceptron学过好几次,是很重要的概念。
Perceptron 感知机
机器学习的经典模型感知机,一种二元分类器,将输入向量x映射到二元输出值y上。

Sequential learning:
- Update the weights whenever the perceptron output is wrong
 - Proved convergence
 
二元感知机可以看线性分类器,可以表示所有除了XOR外的boolean操作。
Activation 激活函数: The function that acts on the weighted combination of inputs (and threshold)
一般是sigmod?第一次作业实现了sigmod,tanh,relu。作为激活函数,要求是可导并且非线性。
MLP
多层感知机,可以将输入向量映射到输出向量。克服了单个感知机无法识别线性不可分数据的弱点。
- MLPs are connectionist computational models
    
- Individual perceptrons are computational equivalent of neurons
 - MLP is a layered composition of many perceptrons
 
 - MLPs can model any Boolean functions
    
- Individual perceptrons can act as Boolean gates
 - Networks of perceptrons are Boolean functions
 
 - MLPs are Boolean machines
    
- They can represent arbitrary decision boundaries
 - They can be used to classify data
 
 - MLP can also model continuous valued functions
 - MLPs are classification engines
 - MLP can also model continuous valued functions
 
Background介绍完后就进入deep学习,depth > 2 算作deep。
由于boolean function可以看作真值表的集合,所以用单隐层的MLP就可以表示所有的boolean function。
对于single-layer boolean network:
- 2^N-1 perceptrons in hidden layer - exponential
 - O(N2^N-1) weights - superexponential
 
同样的boolean function,以所有输入的XOR为例子:
- 3N-1 perceptrons in deep network
 - 9(N-1) parameters - linear
 - can be arranged in 2log2(N) layers
 
得出结论:
- Reducing the number of layers below the minimum will result in anexponentially sized network to express the function fully
 - A network with fewer than the minimum required number of neurons cannot model the function
 - Any Boolean circuit of depth d using AND, OR and NOT gates with unbounded fan-in must have size 2^(n^1/d)
 
Shannon’s theorem: For n > 2, there is Boolean function of variables that requires at least 2^n/n gates

- For general case of mutually intersecting hyperplanes in D dimensions, we will need O(N^D/(D-1)!) weights (assuming N » D)
    
- Increasing input dimensions can increase the worst-case size of the shallower network exponentially, but not the XOR net
 
 
MLP也可以对连续值进行回归:
- An MLP with many units can model an arbitrary function over an input
 - A one-layer MLP can model an arbitrary function of a single input
 - MLPs with additive output units are universal approximators
 - A neural network can represent any function provided it has sufficient capacity
    
- Not all architectures can represent any function
 
 
一个课件一百多页ppt为什么我写笔记却只有几行(._.)
Neural Network
Architecture: MLPs
Parameters: weights and bias
Learning: determine value of parameters so network computes desired function

- More generally, given the function to model, we can derive the parameters of the network to model it, through computation
 
不过一般情况下g(X)是不知道的:Sampling g(x)
- Basically,get input-output pairs for a number of samples of input , many samples (𝑋 , 𝑑 ), where 𝑑 = 𝑔(𝑋) + 𝑛𝑜𝑖𝑠𝑒
 - Good sampling: the samples of X will be drawn from P(X)
 
Learning function: estimate the network parameters to “fit” the training points exactly
Perceptron Learning Algorithm
| Perceptron learning algo | Summary | 
|---|---|
![]()  | 
      ![]()  | 
    
- Guaranteed to converge if classes are linearly separable
 - After no more than (R/r)^2 misclassifications
 
Greddy algorithms
ADALINE: Adaptive Linear Neuron
| Learning | Rule | 
|---|---|
![]()  | 
      ![]()  | 
    
![]()  | 
      ![]()  | 
    
MADLINE: Multiple adaline
- Not very useful for large nets, effective for small networks - too expensive/greedy
 
| Training | Learning | 
|---|---|
![]()  | 
      ![]()  | 
    
Risk & Error
Empirical risk
- The empirical risk is only an empirical approximation to the true risk which is our actual minimization objective
 
| Risk | Minimization | 
|---|---|
![]()  | 
      ![]()  | 
    
求使Err最小的W值,矩阵求导知识,由于失忆第一个作业疯狂推导好久。
Problem statement:
| Find min | Hession | 
|---|---|
![]()  | 
      ![]()  | 
    
Solutions
| Closed Form | Iterative | 
|---|---|
![]()  | 
      ![]()  | 
    
因为Closed form不一定可解,引入了迭代求解的方法,借此介绍非常常用的training算法梯度下降。
Gradient Descent
| Approach | Multivariate | 
|---|---|
![]()  | 
      ![]()  | 
    
![]()  | 
      ![]()  | 
    
所以现在的问题变成如何通过GD来求解最优的参数使Objective(Err/Cost)达到最小。
Problem statemet:

其中(X, d)是input, 用矩阵形式表示, W是神经网络的parameters,也就是我们要得到的训练结果, div计算通过神经网络predict的值与真值的err, 也就是objective目标方程,𝑓表示这个网络。
Network Construction
| Individual Neuron | Activations | 
|---|---|
![]()  | 
      ![]()  | 
    
| Combination | Notation | 
![]()  | 
      ![]()  | 
    
Training by GD

Objective对W的导数,是Objective对Wij导数的矩阵表达形式。利用chain rule从最后一层一层一层向前求导,直到得到结果。
| Chain rule | Network | 
|---|---|
![]()  | 
      ![]()  | 
    
用梯度下降来训练神经网络,首先根据初始化的weights和bias计算出predict的值,根据objective function和ground truth得到loss/err. 为了得到使这个objective最小的weights和bias利用梯度下降来求解。因此需要backward computation,对每一层的weights和bias求出objective对它们的偏导,根据更新规则更新weights。不断重复这个过程直到收敛。
| Forward pass | Backward pass | 
|---|---|
![]()  | 
      ![]()  | 
    
Training by BackProp
| Sudo code | Vector formulation | 
|---|---|
![]()  | 
      ![]()  | 
    
- Assumptions: all these conditions are frequently not applicable
    
- The computation of the output of one neuron does not directly affect computation of other neurons in the same (or previous) layers
 - Outputs of neurons only combine through weighted addition
 - Activations are actually differentiable
 
 - Vector activations: all outputs are functions of all inputs
 
前两周的内容基本都被之前的601和605覆盖了,不过时间久了也有些失忆,就当作复习一下。
参考资料
Multilayer Feedforward NEtworks
A Capacity Scaling Law for ANN
Learning representations by BP err



























