DL

Deep Learning I

11785 DL1 - NN Training & Optimization

Posted by freeCookie🍪 on February 4, 2018

Deep Learning I

当然首先是关于神经网络的基础知识和训练算法。

深度学习很火,学学看看。也不是说什么火就要学什么。瞎学西西╮(╯▽╰)╭

可我还是想学数据库。执念。这学期的高级数据库不适合我。哭唧唧。

但是为了加深印象还是要写笔记。

首先介绍了背景知识,Connectionist Machines, A-type, B-type machines, brain models, Hebbian learning, Rosenblatt’s perceptron什么的。觉得这些了解一下就好,perceptron学过好几次,是很重要的概念。

Perceptron 感知机

机器学习的经典模型感知机,一种二元分类器,将输入向量x映射到二元输出值y上。

Sequential learning:

  • Update the weights whenever the perceptron output is wrong
  • Proved convergence

二元感知机可以看线性分类器,可以表示所有除了XOR外的boolean操作。

Activation 激活函数: The function that acts on the weighted combination of inputs (and threshold)

一般是sigmod?第一次作业实现了sigmod,tanh,relu。作为激活函数,要求是可导并且非线性。

MLP

多层感知机,可以将输入向量映射到输出向量。克服了单个感知机无法识别线性不可分数据的弱点。

  • MLPs are connectionist computational models
    • Individual perceptrons are computational equivalent of neurons
    • MLP is a layered composition of many perceptrons
  • MLPs can model any Boolean functions
    • Individual perceptrons can act as Boolean gates
    • Networks of perceptrons are Boolean functions
  • MLPs are Boolean machines
    • They can represent arbitrary decision boundaries
    • They can be used to classify data
  • MLP can also model continuous valued functions
  • MLPs are classification engines
  • MLP can also model continuous valued functions

Background介绍完后就进入deep学习,depth > 2 算作deep。

由于boolean function可以看作真值表的集合,所以用单隐层的MLP就可以表示所有的boolean function。

对于single-layer boolean network:

  • 2^N-1 perceptrons in hidden layer - exponential
  • O(N2^N-1) weights - superexponential

同样的boolean function,以所有输入的XOR为例子:

  • 3N-1 perceptrons in deep network
  • 9(N-1) parameters - linear
  • can be arranged in 2log2(N) layers

得出结论:

  • Reducing the number of layers below the minimum will result in anexponentially sized network to express the function fully
  • A network with fewer than the minimum required number of neurons cannot model the function
  • Any Boolean circuit of depth d using AND, OR and NOT gates with unbounded fan-in must have size 2^(n^1/d)

Shannon’s theorem: For n > 2, there is Boolean function of variables that requires at least 2^n/n gates

  • For general case of mutually intersecting hyperplanes in D dimensions, we will need O(N^D/(D-1)!) weights (assuming N » D)
    • Increasing input dimensions can increase the worst-case size of the shallower network exponentially, but not the XOR net

MLP也可以对连续值进行回归:

  • An MLP with many units can model an arbitrary function over an input
  • A one-layer MLP can model an arbitrary function of a single input
  • MLPs with additive output units are universal approximators
  • A neural network can represent any function provided it has sufficient capacity
    • Not all architectures can represent any function

一个课件一百多页ppt为什么我写笔记却只有几行(._.)

Neural Network

Architecture: MLPs

Parameters: weights and bias

Learning: determine value of parameters so network computes desired function

  • More generally, given the function to model, we can derive the parameters of the network to model it, through computation

不过一般情况下g(X)是不知道的:Sampling g(x)

  • Basically,get input-output pairs for a number of samples of input , many samples (𝑋 , 𝑑 ), where 𝑑 = 𝑔(𝑋) + 𝑛𝑜𝑖𝑠𝑒
  • Good sampling: the samples of X will be drawn from P(X)

Learning function: estimate the network parameters to “fit” the training points exactly

Perceptron Learning Algorithm

Perceptron learning algo Summary
  • Guaranteed to converge if classes are linearly separable
  • After no more than (R/r)^2 misclassifications

Greddy algorithms

ADALINE: Adaptive Linear Neuron

Learning Rule

MADLINE: Multiple adaline

  • Not very useful for large nets, effective for small networks - too expensive/greedy
Training Learning

Risk & Error

Empirical risk

  • The empirical risk is only an empirical approximation to the true risk which is our actual minimization objective
Risk Minimization

求使Err最小的W值,矩阵求导知识,由于失忆第一个作业疯狂推导好久。

Problem statement:

Find min Hession

Solutions

Closed Form Iterative

因为Closed form不一定可解,引入了迭代求解的方法,借此介绍非常常用的training算法梯度下降。

Gradient Descent

Approach Multivariate

所以现在的问题变成如何通过GD来求解最优的参数使Objective(Err/Cost)达到最小。

Problem statemet:

其中(X, d)是input, 用矩阵形式表示, W是神经网络的parameters,也就是我们要得到的训练结果, div计算通过神经网络predict的值与真值的err, 也就是objective目标方程,𝑓表示这个网络。

Network Construction

Individual Neuron Activations
Combination Notation

Training by GD

Objective对W的导数,是Objective对Wij导数的矩阵表达形式。利用chain rule从最后一层一层一层向前求导,直到得到结果。

Chain rule Network

用梯度下降来训练神经网络,首先根据初始化的weights和bias计算出predict的值,根据objective function和ground truth得到loss/err. 为了得到使这个objective最小的weights和bias利用梯度下降来求解。因此需要backward computation,对每一层的weights和bias求出objective对它们的偏导,根据更新规则更新weights。不断重复这个过程直到收敛。

Forward pass Backward pass

Training by BackProp

Sudo code Vector formulation
  • Assumptions: all these conditions are frequently not applicable
    • The computation of the output of one neuron does not directly affect computation of other neurons in the same (or previous) layers
    • Outputs of neurons only combine through weighted addition
    • Activations are actually differentiable
  • Vector activations: all outputs are functions of all inputs

前两周的内容基本都被之前的601和605覆盖了,不过时间久了也有些失忆,就当作复习一下。

参考资料

Multilayer Feedforward NEtworks

Deep Sum-Product Networks

VC Dimension of NN

A Capacity Scaling Law for ANN

30 Years of ANN

Perceptron Mistake Bounds

Learning representations by BP err