Deep Learning I

当然首先是关于神经网络的基础知识和训练算法。

深度学习很火，学学看看。也不是说什么火就要学什么。瞎学西西╮(╯▽╰)╭

~~可我还是想学数据库。执念。这学期的高级数据库不适合我。哭唧唧。~~

但是为了加深印象还是要写笔记。

首先介绍了背景知识，Connectionist Machines， A-type, B-type machines, brain models, Hebbian learning, Rosenblatt’s perceptron什么的。觉得这些了解一下就好，perceptron学过好几次，是很重要的概念。

Perceptron 感知机

机器学习的经典模型感知机，一种二元分类器，将输入向量x映射到二元输出值y上。

Sequential learning：

Update the weights whenever the perceptron output is wrong
Proved convergence

二元感知机可以看线性分类器，可以表示所有除了XOR外的boolean操作。

Activation 激活函数: The function that acts on the weighted combination of inputs (and threshold)

一般是sigmod？第一次作业实现了sigmod，tanh，relu。作为激活函数，要求是可导并且非线性。

MLP

多层感知机，可以将输入向量映射到输出向量。克服了单个感知机无法识别线性不可分数据的弱点。

MLPs are connectionist computational models
- Individual perceptrons are computational equivalent of neurons
- MLP is a layered composition of many perceptrons
MLPs can model any Boolean functions
- Individual perceptrons can act as Boolean gates
- Networks of perceptrons are Boolean functions
MLPs are Boolean machines
- They can represent arbitrary decision boundaries
- They can be used to classify data
MLP can also model continuous valued functions
MLPs are classification engines
MLP can also model continuous valued functions

Background介绍完后就进入deep学习，depth > 2 算作deep。

由于boolean function可以看作真值表的集合，所以用单隐层的MLP就可以表示所有的boolean function。

对于single-layer boolean network:

2^N-1 perceptrons in hidden layer - exponential
O(N2^N-1) weights - superexponential

同样的boolean function，以所有输入的XOR为例子：

3N-1 perceptrons in deep network
9(N-1) parameters - linear
can be arranged in 2log2(N) layers

得出结论：

Reducing the number of layers below the minimum will result in anexponentially sized network to express the function fully
A network with fewer than the minimum required number of neurons cannot model the function
Any Boolean circuit of depth d using AND, OR and NOT gates with unbounded fan-in must have size 2^(n^1/d)

Shannon’s theorem: For n > 2, there is Boolean function of variables that requires at least 2^n/n gates

For general case of mutually intersecting hyperplanes in D dimensions, we will need O(N^D/(D-1)!) weights (assuming N » D)
- Increasing input dimensions can increase the worst-case size of the shallower network exponentially, but not the XOR net

MLP也可以对连续值进行回归：

An MLP with many units can model an arbitrary function over an input
A one-layer MLP can model an arbitrary function of a single input
MLPs with additive output units are universal approximators
A neural network can represent any function provided it has sufficient capacity
- Not all architectures can represent any function

一个课件一百多页ppt为什么我写笔记却只有几行(._.)

Neural Network

Architecture: MLPs

Parameters: weights and bias

Learning: determine value of parameters so network computes desired function

More generally, given the function to model, we can derive the parameters of the network to model it, through computation

不过一般情况下g(X)是不知道的：Sampling g(x)

Basically,get input-output pairs for a number of samples of input , many samples (𝑋 , 𝑑 ), where 𝑑 = 𝑔(𝑋) + 𝑛𝑜𝑖𝑠𝑒
Good sampling: the samples of X will be drawn from P(X)

Learning function: estimate the network parameters to “fit” the training points exactly

Perceptron Learning Algorithm

Perceptron learning algo	Summary

Guaranteed to converge if classes are linearly separable
After no more than (R/r)^2 misclassifications

Greddy algorithms

ADALINE: Adaptive Linear Neuron

Learning	Rule

MADLINE: Multiple adaline

Not very useful for large nets, effective for small networks - too expensive/greedy

Training	Learning

Risk & Error

Empirical risk

The empirical risk is only an empirical approximation to the true risk which is our actual minimization objective

Risk	Minimization

求使Err最小的W值，矩阵求导知识，由于失忆第一个作业疯狂推导好久。

Problem statement：

Find min	Hession

Solutions

Closed Form	Iterative

因为Closed form不一定可解，引入了迭代求解的方法，借此介绍非常常用的training算法梯度下降。

Gradient Descent

Approach	Multivariate

所以现在的问题变成如何通过GD来求解最优的参数使Objective(Err/Cost)达到最小。

Problem statemet:

其中(X, d)是input, 用矩阵形式表示, W是神经网络的parameters，也就是我们要得到的训练结果, div计算通过神经网络predict的值与真值的err, 也就是objective目标方程，𝑓表示这个网络。

Network Construction

Individual Neuron	Activations

Combination	Notation

Training by GD

Objective对W的导数，是Objective对Wij导数的矩阵表达形式。利用chain rule从最后一层一层一层向前求导，直到得到结果。

Chain rule	Network

用梯度下降来训练神经网络，首先根据初始化的weights和bias计算出predict的值，根据objective function和ground truth得到loss/err. 为了得到使这个objective最小的weights和bias利用梯度下降来求解。因此需要backward computation，对每一层的weights和bias求出objective对它们的偏导，根据更新规则更新weights。不断重复这个过程直到收敛。

Forward pass	Backward pass

Training by BackProp

Sudo code	Vector formulation

Assumptions: all these conditions are frequently not applicable
- The computation of the output of one neuron does not directly affect computation of other neurons in the same (or previous) layers
- Outputs of neurons only combine through weighted addition
- Activations are actually differentiable
Vector activations: all outputs are functions of all inputs

前两周的内容基本都被之前的601和605覆盖了，不过时间久了也有些失忆，就当作复习一下。

参考资料

Multilayer Feedforward NEtworks

Deep Sum-Product Networks

VC Dimension of NN

A Capacity Scaling Law for ANN

30 Years of ANN

Perceptron Mistake Bounds

Learning representations by BP err

FEATURED TAGS

Java Hashset Linkedhashset BFS Array DFS Hashmap Design ICC DP Stack Python Database BinarySearch ArrayList SegmentTree BIT TopologicalSort Bit BST TreeSet TreeMap Trie String LinkedList KMP D&C Greedy Math Backtracking Tree UnionFind TwoPointers C/C++ GC Web Sort Graph DS DL PP ML Sampling Minimax Heap

Deep Learning I

11785 DL1 - NN Training & Optimization

Deep Learning I

Perceptron 感知机

MLP

Neural Network

Greddy algorithms

Risk & Error

Gradient Descent

参考资料

CATALOG

FEATURED TAGS