title image


Just a review of machine learning for myself (really busy recently, so …)

内容来自dive into deep learning, pattern recognition and machine learning, 网络。


  • Batch normalization: subtracting the batch mean and dividing by the batch standard deviation (2 trainable parameters for mean and standard deviation, mean->0, variance->1) to counter covariance shift (i.e. the distribution of input of training and testing are different) Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift 也方便optimization,而且各feature之间不会有莫名的侧重
  • batch normalization 经常被每一层分别使用。batch size不能为1(输出总为0)。
  • batch normalization对于fully connected 经常是在affine和activation之间,而convolutional layer经常是convolutional layer和activation之间(对于每个output channel分别做)
  • batch normalization在train和predict表现不一样,predict时是用的training时根据所有training set估计的population mean和variance
  • MXNet’s ndarray比numpy的要多2特点,1是有automatic differentiation,2是支持在GPU, and distributed cloud architectures上的asynchronous computation.
  • broadcast就是复制来填充新的维度
  • missing data比如NaN可以用imputation(填充一些数)或者deletion来处理
  • 使用x+=y或者z[:]=x可以在老地方设置新ndarray,节约内存
  • scalar, vector, matrix, tensor: 0-, 1-, 2-, n-dimension
  • $L_{p}$ norm: image-20200517120714786
  • $L_{p}$norm的性质, for vectors in $C^{n}$ where $ 0 < r < p $:f58fa1507500f5afe377f76f4d3fc0007c93b64e
  • calculus微积分: integration, differentiation
  • product rule: image-20200517210613582

  • quotient rule: image-20200517210755705
  • chain rule: image-20200517213323541
  • matrix calculus: image-20200517213215946matrix calculus wiki
  • A gradient is a vector whose components are the partial derivatives of a multivariate function with respect to all its variables
  • Bayes’ Theorem: image-20200517214155675
  • image-20200517222203691推导
  • dot product: a scalar; cross product: a vector
  • stochastic gradient descent: update in direction of negative gradient of minibatchimage-20200518114459580
  • likelihood: image-20200518120658206
  • 经常用negative log-likelihood来将maximize multiplication变成minimize sum
  • minimizing squared error is equivalent to maximum likelihood estimation of a linear model under the assumption of additive Gaussian noise
  • one-hot encoding: 1个1,其他补0
  • entropy of a distribution p: image-20200518173223125
  • cross-entropy is asymmetric: $H(p,q)=-\sum\limits_{x\in X}{p(x)\log q(x)}$
  • minimize cross-entropy == maximize likelihood
  • Kullback-Leibler divergence (也叫relative entropy或者information gain) is the difference between cross-entropy and entropy: image-20200518174004941
  • KL divergence is asymmetric and does not satisfy the triangle inequality
  • cross validation: split into k sets. do k experiments on (k-1 train, 1 validation), average the results
  • forward propagation calculates and stores intermediate variables.
  • 对于loss function $J$, 要计算偏导的$W$, $\frac{\partial J}{\partial W}=\frac{\partial J}{\partial O}*I^{T}+\lambda W$, 这里$O$是这个的output, $I$是这个的input,后面的term是regularization的导,这里也说明了为啥forward propagation要保留中间结果,此外注意activation function的导是elementwise multiplication,有些activation function对不同值的导得分别计算。training比prediction要占用更多内存
  • Shift: distribution shift, covariate shift, label shift
  • covariate shift correction: 后面说的只和feature $X$有关,和$y$是没有关系的。训练集来自$q(x)$,测试集来自$p(x)$,image-20200620111745471,所以训练时给$X$一个weight $\frac{p(x)}{q(x)}$即可。经常是来一个混合数据集,训练一个分类器来估计这个weight,logistics分类器好算。
  • label shift correction:和上面一样,加个importance weights,说白了调整一下输出
  • concept shift correction:经常是缓慢的,所以就把新的数据集,再训练更新一下模型即可。
  • 确实的数据,可以用这个feature的mean填充
  • Logarithms are useful for relative loss.除法变减法
  • 对于复杂模型,有block这个概念表示特定的结构,可以是1层,几层或者整个模型。只要写好参数和forward函数即可。
  • 模型有时候需要,也可以,使不同层之间绑定同样的参数,这时候backpropagation的gradient是被分享那一层各自的和,比如a->b->b->c,就是第一个b和第二个b的和
  • chain rule (probability): image-20200627215124291


  • 一般layer width (node个数)取2的幂,计算高效

Transfer learning

  • Chop off classification layers and replace with ones cater to ones’ needs. Freeze pretrained layers during training. Enable training on batch normalization layers as well may get better results.

Curriculum learning

Objective function

Mean absolute error

Mean squared error

Cross-entropy loss

  • $loss(x,class)=-\log(\frac{exp(x[class])}{\Sigma_{j}exp(x[j])})=-x[class]+\log(\Sigma_{j}exp(x[j]))$


Weight decay

  • 即L2 regularization
  • encourages weight values to decay towards zero, unless supported by the data.
  • 这是q=2,ridge,让weights distribute evenly, driven to small values
  • q=1的话,lasso, if $λ$ is sufficiently large, some of the coefficients $w_{j}$ are driven to zero, leading to a sparse model,比如右边lasso的$w_{1}$image-20200519133624932


  • breaks up co-adaptation between layers
  • in training, zeroing out each hidden unit with probability $p$, multiply by $\frac{1}{1-p}$ if kept, 这使得expected sum of weights, expected value of activation the same (这也是可以直接让p=0就用在test mode)
  • in testing, no dropout
  • 不同层可以不同dropout, a common trend is to set a lower dropout probability closer to the input layer

Label smoothing

  • Use not hard target 1 and 0, but a smoothed distribution. Subtract $\epsilon$ from target class, and assign that to all the classes based on a distribution (i.e. sum to 1). So the new smoothed version is $q’(k x)=(1-\epsilon)\delta_{k,y}+\epsilon u(k)$ (x is the sample, y is the target class, u is the class distribution) Rethinking the Inception Architecture for Computer Vision

Learning rate

Pick learning rate

  • Let the learning rate increase linearly (multiply same number) from small to higher over each mini-batch, calculate the loss for each rate, plot it (log scale on learning rate), pick the learning rate that gives the greatest decline (the going-down slope for loss) Cyclical Learning Rates for Training Neural Networks

Differential learning rate

  • Use different learning rate for different layers of the model, e.g. use smaller learning rate for transfer learning pretrained-layers, use a larger learning rate for the new classification layer

Learning rate scheduling

  • Start with a large learning rate, shrink after a number of iterations or after some conditions met (e.g. 3 epoch without improvement on loss)


  • 求偏导,简单例子,对于一个很多层dense的模型,偏导就是连乘,eigenvalues范围广,特别大或者特别小,这个是log-space不能解决的

  • Vanishing gradients: cause by比如用sigmoid做activation function,导数两头都趋于0,见图image-20200520122630230
  • Exploding gradients:比如100个~Normal(0,1)的数连乘,output炸了,gradient也炸了,一发update,model参数就毁了
  • Symmetry:全连接的话,同一层所有unit没差,所以如果初始化为同一个值就废了
  • 普通可用的初始化,比如Uniform(-0.07, 0.07)或者Normal(mean=0, std=0.01)

Xavier initialization

  • 为了满足variance经过一层后稳定,$\sigma^2$是某层$W$初始化后的的variance,对于forward propagation, 我们需要$n_{in}\sigma^2=1$,对于backward propagation,我们需要$n_{out}\sigma^2=1$,所以,选择满足image-20200602162551489
  • 可以用mean是0,variance是$\sigma^2=\frac{2}{n_{in}+n_{out}}$的Gaussian distribution,也可以用uniform distribution $U(-\sqrt{\frac{6}{n_{in} + n_{out}}},\sqrt{\frac{6}{n_{in} + n_{out}}})$
  • 注意到variance of uniform distribution $U(-a, a)$是$\int_{-a}^{a}(x-0)^2 \cdot f(x)dx=\int_{-a}^{a}x^2 \cdot \frac{1}{2a}dx=\frac{a^2}{3}$


  • Gradient descent: go along the gradient, not applicable to extremely large model (memory, time)
  • weight = weight - learning_rate * gradient
  • Stochastic gradient descent: pick a sample or a subset of data, go
  • hessian matrix: a square matrix of second-order partial derivatives of a scalar-valued function, it describes the local curvature of a function of many variables.{\displaystyle \mathbf {H} ={\begin{bmatrix}{\dfrac {\partial ^{2}f}{\partial x_{1}^{2}}}&{\dfrac {\partial ^{2}f}{\partial x_{1}\,\partial x_{2}}}&\cdots &{\dfrac {\partial ^{2}f}{\partial x_{1}\,\partial x_{n}}}\[2.2ex]{\dfrac {\partial ^{2}f}{\partial x_{2}\,\partial x_{1}}}&{\dfrac {\partial ^{2}f}{\partial x_{2}^{2}}}&\cdots &{\dfrac {\partial ^{2}f}{\partial x_{2}\,\partial x_{n}}}\[2.2ex]\vdots &\vdots &\ddots &\vdots \[2.2ex]{\dfrac {\partial ^{2}f}{\partial x_{n}\,\partial x_{1}}}&{\dfrac {\partial ^{2}f}{\partial x_{n}\,\partial x_{2}}}&\cdots &{\dfrac {\partial ^{2}f}{\partial x_{n}^{2}}}\end{bmatrix}},}
  • hessian matrix is symmetric, hessian matrix of a function f is the Jacobian matrix of the gradient of the function: H(f(x)) = J(∇f(x)).
  • input k-dimensional vector and its output is a scalar:
    1. eigenvalues of the functionʼs Hessian matrix at the zero-gradient position are all positive: local minimum
    2. eigenvalues of the functionʼs Hessian matrix at the zero-gradient position are all negative: local maximum
    3. eigenvalues of the functionʼs Hessian matrix at the zero-gradient position are negative and positive: saddle point
  • convex functions are those where the eigenvalues of the Hessian are never negative
  • Jensenʼs inequality: $f$函数是convex的, image-20200702230613264
  • convex函数没有local minima,不过可能有多个global minima或者没有global minima

Stochastic gradient descent

  • 就是相对于gradient descent 用所有training set的平均梯度,这里用random一个sample的梯度
  • stochastic gradient $∇f_{i}(\textbf{x})$ is the unbiased estimate of gradient $∇f(\textbf{x})$.




  • 和SGD相比,对initial learning rate不敏感


  • Combine multiple models’ predictions to produce a final result (can be a collection of different checkpoints of a single model or models of different structures)



  • $ReLU(z)=max(z,0)$image-20200519004508856
  • mitigates vanishing gradient



  • sigmoid是一类s型曲线
  • 代表:logit function, logistic function(logit的inverse function),hyperbolic tangent function
  • logistic function值域0-1: $f(x)=\frac{1}{1+e^{-x}}$image-20200519005935784
  • 求导$\frac{df}{dx}=f(x)(1-f(x))=f(x)f(-x)$过程
  • tanh (hyperbolic tangent) function: $f(x)=\frac{1-e^{-2x}}{1+e^{-2x}}$image-20200519011314832
  • tanh形状和logistic相似,不过tanh是原点对称的$\frac{df}{dx}=1-f^{2}(x)$


  • 计算image-20200518141705932
  • softmax保证了each logit >=0且和为1
  • 给softmax搭配cross entropy避免了exponential带来的数值overflow或者underflow问题

Convolutional neural network

  • 更准确地说法是cross correlation: 各自对应位置乘积的和
  • convolution就是先上下,左右flip,然后同上
  • channel (filter, feature map, kernel):可以有好多个,输入rgb可以看成3channel.有些简单的kernel比如检测edge,[1, -1]就把横线都去了
  • padding: 填充使得产出的结果形状的大小和输入相同,经常kernel是奇数的边长,就是为了padding时可以上下左右一致
  • stride:减小resolution,以及体积
  • 对于多个channel,3层input,就需要3层kernel,然后3层各自convolution后,加一起,成为一层,如果输出想多层,就多写个这种3层kernel,input,output的层数就发生了变化。总的来说,kernel是4维的,长宽进出
  • 1*1 convolutional layer == fully connected layer
  • pooling: 常见的有maximum, average
  • pooling减轻了模型对location的敏感性,并且spatial downsampling,减少参数
  • pooling没有参数,输入输出channel数是一样的
  • 每一层convolutional layer后面都有activation function
  • feature map: 就是一个filter应用于前一层后的output

Recurrent neural network

  • May suffer vanishing or exploding gradient
  • 可以用gradient clippingimage-20200622090458222gradient norm是在所有param上计算的
  • Markov model: 一个first order markov model是image-20200622082131306
  • if $x_{t}$只能取离散的值,那么有image-20200627220821221
  • autoregressive model:根据markov这样利用过去$\tau$个信息,计算下一位的条件概率
  • latent autoregressive model:比如GRU, LSTM,更新一个latent state image-20200622082715737
  • tokenization: word or character or bpe
  • vocabulary 映射到0-n的数字,包括一些特殊的token, , ,
  • RNN的参数并不随timestamp变化,image-20200622085218410image-20200622085246879
  • error是softmax cross-entropy on each label
  • perplexity: image-20200622085755631


  • to deal with : 1. early observation is highly significant for predicting all future observations 2 , some symbols carry no pertinent observation (should skip) 3, logical break (reset internal states)
  • reset gate $R_{t}$: image-20200622093351288capture short-term dependencies
  • update gate $Z_{t}$: image-20200622093311824capture long-term dependencies


  • Long short-term memory
  • input gate
  • forget gate
  • output gate
  • memory cell: entirely internal
  • Can be bidirectional (just stack 2 lstm together, with input of opposite direction)


  • a neural network design pattern, encoder -> state(several vector i.e. tensors) -> decoder
  • An encoder is a network (FC, CNN, RNN, etc.) that takes the input, and outputs a feature map, a vector or a tensor
  • An decoder is a network (usually the same network structure as encoder) that takes the feature vector from the encoder, and gives the best closest match to the actual input or intended output.
  • sequence-to-sequence model is based on encoder-decoder architecture, both encoder and decoder are RNNs
  • 对于一个encoder-decoder模型,内部是这样的image-20200629071351671hidden state of the encoder is used directly to initialize the decoder hidden state to pass information from the encoder to the decoder

Computer vision

Data augmentation


  • Mixup: superimpose e.g. 2 images together with a weight respectively e.g. 0.3, 0.7, classification loss modified to mean of the 2 class (with true labels not as 1s, but as 0.3, 0.7) mixup: Beyond Empirical Risk Minimization

for image

  • Change to RGB, HSV, YUV, LAB color spaces
  • Change the brightness, contrast, saturation and hue: grayscale
  • Affine transformation: horizontal or vertical flip, rotation, rescale, shear, translate
  • Crop

for text

  • Back-translation for machine translation task, use a translator from opposite direction and generate (synthetic source data, monolingual target data) dataset

for audio


Natural language processing

  • beam search: $ Y $这么多的词汇,很简单,就是每一层都挑前一层$k* Y $中挑最可能的k个。最后,收获的不是k个,而是$k*$L个,L是最长搜索的长度,e.g. a, a->b, a->b->c, 最后这些还用perplexity在candidates中来挑选一下最可能的。




  • 语言模型language model: image-20200622084204817
  • Laplace smoothing (additive smoothing): image-20200629065408275,这里m是categories数量,所以估计值会在原本的概率和1/m的均匀分布之间,$\alpha$经常取0~1之间的数,如果是1的话,这个也叫做add-one smoothing






  • Attention Is All You Need

  • attention layer有个key-value pairs $\bf (k_{1}, v_{1})..(k_{n}, v_{n})$组成的memory,输入query $\bf{q}$,然后用score function $\alpha$计算query和key的相似度,然后输出对应的value作为output $\bf o$
  • image-20200702011857241image-20200702012144153image-20200702012222310
  • 两种常见attention layer,都可以内含有dropout: dot product attention and multilayer perceptron attention.前者score function就是点乘,后者则是个可训练的hidden layer
  • seq2seq with attention mechanism: encoder没变化。during the decoding, the decoder output from the previous timestep $t-1$ is used as the query. The output of the attention model is viewed as the context information, and such context is concatenated with the decoder input Dt. Finally, we feed the concatenation into the decoder.
  • The decoder of the seq2seq with attention model passes three items from the encoder:
    1. the encoder outputs of all timesteps: they are used as the attention layerʼs memory with identical keys and values;
    2. the hidden state of the encoderʼs final timestep: it is used as the initial decoderʼs hidden state;
    3. the encoder valid length: so the attention layer will not consider the padding tokens with in the encoder outputs.
  • transformer:主要是加了3个东西,
    1. transformer block:包含multi-head attention layer and position-wise feed-forward network layers
    2. add and norm: a residual structure and a layer normalization image-20200702035112666
    3. position encoding: 唯一add positional information的地方
  • multi-head attention: contain parallel self-attention layers (head), 可以是any attention (e.g. dot product attention, mlp attention)
  • position-wise feed-forward networks:3-d inputs with shape (batch size, sequence length, feature size), consists of two dense layers, equivalent to applying two $1*1$ convolution layers
  • layer normalization和batch normalization类似,不过不是batch 那一维度(d=0)normalize,而是最后一个维度normalize
  • position encoding: image-20200702035740355i refers to the order in the sentence, and j refers to the position along the embedding vector dimension。这个函数应该更容易把握relative positions,并且没有sequence长度限制,不过也可以用别的,比如learned ones


Reinforcement learning

To be continued …

Written on January 3, 2020