Machine learning

Background

Just a review of machine learning for myself (really busy recently, so …)

Basics

Hyper parameters

Transfer learning

  • Chop off classification layers and replace with ones cater to ones’ needs. Freeze pretrained layers during training. Enable training on batch normalization layers as well may get better results.

Curriculum learning

Objective function

Mean absolute error

Mean squared error

Cross-entropy loss

  • $loss(x,class)=-\log(\frac{exp(x[class])}{\Sigma_{j}exp(x[j])})=-x[class]+\log(\Sigma_{j}exp(x[j]))$

Regularization

Dropout

Label smoothing

  • Use not hard target 1 and 0, but a smoothed distribution. Subtract $\epsilon$ from target class, and assign that to all the classes based on a distribution (i.e. sum to 1). So the new smoothed version is $q’(k x)=(1-\epsilon)\delta_{k,y}+\epsilon u(k)$ (x is the sample, y is the target class, u is the class distribution) Rethinking the Inception Architecture for Computer Vision

Learning rate

Pick learning rate

  • Let the learning rate increase linearly (multiply same number) from small to higher over each mini-batch, calculate the loss for each rate, plot it (log scale on learning rate), pick the learning rate that gives the greatest decline (the going-down slope for loss) Cyclical Learning Rates for Training Neural Networks

Differential learning rate

  • Use different learning rate for different layers of the model, e.g. use smaller learning rate for transfer learning pretrained-layers, use a larger learning rate for the new classification layer

Learning rate scheduling

  • Start with a large learning rate, shrink after a number of iterations or after some conditions met (e.g. 3 epoch without improvement on loss)

Initialization

Xavier initialization

Optimizer

  • Gradient descent: go along the gradient, not applicable to extremely large model (memory, time)
  • Stochastic gradient descent: pick a sample or a subset of data, go

Adam

Ensembles

  • Combine multiple models’ predictions to produce a final result (can be a collection of different checkpoints of a single model or models of different structures)

Activations

ReLU

LeakyReLU

tanh

###Sigmoid

Softmax

Convolutional neural network

Recurrent neural network

  • May suffer vanishing or exploding gradient

LSTM

Computer vision

Data augmentation

general

  • Mixup: superimpose e.g. 2 images together with a weight respectively e.g. 0.3, 0.7, classification loss modified to mean of the 2 class (with true labels not as 1s, but as 0.3, 0.7) mixup: Beyond Empirical Risk Minimization

for image

  • Change to RGB, HSV, YUV, LAB color spaces
  • Change the brightness, contrast, saturation and hue: grayscale
  • Affine transformation: horizontal or vertical flip, rotation, rescale, shear, translate
  • Crop

for text

  • Back-translation for machine translation task, use a translator from opposite direction and generate (synthetic source data, monolingual target data) dataset

for audio

Pooling

Natural language processing

Embeddings

Word2vec

N-grams

BPE

Metrics

BLEU

TER

Attention

Reinforcement learning

To be continued …

Written on September 3, 2019