# Background

Just a review of machine learning for myself (really busy recently, so …)

# Basics

## Transfer learning

• Chop off classification layers and replace with ones cater to ones’ needs. Freeze pretrained layers during training. Enable training on batch normalization layers as well may get better results.

## Objective function

### Cross-entropy loss

• $loss(x,class)=-\log(\frac{exp(x[class])}{\Sigma_{j}exp(x[j])})=-x[class]+\log(\Sigma_{j}exp(x[j]))$

## Regularization

### Label smoothing

•  Use not hard target 1 and 0, but a smoothed distribution. Subtract $\epsilon$ from target class, and assign that to all the classes based on a distribution (i.e. sum to 1). So the new smoothed version is $q’(k x)=(1-\epsilon)\delta_{k,y}+\epsilon u(k)$ (x is the sample, y is the target class, u is the class distribution) Rethinking the Inception Architecture for Computer Vision

## Learning rate

### Pick learning rate

• Let the learning rate increase linearly (multiply same number) from small to higher over each mini-batch, calculate the loss for each rate, plot it (log scale on learning rate), pick the learning rate that gives the greatest decline (the going-down slope for loss) Cyclical Learning Rates for Training Neural Networks

### Differential learning rate

• Use different learning rate for different layers of the model, e.g. use smaller learning rate for transfer learning pretrained-layers, use a larger learning rate for the new classification layer

### Learning rate scheduling

• Start with a large learning rate, shrink after a number of iterations or after some conditions met (e.g. 3 epoch without improvement on loss)

## Optimizer

• Gradient descent: go along the gradient, not applicable to extremely large model (memory, time)
• weight = weight - learning_rate * gradient
• Stochastic gradient descent: pick a sample or a subset of data, go

## Ensembles

• Combine multiple models’ predictions to produce a final result (can be a collection of different checkpoints of a single model or models of different structures)

###Sigmoid

# Recurrent neural network

• May suffer vanishing or exploding gradient

# Computer vision

## Data augmentation

### general

• Mixup: superimpose e.g. 2 images together with a weight respectively e.g. 0.3, 0.7, classification loss modified to mean of the 2 class (with true labels not as 1s, but as 0.3, 0.7) mixup: Beyond Empirical Risk Minimization

### for image

• Change to RGB, HSV, YUV, LAB color spaces
• Change the brightness, contrast, saturation and hue: grayscale
• Affine transformation: horizontal or vertical flip, rotation, rescale, shear, translate
• Crop

### for text

• Back-translation for machine translation task, use a translator from opposite direction and generate (synthetic source data, monolingual target data) dataset

# To be continued …

Written on September 3, 2019