深度学习

深度学习

吴恩达深度学习课程
第一课 — 神经网络与深度学习: av66314465
第二课 — 改善深层神经网络:超参数调试、正则化以及优化:av66524657
第三课 — 结构化机器学习项目:av66644404
第四课 — 卷积神经网络:av66646276
第五课 — 序列模型:av66647398

专业课概述

AlphaGo使用的机器学习内容:蒙特卡洛树搜索,两个深度神经网络

  • 神经网络内容:
    • 卷积神经网络:ImageNet(残差网络),Inception
      • 实际应用:目标检测,人脸验证,图片风格转换
    • 序列模型:网络结构,GRU,LSTM
      • 实际应用:词嵌入,语言模型,注意力机制,触发语检测
  • 神经网络使用领域:
    • 计算机视觉
      • 实际运用
        • 图像分类
        • 语义分割
        • 物体识别和检测
        • 运动和跟踪
        • 视觉问答
        • 三维重建
    • 自然语言处理
      • 主要是由两个内容组成
        • 自然语言理解
        • 自然语言生成
      • 实际运用
        • 语音识别
        • 机器翻译
        • 语音合成
        • 人机对话
        • 语音助手
        • 问答系统
        • 机器阅读
    • 生成对抗网络
      • 特质
        • 通过训练数据进行生成需要的数据
        • 属于监督学习方法
        • 同时训练一个生成网络和一个判别网络
          • 生成网络:生成逼真的图片欺骗判别网络
          • 判别网络:区分生成图片和真实图片
  • 深度学习框架
    • TensorFlow from Google for Industry
      • 19.10 -> Version 2.0
      • 专业课主要学习的框架,学习版本为1.x,2.0版本与1.x版本差别很大,需要分开学习
        • 编程语言:python 3.x
        • 编程环境:Anaconda, Jupyter Notebook, TensorFlow 1.x
    • Keras into TensorFlow
    • Pytorch use Python from Facebook for study
    • PaddlePaddle from Baidu
    • Deeplearning4j use JAVA
    • Mxnet from Amazon
    • Caffe&Caffe2 into Pytorch
    • CNTK from Microsoft
    • Theano is ancient
    • Chainer use Python
  • 学习辅用网上视频课:吴恩达深度学习课程

第一课:视频课1-6 导论

  1. 学习顺序

    1. (周1-4)Neural Networks and Deep Learning
      • Cats Recognition
    2. Improving Deep Neural Network: Hyperparameter tuning(参数调整) , Regularization(正则化) and Optimization(优化)
    3. Structing(结构化) your Machine Learning project.
    4. Convolutional Neural Networks 卷积神经网络
    5. Natural Language Processing.
  2. What is a neural network?

    A function to estimate a result which need a lot of data of parameter to confirm how is the final result decided.

    We do not setting the method detail of solving problem

  3. Supervised Learning 监管学习

    Give a series of Input and Gain a series of Output.

    • Example:

      ​ Home features -> Price (Standard NN)

      ​ Ad, user Info … -> Click on ad? (Standard NN)

      ​ Image -> Objects (CNN)

      ​ Audio -> Text transcript (RNN【Recurrent Neural Network】)

      ​ English -> Chinese (RNN)

      ​ Image, Radar Info -> Position of other cars. (Custom / Hybrid)

    Standard NN 标准神经网络,Convolutional NN 卷积神经网络,Recurrent NN 循环神经网络

    Neural Network Make Computer more easy to understand unstructured data.

  4. Why are they just now taking off?

    scale of data and computation is growing rapidly, and the progress in algorithms.

    Using a new way (ReLu function instead of Sigmoid function) to faster the speed of training a model.

    ReLu函数

第一课:视频课7-24

  1. Binary Classification
    • A great algorithm to process the entire train sets with a great speed.
    • A introduction of forward pass or forward propagation step and backward pass or backward propagation step 前向传播和反向传播
    • Logistic regression 逻辑回归
      • An algorithm for binary classification.
    • Important Notation
      • ‘n’ is the length of height of the matric X, ‘m’ is the count of training examples (or looks like { (x1, y1) … (x(m), y(m) ) } ).
      • there are two sets, one of them for train, another for test.
      • (Xn, Y) is a specific vector.
  2. Logistic Regression
    • Given x, want y hat or y^.
    • Parameters: w (matrix), b (real number)
    • Output: y^ = w^T * x + b (Sigmoid function) = Sigmoid(Z)
      • ![sigmoid function](./深度学习专业课内容/Sigmoid function.png “sigmoid function”)
      • Sigmoid(Z)
      • learn about w and b to know how y become a good estimate.
      • Another academic method of display
  3. Logistic Regression cost function

To train the parameters W and B of the logistic regression model, We need to define a cost function

yHat

To get the best prediction, we want the y Hat can be more and more near to the i-th example.

Lost funcion is used to measure how well our algorithm is doing.

Here is the not good algorithm: (which will get serveral global optimums.)

![Old Loss function](./深度学习专业课内容/Old Loss function.png “Old Loss function”)

Here is the better algorithm: (of which graph is convex)

![New Loss function](./深度学习专业课内容/Loss function.png “New Loss Function”)

Cost function: the cost of the parameters.

![Cost function](./深度学习专业课内容/Cost function.png “Cost function”)

Compute the total cost of the whole sets, Measure how well the performance of your trained model.

  1. Gradient Descent 梯度下降法

The method to train the w and b.

ways: Find the minimum or to say the global optimum.

Cost Function

The real usage of Gradient Descent.

GradientDescent

Parameters in this function:

​ α :learning rate :which to control how big a step we take on each iteration or gradient descent.

w := means updated w equal to the right

The Key of function

  • using derivatives导数 to find the side which is more near to the optimum
  • and update the nearest value and repeat it until you can not fould the next value.
  1. derivatives 导数

略(PS:作为一个正在考研的人,要是这里还需要这种入门级的教学,我就别考了hhh)

  1. Computation Graph 计算图

Also the introduction about derivaives, but the graph makes the procedure of whole computation a lot easier to be comprehended.

Example:

ComputationGraph

The Compute of Derivatives in Computation Graph

![image-20200430115619173](./深度学习专业课内容/The backward propagation.png)

  1. Logistic Regression - Gradient descent 逻辑回归中的梯度下降法

LogisticRegressionRecap

LogisticRegressionDerivatives

And Find the derivatives for L :

da = dL / da = - y / a + (1 - y) / (1 - a)

dz = dL / dz = dL / da * da / dz = dL / da * a(1-a) = a - y

**dw1 = dL / dw1 = x1 * dz = x1 * (a - y) **

**dw2 = x2 * dz **

db = dz

using Gradient descent on M samples

The used function is J(w, b)

LogisticRegressionOnMExamples

CalculateTheLossFunction

  1. Vetorization 向量化

A great important method to forbid repeatitive and inefficient looping

ExampleForVectorizedSpeed

A more faster way of calculate a series of great datasets

  1. Vectorizing Logistic Regression 向量化逻辑回归

Using Python Module – Numpy to Simplify the Code

  1. Vectorizing Logistic Regresion’s Gradient Computation 向量化逻辑回归的梯度计算

ImplementingLogisticRegression

  1. Broadcasting in Python

Broadingcasting

ExampleForBroadcasting

第一课:视频课25-35

  1. Neural Network Overview

  2. Neural Network Representation

Neural Network Representation

image-20200515110648704

image-20200515111536381

  1. Vertorizing

image-20200515112005402

image-20200515112240179

  1. Activation functions

Here are four useful and often activation function:

  • sigmoid : It doesn’t to be used any more, because it has some disadventages
  • tanh: limit the output between 0 and 1
  • ReLU: Rectified limit Unit – a = max(0, 1) (Mr.Wu recommend)
  • Leaky ReLU: Rectified limit Unit which is not zero when input go down zero.

image-20200515114039515

  1. Why we use non-linear activate function.

If we use linear activate function in neural network, the whole hidden layers is useless, and it may be OK when you wanna computing output a real number.**So we only use linear function in the final output layer, except for some special circumstance. **but In that case, using ReLU is fine, too

  1. the deriviatives of Activate functions
  • sigmoid -> a’ = sigmoid’(z) = a (1 - a)
  • tanh -> a’ = tanh’(z) = 1 - a^2
  • ReLU -> a’ = (max(0, z))’ = 0, if z < 0; 1, if z >= 0.
  • Leaky ReLU -> a’ = (max(0.01 * z, z))’ = 0.01, if z < 0; 1 if z >=0.
  1. Gradient descent for neural networks

image-20200515120536239

backward propagation

  1. Randomly Initialization

If we don’t randomly initialize paremeter W, we will get the same result from each unit in one

layer. It will make out gradient descent to doesn’t work by makes the units output the same results.

b initialized by zeros is OK, but W should be randomly initialized.

In Python:

  • w1 = np.random.rand((a, b)) * 0.01 (why here is 0.01 instead of 100 or 1000?)

    • the latter multiplier usually be very small because if this parameter become too large, it makes the sigmoid function or tanh or others useless, hence it makes the learning too slow.
  • b1 = np.zero((b, 1))

image-20200518213524372

第一课:视频课36-43

  1. Deep L-layer Neural network

image-20200522101551302

image-20200522101958747

  1. getting your matrix dimensions right

单变量的情况下:

image-20200522104705786

Z[l] -> (n[l], 1); A[I] -> (n[l], 1);

向量化后:when function plus parameter b, python will automatically using broadcast to duplicate b into a (n, m) matrix

image-20200522105131223

  1. Why deep representations?

image-20200522110207245

  1. Building blocks of deep neural networks

image-20200522111438793

image-20200522111720050

  1. Forward propagation in a deep network

image-20200522102625960

  1. Forward and backward propagation

Forward propagation:

image-20200522112104484

Backward propagation:

image-20200522112407294

Noticing the backward propagation initialization parameter ==dA==

image-20200522112701454

  1. Parameters ==vs== Hyperparameters

Hyperparameters is controlled by researchers, which could directly effect the paramters ==W, B==;

第二课:视频课1-14

  1. Train / dev / test sets

image-20200525092506646

image-20200525093136620

image-20200525093421911

  1. Bias / Variance 偏差/方差

image-20200525093718667

image-20200525094113692

  1. Basic recipe for machine learning

image-20200525095149437

  1. Regularization 正则化

image-20200525095646075

image-20200525100202996

  1. Why regularization reduces overfittings?

image-20200525101523571

image-20200525101056476

  1. Dropout regularization 随机失活正则化

Randomly choose several units in each layer and set them disabled

image-20200525102156007

image-20200525103248930

image-20200525103527601

image-20200525104132554

  1. Other regularization methods

Data augmentation (when we can’t get more data)

flipping horizontally, rotated or distortion

image-20200525105156574

Early stopping (avoid overfitting)

image-20200525105734977

  1. Normalizing inputs

There are two steps:

  • subtract mean (zero mean)
  • normalize variance

Use the same mu and sigma in train set and test set

image-20200525110214245

image-20200525110622485

  1. Vanishing / exploding gradients

If each W greater than 1, the activation will be exploding.

If each W smaller than 1, the activation will be vanishing.

They will also make the learning very slow.

image-20200525111309317

  1. Weight initialization for deep network

Because of the existence of Vanishing / exploding gradients, we have to take careful in weight initialization.

image-20200525112033756

  1. Gradient checking

For checking if the compute of derivatives is correct.

Numerical approximation of gradients 梯度数值逼近

Using the double sides derivative, which makes the Error much less than one side derivative.

image-20200525113200276

Gradient checking

image-20200525113515862

image-20200525113823819

  1. Gradient Checking implementation notes.
  • Don’t use in training, only to debug.
  • If algorithm fails grad check, look at components to try to identify bug.
  • Remember regularization.
  • Doesn’t work with dropout.
  • Run at ramdom initialization; perhaps again after some training.

第二课:视频课15-24

  1. Mini-batch gradent descent

When the count of sample is too large, the speed of using your algorithm will still be too slow to compute.

Mini-batch means a little sets of trainning samples extracted from a whole tremendous training samples sets.

image-20200530164024668

image-20200530164538205

image-20200530164819579

image-20200530171013466

image-20200530171241167

  1. Exponentially weighted averages. 指数加权平衡

image-20200530204422817

image-20200531095044302

image-20200531095701942

image-20200531095919807

The adventages of exponentially weighted averages:

  • it just takes up one line of code basically. 只占用一行代码
  • and just storage and memory for a single row number to compute this exponentially weighted average. 只需要存储单行数字内容
  • not the best and most accurate way to compute average.
  1. Bias correction in exponentially weighted averages

In last example, using exponentially weighted averages might make the first several averages too low to be accurate to portray the first several days temperature.

So, we have to use bias to fix it.

image-20200531100714813

The bias will go very near to zero when days go up, so we don’t worry about the influence of bias in rear samples.

Most people don’t care about it unless you have to concentrate on the first several indics

  1. Gradient descent with momentum 动量梯度下降法

In previous methods, we have to make the learning rate low lest make the predictions go far away from the minimum.

In Gradient descent with momentum, we using the exponentially weighted averages as our forward momentum to keep the learning orientation is point to the position of the minimum.

image-20200531101841756

image-20200531102131122

  1. RMSprop

Another way to speed up gradient descent.

I do not really comprehend understand it.

image-20200531103133043

  1. Adam optimization algorithm

Adam : Adaption Moment Estimation

Another way to speed up gradient descent using in generally

image-20200531105317686

image-20200531105418927

  1. Learning rate decay

There are several way to make the learning rate goes down with the iteration goes through.

image-20200531105938019

image-20200531110051578

  1. The problem of local optima 局部最优问题

image-20200531110604893

image-20200531110724616

The optimization methods could help us exemplify from the local optima

第二课:视频课25-35

  1. Tuning process 调试处理

image-20200601204302879

Find the best hyperparameters‘ values

image-20200601204634913

image-20200601204808739

  1. Using an appropriate scale to pick hyperparameters

A normal way to choose but not for all.

image-20200601205022283

A special way for estimate the hyperparameters which is sensitive for a very small changes. image-20200601205248898

Because of the trait of exponentially weighted averages, which is when it comes too close to 1, the influence of it will become more and more intensive.

image-20200601205745768

  1. Hyperparameters tuning in practice: Pandas vs. Caviar

The two different way of hyperparameters tuning.

image-20200601210803581

  1. Normalizing activations in a network.

image-20200601211313319

image-20200601211758052

image-20200601214817911

image-20200601215137626

  1. Why Batch Norm work?

image-20200603111555383

the function of Batch Norm is to control the mean (to zero) and variance (to one) of the input of each hidden layers by transfer the shape of input to normal distribution.

so why we have to control the mean and variance?

image-20200603112741076

this controlling is to undermine covariate shift which means the precede hidden layers’ parameters updated could put a great influence into the output, affect the latter hidden layers’ parameters, which means in chinese “牵一发而动全身”. It makes the updating of neural networks is unstable and chaos, which I mean when our newest data have a great difference from prior data, the neural network might perform worse.

but using inputs with the shape of normal distribution could avert it.

So the truly meaning of using batch norm is to make each hidden layer more independent.

image-20200603113317734

  1. Batch norm at test time

In test time, we need other way to calculate mu and sigma

image-20200603114739137

  1. Softmax regression

one of generalizations of logical regression

Not just binary classification, but could give a variety of result.

image-20200603115657782

image-20200603120159683

The different is input a vector, output a vector

image-20200603120441817

  1. Training a softmax classification.

image-20200603120759334

image-20200603121403079

image-20200603121622370

  1. Deep Learning frameworks

image-20200603121814542

image-20200603121851611

  1. TensorFlow
  • Basically using

image-20200603122748753

image-20200603122903528

  • input data

image-20200603123148606

  • whole code example

image-20200603123435243

第三课:视频课1-12

  1. Why is ML strategy

image-20200606091453068

How to choose the best method for improve the performance of ML.

  1. Orthogonalization 正交化

image-20200606092046899

Orthogonalization means let each parameter when it changes will not affect other parameters.

image-20200606092742211

  1. Single number evaluation metric 单一数字评估指标

Precision 查准率 of the example that your classifier recognizes as cats, what percentage actually are cats?

Recall 查全率 of all the images that really are cats, what percentage were correctly recognized by your classifier?

F1 Score the mixture of P and R

image-20200606102433184

By using the average, it make us easy to figure out which classifier is performing better than others classifier overall.

  1. Satisficing and optimizing metrics.

We should care about both accuracy and running time.

image-20200606103351231

  1. Train/dev/test distributions

One bad idea: using not really randomly choosed sets for equally distributing.

image-20200610103534117

==A real great idea is to set the dev and test sets both have data containing all apperent conditions such as region in this example.==

image-20200610103903800

  1. Size of dev and test sets

image-20200610104151103

But in the era of big data, we have to change ways to decide the size of dev and test sets

image-20200610104248502

image-20200610104500075

  1. When to change dev/test sets and metrics

Sometimes, one algorithm maybe perform greater than others, but it will output some unacceptible results when it goes wrong. So we may not use it because of terrible consequence because of the algorithm’s low-possible but inexorable blunder.

image-20200610105102600

Here are our usage which weighting the index of each error emerging to make us get the best algorithm which fits our individually special demand

image-20200610110225404

  1. Why human-level perormance

image-20200610110759027

Bayes optimal error, or Bayesian optimal error, or Bayes error for short is the very best theoretical function for mapping from x to y.

image-20200610111146468

  1. Avoidable bias

image-20200611115346592

  1. Understanding human-level performance

image-20200613091305773

How should you define human-level error?

image-20200613092311029

**Of course we should use the best performance as human-level error, to substitute or estimate the Bayes error, but it this case, when our AI do not perform good enough, choose using anyone to be Bayes error is meaningless. **

When avoidable bias is non-ignorable, we should keep eyes on Bias, otherwise we should pay attention to Variance.

  1. Surpassing human-level performance

image-20200613093244367

image-20200613093611006

  1. Improving your model performance

image-20200613094016023

image-20200613094305142

第三课:视频课13-22

  1. Carring out error analysis 错误分析

image-20200615144412091

  1. Found how the error may be caused.
  2. Figure out the possible reasons.
  3. Use a sheet to statistic.

image-20200615145025091

  1. cleaning up incorrectly labeled data

image-20200615145422040

So long as the total data set is big enough, the random error is ok, but the systematic errors is not ok.

image-20200615150053960

image-20200615150516455

  1. Build your first system quickly, then iterate

image-20200615151036893

image-20200615151140380

Guideline: Build a system first, and iterate.

  1. Training and testing on different distributions.

There are two ways to solve it.

  1. Merge them and shuffle.

image-20200615153559584

  1. use basis dataset to train, and the real worthy data to test.

image-20200615154153186

  1. Bias and Variance with mismatched data distributions

image-20200615160321279

image-20200615160644471

image-20200615161232388

  1. Addressing data mismatch

image-20200615161755207

image-20200615161946845

  1. Transfer learning 迁移学习

Use old knowledge in new task

image-20200615163710699

image-20200615164002861

  1. Multi-task learning 多任务学习

image-20200615164203212

image-20200615164535439

image-20200615164843129

  1. End-to-End deep learning 端对端深度学习

image-20200615165235342

image-20200615165819492

image-20200615165939240

  1. Whether to use end-to-end deep learning

image-20200615170349079

image-20200615170925241

第四课:视频课1-11

  1. Computer Vision

image-20200616084951453

  1. Edge Detection

Convolution 卷积

image-20200616085511773

image-20200616085604059

image-20200616085655347

image-20200616085806762

The usage of convolution in Vertical edge detection

image-20200616090001930

image-20200616090644499

image-20200616090731360

  1. Padding

image-20200616091338859

Two strategy of Padding convolutions

image-20200616091512938

We usually use odd-numbered f for building filter

  1. Strided convolution

image-20200616091748865

image-20200616091757404

image-20200616091814573

image-20200616091925121

image-20200616092444984

  1. Convolution over volumes

image-20200616092639613

image-20200616092811448

image-20200616093039875

  1. One layer of convolutional neural network

image-20200616093610946

image-20200616093711561

image-20200616094232888

An Example of ConvNet

image-20200616095041404

image-20200616095216286

  1. Pooling layers 池化层(汇合层)

Use Pooling layers to reduce the size of their representation to speed up computtion, as wel as to some of the features it detects a bit more rebust. 缩减模型大小,提高计算速度,同时提高所提取特征的健壮性

image-20200616095619531

image-20200616095802291

image-20200616100019095

  1. Neural network example

image-20200616100952112

image-20200616101006120

  1. Why convolution?

image-20200616101443682

image-20200616101630163

image-20200616101947696

第四课:视频课12-22

  1. Why look at cases studies?

image-20200622195228815

  1. Classic networks
  • LeNet - 5 : Recognize the number written by hand

image-20200622195928733

  • Alex Net

Input a picture, return a series.

image-20200622200326093

  • VGG - 16

image-20200626102816754

  1. Residual Network — Res Net

image-20200626103551402

image-20200626103751284

image-20200626104551750

image-20200626104906027

image-20200626105033227

  1. Network in Network and 1 * 1 convolutions

image-20200626105530885

image-20200626105656715

  1. Inception Network Motivation

image-20200626110040713

image-20200626110247040

image-20200626110517819

1*1 convolution use to reduce the size of third layers

  1. Inception Network

image-20200626110818596

image-20200626111002896

  1. Using open-source implementations

Github

  1. Transfer Learning

image-20200626111551586

image-20200626111711592

image-20200626111812124

  1. Data augmentation

image-20200626112129330

image-20200626112227329

image-20200626112411712

image-20200626112530514

  1. The state of computer vision

image-20200626112849086

image-20200626113335320

image-20200626113601239

第四课 视频课22-32

  1. Object localization

image-20200626130754143

image-20200626131613768

image-20200626131924761

image-20200626132129076

  1. Landmark detection

image-20200626133150844

  1. Object detection

image-20200626134703112

image-20200626134920442

  1. Convolutional implementation of sliding windows

image-20200626135530490

image-20200626141058696

image-20200626141205599

  1. Bouding box prediction

image-20200626141323362

YOLO algorithm

image-20200626170208739

image-20200626170233582

image-20200626170446787

image-20200626171014053

  1. Intersection over union 交并比

image-20200626171327450

Is it good enough or too bad?

We have to use a algorithm to evaluate it.

image-20200626171519270

  1. Non-max suppresion 非极大值抑制

One result only be detected once.

image-20200626172514450

  1. Anchor Boxes

Detect multiple objects.

image-20200626173155401

image-20200626173254454image-20200626173556907

  1. YOLO algorithm

image-20200626174008319

image-20200626174316780

image-20200626174513394

  1. Region proposals 候选区域

image-20200626175039025

image-20200626180816656

第四课:视频课33-43