hchang 07 Nov 2021

DROPOUT: A SIMPLE WAY TO PREVENT NEURAL NETWORKS FROM OVERFITTING

1. 인트로

1) 발표 전 퀴즈

Dropout은 특정 layer의 일부 unit을 0으로 만들어버림으로 전체 layer와 연결을 끊어준다. :(O/X)
Dropout은 특정 layer의 일부 unit을 layer와 연결을 완전히 끊어준다. :(O/X)
test시에는 연결을 끊었던 unit이라도 사용해서 결과를 얻는다. :(O/X)

2) 읽으면서 생긴 가장 큰 궁금점 들

“thinned” network의 의미…!: dropout되서 안쪽이 텅텅 빈 network

아래 문단

 In the simplest case, each unit is retained with a fixed probability p independent of other units, where p can be chosen using a validation set or can simply be set at 0.5

에서 “where p can be chosen using a validation set” 이거 뭔소리..?

feed-forward neural net은? : 순전파만 있는 함수를 뜻한다.
Boltzmann Machine은?

3) 저자가 다루고 싶어하는 것:

Deep neural nets에서의 overfitting control

모델의 적절한 실용성

4) 저자가 생각하는 문제점:

기존 데이터에서 잘 찾게 쌓아놓으면 과적합이 발생한다.
과적합을 해결할 가장 좋은 방법은 다양한 모델에 다양한 데이터를 사용하여, 앙상블 효과로 overfitting을 없에는 것이지만, 그렇게 다양한 데이터도 얻기 힘들고, 각 모델의 하이퍼 파라미터도 일일히 찾아주어야 한다.
또 여러 모델을 돌려서 앙상블을 하는 것은 상용적으로 적합하지 않다.

해결 방법:

Dropout을 사용한다.

randomly drop unit: 유닛들이 지나치게 상호 적응하는 것을 막는다.

During training, dropout samples from an exponential number of different "thinned" networks.    
At test time, it is easy to approximate the effect of averaging the predictions of all these thinned networks.

여러 모델을 앙상블 하는 효과를 갖는다.

hidden layer란?

여러 non-linear hidden layers 가 모델을 매우 표현력 있게 해주어, input과 output의 복잡한 연관성을 찾아내게 해준다.

the result of sampling noise

sampling noise란?
샘플링을 하는 과정에서, 필연적으로 발생하는 노이즈를 말한다.

With unlimited computation, the bst way to “regularize” a fixed-sized model is to average the predictions of all possible settings of the parameters, weighting each setting by its posterior probability given the training data.
data에 제한이 없을 경우에, 가장 좋은 정규화는 모든 가능한 변수들의 setting들의 예측값을 평균내는 것이다. 왜지?

Bayesian gold standard

The “Bayesian gold standard” is to “regularize” predictions by computing the posterior predictive distribution $P(y|x,D)= \int p(y|x,w)p(w|D)dw$ , where $D$ is our dataset and w is the weights in our network. This integral is intractable in most cases.

We propose to do this by approximating an equally weighted geometric mean of the predictions of an exponential number of learned models that share parameters.
기하평균을 추천한댄다.

저자가 쓰는 overfitting에 가장 좋은 해결방안 소개

Model combination nearly always improves the performance of machine learning methods. With large neural networks, however, the obvious idea of averaging the outputs of many separately trained nets is prohibitively expensive. Combining several models is most helpful when the individual models are different from each other and in order to make neural net models different, they should either have different architectures or be trained on different data. Training many different architectures is hard because finding optimal hyperparameters for each architecture is a daunting task and training each large network requires a lot of computation. Moreover, large networks normally require large amounts of training data and there may not be enough data available to train different networks on different subsets of the data. Even if one was able to train many different large networks, using them all at test time is infeasible in applications where it is important to respond quickly.

앙상블 설명인듯 하다.

여러 모델을 쓰면 좋다. 그 모델들이 각자 다른 데이터를 가지고 학습되었거나, 다른 구조로 만들어진 모델일 경우에 말이다. 하지만 이렇게 여러가지 모델이 다른구조로 많들어지는데에는 많은 비용이 든다. 왜냐하면, 각각 모델의 적절한 Hyperparameters를 찾아줘야하고, 거대한 구조의 모델들을 일일히 학습시켜야 하기 때문이다. 그리고 test마다 모델 여러개에서 전부 결과를 얻어야하는데, 응용프로그램 내부에서는 곤란한 면이 있다.

Dropout is a technique that addresses both these issues. It prevents overfitting and provides a way of approximately combining exponentially many different neural network architectures efficiently.

이런 엄청난 기능이 있다고 하는데, 왜인지… 빨리 알고싶다.

The term “dropout” refers to dropping out units (hidden and visible) in a neural network. By dropping a unit out, we mean temporarily removing it from the network, along with all its incoming and outgoing connections, as shown in Figure 1. The choice of which units to drop is random

일단 랜덤하게 out시켜줍니다.

In the simplest case, each unit is retained with a fixed probability p independent of other units, where p can be chosen using a validation set or can simply be set at 0.5, which seems to be close to optimal for a wide range of networks and tasks. For the input units, however, the optimal probability of retention is usually closer to 1 than to 0.5.

아니 갑자기 probability p 뭐죠..?ㅜㅜ p는 검증집합으로 선택해줄 수도 있고, 그냥 0.5로 할수도 있다는데;; 아직 알수 없는 것 같아요.

대박..!!

Applying dropout to a neural network amounts to sampling a “thinned” network from it. The thinned network consists of all the units that survived dropout (Figure 1b). A neural net with n units, can be seen as a collection of 2n possible thinned neural networks. These networks all share weights so that the total number of parameters is still O(n2), or less. For each presentation of each training case, a new thinned network is sampled and trained. So training a neural network with dropout can be seen as training a collection of 2n thinned networks with extensive weight sharing, where each thinned network gets trained very rarely, if at all.

노드를 일부 없에서 학습시키는게 아니였다..!! 학습 할때마다, 랜덤하게 항목을 지정해서, 결과를 얻고 업데이트를 한다. 하지만 테스트 시에는 랜덤으로 주었던 확률을 곱해서, 모든 unit으로 결과를 얻는다. 이것이 마치 $2^n$개의 thin layer를 사용하여, 평균을 낸 결과를 얻는 것과 동일한 효과를 가져오는 것이다.

Boltzmann Machine 볼쯔만 머신은 stochastic recurrent neural network이고, 이 unconstrained connectivity때문에 machine learning이나 inference분야에서 practical한 문제를 푸는데는 별로 유용하지 못해서, 이 connectivity에 constraint를 둬서 practical한 문제를 해결할 수 있다는 것.

앞으로의 내용

This paper is structured as follows.
Section 2 describes the motivation for this idea.

Section 3 describes relevant previous work.

Section 4 formally describes the dropout model.

Section 5 gives an algorithm for training dropout networks.

In Section 6, we present our experimental results where we apply dropout to problems in different domains and compare it with other forms of regularization and model combination.

Section 7 analyzes the effect of dropout on different properties of a neural network and describes how dropout interacts with the network’s hyperparameters.

Section 8 describes the Dropout RBM model.

In Section 9 we explore the idea of marginalizing dropout.

In Appendix A we present a practical guide 1931Srivastava, Hinton, Krizhevsky, Sutskever and Salakhutdinov for training dropout nets. This includes a detailed analysis of the practical considerations involved in choosing hyperparameters when training dropout networks

2. Motivation

동기는 생물의 진화에 성이 미치는 영향에서 따왔다고 한다. 무성 생식이 한가지 상황에 맞춰서 진화하기는 더 좋을지 몰라도 다양한 상황에 맞춰서 진화하려면, 성에 의한 생식이 효과적이라는 말이다.

별 내용 없고, 인풋에는 20%, 중간 층에는 50% 하라는 정도?

노이즈를 추가하여, 오버피팅을 막는 기존 모델을 소개해 준 정도인 것 같다.

4. Model Description

$z^{l+1}_i = w^{l+1}_i y^l + b^{l+1}_i$
$y^{l+1}_i = f(z^{l+1}_i)$

이와같은 기본 구조에서

$r_j^{(l)} \sim Bernoulli(p)$
$\tilde{y}^{(l)} = r^{(l)} * y^{(l)}$
$z_i^{(l+1)} = w_i^{(l+1)}\tilde{y}^{l} + b_i^{(l+1)}$
$y^{(l+1)} = f(z_i^{(l+1)})$

5. Learning Dropout Nets

1) Backpropagation

minibatch안에서 확률적 경사하강법으로 학습이 진행됩니다. 저희가 다뤄오던 일반적인 딥러닝 모델인 느낌입니다. 특이한 점은 Dropout된 가중치는 minibatch안에서 조금도 학습되지 않습니다.

One particular form of regularization was found to be especially useful for dropout—constraining the norm of the incoming weight vector at each hidden unit to be upper bounded by a fixed constant c. In other words, if w represents the vector of weights incident on any hidden unit, the neural network was optimized under the constraint $||w||2 ≤ c$. This constraint was imposed during optimization by projecting w onto the surface of a ball of radius c, whenever w went out of it.

일단 논문이 쓰일 당시에 batchnorm이 없었던 것 같다.(2014년) 학습이 문제없이 될 수 있도록 W의 크기가 c이하로만 되도록 max-norm을 사용해준다.