Deep Learning

ADAMP: SLOWING DOWN THE SLOWDOWN FOR MOMENTUM OPTIMIZERS ON SCALE-INVARIANT WEIGHTS

Jongmin Lim 2024. 6. 10. 09:48

참고

[Momentum]

모멘텀은 기울기(gradient) 기반의 업데이트에 일관성을 부여하여 빠른 수렴을 유도
이를 통해 최적화 과정이 보다 빠르고 안정적으로 이루어질 수 있습니다.

[Batch Normalization]

각 배치의 데이터 $X$에 대해 평균을 0, 분산을 1로 맞추어서, 입력 데이터의 스케일을 일정하게 유지
정규화된 값 $\hat{X}$에 스케일 학습 파라미터 $\gamma$ 와 시프트 파라미터 $\beta$를 적용하면 최종 출력은 다음과 같음: $Y= \gamma \hat{X} + \beta$
배치 정규화와, 학습된 $\gamma$와 $\beta$ 덕분에 모델의 출력은 본질적으로 동일하게 유지될 수 있다.

좋은문장

They let weights converge more quickly with often better gener alization performances.
Because of the scale invariance, this modifi cation only alters the effective step sizes without changing the effective update directions
In this paper, we verify that the widely-adopted com bination of the two ingredients lead to the premature decay of effective step sizes and sub-optimal model performances.
Because of the scale invariance, this modification only alters the effective step sizes without changing the effective update directions, thus enjoying the original convergence properties of GD optimizers.
We propose a simple solution to slow down the decay of effective step sizes while maintaining the step directions of the original optimizer in the effective space.

Abstract

Batch Normalization과 같은 Normalization techniques은 좀 더 나은 일반화 성능을 가진 weights에 수렴을 빠르게 한다
- 이것은 Normalization-induced scale invariance 덕분이라고 여겨진다.
그러나 스케일 불변성(scale invariance)이 있는 가중치에서는 모멘텀이 추가되면 학습 과정에서 효과적인 Step size가 예상보다 더 빠르게 감소할 수 있다.
- Step size가 줄어들면 학습이 점점 더 작은 걸음으로 이루어지게 되어, 결국 학습이 지연되거나 정체
- 또한, 학습 속도가 불규칙해지거나, 수렴 과정에서 불안정성이 증가하는 등의 문제가 발생할 수 있
본 논문에서는 Batch Normalization과 Momentum을 함께 사용하는 것이 effective step size가 조기에 감소하고, sub-optimal model performances를 이끈다는 것을 확인한다
그리고 SGDP와 AdamP를 제안하는데, 이 두 메소드는 각 optimizer step에서 가중치의 크기를 증가시키는 radial component를 제거
본 논문의 방법론은 오로지 effective step sizes만 수정한다
- 즉, 수정된 최적화 방법은 학습 속도와 안정성을 높이지만, 학습이 이루어지는 방향을 그대로 유지하기 때문에 기존 Gradient descent 방법의 장점을 그대로 가져간다

1 INTRODUCTION

대부분의 모델들에서 Batch normalization 등의 normalization 기법들을 사용해 weight를 scale-invariant하게 만드므로 weight들의 크기는 모델에 영향을 미치지 않게 된다.

따라서 모델에 영향을 미치는 값은, weight들을 l2-norm으로 나눈 값

이다. 이들을 effective weight $\hat{w}=\frac{w}{||w||^2}$가 된다

그러나 실제 optimization이 일어나는 공간은 effective weight이 있는 공간이 아니라, 원래의 weight가 놓여있는 nominal space이다.

이러한 이유로 effective step size 와 실제 nominal step size 간에 차이가 발생한다. 아래 그림에서 $w_t$가 $w_{t+1}$로 업데이트 되면, 실제 모델에 영향을 미치는effective step은 주황색으로 표시된 부분과 같다.

Nominal step size와 effective step size는 약 $\hat{w}=\frac{w}{||w_{t+1}||^2}$만큼 차이가 나게된다

일반적인 gradient descent (GD) 알고리즘을 사용하면 학습 도중 weight norm이 증가하는 현상이 발생

Momentum이 추가된 Adam과 같은 optimizer의 경우, weight norm이 더욱 빠르게 증가

Reference

https://arxiv.org/abs/2006.08217

AdamP: Slowing Down the Slowdown for Momentum Optimizers on Scale-invariant Weights

Normalization techniques are a boon for modern deep learning. They let weights converge more quickly with often better generalization performances. It has been argued that the normalization-induced scale invariance among the weights provides an advantageou

arxiv.org

'Deep Learning' 카테고리의 다른 글

ON THE VARIANCE OF THE ADAPTIVE LEARNING RATE AND BEYOND (0)	2024.06.15
ON THE CONVERGENCE OF ADAM AND BEYOND (0)	2024.06.12
Domain Generalization Guided by Gradient Signal to Noise Ratio of Parameters (0)	2024.06.04
Normalized Gradient Descent (0)	2024.05.29
Two Natural Weaknesses of Gradient Descent (0)	2024.05.29

현재글ADAMP: SLOWING DOWN THE SLOWDOWN FOR MOMENTUM OPTIMIZERS ON SCALE-INVARIANT WEIGHTS

JM's Research

Today :
Yesterday :

일	월	화	수	목	금	토
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30

JM's Research

ADAMP: SLOWING DOWN THE SLOWDOWN FOR MOMENTUM OPTIMIZERS ON SCALE-INVARIANT WEIGHTS

참고

좋은문장

Abstract

1 INTRODUCTION

Reference

'Deep Learning' 카테고리의 다른 글

'Deep Learning'의 다른글

티스토리툴바

ADAMP: SLOWING DOWN THE SLOWDOWN FOR MOMENTUM OPTIMIZERS ON SCALE-INVARIANT WEIGHTS

참고

좋은문장

Abstract

1 INTRODUCTION

Reference

'Deep Learning' 카테고리의 다른 글

'Deep Learning'의 다른글

관련글

티스토리툴바