header-img
Info :
728x90

OpenAI

- ์ƒ˜ ์•ŒํŠธ๋งŒ์ด 2015๋…„ 12์›” 11์ผ ์„ค๋ฆฝ

- ์ธ๊ณต์ง€๋Šฅ์ด ์ธ๋ฅ˜์— ์žฌ์•™์ด ๋˜์ง€ ์•Š๊ณ , ์ธ๋ฅ˜์—๊ฒŒ ์ด์ต์„ ์ฃผ๋Š” ๊ฒƒ์„ ๋ชฉํ‘œ๋กœ ํ•จ

- Microsoft๊ฐ€ ํˆฌ์ž๋ฅผ ๋ฐ›๊ณ , ๋…์  ๋ผ์ด์„ ์Šค ์ œ๊ณต

- ๋น„์˜๋ฆฌ ๊ธฐ์—…์œผ๋กœ ์‹œ์ž‘ํ–ˆ์œผ๋‚˜ ํ˜„์žฌ๋Š” ํ•œ๊ณ„ ์˜๋ฆฌ๊ธฐ์—… ํ˜•ํƒœ

ChatGPT

- ์ดˆ๊ฑฐ๋Œ€ ์–ธ์–ด๋ชจ๋ธ GPT-3.5 ๊ธฐ๋ฐ˜ ๋Œ€ํ™”ํ˜• ์ธ๊ณต์ง€๋Šฅ ์ฑ—๋ด‡

- ์ฑ„ํŒ…์„ ํ†ตํ•ด ์งˆ๋ฌธ์„ ์ž…๋ ฅํ•˜๋ฉด ์‚ฌ๋žŒ์ฒ˜๋Ÿผ ๋‹ต์„ ํ•ด์ฃผ๋Š” ์„œ๋น„์Šค

- ์•„์ฃผ ๋งŽ์€ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ๊ฐ€์ง„ Transformer ๊ธฐ๋ฐ˜ ์–ธ์–ด๋ชจ๋ธ์„ ์•„์ฃผ ๋งŽ์€ ๋ฐ์ดํ„ฐ๋กœ ํ•™์Šต

 

Language Model ์–ธ์–ด ๋ชจ๋ธ

- ์–ธ์–ด ๋ชจ๋ธ์€ ๋‹จ์–ด์— ํ™•๋ฅ ์„ ๋ถ€์—ฌํ•ด์„œ ๋ฌธ์žฅ์ด ์–ผ๋งˆ๋‚˜ ์ž์—ฐ์Šค๋Ÿฌ์šด์ง€๋ฅผ ํ‰๊ฐ€ํ•  ์ˆ˜ ์žˆ๋Š” ๋ชจ๋ธ

- ์ด์ „์˜ ๋ฌธ๋งฅ์„ ๋ฐ”ํƒ•์œผ๋กœ ๋‹ค์Œ ๋˜๋Š” ํŠน์ • ์œ„์น˜์— ์ ํ•ฉํ•œ ๋‹จ์–ด๋ฅผ ํ™•๋ฅ ์ ์œผ๋กœ ์˜ˆ์ธก

- ์–ธ์–ด ๋ชจ๋ธ์˜ ํ‰๊ฐ€ ์ง€ํ‘œ๋Š” ํ…์ŠคํŠธ์˜ PPL(Perplexity)์„ ์‚ฌ์šฉ

   Ngram > RNN(1986) > LSTM(1997) > Transformer(2017)

 

N-gram

- ํ…์ŠคํŠธ์—์„œ n๊ฐœ์˜ ๋‹จ์–ด ์‹œํ€€์Šค์˜ ํ™•๋ฅ ๋งŒ์ด ๊ณ ๋ ค๋˜๋Š” ๋งค์šฐ ๊ฐ„๋‹จํ•œ ์–ธ์–ด๋ชจ๋ธ

- ์ฝ”ํผ์Šค์— ๋‚˜์˜ค์ง€ ์•Š์€ ๋‹จ์–ด ์‹œํ€€์Šค์˜ ํ™•๋ฅ ์„ ์ •ํ™•ํ•˜๊ฒŒ ์ถ”์ •ํ•˜์ง€ ๋ชปํ•จ

- ์ด์ „ n-1๊ฐœ์˜ ๋ฌธ๋งฅ๋งŒ ๊ณ ๋ คํ•˜๋ฏ€๋กœ ๊ธด ์‹œํ€€์Šค์— ์ ํ•ฉํ•˜์ง€ ์•Š์Œ

 

RNN

- Recurrent์˜ ๊ฐœ๋…์„ ์ ์šฉํ•œ Neural Network ๊ธฐ๋ฐ˜ ์–ธ์–ด๋ชจ๋ธ

- RNN, LSTM, GRU ๋“ฑ์˜ ๋„คํŠธ์›Œํฌ๋ฅผ ์‚ฌ์šฉ

- N-gram ๋ณด๋‹ค ๊ธด ๋ฌธ๋งฅ์„ ๊ธฐ์–ตํ•  ์ˆ˜ ์žˆ๊ณ , Unseen ํ™•๋ฅ ๋„ ๋”์šฑ ์ •ํ™•

 

Transformer

- Attension Is All you Need, 2017 ๋…ผ๋ฌธ์—์„œ ์ œ์•ˆํ•œ ๋ฐฉ๋ฒ•

- ๋†’์€ Scalability ๋กœ ๋Œ€ํ˜• ์–ธ์–ด๋ชจ๋ธ์—์„œ๋„ ์•ˆ์ •์ ์ธ ํ•™์Šต์ด ๊ฐ€๋Šฅ

- ๋ฒ”์šฉ์„ฑ๋„ ๋†’์•„์„œ ํ…์ŠคํŠธ ์™ธ์—๋„ ์Œ์„ฑ, ์˜์ƒ์—์„œ๋„ ์‚ฌ์šฉ๋˜๊ณ  ์žˆ์Œ

Search

- Greedy Search : ๊ฐ€์žฅ ํ™•๋ฅ ์ด ๋†’์€ ๋‹จ์–ด์˜ ๊ฒฝ๋กœ๋งŒ ์ƒ์กด

- Beam Search

Sampling

- Top-K Sampling: ๋†’์€ ํ™•๋ฅ ์˜ K๊ฐœ ๋‹จ์–ด ์ค‘์—์„œ ์ƒ˜ํ”Œ๋ง

- Top-p Sampling: ํ™•๋ฅ ์˜ ๋ˆ„์ ํ•ฉ์ด p ์ด์ƒ ๋˜๋„๋ก ํ•˜๋Š” ๋‹จ์–ด ์ƒ˜ํ”Œ๋ง

GPTS

InstructGPT

- ๋“ฑ์žฅ ๋ฐฐ๊ฒฝ : GPT3 ๊ฒฐ๊ณผ๋Š” ๋„์›€์ด ์•ˆ๋˜๊ณ , ์‚ฌ์‹ค์ด ์•„๋‹Œ ๊ฒฐ๊ณผ๋ฅผ ์ถœ๋ ฅ + ํ•ด๋กœ์šด ๊ฒฐ๊ณผ๋ฅผ ์ œ๊ณต

Step 1: Supervised fine-tuning (SFT)

   - ํ”„๋กฌํ”„ํŠธ ์„ ํƒ: ๋ฐ์ดํ„ฐ์…‹์—์„œ ๋‹ค์–‘ํ•˜๊ณ  ์ ํ•ฉํ•œ ํ”„๋กฌํ”„ํŠธ ์„ ํƒ

   - ์ž‘์—…์ž์˜ ๋‹ต๋ณ€ ์ž‘์„ฑ: ์„œ๋น„์Šค ์ •์ฑ…์— ๋งž๋Š” ์ ํ•ฉํ•œ ๋‹ต๋ณ€ ์ž‘์„ฑ

   - ๋ชจ๋ธ ํ•™์Šต: ๊ตฌ์ถ•๋œ ๋ฐ์ดํ„ฐ๋กœ GPT-3๋ฅผ ๋ฏธ์„ธ ์กฐ์ •

Step 2: Reward Model Training

   - ์‘๋‹ต ๋ฐ์ดํ„ฐ ์ƒ์„ฑ: SFT ๋ชจ๋ธ์—์„œ ํ”„๋กฌํ”„ํŠธ์— ๋Œ€ํ•œ ์—ฌ๋Ÿฌ ๊ฐœ์˜ ๋‹ต๋ณ€์„ ์ƒ์„ฑ

   - ์ž‘์—…์ž์˜ ์ˆœ์œ„ ๋ฐ ์ ์ˆ˜ ์ž‘์„ฑ: ๊ฐ ์‘๋‹ต์— ๋Œ€ํ•œ ์ˆœ์œ„์™€ ์ ์ˆ˜๋ฅผ ์ž‘์„ฑ

   - ๋ณด์ƒ ๋ชจ๋ธ ํ•™์Šต: ์‘๋‹ต์˜ ๋ณด์ƒ์„ ์˜ˆ์ธกํ•˜๋Š” ๋ชจ๋ธ ํ•™์Šต

Step 3: Policy Training (LM Training)

   - ํ•™์Šต์šฉ ํ”„๋กฌํ”„ํŠธ ๋ฐ์ดํ„ฐ ์„ ์ •

   - ์–ธ์–ด ๋ชจ๋ธ ์—…๋ฐ์ดํŠธ ๋ฐ ๋‹ต๋ณ€ ์ƒ์„ฑ: ์†์‹ค ํ•จ์ˆ˜์— ๋”ฐ๋ผ ๋ชจ๋ธ์„ ์—…๋ฐ์ดํŠธ, ์ƒˆ๋กœ์šด ๋ชจ๋ธ๋กœ ๋‹ต๋ณ€ ๋ฐ์ดํ„ฐ ์ƒ์„ฑ

 

๋น„๊ฐ๋… ์‚ฌ์ „ํ•™์Šต (Un-supervised Pre-training)

- Wav2Vec ๊ฐ™์€ ๋น„๊ฐ๋… ์‚ฌ์ „ํ•™์Šต ๊ธฐ์ˆ ์ด ์ฃผ๋ชฉ๋ฐ›๊ณ  ์žˆ์Œ

- ์‚ฌ๋žŒ์ด ๋งŒ๋“  ๋ผ๋ฒจ์ด ํ•„์š” ์—†์œผ๋ฏ€๋กœ ์ˆ˜๋ฐฑ๋งŒ ์‹œ๊ฐ„๊นŒ์ง€ ๋ฐ์ดํ„ฐ๊ฐ€ ์‚ฌ์šฉ ๊ฐ€๋Šฅ

- ๋น„๊ฐ๋… ์‚ฌ์ „ ํ•™์Šต ๋ชจ๋ธ์„ ํŒŒ์ธ ํŠœ๋‹ํ•ด์„œ ํŠนํžˆ ์ ์€ ๋ฐ์ดํ„ฐ์—์„œ SOTA๋ฅผ ๋‹ฌ์„ฑ

 

Motivation

๋น„๊ฐ๋… ์‚ฌ์ „ ํ•™์Šต์˜ ํ•œ๊ณ„

- ์Œ์„ฑ์ธ์‹์„ ์œ„ํ•ด ํŒŒ์ธํŠœ๋‹ ํ•„์š”

- ๋จธ์‹  ๋Ÿฌ๋‹์ด ์šฐ์ˆ˜ํ•˜์ง€๋งŒ ๋‹ค๋ฅธ ๋ฐ์ดํ„ฐ์—์„œ ์„ฑ๋Šฅ ๋‚ฎ์Œ

- ๋จธ์‹  ๋Ÿฌ๋‹์ด ์‚ฌ๋žŒ์ด ์ธ์ง€ํ•˜์ง€ ๋ชปํ•˜๋Š” ๋ฐ์ดํ„ฐ์˜ ํŠน์„ฑ์„ ์•…์šฉํ•˜๊ณ  ์žˆ๊ธฐ ๋•Œ๋ฌธ

Whisper

- ์Œ์„ฑ์ธ์‹์˜ ์ตœ์ข… ๋ชฉ์ ์€ ๋ชจ๋“  ๋ฐ์ดํ„ฐ์—์„œ ํŒŒ์ธ ํŠœ๋‹์ด ์—†์ด ์•ˆ์ •์ ์œผ๋กœ ๋™์ž‘ํ•˜๋Š” ๊ฒƒ

 

Data Preprocessing

- ์ธํ„ฐ๋„ท์—์„œ ์ „์‚ฌ๊ฐ€ ์žˆ๋Š” ์Œ์„ฑ ๋ฐ์ดํ„ฐ๋ฅผ ๊ตฌ์ถ•ํ–ˆ๋‹ค. : ๋‹ค์–‘ํ•œ ํ™”์ž, ์–ธ์–ด, ํ™˜๊ฒฝ์ด ํฌํ•จ, ์ธ์‹๊ธฐ๋ฅผ ๊ฐ•๊ฑดํ•˜๊ฒŒ ๋งŒ๋“ค์–ด ์คŒ

- ์ „์‚ฌ ์˜ค๋ฅ˜๊ฐ€ ๋งŽ์•„์„œ ์ž๋™ ํ•„ํ„ฐ๋ง์„ ๊ฐœ๋ฐœ : ์ธ์‹๊ธฐ๊ฐ€ ๋งŒ๋“ค์–ด๋‚ธ ์ „์‚ฌ๋ฅผ ๊ฑธ๋Ÿฌ์•ผ ํ–ˆ์Œ

- ์–ธ์–ด ๊ฒ€์ถœ๊ธฐ ๊ฐœ๋ฐœ : ์Œ์„ฑ ์–ธ์–ด ๊ฒ€์ถœ๊ธฐ(VoxLingua107 ์‚ฌ์šฉ) ๊ฐœ๋ฐœ, ์ „์‚ฌ ํ…์ŠคํŠธ๋กœ CLD2์—์„œ ๋‚˜์˜จ ์–ธ์–ด์™€ ๋‹ค๋ฅด๋ฉด ํ•™์Šต๋ฐ์ดํ„ฐ ์ œ์™ธ, ์ „์‚ฌ ์–ธ์–ด๊ฐ€ ์˜์–ด๋ฉด ๋ฒˆ์—ญ ๋ฐ์ดํ„ฐ๋กœ ์‚ฌ์šฉ

- ์Œ์„ฑ์€ 30์ดˆ ๋‹จ์œ„๋กœ ์ž๋ฅด๊ณ  ์ „์‚ฌ๋„ ๋ถ„๋ฆฌ : ์ž˜๋ž๋Š”๋ฐ ์ „์‚ฌ๊ฐ€ ์—†๋Š” ์Œ์„ฑ๋„ VAD ํ•™์Šต ๋ฐ์ดํ„ฐ๋กœ ์‚ฌ์šฉ

 

Long-form Transcription

- 30์ดˆ ์„ธ๊ทธ๋จผํŠธ๋กœ ์—ฐ์†์ ์ธ ์ „์‚ฌ๋ฅผ ์ˆ˜ํ–‰

- ๋ชจ๋ธ๋กœ ์ถ”์ •๋œ ์‹œ๊ฐ„ ์Šคํƒฌํ”„์— ๋”ฐ๋ผ์„œ window shifting์„ ํ•œ๋‹ค.

- ์•ˆ์ •์ ์œผ๋กœ ๊ธด ์˜ค๋””์˜ค๋ฅผ ์ „์‚ฌํ•˜๊ธฐ ์œ„ํ•ด์„œ ๋น”์„œ์น˜์™€ "๋ฐ˜๋ณต๊ณผ ๋ชจ๋ธ ์˜ˆ์ธก ๋กœ๊ทธ ํ™•๋ฅ "์— ๊ธฐ๋ฐ˜ํ•œ temperature scheduling์ด ์ค‘์š”

Comparison with Human Performance

- ํ…Œ์ŠคํŠธ์…‹์—์„œ ๋‚จ์€ ๊ฐœ์„ ๋Ÿ‰์„ ์œ„ํ•ด์„œ ์ „๋ฌธ๊ฐ€์˜ ์ „์‚ฌ๋Šฅ๋ ฅ๊ณผ ๋น„๊ต

- Whisper๋Š” ์‚ฌ๋žŒํ•œํ…Œ ์•„์ฃผ ๊ฐ€๊นŒ์šด ์„ฑ๋Šฅ์„ ๋ณด์ธ๋‹ค.

Limitations and Future Work

Current limitations of Whisper

- Inaccurate timestamps

- Hallucinations

- Low performance on low-resource languages

- No speaker recognition

- No real-time transcription

- Pure PyTorch inference

Whisper ์„ค์น˜ & ์‚ฌ์šฉ ๋ฐฉ๋ฒ•

๊ฒฐ๊ณผ

 

728x90
๋”๋ณด๊ธฐ
Seminar