NLP 入门知识点

最近从 B 站上找了个教程学习NLP 的知识，就以此篇博客作为载体记录课上学的知识点吧。

Long Short Term Memory (LSTM) 模型

LSTM uses a “conveyor belt” to get longer memory than SimpleRNN.
Each of the following blocks has a parameter matrix:
- Forget gate
- Input gate
- New values
- Output values
Number of parameters: 4 × times × shape(h) × times × [shape(h) + shape(x)]

Making RNNs More Effective

Stacked RNN

Bidirectional RNN

Text Generation (自动文本生成) Train a Neural Network

Partition text to (segment, next_char) pairs.
One-hot encode the characters.
- Character → rightarrow → v v v × times × 1 vector.
- Segment → rightarrow → l l l × times × v v v matrix
Build and train a neural network
- l l l × times × v v v matrix ⇒ Rightarrow ⇒ LSTM ⇒ Rightarrow ⇒ Dense ⇒ Rightarrow ⇒ v v v × times × 1 vector

Text Generation

Propose a seed segment
Repeat the followings:
a) Feed the segment (with one-hot) to the neural network.
b) The neural network outputs probabilities.
c) next_char ← leftarrow ← Sample from the probabilities.
d) Append next_char to the segment.

机器翻译与 Seq2Seq 模型

如何提升 Seq2Seq

Bi-LSTM instead of LSTM (Encoder only!!!)
- Encoder’s final states ( h t h_t ht and c t c_t ct) have all the information of the English sentense.
- If the sentence is long, the final states have forgotten early inputs.
- Bi-LSTM (left-to-right and right-to-left) has longer memory.
- Use Bi-LSTM in the encoder; use unidirectional LSTM in the decoder.
Word-Level Tokenization
- Word-Level tokenization instead of char-level.
  - The average length of English words is 4.5 letters.
  - The sequences will be 4.5x shorter.
  - Shorter sequence -> less likely to forget.
- But you will need a large dataset,
  - of (frequently used) chars is ~ 1 0 2 10^2 102 → rightarrow → one-hot suffices.
  - of (frequently used) words is ~ 1 0 4 10^4 104 → rightarrow → must be embedding.
  - Embedding layer has many parameters → rightarrow → overfitting!
1. Multi-Task Learning (这样一来 encoder 只有一个而训练数据多了一倍，所以可以训练的更好)

Attention (注意力机制)

Seq2Seq 模型有个缺点是无法记住一个很长的句子的完整信息，所以有可能句子中有个别词被忘记而 decoder 无从得知完整的句子进入无法得到正确的翻译。

因此，一些研究者引入了 Attention 机制，Attention 的特点如下：

Transformer and BERT