最近从 B 站上找了个教程 学习NLP 的知识,就以此篇博客作为载体记录课上学的知识点吧。
Long Short Term Memory (LSTM) 模型-
LSTM uses a “conveyor belt” to get longer memory than SimpleRNN.
-
Each of the following blocks has a parameter matrix:
- Forget gate
- Input gate
- New values
- Output values
-
Number of parameters: 4 × times × shape(h) × times × [shape(h) + shape(x)]
- SimpleRNN and LSTM are two kinds of RNNs; always use LSTM instead of SimpleRNN.
- Use Bi-RNN instead of RNN whenever possible.
- Stacked RNN may be better than a single RNN layer (if n is big).
- Pretrain the embedding layer (if n is small).
- Partition text to (segment, next_char) pairs.
- One-hot encode the characters.
- Character → rightarrow → v v v × times × 1 vector.
- Segment → rightarrow → l l l × times × v v v matrix
- Build and train a neural network
- l l l × times × v v v matrix ⇒ Rightarrow ⇒ LSTM ⇒ Rightarrow ⇒ Dense ⇒ Rightarrow ⇒ v v v × times × 1 vector
- Propose a seed segment
- Repeat the followings:
a) Feed the segment (with one-hot) to the neural network.
b) The neural network outputs probabilities.
c) next_char ← leftarrow ← Sample from the probabilities.
d) Append next_char to the segment.
- Bi-LSTM instead of LSTM (Encoder only!!!)
- Encoder’s final states ( h t h_t ht and c t c_t ct) have all the information of the English sentense.
- If the sentence is long, the final states have forgotten early inputs.
- Bi-LSTM (left-to-right and right-to-left) has longer memory.
- Use Bi-LSTM in the encoder; use unidirectional LSTM in the decoder.
- Word-Level Tokenization
- Word-Level tokenization instead of char-level.
- The average length of English words is 4.5 letters.
- The sequences will be 4.5x shorter.
- Shorter sequence -> less likely to forget.
- But you will need a large dataset,
- of (frequently used) chars is ~ 1 0 2 10^2 102 → rightarrow → one-hot suffices.
- of (frequently used) words is ~ 1 0 4 10^4 104 → rightarrow → must be embedding.
- Embedding layer has many parameters → rightarrow → overfitting!
- Multi-Task Learning (这样一来 encoder 只有一个而训练数据多了一倍,所以可以训练的更好)
- Word-Level tokenization instead of char-level.
Seq2Seq 模型有个缺点是无法记住一个很长的句子的完整信息,所以有可能句子中有个别词被忘记而 decoder 无从得知完整的句子进入无法得到正确的翻译。
因此,一些研究者引入了 Attention 机制,Attention 的特点如下:
- Attention tremendously improves Seq2Seq model.
- With attention, Seq2Seq model does not forget source input.
- With attention, the decoder knows where to focus.
- Downside: much more computation
- Transformer is Seq2Seq modell; it has an encoder and decoder.
- Transformer model is not RNN.
- Transformer is based on attention and self-attention.
- BERT is for pre-training Transformer’s encoder.



