2021SC@SDUSC
模型介绍考虑到平台的扩展性和丰富性,我们添加了更多传统的机器学习分类模型+transformer的一些变种+图网络GNN的一些前沿扩展。
这篇博客主要围绕transformer的一些变种来进行解读。
| NLP | traditional | GNN |
|---|---|---|
| TransformerEncoder | DNN | TextGNN |
| ReformerEncoder | RNN | GCN |
| PerformerEncoder | LSTM | GAN |
| LinformerEncoder | BiLSTM | |
| RoutingTransformerEncoder | LSTMAttention | |
| DNA bert | GRU | |
| Prot bert | TextCNN | |
| TextRCNN | ||
| VDCNN | ||
| RNN_CNN |
这篇论文来自ICLR 2020的一项工作 https://openreview.net/pdf?id=rkgNKkHtvB,针对长序列的Transformer训练问题,Reformer给出了一种存储极致压缩的方案.
Reformer主要涉及了四处创新点:
- 使用Axial Positional Embedding来减小位置编码矩阵
- 提出了基于局部敏感性哈希算法(Locality Sensitive Hashing, LSH)的self-attention机制
- 提出可以使用分段的方式来处理全连接层(FFN)的计算过程,每次只计算一部分,从而不需要将整个序列加载到内存
- 使用可逆(Reversible)残差连接来代替传统残差连接,好处是在前向过程中前 [公式] 层的中间值和输出都不需要存储了,只保留最后一层的输出
这里我们直接采用了Reformer-pytorch提供的库来进行实现
安装$ pip install reformer_pytorch示例
一个简单的模型架构搭建方式
import torch
from reformer_pytorch import ReformerLM
model = ReformerLM(
num_tokens= 20000,
dim = 1024,
depth = 12,
max_seq_len = 8192,
heads = 8,
lsh_dropout = 0.1,
ff_dropout = 0.1,
post_attn_dropout = 0.1,
layer_dropout = 0.1, # layer dropout from 'Reducing Transformer Depth on Demand' paper
causal = True, # auto-regressive or not
bucket_size = 64, # average size of qk per bucket, 64 was recommended in paper
n_hashes = 4, # 4 is permissible per author, 8 is the best but slower
emb_dim = 128, # embedding factorization for further memory savings
dim_head = 64, # be able to fix the dimension of each head, making it independent of the embedding dimension and the number of heads
ff_chunks = 200, # number of chunks for feedforward layer, make higher if there are memory issues
attn_chunks = 8, # process lsh attention in chunks, only way for memory to fit when scaling to 16k tokens
num_mem_kv = 128, # persistent learned memory key values, from all-attention paper
full_attn_thres = 1024, # use full attention if context length is less than set value
reverse_thres = 1024, # turn off reversibility for 2x speed for sequence lengths shorter or equal to the designated value
use_scale_norm = False, # use scale norm from 'Transformers without tears' paper
use_rezero = False, # remove normalization and use rezero from 'ReZero is All You Need'
one_value_head = False, # use one set of values for all heads from 'One Write-Head Is All You Need'
weight_tie = False, # tie parameters of each layer for no memory per additional depth
weight_tie_embedding = False, # use token embedding for projection of output, some papers report better results
n_local_attn_heads = 2, # many papers suggest mixing local attention heads aids specialization and improves on certain tasks
pkm_layers = (4,7), # specify layers to use product key memory. paper shows 1 or 2 modules near the middle of the transformer is best
pkm_num_keys = 128, # defaults to 128, but can be increased to 256 or 512 as memory allows
use_full_attn = False # only turn on this flag to override and turn on full attention for all sequence lengths. for comparison with LSH to show that it is working
).cuda()
x = torch.randint(0, 20000, (1, 8192)).long().cuda()
y = model(x) # (1, 8192, 20000)
考虑到我们需求没有那么多,在综合数据等因素后,我们选取了其中的一些参数
bucket_size=1, max_seq_len = config.max_len,
这两个参数需要特别解释一下,因为bucket_size这个参数是用来分块的,所以必须满足相应的公式max_seq_len % ( 2 * bucket_size) 但是我们在数据集加载的时候只能讲max的seq采用最长序列统一标号,鉴于有可能用户在平台上输入的不一致性,出现奇数的时候,可能必须要进行补全操作。
前向传播前向传播就是很简单的调用在初始化阶段定义好的reformer模型和classification的线性分类器
在将输入的x移入cuda即可完成准备
def forward(self, x):
x = x.cuda()
padding_mask = get_attn_pad_mask(x)
# x = self.embedding(x)
# representation = self.reformer_encoder(x)[:, 0, :].squeeze(1)
representation = self.reformer_encoder(x, input_mask=padding_mask)[:, 0, :].squeeze(1)
output = self.classification(representation)
return output, representation
Performer
这是一篇来自ICLR 2021的一篇论文,很有前沿性和创新型
Performer是一个Transformer架构,其注意力机制可线性扩展,一方面可以让模型训练得更快,另一方面也能够让模型处理更长的输入序列。这对于某些图像数据集(如ImageNet64)和文本数据集(如PG-19)来说定然是很香的。Performer使用了一个高效的(线性)通用注意力框架,在框架中使用不同的相似度测量(即各种核方法)可以实现各种注意力机制。该框架由FAVOR+(Fast Attention Via Positive Orthogonal Random Features,通过正交随机特征实现快速注意力)算法实现,该算法提供了可扩展、低方差、无偏估计的注意力机制,可以通过随机特征图分解来表达。该方法一方面确保了线性空间和时间复杂度,另一方面也保障了准确率。此外,该方法可以单独用于softmax 运算,还可以和可逆层等其他技术进行配合使用。
这里我们同样借助相关包来进行封装和应用
self.performer_encoder = PerformerLM(
num_tokens=self.emb_dim,
dim=self.emb_dim,
heads=8,
depth=1,
max_seq_len=config.max_len,
reversible=True,
local_attn_heads=4, # 4 heads are local attention, 4 others are global performers
local_window_size=config.max_len, # window size of local attention
# return_embeddings=True
)
由于transfomer的输出较为特殊,要求是定长,而我们的数入是不定长的,所以我们要截取不定长输出的位置,并作相应的mask操作,即用掩码将padding部分遮盖,0是padding,非0是真真的输入序列。
def get_attn_pad_mask(input_ids):
pad_attn_mask_expand = torch.zeros_like(input_ids)
batch_size, seq_len = input_ids.size()
for i in range(batch_size):
for j in range(seq_len):
if input_ids[i][j] != 0:
pad_attn_mask_expand[i][j] = 1
所以用相关函数来进行掩码,并配合前向传播来进行相关操作补充。
representation = self.performer_encoder(x, mask=padding_mask)[:, 0, :].squeeze(1)



