生物序列智能分析平台blog(4)

2021SC@SDUSC

模型介绍

考虑到平台的扩展性和丰富性，我们添加了更多传统的机器学习分类模型+transformer的一些变种+图网络GNN的一些前沿扩展。

这篇博客主要围绕transformer的一些变种来进行解读。

NLP	traditional	GNN
TransformerEncoder	DNN	TextGNN
ReformerEncoder	RNN	GCN
PerformerEncoder	LSTM	GAN
LinformerEncoder	BiLSTM
RoutingTransformerEncoder	LSTMAttention
DNA bert	GRU
Prot bert	TextCNN
	TextRCNN
	VDCNN
	RNN_CNN

Reformer

这篇论文来自ICLR 2020的一项工作 https://openreview.net/pdf?id=rkgNKkHtvB,针对长序列的Transformer训练问题，Reformer给出了一种存储极致压缩的方案.

Reformer主要涉及了四处创新点：

使用Axial Positional Embedding来减小位置编码矩阵
提出了基于局部敏感性哈希算法(Locality Sensitive Hashing, LSH)的self-attention机制
提出可以使用分段的方式来处理全连接层(FFN)的计算过程，每次只计算一部分，从而不需要将整个序列加载到内存
使用可逆(Reversible)残差连接来代替传统残差连接，好处是在前向过程中前 [公式] 层的中间值和输出都不需要存储了，只保留最后一层的输出

这里我们直接采用了Reformer-pytorch提供的库来进行实现

安装

$ pip install reformer_pytorch

示例

一个简单的模型架构搭建方式

import torch
from reformer_pytorch import ReformerLM

model = ReformerLM(
    num_tokens= 20000,
    dim = 1024,
    depth = 12,
    max_seq_len = 8192,
    heads = 8,
    lsh_dropout = 0.1,
    ff_dropout = 0.1,
    post_attn_dropout = 0.1,
    layer_dropout = 0.1,  # layer dropout from 'Reducing Transformer Depth on Demand' paper
    causal = True,        # auto-regressive or not
    bucket_size = 64,     # average size of qk per bucket, 64 was recommended in paper
    n_hashes = 4,         # 4 is permissible per author, 8 is the best but slower
    emb_dim = 128,        # embedding factorization for further memory savings
    dim_head = 64,        # be able to fix the dimension of each head, making it independent of the embedding dimension and the number of heads
    ff_chunks = 200,      # number of chunks for feedforward layer, make higher if there are memory issues
    attn_chunks = 8,      # process lsh attention in chunks, only way for memory to fit when scaling to 16k tokens
    num_mem_kv = 128,       # persistent learned memory key values, from all-attention paper
    full_attn_thres = 1024, # use full attention if context length is less than set value
    reverse_thres = 1024,   # turn off reversibility for 2x speed for sequence lengths shorter or equal to the designated value
    use_scale_norm = False,  # use scale norm from 'Transformers without tears' paper
    use_rezero = False,      # remove normalization and use rezero from 'ReZero is All You Need'
    one_value_head = False,  # use one set of values for all heads from 'One Write-Head Is All You Need'
    weight_tie = False,           # tie parameters of each layer for no memory per additional depth
    weight_tie_embedding = False, # use token embedding for projection of output, some papers report better results
    n_local_attn_heads = 2,       # many papers suggest mixing local attention heads aids specialization and improves on certain tasks
    pkm_layers = (4,7),           # specify layers to use product key memory. paper shows 1 or 2 modules near the middle of the transformer is best
    pkm_num_keys = 128,           # defaults to 128, but can be increased to 256 or 512 as memory allows
    use_full_attn = False    # only turn on this flag to override and turn on full attention for all sequence lengths. for comparison with LSH to show that it is working
).cuda()

x = torch.randint(0, 20000, (1, 8192)).long().cuda()
y = model(x) # (1, 8192, 20000)

考虑到我们需求没有那么多，在综合数据等因素后，我们选取了其中的一些参数

bucket_size=1,
max_seq_len = config.max_len,

这两个参数需要特别解释一下，因为bucket_size这个参数是用来分块的，所以必须满足相应的公式max_seq_len % ( 2 * bucket_size) 但是我们在数据集加载的时候只能讲max的seq采用最长序列统一标号，鉴于有可能用户在平台上输入的不一致性，出现奇数的时候，可能必须要进行补全操作。

前向传播

前向传播就是很简单的调用在初始化阶段定义好的reformer模型和classification的线性分类器
在将输入的x移入cuda即可完成准备

    def forward(self, x):
        x = x.cuda()
        padding_mask = get_attn_pad_mask(x)
        # x = self.embedding(x)
        # representation = self.reformer_encoder(x)[:, 0, :].squeeze(1)
        representation = self.reformer_encoder(x, input_mask=padding_mask)[:, 0, :].squeeze(1)

        output = self.classification(representation)

        return output, representation

Performer

这是一篇来自ICLR 2021的一篇论文，很有前沿性和创新型

Performer是一个Transformer架构，其注意力机制可线性扩展，一方面可以让模型训练得更快，另一方面也能够让模型处理更长的输入序列。这对于某些图像数据集（如ImageNet64）和文本数据集（如PG-19）来说定然是很香的。Performer使用了一个高效的（线性）通用注意力框架，在框架中使用不同的相似度测量（即各种核方法）可以实现各种注意力机制。该框架由FAVOR+(Fast Attention Via Positive Orthogonal Random Features，通过正交随机特征实现快速注意力)算法实现，该算法提供了可扩展、低方差、无偏估计的注意力机制，可以通过随机特征图分解来表达。该方法一方面确保了线性空间和时间复杂度，另一方面也保障了准确率。此外，该方法可以单独用于softmax 运算，还可以和可逆层等其他技术进行配合使用。

这里我们同样借助相关包来进行封装和应用

self.performer_encoder = PerformerLM(
            num_tokens=self.emb_dim,
            dim=self.emb_dim,
            heads=8,
            depth=1,
            max_seq_len=config.max_len,
            reversible=True,
            local_attn_heads=4,  # 4 heads are local attention, 4 others are global performers
            local_window_size=config.max_len,  # window size of local attention
            # return_embeddings=True
            )

由于transfomer的输出较为特殊，要求是定长，而我们的数入是不定长的，所以我们要截取不定长输出的位置，并作相应的mask操作，即用掩码将padding部分遮盖，0是padding，非0是真真的输入序列。

def get_attn_pad_mask(input_ids):
    pad_attn_mask_expand = torch.zeros_like(input_ids)
    batch_size, seq_len = input_ids.size()
    for i in range(batch_size):
        for j in range(seq_len):
            if input_ids[i][j] != 0:
                pad_attn_mask_expand[i][j] = 1

所以用相关函数来进行掩码，并配合前向传播来进行相关操作补充。

representation = self.performer_encoder(x, mask=padding_mask)[:, 0, :].squeeze(1)

生物序列智能分析平台blog(4)

Python相关栏目本月热门文章