【报错解决】ValueError: batch length of `text`: xx does not match batch length of `text

错误样例输入和输出

样例代码如下：

from transformers import GPT2Tokenizer,GPT2Model
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2Model.from_pretrained('gpt2')
special_tokens_dict = {'cls_token': ''}
num_added_toks = tokenizer.add_special_tokens(special_tokens_dict)
text = ["this is the first sentences", "this is the second sentece, ", "this one is the third sentence"]
tp = ['first sentence']
output = tokenizer(text,tp)
print(output)

报错如下：

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
 in ()
      8 text = ["this is the first sentences", "this is the second sentece, ", "this one is the third sentence"]
      9 tp = ['first sentence']
---> 10 output = tokenizer(text,tp)
     11 print(output)

/usr/local/lib/python3.7/dist-packages/transformers/tokenization_utils_base.py in __call__(self, text, text_pair, add_special_tokens, padding, truncation, max_length, stride, is_split_into_words, pad_to_multiple_of, return_tensors, return_token_type_ids, return_attention_mask, return_overflowing_tokens, return_special_tokens_mask, return_offsets_mapping, return_length, verbose, **kwargs)
   2377             if text_pair is not None and len(text) != len(text_pair):
   2378                 raise ValueError(
-> 2379                     f"batch length of `text`: {len(text)} does not match batch length of `text_pair`: {len(text_pair)}."
   2380                 )
   2381             batch_text_or_text_pairs = list(zip(text, text_pair)) if text_pair is not None else text

ValueError: batch length of `text`: 3 does not match batch length of `text_pair`: 1.

在样例代码中，注意text_pair的长度是1，但是text的长度是3。这就是错误的原因。

错误分析

Exception Class: TypeError

Raise code

lit_into_words:
            is_batched = isinstance(text, (list, tuple)) and text and isinstance(text[0], (list, tuple))
        else:
            is_batched = isinstance(text, (list, tuple))

        if is_batched:
            if isinstance(text_pair, str):
                raise TypeError(
                    "when tokenizing batches of text, `text_pair` must be a list or tuple with the same length as `text`."
                )
            if text_pair is not None and len(text) != len(text_pair):
                raise ValueError(
                    f"batch length of `text`: {len(text)} does not match batch length of `text_pair`: {len(text_pair)}."
                )
            batch_text_or_text_pairs = list(zip(text, text_pair)) if text_pair is not None else text
            retu

raise code 链接

分析如下：
这个错误是被Transformers.PreTrainedTokenizer class的主要代码部分被raised. 此主要方法用于tokenize和为模型准备一个或多个序列或一对或多对序列。如果text参数是以批处理形式给出的，那么text_pairs应该是一个与text长度相同的元组或列表。

解决的方法

确保text_pair要和text保持一样的长度，在text是batched sequences的情况下。
代码修改成如下：

from transformers import GPT2Tokenizer,GPT2Model
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2Model.from_pretrained('gpt2')


special_tokens_dict = {'cls_token': ''}


num_added_toks = tokenizer.add_special_tokens(special_tokens_dict)
text = ["this is the first sentences", "this is the second sentece, ", "this one is the third sentence"]
tp = ['first sentence',"second","third"]
output = tokenizer(text,tp)
print(output)

输出结果：

{'input_ids': [[5661, 318, 262, 717, 13439, 11085, 6827], [5661, 318, 262, 1218, 1908, 68, 344, 11, 220, 12227], [5661, 530, 318, 262, 2368, 6827, 17089]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1]]}

【报错解决】ValueError: batch length of `text`: xx does not match batch length of `text

Python相关栏目本月热门文章