栏目分类:
子分类:
返回
名师互学网用户登录
快速导航关闭
当前搜索
当前分类
子分类
实用工具
热门搜索
名师互学网 > IT > 软件开发 > 后端开发 > Python

【报错解决】ValueError: batch length of `text`: xx does not match batch length of `text

Python 更新时间: 发布时间: IT归档 最新发布 模块sitemap 名妆网 法律咨询 聚返吧 英语巴士网 伯小乐 网商动力

【报错解决】ValueError: batch length of `text`: xx does not match batch length of `text

错误样例输入和输出

样例代码如下:

from transformers import GPT2Tokenizer,GPT2Model
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2Model.from_pretrained('gpt2')
special_tokens_dict = {'cls_token': ''}
num_added_toks = tokenizer.add_special_tokens(special_tokens_dict)
text = ["this is the first sentences", "this is the second sentece, ", "this one is the third sentence"]
tp = ['first sentence']
output = tokenizer(text,tp)
print(output)

报错如下:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
 in ()
      8 text = ["this is the first sentences", "this is the second sentece, ", "this one is the third sentence"]
      9 tp = ['first sentence']
---> 10 output = tokenizer(text,tp)
     11 print(output)

/usr/local/lib/python3.7/dist-packages/transformers/tokenization_utils_base.py in __call__(self, text, text_pair, add_special_tokens, padding, truncation, max_length, stride, is_split_into_words, pad_to_multiple_of, return_tensors, return_token_type_ids, return_attention_mask, return_overflowing_tokens, return_special_tokens_mask, return_offsets_mapping, return_length, verbose, **kwargs)
   2377             if text_pair is not None and len(text) != len(text_pair):
   2378                 raise ValueError(
-> 2379                     f"batch length of `text`: {len(text)} does not match batch length of `text_pair`: {len(text_pair)}."
   2380                 )
   2381             batch_text_or_text_pairs = list(zip(text, text_pair)) if text_pair is not None else text

ValueError: batch length of `text`: 3 does not match batch length of `text_pair`: 1.

在样例代码中,注意text_pair的长度是1,但是text的长度是3。这就是错误的原因。

错误分析

Exception Class: TypeError

Raise code

lit_into_words:
            is_batched = isinstance(text, (list, tuple)) and text and isinstance(text[0], (list, tuple))
        else:
            is_batched = isinstance(text, (list, tuple))

        if is_batched:
            if isinstance(text_pair, str):
                raise TypeError(
                    "when tokenizing batches of text, `text_pair` must be a list or tuple with the same length as `text`."
                )
            if text_pair is not None and len(text) != len(text_pair):
                raise ValueError(
                    f"batch length of `text`: {len(text)} does not match batch length of `text_pair`: {len(text_pair)}."
                )
            batch_text_or_text_pairs = list(zip(text, text_pair)) if text_pair is not None else text
            retu

raise code 链接

分析如下:
这个错误是被Transformers.PreTrainedTokenizer class的主要代码部分被raised. 此主要方法用于tokenize和为模型准备一个或多个序列或一对或多对序列。如果text参数是以批处理形式给出的,那么text_pairs应该是一个与text长度相同的元组或列表。

解决的方法

确保text_pair要和text保持一样的长度,在text是batched sequences的情况下。
代码修改成如下:

from transformers import GPT2Tokenizer,GPT2Model
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2Model.from_pretrained('gpt2')


special_tokens_dict = {'cls_token': ''}


num_added_toks = tokenizer.add_special_tokens(special_tokens_dict)
text = ["this is the first sentences", "this is the second sentece, ", "this one is the third sentence"]
tp = ['first sentence',"second","third"]
output = tokenizer(text,tp)
print(output)

输出结果:

{'input_ids': [[5661, 318, 262, 717, 13439, 11085, 6827], [5661, 318, 262, 1218, 1908, 68, 344, 11, 220, 12227], [5661, 530, 318, 262, 2368, 6827, 17089]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1]]}
转载请注明:文章转载自 www.mshxw.com
本文地址:https://www.mshxw.com/it/715546.html
我们一直用心在做
关于我们 文章归档 网站地图 联系我们

版权所有 (c)2021-2022 MSHXW.COM

ICP备案号:晋ICP备2021003244-6号