深度学习第18天

文本分词

准备词典和停用词

1 准备词典

user_dict_path C:/Users/dajian/PycharmProjects/pythonProject9/7.chat_service/corpus/user_dict/keywords.txt 
jieba.load_userdict(config.user_dict_path)

2 准备停用词

stopwords_path C:/Users/dajian/PycharmProjects/pythonProject9/7.chat_service/corpus/user_dict/stopwords.txt 
stopwords [i.strip() for i in open(config.stopwords_path,encoding UTF-8 ).readlines()]

准备按照单个字切分句子的方法

def cut_sentence_by_word(sentence):
 实现中英文分词
 中文 按单个汉字
 英文 按单词
 # python和c 哪个难 - [python,和,c ,哪,个,难]
 result []
 temp 
 for word in sentence:
 if word.lower() in letters: # 如果word是字母 则添加到temp后面
 temp word
 else:
 if temp! : # 如果word不是字母 且temp不为空 则把temp加入result中
 result.append(temp.lower())
 temp 
 result.append(word.strip()) # 如果word不是字母 就直接加入result中
 if temp ! : # 最后一组如果包含字母 则需要把最后一个加入到result中
 result.append(temp.lower())
 return result

完成分词方法的封装

import jieba
import jieba.posseg as psg
import config
import string
from lib.stopwords import stopwords
# 将准备好的语料导入jieba中
jieba.load_userdict(config.user_dict_path)
#准备英文字符
letters string.ascii_lowercase 
def cut_sentence_by_word(sentence):
 实现中英文分词
 中文 按单个汉字
 英文 按单词
 # python和c 哪个难 - [python,和,c ,哪,个,难]
 result []
 temp 
 for word in sentence:
 if word.lower() in letters: # 如果word是字母 则添加到temp后面
 temp word
 else:
 if temp! : # 如果word不是字母 且temp不为空 则把temp加入result中
 result.append(temp.lower())
 temp 
 result.append(word.strip()) # 如果word不是字母 就直接加入result中
 if temp ! : # 最后一组如果包含字母 则需要把最后一个加入到result中
 result.append(temp.lower())
 return result

深度学习第18天

Python相关栏目本月热门文章