由于您实际上还无法演示晶体输出,因此以下是我的最佳镜头:
list_A = ['email','user','this','email','address','customer']list_B = ['email','mail','address','netmail']
在上面的两个列表中,我们将找到列表中每个元素与其余元素之间的余弦相似度。即
list_B与每一个元素
list_A:
def word2vec(word): from collections import Counter from math import sqrt # count the characters in word cw = Counter(word) # precomputes a set of the different characters sw = set(cw) # precomputes the "length" of the word vector lw = sqrt(sum(c*c for c in cw.values())) # return a tuple return cw, sw, lwdef cosdis(v1, v2): # which characters are common to the two words? common = v1[1].intersection(v2[1]) # by definition of cosine distance we have return sum(v1[0][ch]*v2[0][ch] for ch in common)/v1[2]/v2[2]list_A = ['email','user','this','email','address','customer']list_B = ['email','mail','address','netmail']threshold = 0.80 # if neededfor key in list_A: for word in list_B: try: # print(key) # print(word) res = cosdis(word2vec(word), word2vec(key)) # print(res) print("The cosine similarity between : {} and : {} is: {}".format(word, key, res*100)) # if res > threshold: # print("Found a word with cosine distance > 80 : {} with original word: {}".format(word, key)) except IndexError: pass输出 :
The cosine similarity between : email and : email is: 100.0The cosine similarity between : mail and : email is: 89.44271909999159The cosine similarity between : address and : email is: 26.967994498529684The cosine similarity between : netmail and : email is: 84.51542547285166The cosine similarity between : email and : user is: 22.360679774997898The cosine similarity between : mail and : user is: 0.0The cosine similarity between : address and : user is: 60.30226891555272The cosine similarity between : netmail and : user is: 18.89822365046136The cosine similarity between : email and : this is: 22.360679774997898The cosine similarity between : mail and : this is: 25.0The cosine similarity between : address and : this is: 30.15113445777636The cosine similarity between : netmail and : this is: 37.79644730092272The cosine similarity between : email and : email is: 100.0The cosine similarity between : mail and : email is: 89.44271909999159The cosine similarity between : address and : email is: 26.967994498529684The cosine similarity between : netmail and : email is: 84.51542547285166The cosine similarity between : email and : address is: 26.967994498529684The cosine similarity between : mail and : address is: 15.07556722888818The cosine similarity between : address and : address is: 100.0The cosine similarity between : netmail and : address is: 22.79211529192759The cosine similarity between : email and : customer is: 31.62277660168379The cosine similarity between : mail and : customer is: 17.677669529663685The cosine similarity between : address and : customer is: 42.640143271122085The cosine similarity between : netmail and : customer is: 40.08918628686365
注意:我也已
threshold在代码中注释了该部分,以防万一您只需要单词的相似度超过某个阈值(即80%)
编辑 :
OP : 但是我想做的不是逐字比较,而是逐个列出
使用
Counter和
math:
from collections import Counterimport mathcounterA = Counter(list_A)counterB = Counter(list_B)def counter_cosine_similarity(c1, c2): terms = set(c1).union(c2) dotprod = sum(c1.get(k, 0) * c2.get(k, 0) for k in terms) magA = math.sqrt(sum(c1.get(k, 0)**2 for k in terms)) magB = math.sqrt(sum(c2.get(k, 0)**2 for k in terms)) return dotprod / (magA * magB)print(counter_cosine_similarity(counterA, counterB) * 100)
输出 :
53.03300858899106



