计算单词列表之间的相似度

由于您实际上还无法演示晶体输出，因此以下是我的最佳镜头：

list_A = ['email','user','this','email','address','customer']list_B = ['email','mail','address','netmail']

在上面的两个列表中，我们将找到列表中每个元素与其余元素之间的余弦相似度。即

email

从

list_B

与每一个元素

list_A

：

def word2vec(word):    from collections import Counter    from math import sqrt    # count the characters in word    cw = Counter(word)    # precomputes a set of the different characters    sw = set(cw)    # precomputes the "length" of the word vector    lw = sqrt(sum(c*c for c in cw.values()))    # return a tuple    return cw, sw, lwdef cosdis(v1, v2):    # which characters are common to the two words?    common = v1[1].intersection(v2[1])    # by definition of cosine distance we have    return sum(v1[0][ch]*v2[0][ch] for ch in common)/v1[2]/v2[2]list_A = ['email','user','this','email','address','customer']list_B = ['email','mail','address','netmail']threshold = 0.80     # if neededfor key in list_A:    for word in list_B:        try: # print(key) # print(word) res = cosdis(word2vec(word), word2vec(key)) # print(res) print("The cosine similarity between : {} and : {} is: {}".format(word, key, res*100)) # if res > threshold: #     print("Found a word with cosine distance > 80 : {} with original word: {}".format(word, key))        except IndexError: pass

输出：

The cosine similarity between : email and : email is: 100.0The cosine similarity between : mail and : email is: 89.44271909999159The cosine similarity between : address and : email is: 26.967994498529684The cosine similarity between : netmail and : email is: 84.51542547285166The cosine similarity between : email and : user is: 22.360679774997898The cosine similarity between : mail and : user is: 0.0The cosine similarity between : address and : user is: 60.30226891555272The cosine similarity between : netmail and : user is: 18.89822365046136The cosine similarity between : email and : this is: 22.360679774997898The cosine similarity between : mail and : this is: 25.0The cosine similarity between : address and : this is: 30.15113445777636The cosine similarity between : netmail and : this is: 37.79644730092272The cosine similarity between : email and : email is: 100.0The cosine similarity between : mail and : email is: 89.44271909999159The cosine similarity between : address and : email is: 26.967994498529684The cosine similarity between : netmail and : email is: 84.51542547285166The cosine similarity between : email and : address is: 26.967994498529684The cosine similarity between : mail and : address is: 15.07556722888818The cosine similarity between : address and : address is: 100.0The cosine similarity between : netmail and : address is: 22.79211529192759The cosine similarity between : email and : customer is: 31.62277660168379The cosine similarity between : mail and : customer is: 17.677669529663685The cosine similarity between : address and : customer is: 42.640143271122085The cosine similarity between : netmail and : customer is: 40.08918628686365

注意：我也已
threshold
在代码中注释了该部分，以防万一您只需要单词的相似度超过某个阈值（即80％）

编辑：

OP ： 但是我想做的不是逐字比较，而是逐个列出

使用

Counter

和

math

：

from collections import Counterimport mathcounterA = Counter(list_A)counterB = Counter(list_B)def counter_cosine_similarity(c1, c2):    terms = set(c1).union(c2)    dotprod = sum(c1.get(k, 0) * c2.get(k, 0) for k in terms)    magA = math.sqrt(sum(c1.get(k, 0)**2 for k in terms))    magB = math.sqrt(sum(c2.get(k, 0)**2 for k in terms))    return dotprod / (magA * magB)print(counter_cosine_similarity(counterA, counterB) * 100)

输出：

53.03300858899106

计算单词列表之间的相似度

面试问答相关栏目本月热门文章