栏目分类:
子分类:
返回
名师互学网用户登录
快速导航关闭
当前搜索
当前分类
子分类
实用工具
热门搜索
名师互学网 > IT > 面试经验 > 面试问答

计算单词列表之间的相似度

面试问答 更新时间: 发布时间: IT归档 最新发布 模块sitemap 名妆网 法律咨询 聚返吧 英语巴士网 伯小乐 网商动力

计算单词列表之间的相似度

由于您实际上还无法演示晶体输出,因此以下是我的最佳镜头:

list_A = ['email','user','this','email','address','customer']list_B = ['email','mail','address','netmail']

在上面的两个列表中,我们将找到列表中每个元素与其余元素之间的余弦相似度。即

email
list_B
与每一个元素
list_A

def word2vec(word):    from collections import Counter    from math import sqrt    # count the characters in word    cw = Counter(word)    # precomputes a set of the different characters    sw = set(cw)    # precomputes the "length" of the word vector    lw = sqrt(sum(c*c for c in cw.values()))    # return a tuple    return cw, sw, lwdef cosdis(v1, v2):    # which characters are common to the two words?    common = v1[1].intersection(v2[1])    # by definition of cosine distance we have    return sum(v1[0][ch]*v2[0][ch] for ch in common)/v1[2]/v2[2]list_A = ['email','user','this','email','address','customer']list_B = ['email','mail','address','netmail']threshold = 0.80     # if neededfor key in list_A:    for word in list_B:        try: # print(key) # print(word) res = cosdis(word2vec(word), word2vec(key)) # print(res) print("The cosine similarity between : {} and : {} is: {}".format(word, key, res*100)) # if res > threshold: #     print("Found a word with cosine distance > 80 : {} with original word: {}".format(word, key))        except IndexError: pass

输出

The cosine similarity between : email and : email is: 100.0The cosine similarity between : mail and : email is: 89.44271909999159The cosine similarity between : address and : email is: 26.967994498529684The cosine similarity between : netmail and : email is: 84.51542547285166The cosine similarity between : email and : user is: 22.360679774997898The cosine similarity between : mail and : user is: 0.0The cosine similarity between : address and : user is: 60.30226891555272The cosine similarity between : netmail and : user is: 18.89822365046136The cosine similarity between : email and : this is: 22.360679774997898The cosine similarity between : mail and : this is: 25.0The cosine similarity between : address and : this is: 30.15113445777636The cosine similarity between : netmail and : this is: 37.79644730092272The cosine similarity between : email and : email is: 100.0The cosine similarity between : mail and : email is: 89.44271909999159The cosine similarity between : address and : email is: 26.967994498529684The cosine similarity between : netmail and : email is: 84.51542547285166The cosine similarity between : email and : address is: 26.967994498529684The cosine similarity between : mail and : address is: 15.07556722888818The cosine similarity between : address and : address is: 100.0The cosine similarity between : netmail and : address is: 22.79211529192759The cosine similarity between : email and : customer is: 31.62277660168379The cosine similarity between : mail and : customer is: 17.677669529663685The cosine similarity between : address and : customer is: 42.640143271122085The cosine similarity between : netmail and : customer is: 40.08918628686365

注意:我也已

threshold
在代码中注释了该部分,以防万一您只需要单词的相似度超过某个阈值(即80%)

编辑

OP但是我想做的不是逐字比较,而是逐个列出

使用

Counter
math

from collections import Counterimport mathcounterA = Counter(list_A)counterB = Counter(list_B)def counter_cosine_similarity(c1, c2):    terms = set(c1).union(c2)    dotprod = sum(c1.get(k, 0) * c2.get(k, 0) for k in terms)    magA = math.sqrt(sum(c1.get(k, 0)**2 for k in terms))    magB = math.sqrt(sum(c2.get(k, 0)**2 for k in terms))    return dotprod / (magA * magB)print(counter_cosine_similarity(counterA, counterB) * 100)

输出

53.03300858899106


转载请注明:文章转载自 www.mshxw.com
本文地址:https://www.mshxw.com/it/646696.html
我们一直用心在做
关于我们 文章归档 网站地图 联系我们

版权所有 (c)2021-2022 MSHXW.COM

ICP备案号:晋ICP备2021003244-6号