栏目分类:
子分类:
返回
名师互学网用户登录
快速导航关闭
当前搜索
当前分类
子分类
实用工具
热门搜索
名师互学网 > IT > 面试经验 > 面试问答

N-Gram,tf-idf和余弦相似度在Python中的简单实现

面试问答 更新时间: 发布时间: IT归档 最新发布 模块sitemap 名妆网 法律咨询 聚返吧 英语巴士网 伯小乐 网商动力

N-Gram,tf-idf和余弦相似度在Python中的简单实现

查看NLTK软件包:http://www.nltk.org,它具有您需要的一切

对于cosine_similarity:

def cosine_distance(u, v):    """    Returns the cosine of the angle between vectors v and u. This is equal to    u.v / |u||v|.    """    return numpy.dot(u, v) / (math.sqrt(numpy.dot(u, u)) * math.sqrt(numpy.dot(v, v)))

对于ngrams:

def ngrams(sequence, n, pad_left=False, pad_right=False, pad_symbol=None):    """    A utility that produces a sequence of ngrams from a sequence of items.    For example:    >>> ngrams([1,2,3,4,5], 3)    [(1, 2, 3), (2, 3, 4), (3, 4, 5)]    Use ingram for an iterator version of this function.  Set pad_left    or pad_right to true in order to get additional ngrams:    >>> ngrams([1,2,3,4,5], 2, pad_right=True)    [(1, 2), (2, 3), (3, 4), (4, 5), (5, None)]    @param sequence: the source data to be converted into ngrams    @type sequence: C{sequence} or C{iterator}    @param n: the degree of the ngrams    @type n: C{int}    @param pad_left: whether the ngrams should be left-padded    @type pad_left: C{boolean}    @param pad_right: whether the ngrams should be right-padded    @type pad_right: C{boolean}    @param pad_symbol: the symbol to use for padding (default is None)    @type pad_symbol: C{any}    @return: The ngrams    @rtype: C{list} of C{tuple}s    """    if pad_left:        sequence = chain((pad_symbol,) * (n-1), sequence)    if pad_right:        sequence = chain(sequence, (pad_symbol,) * (n-1))    sequence = list(sequence)    count = max(0, len(sequence) - n + 1)    return [tuple(sequence[i:i+n]) for i in range(count)]

对于tf-idf,您将必须首先计算分布,我正在使用Lucene来做到这一点,但您可能会对NLTK做类似的事情,请使用FreqDist:

http://nltk.googlepre.com/svn/trunk/doc/book/ch01.html#frequency_distribution_index_term

如果您喜欢pylucene,这将告诉您如何上下班tf.idf

    # reader = lucene.IndexReader(FSDirectory.open(index_loc))    docs = reader.numDocs()    for i in xrange(docs):        tfv = reader.getTermFreqVector(i, fieldname)        if tfv: rec = {} terms = tfv.getTerms() frequencies = tfv.getTermFrequencies() for (t,f,x) in zip(terms,frequencies,xrange(maxtokensperdoc)):         df= searcher.docFreq(Term(fieldname, t)) # number of docs with the given term  tmap.setdefault(t, len(tmap))  rec[t] = sim.tf(f) * sim.idf(df, max_doc)  #compute TF.IDF # and normalize the values using cosine normalization if cosine_normalization:     denom = sum([x**2 for x in rec.values()])**0.5     for k,v in rec.items():         rec[k] = v / denom


转载请注明:文章转载自 www.mshxw.com
本文地址:https://www.mshxw.com/it/638084.html
我们一直用心在做
关于我们 文章归档 网站地图 联系我们

版权所有 (c)2021-2022 MSHXW.COM

ICP备案号:晋ICP备2021003244-6号