N-Gram，tf-idf和余弦相似度在Python中的简单实现

查看NLTK软件包：http：//www.nltk.org，它具有您需要的一切

对于cosine_similarity：

def cosine_distance(u, v):    """    Returns the cosine of the angle between vectors v and u. This is equal to    u.v / |u||v|.    """    return numpy.dot(u, v) / (math.sqrt(numpy.dot(u, u)) * math.sqrt(numpy.dot(v, v)))

对于ngrams：

def ngrams(sequence, n, pad_left=False, pad_right=False, pad_symbol=None):    """    A utility that produces a sequence of ngrams from a sequence of items.    For example:    >>> ngrams([1,2,3,4,5], 3)    [(1, 2, 3), (2, 3, 4), (3, 4, 5)]    Use ingram for an iterator version of this function.  Set pad_left    or pad_right to true in order to get additional ngrams:    >>> ngrams([1,2,3,4,5], 2, pad_right=True)    [(1, 2), (2, 3), (3, 4), (4, 5), (5, None)]    @param sequence: the source data to be converted into ngrams    @type sequence: C{sequence} or C{iterator}    @param n: the degree of the ngrams    @type n: C{int}    @param pad_left: whether the ngrams should be left-padded    @type pad_left: C{boolean}    @param pad_right: whether the ngrams should be right-padded    @type pad_right: C{boolean}    @param pad_symbol: the symbol to use for padding (default is None)    @type pad_symbol: C{any}    @return: The ngrams    @rtype: C{list} of C{tuple}s    """    if pad_left:        sequence = chain((pad_symbol,) * (n-1), sequence)    if pad_right:        sequence = chain(sequence, (pad_symbol,) * (n-1))    sequence = list(sequence)    count = max(0, len(sequence) - n + 1)    return [tuple(sequence[i:i+n]) for i in range(count)]

对于tf-idf，您将必须首先计算分布，我正在使用Lucene来做到这一点，但您可能会对NLTK做类似的事情，请使用FreqDist：

http://nltk.googlepre.com/svn/trunk/doc/book/ch01.html#frequency_distribution_index_term

如果您喜欢pylucene，这将告诉您如何上下班tf.idf

    # reader = lucene.IndexReader(FSDirectory.open(index_loc))    docs = reader.numDocs()    for i in xrange(docs):        tfv = reader.getTermFreqVector(i, fieldname)        if tfv: rec = {} terms = tfv.getTerms() frequencies = tfv.getTermFrequencies() for (t,f,x) in zip(terms,frequencies,xrange(maxtokensperdoc)):         df= searcher.docFreq(Term(fieldname, t)) # number of docs with the given term  tmap.setdefault(t, len(tmap))  rec[t] = sim.tf(f) * sim.idf(df, max_doc)  #compute TF.IDF # and normalize the values using cosine normalization if cosine_normalization:     denom = sum([x**2 for x in rec.values()])**0.5     for k,v in rec.items():         rec[k] = v / denom

N-Gram，tf-idf和余弦相似度在Python中的简单实现

面试问答相关栏目本月热门文章