这是使用ApacheLucene的可能解决方案。我没有使用最新版本,但使用3.6.2版本,因为这是我所知道的最好的版本。除了之外
/lucene-core-x.x.x.jar,别忘了将
/contrib/analyzers/common/lucene-analyzers-x.x.x.jar下载的存档中的添加到您的项目中:它包含特定于语言的分析器(在您的情况下尤其是英语)。
注意,这将 _仅_基于输入文本词的词干找到它们的频率。然后将这些频率与英语统计数据进行比较。
数据模型
一个词干一词。不同的词可能具有相同的词干,因此具有相同的词干
terms。每次找到新术语时,关键字频率都会增加(即使已经找到它-
一个集合会自动删除重复项)。
public class Keyword implements Comparable<Keyword> { private final String stem; private final Set<String> terms = new HashSet<String>(); private int frequency = 0; public Keyword(String stem) { this.stem = stem; } public void add(String term) { terms.add(term); frequency++; } @Override public int compareTo(Keyword o) { // descending order return Integer.valueOf(o.frequency).compareTo(frequency); } @Override public boolean equals(Object obj) { if (this == obj) { return true; } else if (!(obj instanceof Keyword)) { return false; } else { return stem.equals(((Keyword) obj).stem); } } @Override public int hashCode() { return Arrays.hashCode(new Object[] { stem }); } public String getStem() { return stem; } public Set<String> getTerms() { return terms; } public int getFrequency() { return frequency; }}实用工具
词干:
public static String stem(String term) throws IOException { TokenStream tokenStream = null; try { // tokenize tokenStream = new ClassicTokenizer(Version.LUCENE_36, new StringReader(term)); // stem tokenStream = new PorterStemFilter(tokenStream); // add each token in a set, so that duplicates are removed Set<String> stems = new HashSet<String>(); CharTermAttribute token = tokenStream.getAttribute(CharTermAttribute.class); tokenStream.reset(); while (tokenStream.incrementToken()) { stems.add(token.toString()); } // if no stem or 2+ stems have been found, return null if (stems.size() != 1) { return null; } String stem = stems.iterator().next(); // if the stem has non-alphanumerical chars, return null if (!stem.matches("[a-zA-Z0-9-]+")) { return null; } return stem; } finally { if (tokenStream != null) { tokenStream.close(); } }}要搜索集合(将由潜在关键字列表使用):
public static <T> T find(Collection<T> collection, T example) { for (T element : collection) { if (element.equals(example)) { return element; } } collection.add(example); return example;}核心
这是主要的输入法:
public static List<Keyword> guessFromString(String input) throws IOException { TokenStream tokenStream = null; try { // hack to keep dashed words (e.g. "non-specific" rather than "non" and "specific") input = input.replaceAll("-+", "-0"); // replace any punctuation char but apostrophes and dashes by a space input = input.replaceAll("[\p{Punct}&&[^'-]]+", " "); // replace most common english contractions input = input.replaceAll("(?:'(?:[tdsm]|[vr]e|ll))+\b", ""); // tokenize input tokenStream = new ClassicTokenizer(Version.LUCENE_36, new StringReader(input)); // to lowercase tokenStream = new LowerCaseFilter(Version.LUCENE_36, tokenStream); // remove dots from acronyms (and "'s" but already done manually above) tokenStream = new ClassicFilter(tokenStream); // convert any char to ASCII tokenStream = new ASCIIFoldingFilter(tokenStream); // remove english stop words tokenStream = new StopFilter(Version.LUCENE_36, tokenStream, EnglishAnalyzer.getDefaultStopSet()); List<Keyword> keywords = new linkedList<Keyword>(); CharTermAttribute token = tokenStream.getAttribute(CharTermAttribute.class); tokenStream.reset(); while (tokenStream.incrementToken()) { String term = token.toString(); // stem each term String stem = stem(term); if (stem != null) { // create the keyword or get the existing one if any Keyword keyword = find(keywords, new Keyword(stem.replaceAll("-0", "-"))); // add its corresponding initial token keyword.add(term.replaceAll("-0", "-")); } } // reverse sort by frequency Collections.sort(keywords); return keywords; } finally { if (tokenStream != null) { tokenStream.close(); } }}例
使用
guessFromString的方法的Java
Wikipedia文章引言部分,这里是第10个最常见的关键字(即茎)中发现:
java x12 [java]compil x5 [compiled, compiler, compilers]sun x5 [sun]develop x4 [developed, developers]languag x3 [languages, language]implement x3 [implementation, implementations]applic x3 [application, applications]run x3 [run]origin x3 [originally, original]gnu x3 [gnu]
遍历输出列表,通过获取集合(在上述示例中的方括号之间显示),了解每个词干的 原始找到的单词 。
terms``[...]
下一步是什么
将 词干频率/频率总和 比率与英语统计的比率进行比较,如果可以的话,让我保持循环:我也可能很感兴趣
:)



