Nutch开发(四)
开发环境 1.Nutch插件设计介绍2.解读插件目录结构3. build.xml4. ivy.xml5. plugin.xml6. 解读parse-html插件
HtmlParser
setConf(Configuration conf)parse(InputSource input)getParse(Content content) 7.解读parse-metatags插件
metaTagsParser
filter方法addIndexedmetatags方法metadata plugin的配置
开发环境Linux,Ubuntu20.04LSTIDEANutch1.18Solr8.11
转载请声明出处!!!By 鸭梨的药丸哥
1.Nutch插件设计介绍Nutch高度可扩展,使用的插件系统是基于Eclipse2.x的插件系统。
Nutch暴露了几个扩展点,每个扩展点都是一个接口,通过实现接口来进行插件扩展的开发。Nutch提供以下扩展点,我们只需要实现对应的接口即可开发我们的Nutch插件
2.解读插件目录结构IndexWriter – Writes crawled data to a specific indexing backends (Solr, ElasticSearch, a CVS file, etc.).IndexingFilter – Permits one to add metadata to the indexed fields. All plugins found which implement this extension point are run sequentially on the parse (from javadoc).Parser – Parser implementations read through fetched documents in order to extract data to be indexed. This is what you need to implement if you want Nutch to be able to parse a new type of content, or extract more data from currently parseable content.HtmlParseFilter – Permits one to add additional metadata to HTML parses (from javadoc).Protocol – Protocol implementations allow Nutch to use different protocols (ftp, http, etc.) to fetch documents.URLFilter – URLFilter implementations limit the URLs that Nutch attempts to fetch. The RegexURLFilter distributed with Nutch provides a great deal of control over what URLs Nutch crawls, however if you have very complicated rules about what URLs you want to crawl, you can write your own implementation.URLNormalizer – Interface used to convert URLs to normal form and optionally perform substitutions.ScoringFilter – A contract defining behavior of scoring plugins. A scoring filter will manipulate scoring variables in CrawlDatum and in resulting search indexes. Filters can be chained in a specific order, to provide multi-stage scoring adjustments.SegmentMergeFilter – Interface used to filter segments during segment merge. It allows filtering on more sophisticated criteria than just URLs. In particular it allows filtering based on metadata collected while parsing page.
Nutch插件的目录都相似,这里介绍一下parse-html的目录就行了
/src #源码目录 build.xml #ant怎样编译这个plugin配置文件(编译出jar包放哪啊等配置信息) ivy.xml #plugin的ivy配置信息(依赖管理,跟maven的pom.xml一样的东东) plugin.xml #nutch描述这个plugin的信息(如,这个插件实现了哪些扩展点,插件的扩展点实现类名字等)3. build.xml
build.xml告知ant如何编译这个插件的
metadata plugin的配置 在看看配置并和addIndexedmetatags对比一下,这就可以看出为什么插件的index.parse.md要加上metatag.前缀
metatags.names description,keywords Names of the metatags to extract, separated by ','. Use '*' to extract all metatags. Prefixes the names with 'metatag.' in the parse-metadata. For instance to index description and keywords, you need to activate the plugin index-metadata and set the value of the parameter 'index.parse.md' to 'metatag.description,metatag.keywords'. index.parse.md metatag.description,metatag.keywords Comma-separated list of keys to be taken from the parse metadata to generate fields. Can be used e.g. for 'description' or 'keywords' provided that these values are generated by a parser (see parse-metatags plugin)



