Nutch开发(四)_大数据系统

Nutch开发(四)

文章目录

Nutch开发(四)

开发环境 1.Nutch插件设计介绍2.解读插件目录结构3. build.xml4. ivy.xml5. plugin.xml6. 解读parse-html插件

HtmlParser

setConf(Configuration conf)parse(InputSource input)getParse(Content content) 7.解读parse-metatags插件

metaTagsParser

filter方法addIndexedmetatags方法metadata plugin的配置

开发环境

Linux，Ubuntu20.04LSTIDEANutch1.18Solr8.11

转载请声明出处！！！By 鸭梨的药丸哥

1.Nutch插件设计介绍

Nutch高度可扩展，使用的插件系统是基于Eclipse2.x的插件系统。

Nutch暴露了几个扩展点，每个扩展点都是一个接口，通过实现接口来进行插件扩展的开发。Nutch提供以下扩展点，我们只需要实现对应的接口即可开发我们的Nutch插件

IndexWriter – Writes crawled data to a specific indexing backends (Solr, ElasticSearch, a CVS file, etc.).IndexingFilter – Permits one to add metadata to the indexed fields. All plugins found which implement this extension point are run sequentially on the parse (from javadoc).Parser – Parser implementations read through fetched documents in order to extract data to be indexed. This is what you need to implement if you want Nutch to be able to parse a new type of content, or extract more data from currently parseable content.HtmlParseFilter – Permits one to add additional metadata to HTML parses (from javadoc).Protocol – Protocol implementations allow Nutch to use different protocols (ftp, http, etc.) to fetch documents.URLFilter – URLFilter implementations limit the URLs that Nutch attempts to fetch. The RegexURLFilter distributed with Nutch provides a great deal of control over what URLs Nutch crawls, however if you have very complicated rules about what URLs you want to crawl, you can write your own implementation.URLNormalizer – Interface used to convert URLs to normal form and optionally perform substitutions.ScoringFilter – A contract defining behavior of scoring plugins. A scoring filter will manipulate scoring variables in CrawlDatum and in resulting search indexes. Filters can be chained in a specific order, to provide multi-stage scoring adjustments.SegmentMergeFilter – Interface used to filter segments during segment merge. It allows filtering on more sophisticated criteria than just URLs. In particular it allows filtering based on metadata collected while parsing page.

2.解读插件目录结构

Nutch插件的目录都相似，这里介绍一下parse-html的目录就行了

/src #源码目录
build.xml   #ant怎样编译这个plugin配置文件(编译出jar包放哪啊等配置信息)
ivy.xml     #plugin的ivy配置信息(依赖管理，跟maven的pom.xml一样的东东)
plugin.xml  #nutch描述这个plugin的信息(如，这个插件实现了哪些扩展点，插件的扩展点实现类名字等)

3. build.xml

build.xml告知ant如何编译这个插件的



  

  
  
      
    
  

  
  
    
       
metadata plugin的配置 
在看看配置并和addIndexedmetatags对比一下，这就可以看出为什么插件的index.parse.md要加上metatag.前缀 

metatags.names
description,keywords
 Names of the metatags to extract, separated by ','.
  Use '*' to extract all metatags. Prefixes the names with 'metatag.'
  in the parse-metadata. For instance to index description and keywords,
  you need to activate the plugin index-metadata and set the value of the
  parameter 'index.parse.md' to 'metatag.description,metatag.keywords'.


 
 
  index.parse.md
     
  metatag.description,metatag.keywords
  
  Comma-separated list of keys to be taken from the parse metadata to generate fields.
  Can be used e.g. for 'description' or 'keywords' provided that these values are generated
  by a parser (see parse-metatags plugin)

Nutch开发(四)

大数据系统相关栏目本月热门文章