Nutch 开发(一)
开发环境 1.IDEA 导入nutch项目2.nutch源码目录了解3.Nutch爬取步骤4.启动类的介绍5.Nutch的sh脚本6.运行injector
6.1 配置6.2创建一个url列表6.3 IDEA创建启动6.4 运行效果对等 7.Injector主函数分析8.运行Generator
8.1 IDEA创建启动8.2 运行效果对等 9.运行Fetcher
9.1 IDEA创建启动9.2 报错分析9.3 配置http.agent.name9.3 运行效果对等 10.运行ParseSegment
10.1 IDEA创建启动10.2 运行效果对等 11.运行CrawlDb
11.1 IDEA创建启动11.2 运行效果对等 12.运行linkDb
12.1 IDEA创建启动12.2 运行效果对等 下一章
开发环境Linux,Ubuntu20.04LSTIDEANutch1.18Solr8.11
转载请声明出处!!!By 鸭梨的药丸哥
1.IDEA 导入nutch项目要开发nutch最好连nutch源码一起下载下来。去官方下载nutch的源码包。
1.18版本的下载地址:https://www.apache.org/dyn/closer.lua/nutch/1.18/apache-nutch-1.18-src.tar.gz
我下载的是Linux的源码包,因为nutch很多命令都需要运行在Linux上面,所以为了方便我是在Linux上对nutch的插件进行开发。
编译源码前,确保已经安装好ant,可以执行下面的方法进行ant的安装
sudo apt-get update sudo apt-get install ant
将nutch构建成eclipse项目
ant eclipse
然后使用IDEA以eclipse工程导入项目,这个网上的资源比较多,正常滴导入Nutch源码项目即可,导入时选择以eclipse项目的方式进行导入。
2.nutch源码目录了解通过nutch源码编译出来的目录结构跟下载的bin包的结构目录有细微的差异
build/ #ant eclipse编译后的生成的 conf/ #配置文件目录 docs/ #接口文档 ivy/ #ivy依赖管理工具的文件夹 lib/ #放置Hadoop本机库的占位符的文件夹(不会自动下载,里面的组件用来加快数据(反)压缩) src/ #源码目录3.Nutch爬取步骤
Nutch整个爬取过程是分很多步骤的:
injector -> generator -> fetcher -> parseSegment -> updateCrawleDB -> Invert links -> Index -> DeleteDuplicates -> IndexMerger
建立初始URL集
将URL集注入crawldb数据库—inject
根据crawldb数据库创建抓取列表—generate
执行抓取,获取网页信息—fetch
4.2)执行解析,解析网页信息—parse
更新数据库,把获取到的页面信息存入数据库中—updatedb
重复进行3~5的步骤,直到预先设定的抓取深度。—这个循环过程被称为“产生/抓取/更新”循环
根据sengments的内容更新linkdb数据库—invertlinks
建立索引—index (如:在solr中建立索引)
Nutch作者画的一个Nutch架构图,以前较老版本的架构,当初nutch还未吧全文检索功能分离出来
4.启动类的介绍主要的启动类如下:
| Operation | Class in Nutch 1.x (i.e.trunk) | Class in Nutch 2.x |
|---|---|---|
| inject | org.apache.nutch.crawl.Injector | org.apache.nutch.crawl.InjectorJob |
| generate | org.apache.nutch.crawl.Generator | org.apache.nutch.crawl.GeneratorJob |
| fetch | org.apache.nutch.fetcher.Fetcher | org.apache.nutch.fetcher.FetcherJob |
| parse | org.apache.nutch.parse.ParseSegment | org.apache.nutch.parse.ParserJob |
| updatedb | org.apache.nutch.crawl.CrawlDb | org.apache.nutch.crawl.DbUpdaterJob |
| invertlinks | org.apache.nutch.crawl.linkDb | ??? |
重Nutch的sh脚本可以发现,nutch脚本的本质还是调用具体的启动类来实现其功能。
下面截取sh的部分片段,可以看出不同的COMMAND对应不同的启动类,然后将命令行的参数传递给启动类。
# figure out which class to run
if [ "$COMMAND" = "crawl" ] ; then
echo "Command $COMMAND is deprecated, please use bin/crawl instead"
exit -1
elif [ "$COMMAND" = "inject" ] ; then
CLASS=org.apache.nutch.crawl.Injector
elif [ "$COMMAND" = "generate" ] ; then
CLASS=org.apache.nutch.crawl.Generator
elif [ "$COMMAND" = "freegen" ] ; then
CLASS=org.apache.nutch.tools.FreeGenerator
elif [ "$COMMAND" = "fetch" ] ; then
CLASS=org.apache.nutch.fetcher.Fetcher
elif [ "$COMMAND" = "parse" ] ; then
CLASS=org.apache.nutch.parse.ParseSegment
elif [ "$COMMAND" = "readdb" ] ; then
CLASS=org.apache.nutch.crawl.CrawlDbReader
elif [ "$COMMAND" = "mergedb" ] ; then
CLASS=org.apache.nutch.crawl.CrawlDbMerger
elif [ "$COMMAND" = "readlinkdb" ] ; then
CLASS=org.apache.nutch.crawl.linkDbReader
elif [ "$COMMAND" = "readseg" ] ; then
CLASS=org.apache.nutch.segment.SegmentReader
elif [ "$COMMAND" = "mergesegs" ] ; then
CLASS=org.apache.nutch.segment.SegmentMerger
elif [ "$COMMAND" = "updatedb" ] ; then
CLASS=org.apache.nutch.crawl.CrawlDb
elif [ "$COMMAND" = "invertlinks" ] ; then
CLASS=org.apache.nutch.crawl.linkDb
elif [ "$COMMAND" = "mergelinkdb" ] ; then
CLASS=org.apache.nutch.crawl.linkDbMerger
elif [ "$COMMAND" = "dump" ] ; then
CLASS=org.apache.nutch.tools.FileDumper
elif [ "$COMMAND" = "commoncrawldump" ] ; then
CLASS=org.apache.nutch.tools.CommonCrawlDataDumper
elif [ "$COMMAND" = "solrindex" ] ; then
CLASS="org.apache.nutch.indexer.IndexingJob -D solr.server.url=$1"
shift
elif [ "$COMMAND" = "index" ] ; then
CLASS=org.apache.nutch.indexer.IndexingJob
elif [ "$COMMAND" = "solrdedup" ] ; then
echo "Command $COMMAND is deprecated, please use dedup instead"
exit -1
elif [ "$COMMAND" = "dedup" ] ; then
CLASS=org.apache.nutch.crawl.DeduplicationJob
elif [ "$COMMAND" = "solrclean" ] ; then
CLASS="org.apache.nutch.indexer.CleaningJob -D solr.server.url=$2 $1"
shift; shift
elif [ "$COMMAND" = "clean" ] ; then
CLASS=org.apache.nutch.indexer.CleaningJob
elif [ "$COMMAND" = "parsechecker" ] ; then
CLASS=org.apache.nutch.parse.ParserChecker
elif [ "$COMMAND" = "indexchecker" ] ; then
CLASS=org.apache.nutch.indexer.IndexingFiltersChecker
elif [ "$COMMAND" = "filterchecker" ] ; then
CLASS=org.apache.nutch.net.URLFilterChecker
elif [ "$COMMAND" = "normalizerchecker" ] ; then
CLASS=org.apache.nutch.net.URLNormalizerChecker
elif [ "$COMMAND" = "domainstats" ] ; then
CLASS=org.apache.nutch.util.domain.DomainStatistics
elif [ "$COMMAND" = "protocolstats" ] ; then
CLASS=org.apache.nutch.util.ProtocolStatusStatistics
elif [ "$COMMAND" = "crawlcomplete" ] ; then
CLASS=org.apache.nutch.util.CrawlCompletionStats
elif [ "$COMMAND" = "webgraph" ] ; then
CLASS=org.apache.nutch.scoring.webgraph.WebGraph
elif [ "$COMMAND" = "linkrank" ] ; then
CLASS=org.apache.nutch.scoring.webgraph.linkRank
elif [ "$COMMAND" = "scoreupdater" ] ; then
CLASS=org.apache.nutch.scoring.webgraph.ScoreUpdater
elif [ "$COMMAND" = "nodedumper" ] ; then
CLASS=org.apache.nutch.scoring.webgraph.NodeDumper
elif [ "$COMMAND" = "plugin" ] ; then
CLASS=org.apache.nutch.plugin.PluginRepository
elif [ "$COMMAND" = "junit" ] ; then
CLASSPATH="$CLASSPATH:$NUTCH_HOME/test/classes/"
if $local; then
for f in "$NUTCH_HOME"/test/lib/*.jar; do
CLASSPATH="${CLASSPATH}:$f";
done
fi
CLASS=org.junit.runner.JUnitCore
elif [ "$COMMAND" = "startserver" ] ; then
CLASS=org.apache.nutch.service.NutchServer
elif [ "$COMMAND" = "webapp" ] ; then
CLASS=org.apache.nutch.webui.NutchUiServer
elif [ "$COMMAND" = "warc" ] ; then
CLASS=org.apache.nutch.tools.warc.WARCExporter
elif [ "$COMMAND" = "updatehostdb" ] ; then
CLASS=org.apache.nutch.hostdb.UpdateHostDb
elif [ "$COMMAND" = "readhostdb" ] ; then
CLASS=org.apache.nutch.hostdb.ReadHostDb
elif [ "$COMMAND" = "sitemap" ] ; then
CLASS=org.apache.nutch.util.SitemapProcessor
elif [ "$COMMAND" = "showproperties" ] ; then
CLASS=org.apache.nutch.tools.ShowProperties
else
CLASS=$COMMAND
fi
6.运行injector
inject的主函数在org.apache.nutch.crawl包的injector类中。
6.1 配置要运行inject,首先要apache-nutch-1.18/conf/nutch-site.xml添加plugin.folders配置,用来覆盖掉默认的相对路径的配置。因为使用nutch脚本的运行路径和我们直接用源码运行的路径是不同的。
6.2创建一个url列表plugin.folders /home/liangwy/IdeaProjects/apache-nutch-1.18/src/plugin Directories where nutch plugins are located. Each element may be a relative or absolute path. If absolute, it is used as is. If relative, it is searched for on the classpath.
mkdir urls touch urls/seeds.txt vim urls/seeds.txt #然后输入要第一批进行爬取的url即可6.3 IDEA创建启动
点击主菜单依次选择: Run -> Edit Configurations ,点击 + 号,选择创建 Application :
Name : InjectorMain Class :org.apache.nutch.crawl.Injector (1.x版本的主函数类,具体名字要看源码2.x叫InjectorJob)VM options :-Dhadoop.log.dir=logs -Dhadoop.log.file=hadoop.logProgram arguments : /home/liangwy/IdeaProjects/apache-nutch-1.18/myNutch/crawldb /home/liangwy/apache-nutch-1.18/urls (存储抓取地址文件seed.txt的目录)
注意:Program arguments的填充是跟你nutch提供的脚本传递的参数一样的
6.4 运行效果对等运行效果等价于使用nutch(Bin版本)的bin/目录下的nutch命令一样。
./nutch inject /home/liangwy/IdeaProjects/apache-nutch-1.18/myNutch/crawldb /home/liangwy/apache-nutch-1.18/urls7.Injector主函数分析
injector的main函数如下:
public static void main(String[] args) throws Exception {
int res = ToolRunner.run(NutchConfiguration.create(), new Injector(), args);
System.exit(res);
}
Injector的运行是通过ToolRunner进行的,点开ToolRunner的run函数,发现最后运行的实际调用方法是injector的run函数。
方法参数:
Configuration conf #nutch的配置Tool tool #要运行的工具类(如:injector,generator)String[] args #传递给工具类的命令行参数
public static int run(Configuration conf, Tool tool, String[] args) throws Exception {
if (CallerContext.getCurrent() == null) {
CallerContext ctx = (new Builder("CLI")).build();
CallerContext.setCurrent(ctx);
}
if (conf == null) {
conf = new Configuration();
}
//解析配置
GenericOptionsParser parser = new GenericOptionsParser(conf, args);
tool.setConf(conf);
String[] toolArgs = parser.getRemainingArgs();
//实际运行还是调用tool自身的run
return tool.run(toolArgs);
}
8.运行Generator 8.1 IDEA创建启动
点击主菜单依次选择: Run -> Edit Configurations ,点击 + 号,选择创建 Application :
Name : GeneratorMain Class :org.apache.nutch.crawl.GeneratorProgram arguments : /home/liangwy/IdeaProjects/apache-nutch-1.18/myNutch/crawldb /home/liangwy/IdeaProjects/apache-nutch-1.18/myNutch/segments -topN 100
注意:Program arguments的填充是跟你nutch提供的脚本传递的参数一样的
8.2 运行效果对等运行效果等价于使用nutch(Bin版本)的bin/目录下的nutch命令一样。
./nutch generate /home/liangwy/IdeaProjects/apache-nutch-1.18/myNutch/crawldb /home/liangwy/IdeaProjects/apache-nutch-1.18/myNutch/segments -topN 100
9.运行Fetcher 9.1 IDEA创建启动
点击主菜单依次选择: Run -> Edit Configurations ,点击 + 号,选择创建 Application :
Name : FetcherMain Class :org.apache.nutch.fetcher.FetcherProgram arguments : /home/liangwy/IdeaProjects/apache-nutch-1.18/myNutch/segments/20220114175955 -threads 16
注意:Program arguments的填充是跟你nutch提供的脚本传递的参数一样的
9.2 报错分析没有配置http.agent.name,这个配置可以在conf/nutch-site.xml中进行配置
9.3 配置http.agent.nameFetcher: No agents listed in ‘http.agent.name’ property.
Fetcher: java.lang.IllegalArgumentException: Fetcher: No agents listed in ‘http.agent.name’ property.
at org.apache.nutch.fetcher.Fetcher.checkConfiguration(Fetcher.java:563)
at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:431)
at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:545)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:518)
在conf/nutch-site.xml文件中添加如下配置
property>
http.agent.name
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.55 Safari/537.36 Edg/96.0.1054.43
HTTP 'User-Agent' request header. MUST NOT be empty -
please set this to a single word uniquely related to your organization.
NOTE: You should also check other related properties:
http.robots.agents
http.agent.description
http.agent.url
http.agent.email
http.agent.version
and set their values appropriately.
http.robots.agents
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.55 Safari/537.36 Edg/96.0.1054.43,*
9.3 运行效果对等
运行效果等价于使用nutch(Bin版本)的bin/目录下的nutch命令一样。
./nutch fetch /home/liangwy/IdeaProjects/apache-nutch-1.18/myNutch/segments/20220114175955 -threads 1610.运行ParseSegment 10.1 IDEA创建启动
点击主菜单依次选择: Run -> Edit Configurations ,点击 + 号,选择创建 Application :
Name : ParseSegmentMain Class :org.apache.nutch.parse.ParseSegmentProgram arguments : /home/liangwy/IdeaProjects/apache-nutch-1.18/myNutch/segments/20220114175955
注意:Program arguments的填充是跟你nutch提供的脚本传递的参数一样的
10.2 运行效果对等运行效果等价于使用nutch(Bin版本)的bin/目录下的nutch命令一样。
./nutch parse /home/liangwy/IdeaProjects/apache-nutch-1.18/myNutch/segments/2022011417595511.运行CrawlDb 11.1 IDEA创建启动
点击主菜单依次选择: Run -> Edit Configurations ,点击 + 号,选择创建 Application :
Name : CrawlDbMain Class :org.apache.nutch.crawl.CrawlDbProgram arguments : /home/liangwy/IdeaProjects/apache-nutch-1.18/myNutch/crawldb/ -dir /home/liangwy/IdeaProjects/apache-nutch-1.18/myNutch/segments
注意:Program arguments的填充是跟你nutch提供的脚本传递的参数一样的
11.2 运行效果对等运行效果等价于使用nutch(Bin版本)的bin/目录下的nutch命令一样。
./nutch updatedb /home/liangwy/IdeaProjects/apache-nutch-1.18/myNutch/crawldb/ -dir /home/liangwy/IdeaProjects/apache-nutch-1.18/myNutch/segments12.运行linkDb 12.1 IDEA创建启动
点击主菜单依次选择: Run -> Edit Configurations ,点击 + 号,选择创建 Application :
Name : linkDbMain Class :org.apache.nutch.crawl.linkDbProgram arguments : /home/liangwy/IdeaProjects/apache-nutch-1.18/myNutch/linkdb -dir /home/liangwy/IdeaProjects/apache-nutch-1.18/myNutch/segments/
注意:Program arguments的填充是跟你nutch提供的脚本传递的参数一样的
12.2 运行效果对等运行效果等价于使用nutch(Bin版本)的bin/目录下的nutch命令一样。
/nutch invertlinks /home/liangwy/IdeaProjects/apache-nutch-1.18/myNutch/linkdb -dir /home/liangwy/IdeaProjects/apache-nutch-1.18/myNutch/segments/下一章
下一章,教如何将这些步骤进行整合。



