MapReduce之单词统计

提示：文章写完后，目录可以自动生成，如何生成可参考右边的帮助文档

MapReduce单词统计

一、搭建环境
二、编写WordCount程序
- 1.在IDEA中引入所需的jar包，IDEA支持文件夹方式引入
- 2.代码编写
- 3.代码打包
- 4.程序调试
总结

一、搭建环境

window下安装jdk和idea
VM安装Linux
配置JDK：JDK是Java运行环境，基于Java编写的应用程序都需要Java运行环境，显然Hadoop是基于Java开发的，所以也需要配置JDK环境。
上传并解压jdk压缩包至Linux服务器
配置jdk全局环境变量
安装Hadoop：上传并解压jdk压缩包至Linux服务器；配置Hadoop全局环境变量
伪分布式配置：配置核心文件hadoop-env.sh、core-site.xml、hdfs-site.xml、mapred-site.xml、yarn-site.xml

二、编写WordCount程序 1.在IDEA中引入所需的jar包，IDEA支持文件夹方式引入 2.代码编写

1.编写Mapper函数
2.编写Reducer函数
3.编写Main函数主入口

Mapper代码如下：

import java.io.IOException;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

public class WordCountMapper extends Mapper {

   @Override
   protected void map(LongWritable key, Text value, Context context)
         throws IOException, InterruptedException {
      
      String data= value.toString();
      //分词
      String[] words = data.split(" ");
      
      //输出每个单词
      for(String w:words){
         context.write(new Text(w), new LongWritable(1));
      }
   }

}

Main函数主入口如下：

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class WordCountMain {

   public static void main(String[] args) throws Exception {
      //创建一个job = map + reduce
      Configuration conf = new Configuration();
      
      //创建一个Job
      Job job = Job.getInstance(conf);
      //指定任务的入口
      job.setJarByClass(WordCountMain.class);
      
      //指定job的mapper
      job.setMapperClass(WordCountMapper.class);
      job.setMapOutputKeyClass(Text.class);
      job.setMapOutputValueClass(LongWritable.class);
      
      //指定job的reducer
      job.setReducerClass(WordCountReducer.class);
      job.setOutputKeyClass(Text.class);
      job.setOutputValueClass(LongWritable.class);
      
      //指定任务的输入和输出
      FileInputFormat.setInputPaths(job, new Path(args[0]));
      FileOutputFormat.setOutputPath(job, new Path(args[1]));       
      
      //提交任务
      job.waitForCompletion(true);
   }

}







Reducer代码如下：



```c
import java.io.IOException;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

public class WordCountReducer extends Reducer{

   @Override
   protected void reduce(Text k3, Iterable v3,Context context) throws IOException, InterruptedException {
      //v3: 是一个集合，每个元素就是v2
      long total = 0;
      for(LongWritable l:v3){
         total = total + l.get();
      }
      
      //输出
      context.write(k3, new LongWritable(total));
   }

}

3.代码打包

把生成WordCount的Java程序传上去伪分布式环境运行

4.程序调试

创建需要统计单词字数的文件，创建一个文本文件
编辑文件，将数据来源的文章添加至文件
把创建的文件上传到HDFS中去
进入Hadoop的jar包所在目录，然后执行单词统计指令
执行指令统计单词次数
查看统计输出后的结果
可以使用在虚拟机使用命令行来看
可以在HDFS的Web管理界面看

总结

单词计数是最简单也是最能体现 MapReduce 思想的程序之一，可以称为 MapReduce 版“Hello World”。单词计数的主要功能是统计一系列文本文件中每个单词出现的次数。

MapReduce之单词统计

大数据系统相关栏目本月热门文章