MapReduce_大数据系统

MapReduce

MapReduce优缺点
- 优点
- 缺点
WordCount案例
- 1. 需求
- 2. 环境准备
- 3. Mapper
- 4. Reducer
- 5. Driver
序列化
- 1. 概念
- 特点

边学边写

MapReduce优缺点优点

易于编程
它简单的实现一些接口,就可以完成一个分布式程序。这个分布式程序可以分布到大量廉价的PC机器上运行;也就是说,编写一个分布式程序,跟写一个串行程序是一模一样的。就是因为这个特点,使得Hadoop编程变得非常流行。
良好的扩展性
当计算资源不能满足的时候,可以通过简单增加机器来扩展集群的计算能力。
高容错性
MapReduce的设计初衷就是使程序能够部署在廉价机器上,这就要求它具有很高的容错性。比如,其中一个机器挂掉,它可以上面的计算任务转移到另外一个节点上运行,不至于这个任务运行失败,而且这个过程无需人工参与,完全是由Hadoop内部完成的。
适合PB级以上海量数据的离线处理
可以实现上千台服务器集群并发工作,提供数据处理能力。

缺点

不擅长实时计算
MapReduce无法像MySQL一样,在毫秒或者秒级内返回结果
不擅长流式计算
流式计算的输入数据是动态的,而MapReduce的输入数据是静态的,不能动态变化。这是因为MapReduce自身的设计特点决定了数据源必须是静态的。
不擅长DAG(有向图)计算
多个应用程序存在依赖关系,后一个应用程序的输入为前一个程序的输出。在这种情况下,MapReduce并不是不能使用,而是使用后,每个MapReduce作业的输出结果都会写入到磁盘,会造成大量的磁盘IO，导致性能的低下。

WordCount案例 1. 需求

在给定的文本文件中统计输出每一个单词出现的总次数

根据要求，分别编写 Mapper、Ruducer、Driver

2. 环境准备

用IDEA+jdk17+自带的maven

创建工程
打开IDEA，新建工程，选Maven

这里系统会自动下载Maven 的相关文件，在右下角，等他下载完
导入依赖
在iml文件中贴入以下代码

 
    
        junit
        junit
        RELEASE
    
    
        org.apache.logging.log4j
        log4j-core
        2.8.2
    
    
        org.apache.hadoop
        hadoop-common
        2.7.2
    
    
        org.apache.hadoop
        hadoop-client
        2.7.2
    
    
        org.apache.hadoop
        hadoop-hdfs
        2.7.2

配置log4
在main 目录下，创建 resources

在rescources 目录下，创建 file，名字是 log4j.properties
贴入以下代码

log4j.rootLogger=INFO, stdout
log4j.appender.stdout=org.apache.log4j.ConsoleAppender
log4j.appender.stdout.layout=org.apache.log4j.PatternLayout
log4j.appender.stdout.layout.ConversionPattern=%d	%p	[%c] - %m%n
log4j.appender.logfile=org.apache.log4j.FileAppender
log4j.appender.logfile.File=target/spring.log
log4j.appender.logfile.layout=org.apache.log4j.PatternLayout
log4j.appender.logfile.layout.ConversionPattern=%d       %p      [%c] - %m%n

在main/java 路径下，创建包 com.mapreduce.wordcount

3. Mapper

作用：切分，生成键值对
创建java程序，wordcountMapper

继承 Mapper 父类，去看Mapper的详细代码
一定要自己写一遍

package com.mapreduce.wordcount;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

import java.io.IOException;


public class WordCount extends Mapper {
    Text k = new Text();
    IntWritable v = new IntWritable(1);
    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        // 1 将Text转化为String
        String line = value.toString();
        // 2 根据空格进行切分
        String[] words = line.split(" ");
        // 3 遍历输出
        for (String word : words){
            k.set(word);
            context.write(k,v);
        }
    }
}

4. Reducer

作用：汇总，统计
创建java，wordcountReducer
继承父类 Reducer

package com.mapreduce.wordcount;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.mapreduce.Reducer;

import javax.xml.soap.Text;
import java.io.IOException;


public class wordcountReduce extends Reducer {
    IntWritable v =new IntWritable();
    @Override
    protected void reduce(Text key, Iterable values, Context context) throws IOException, InterruptedException {
        //
        // 1.累加和
        int sum = 0;
        for(IntWritable value:values){
            sum += value.get();
        }
        //2. 写出
        v.set(sum);
        context.write(key,v);
    }
}

5. Driver

八大步
八股文，每个都一样

package com.mapreduce.wordcount;

import com.sun.xml.bind.v2.runtime.output.StAXExStreamWriterOutput;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.IOException;


public class WordCount {
    public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
        // 1. 配置job
        Configuration conf = new Configuration();
        Job job = Job.getInstance(conf);

        //2. set jar path
        job.setJarByClass(WordCount.class);//根据编译生成的class找到其他的文件

        //3. connect Mapper and Reducer
        job.setMapperClass(WordCountMapper.class);
        job.setReducerClass(wordcountReduce.class);

        //4. set type of Mapper outout
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(IntWritable.class);

        //5.set type of final output
        job.setOutputKeyClass(Text.class);
        job.setMapOutputValueClass(IntWritable.class);

        //6. set input path
        FileInputFormat.setInputPaths(job,new Path(args[0]));

        //7. set output path
        FileOutputFormat.setOutputPath(job,new Path(args[1]));

        //8. submit
        boolean res = job.waitForCompletion(true);
        System.exit(res?0:1);
    }
}

打包，即可得到一个wordcouont包，传到虚拟机中即可运行

序列化 1. 概念特点

紧凑：高效使用存储空间
快速：读写数据的开销小
可扩展：通信协议的升级，可升级
互操作性：支持多语言的交互（数据互通）

MapReduce

大数据系统相关栏目本月热门文章