MapReduce-Map Join（面试问题）

个人学习整理，所有资料来自尚硅谷
B站学习连接：添加链接描述

MapReduce-Map Join（面试问题） 1. 问题：

Reduce Join中，在map阶段，map函数同时读取两个文件File1和File2，为了区分两种来源的key/value数据对，对每条数据打一个标签flag，比如flag=0表示来自文件File1，flag=1表示来自文件File2。如下：

Reduce Join的Mapper类

package com.atguigu.mapreduce.reduceJoin;

import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;

import java.io.IOException;

public class TableMapper extends Mapper {

    private String fileName;
    private Text outK = new Text();
    private TableBean outV = new TableBean();

    @Override
    protected void setup(Mapper.Context context) throws IOException, InterruptedException {
        FileSplit split = (FileSplit) context.getInputSplit();
        //在Setup获取fileName，默认切片规则：一个文件一个切片。因此一个文件进入之后有一个setup方法，一个map方法
        //若不是用setup方法，则每一行都会获取当前文件的名称
        //fileName后续要使用，要设置为全局变量
        fileName = split.getPath().getName();
    }

    @Override
    protected void map(LongWritable key, Text value, Mapper.Context context) throws IOException, InterruptedException {
        //1.获取一行
        String line = value.toString();

        //2.判断是哪个文件
        if (fileName.contains("order")){//处理的是order表
            String[] split = line.split("t");
            //3.封装kv
            //order表字段：id pid amount,key:pid,value:TableBean
            outK.set(split[1]);
            outV.setId(split[0]);
            outV.setPid(split[1]);
            outV.setAmount(Integer.parseInt(split[2]));
            outV.setPname("");
            outV.setFlag("order");
        }else{//处理的是pd表
            String[] split = line.split("t");
            //3.封装kv
            //pd表字段：pid pname,key:pid,value:TableBean
            outK.set(split[0]);
            outV.setId("");
            outV.setPid(split[0]);
            outV.setAmount(0);
            outV.setPname(split[1]);
            outV.setFlag("pd");
        }
        //写出
        context.write(outK,outV);
    }
}

缺点：这种方式种，合并操作是在Reduce阶段完成，Reduce端的处理压力太大，Map节点的运算负载则很低，资源利用率不高，且在Reduce阶段极易产生数据倾斜（即大量数据在Reduce端进行汇总）。

解决方案： Map端实现数据合并。

2. Map join 2.1 使用场景

Map Join适用于一张表十分小（存在内存中，所以要小）、一张表十分大的场景。

2.2 优点

在Map端缓存多张表，提前处理业务逻辑，这样增加Map业务端，减少Reduce端数据的压力，尽可能的减少数据倾斜。

2.3 具体办法

采用DistributedCache。

（1）在Mapper的setup阶段，将文件读取到缓存集合中。

（2）在Driver驱动类中加载缓存

//缓存普通文件到Task运行节点
job.addCacheFile(new URI("file:///e:/cache/pd.txt"));
//如果是集群运行，需要设置HDFS路径
job.addCacheFile(new URI("hdfs://hadoop102:8020/cache/pd.txt"))

3. Map Join案例实操

（1）需求：

数据连接：添加链接描述
提取码：eei5
（2）分析

原先两个文件在切片时是单独切片的，决定了两个文件分别进入的是maptask1和maptask2，不在一个maptask中，两个文件怎么join呢？——将其中一个表pd缓存在内存中，将order表正常加载到maptask1中，执行maptask1的时候，通过内存传入（加入内存中以集合的方式存储{01：小米，02：华为，03：格力}，maptask1中遍历每一行，通过每行中的pid从内存中的集合中获得相应的产品名称，最后拼接字符串并写出）。Map端Join的逻不需要Reduce阶段，设置ReduceTask数量为0。

setup方法中：

获取缓存的文件（小表）

循环读取缓存文件的一行

切割

缓存数据到集合， 01，小米；02，华为；03，格力

关流

map方法中：

获取一行

截取

获取pid

获取订单id和商品名称

拼接

写出

MapJoinMapper类

package com.atguigu.mapreduce.mapjoin;

import org.apache.commons.lang.StringUtils;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.URI;
import java.util.HashMap;

public class MapJoinMapper extends Mapper {
    HashMap pdMap = new HashMap<>();
    private Text outK = new Text();
    @Override
    protected void setup(Mapper.Context context) throws IOException, InterruptedException {
        //获取缓存的文件，并把文件内容封装到集合pd.txt
        URI[] cacheFiles = context.getCacheFiles();
        FileSystem fs = FileSystem.get(context.getConfiguration());//先获取一个fs文件系统
        FSDataInputStream fis = fs.open(new Path(cacheFiles[0]));//取出缓存文件中的值，这里只有一个
        //从流中读取数据
        BufferedReader reader = new BufferedReader(new InputStreamReader(fis, "UTF-8"));
        String line;
        while (StringUtils.isNotEmpty(line=reader.readLine())){
            //切割
            String[] fields = line.split("t");

            //赋值
            pdMap.put(fields[0],fields[1]);//pid,pname
        }
        //关流
        IOUtils.closeStream(reader);
    }

    @Override
    protected void map(LongWritable key, Text value, Mapper.Context context) throws IOException, InterruptedException {
        //处理order.txt:id,pid,amount
        String line = value.toString();
        String[] fields = line.split("t");
        //获取pid所对应的pname
        String pname = pdMap.get(fields[1]);
        //获取id和amount
        //封装
        outK.set(fields[0]+"t"+pname+"t"+ fields[2]);
        context.write(outK,NullWritable.get());
    }
}

MapperJoinDriver类

package com.atguigu.mapreduce.mapjoin;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.IOException;
import java.net.URI;
import java.net.URISyntaxException;

public class MapJoinDriver {
    public static void main(String[] args) throws IOException, URISyntaxException, InterruptedException, ClassNotFoundException {
        Configuration conf = new Configuration();
        Job job = Job.getInstance(conf);

        job.setJarByClass(MapJoinDriver.class);
        job.setMapperClass(MapJoinMapper.class);
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(NullWritable.class);

        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(NullWritable.class);
        //注意URI的输入格式
        job.addCacheFile(new URI("file:///D:/downloads/hadoop-3.1.0/data/11_input/tablecache/pd.txt"));
        job.setNumReduceTasks(0);

        FileInputFormat.setInputPaths(job,new Path("D:\downloads\hadoop-3.1.0\data\11_input\inputtable2"));
        FileOutputFormat.setOutputPath(job,new Path("D:\downloads\hadoop-3.1.0\data\output\output13"));

        boolean result = job.waitForCompletion(true);
        System.exit(result?0:1);
    }
}

（3）测试

4. 总结

Reduce Join 缺点：

会出现数据倾斜；

Map Join 优缺点：

缺点：只适合大小表join

优点：不会出现数据倾斜;

Map Join 实现：

将小表数据加入缓存分发到各个计算节点,按连接关键字建立索引；

job.addCacheFile(new URI("file:xxxxxxx));
job.setNumReduceTasks(0);

MapReduce-Map Join（面试问题）

大数据系统相关栏目本月热门文章