MapReduce(图解)

1.input

读文件

2.split（分片）

例如，blocksize（块） 128MB

3.map

word->(word,1) key-->values 键值对

4.shuffie（洗牌）

默认按照kkey的hash值进行分发，相同的key肯定要分发到同一个reduce任务上去，做到最后的汇总操作。

5.reduce

规约汇总，这里对value做加法。

6.result

输出成文件

java实现：

1.Map：

public static class MRMaper extends Mapper{
         
        LongWritable one = new LongWritable(1);
        @Override
        protected void map(LongWritable key, Text value,
                Mapper.Context context)
                throws IOException, InterruptedException {
            //接收每一行数据
            String line = value.toString();
            //按空格进行分割 
            String[] words = line.split(" ");
            for(String word :words){
                //通过上下文把map处理结果输出
                context.write(new Text(word), one);
            }
        }
    }

2.reduce

protected void MRreduce(Text key, Iterable values,
                Reducer.Context context)
                throws IOException, InterruptedException {
            long sum = 0;
            for (LongWritable value : values){
                //求单词次数
                sum += value.get();
            }
            //通过上下文把reduce处理结果输出
            context.write(key, new LongWritable(sum));
        }
    }

sql：

select t5.t4_id id1, t5.t4_id2 id2, t6.spu_id spu
  from (select t4.t1_id t4_id, t4.t2_id t4_id2, count(t4.spu) count_spu
          from (select t3.t1_id, t3.t2_id, t3.t1_spu spu
                  from (select t1.id     t1_id,
                               t1.spu_id t1_spu,
                               t2.id     t2_id,
                               t2.spu_id t2_spu
                          from (select t.id, t.spu_id from collect_spu t) t1
                         inner join collect_spu t2
                            on t1.id <> t2.id --说明不是同一个用户
                         where t1.spu_id = t2.spu_id) t3) t4
         group by t4.t1_id, t4.t2_id
        having count(t4.spu) >= 2) t5
   join collect_spu t6
    on t5.t4_id = t6.id

MapReduce(图解)

Java相关栏目本月热门文章