1.input
读文件
2.split(分片)
例如,blocksize(块) 128MB
3.map
word->(word,1) key-->values 键值对
4.shuffie(洗牌)
默认按照kkey的hash值进行分发,相同的key肯定要分发到同一个reduce任务上去,做到最后的汇总操作。
5.reduce
规约汇总,这里对value做加法。
6.result
输出成文件
java实现:
1.Map:
public static class MRMaper extends Mapper{ LongWritable one = new LongWritable(1); @Override protected void map(LongWritable key, Text value, Mapper .Context context) throws IOException, InterruptedException { //接收每一行数据 String line = value.toString(); //按空格进行分割 String[] words = line.split(" "); for(String word :words){ //通过上下文把map处理结果输出 context.write(new Text(word), one); } } }
2.reduce
protected void MRreduce(Text key, Iterablevalues, Reducer .Context context) throws IOException, InterruptedException { long sum = 0; for (LongWritable value : values){ //求单词次数 sum += value.get(); } //通过上下文把reduce处理结果输出 context.write(key, new LongWritable(sum)); } }
sql:
select t5.t4_id id1, t5.t4_id2 id2, t6.spu_id spu
from (select t4.t1_id t4_id, t4.t2_id t4_id2, count(t4.spu) count_spu
from (select t3.t1_id, t3.t2_id, t3.t1_spu spu
from (select t1.id t1_id,
t1.spu_id t1_spu,
t2.id t2_id,
t2.spu_id t2_spu
from (select t.id, t.spu_id from collect_spu t) t1
inner join collect_spu t2
on t1.id <> t2.id --说明不是同一个用户
where t1.spu_id = t2.spu_id) t3) t4
group by t4.t1_id, t4.t2_id
having count(t4.spu) >= 2) t5
join collect_spu t6
on t5.t4_id = t6.id



