一：flume事务

二：flume agent内部原理

三：复制和多路复用

3.1：案例

四：故障转移

4.1：案例

五：负载均衡

六：聚合

6.1：案例

一：flume事务

二：flume agent内部原理
重要组件： 1）ChannelSelector ChannelSelector 的作用就是选出 Event 将要被发往哪个 Channel。其共有两种类型，分别是 Replicating（复制）和 Multiplexing（多路复用）。 ReplicatingSelector 会将同一个 Event 发往所有的 Channel，Multiplexing 会根据相应的原则，将不同的 Event 发往不同的 Channel。 2）SinkProcessor SinkProcessor 共有三种类型，分别是 DefaultSinkProcessor ，LoadBalancingSinkProcessor 和FailoverSinkProcessor。 DefaultSinkProcessor 对应的是单个的 Sink ， LoadBalancingSinkProcessor 和 FailoverSinkProcessor 对应的是 Sink Group，LoadBalancingSinkProcessor 可以实现负载均衡的功能，FailoverSinkProcessor 可以实现故障转移的功能。

三：复制和多路复用
Flume 支持将事件流向一个或者多个目的地。这种模式可以将相同数据复制到多个 channel 中，或者将不同数据分发到不同的 channel 中，sink 可以选择传送到不同的目的地

3.1：案例

使用 Flume-1 监控文件变动，Flume-1 将变动内容传递给 Flume-2，Flume-2 负责存储到 HDFS。同时 Flume-1 将变动内容传递给 Flume-3，Flume-3 负责输出到 Local FileSystem。分析：1：我们配置三个flume文件，flume-file-flume.conf，flume-flume-hdfs.conf，flume-flume-dir.conf。 2：flume-file-flume.conf监控/opt/data/test.log文件，把变动内容发送到Hadoop12的4141和4142端口，然后由flume-flume-hdfs.conf接受4141端口的数据存储到hdfs, flume-flume-dir.conf监控4142端口数据存储到本地目录 /opt/data/flume3 3：注意，涉及本地目录需要自行创建好,事先启动好Hadoop

 # flume-file-flume.conf内容

# Name the components on this agent
a1.sources = r1
a1.sinks = k1 k2
a1.channels = c1 c2
# 默认将数据流复制给所有 channel
a1.sources.r1.selector.type = replicating
# Describe/configure the source
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /opt/data/test.log
a1.sources.r1.shell = /bin/bash -c
# Describe the sink
# sink 端的 avro 是一个数据发送者
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = hadoop112
a1.sinks.k1.port = 4141
a1.sinks.k2.type = avro
a1.sinks.k2.hostname = hadoop112
a1.sinks.k2.port = 4142
# Describe the channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
a1.channels.c2.type = memory
a1.channels.c2.capacity = 1000
a1.channels.c2.transactionCapacity = 100
# Bind the source and sink to the channel
a1.sources.r1.channels = c1 c2
a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c2

# flume-flume-hdfs.conf内容

# Name the components on this agent
a2.sources = r1
a2.sinks = k1
a2.channels = c1
# Describe/configure the source
# source 端的 avro 是一个数据接收服务
a2.sources.r1.type = avro
a2.sources.r1.bind = hadoop112
a2.sources.r1.port = 4141
# Describe the sink
a2.sinks.k1.type = hdfs
a2.sinks.k1.hdfs.path = hdfs://hadoop112:9000/flume2/%Y%m%d/%H
#上传文件的前缀
a2.sinks.k1.hdfs.filePrefix = flume2- 
#是否按照时间滚动文件夹
a2.sinks.k1.hdfs.round = true
#多少时间单位创建一个新的文件夹
a2.sinks.k1.hdfs.roundValue = 1
#重新定义时间单位
a2.sinks.k1.hdfs.roundUnit = hour
#是否使用本地时间戳
a2.sinks.k1.hdfs.useLocalTimeStamp = true
#积攒多少个 Event 才 flush 到 HDFS 一次
a2.sinks.k1.hdfs.batchSize = 100
#设置文件类型，可支持压缩
a2.sinks.k1.hdfs.fileType = DataStream
#多久生成一个新的文件
a2.sinks.k1.hdfs.rollInterval = 30
#设置每个文件的滚动大小大概是 128M
a2.sinks.k1.hdfs.rollSize = 134217700
#文件的滚动与 Event 数量无关
a2.sinks.k1.hdfs.rollCount = 0
# Describe the channel
a2.channels.c1.type = memory
a2.channels.c1.capacity = 1000
a2.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a2.sources.r1.channels = c1
a2.sinks.k1.channel = c1

# flume-flume-dir.conf内容

# Name the components on this agent
a3.sources = r1
a3.sinks = k1
a3.channels = c2
# Describe/configure the source
a3.sources.r1.type = avro
a3.sources.r1.bind = hadoop112
a3.sources.r1.port = 4142
# Describe the sink
a3.sinks.k1.type = file_roll
a3.sinks.k1.sink.directory = /opt/data/flume3
# Describe the channel
a3.channels.c2.type = memory
a3.channels.c2.capacity = 1000
a3.channels.c2.transactionCapacity = 100
# Bind the source and sink to the channel
a3.sources.r1.channels = c2
a3.sinks.k1.channel = c2

在hadoop112启动三个终端，在flume安装目录依次执行下面命令

[atguigu@hadoop112 flume-1.7.0]$ bin/flume-ng agent -c conf/ -n a1 -f job/group1/flum file-flume.conf


[atguigu@hadoop112 flume-1.7.0]$ bin/flume-ng agent -c conf/ -n a2 -f job/group1/flume-flume-hdfs.conf



[atguigu@hadoop112 flume-1.7.0]$ bin/flume-ng agent -c conf/ -n a3 -f job/group1/flume-flume-dir.conf

运行结果

/opt/data/flume3每30秒会生成一个文件

hdfs会产生目录文件

四：故障转移 Flume支持使用将多个sink逻辑上分到一个sink组，sink组配合不同的SinkProcessor 可以实现负载均衡和错误恢复的功能，当某个flume出现故障，另外等候优先级高的flume会顶上

4.1：案例
flume监控44444端口然后分别发往两个sink，两个sink的优先级不同，观察输出效果

分析：1，配置sink组，然后配置两个sink，优先级不同

# flume-netcat-flume.conf内容

# Name the components on this agent
a1.sources = r1
a1.channels = c1
a1.sinkgroups = g1
a1.sinks = k1 k2
# Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = hadoop112
a1.sources.r1.port = 44444
# 配置故障转移策略
a1.sinkgroups.g1.processor.type = failover
# 配置优先级，越大越优先
a1.sinkgroups.g1.processor.priority.k1 = 5
a1.sinkgroups.g1.processor.priority.k2 = 10
a1.sinkgroups.g1.processor.maxpenalty = 10000
# Describe the sink
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = hadoop112
a1.sinks.k1.port = 4141
a1.sinks.k2.type = avro
a1.sinks.k2.hostname = hadoop112
a1.sinks.k2.port = 4142
# Describe the channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinkgroups.g1.sinks = k1 k2
a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c1

# flume-flume-console1.conf内容

# Name the components on this agent
a2.sources = r1
a2.sinks = k1
a2.channels = c1
# Describe/configure the source
a2.sources.r1.type = avro
a2.sources.r1.bind = hadoop112
a2.sources.r1.port = 4141
# Describe the sink
a2.sinks.k1.type = logger
# Describe the channel
a2.channels.c1.type = memory
a2.channels.c1.capacity = 1000
a2.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a2.sources.r1.channels = c1
a2.sinks.k1.channel = c1

# flume-flume-console2.conf内容

# Name the components on this agent
a3.sources = r1
a3.sinks = k1
a3.channels = c2
# Describe/configure the source
a3.sources.r1.type = avro
a3.sources.r1.bind = hadoop112
a3.sources.r1.port = 4142
# Describe the sink
a3.sinks.k1.type = logger
# Describe the channel
a3.channels.c2.type = memory
a3.channels.c2.capacity = 1000
a3.channels.c2.transactionCapacity = 100
# Bind the source and sink to the channel
a3.sources.r1.channels = c2
a3.sinks.k1.channel = c2

在hadoop112启动三个终端，在flume安装目录依次执行下面命令

[atguigu@hadoop112 flume-1.7.0]$ bin/flume-ng agent -c conf/ -n a2 -f job/group2/flume-flume-console1.conf -Dflume.root.logger=INFO,console 


[atguigu@hadoop112 flume-1.7.0]$ bin/flume-ng agent -c conf/ -n a3 -f job/group2/flume-flume-console2.conf -Dflume.root.logger=INFO,console


[atguigu@hadoop112 flume-1.7.0]$ bin/flume-ng agent -c conf/ -n a1 -f job/group2/flume-netcat-flume.conf

然后开启端口发送数据

运行结果：

刚开始，只有监控4142端口的flume输出,当这个挂掉后，4141开始输出

五：负载均衡

同一个sink组的不同sink轮询向flume服务端拉去数据

# flume-netcat2-flume.conf内容

# Name the components on this agent
a1.sources = r1
a1.channels = c1
a1.sinkgroups = g1
a1.sinks = k1 k2
# Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = hadoop112
a1.sources.r1.port = 44444
# 配置轮询策略
a1.sinkgroups.g1.processor.type = load_balance
a1.sinkgroups.g1.processor.backoff = true
# backoff 退避时间
a1.sinkgroups.g1.processor.selector.maxTimeOut = 30000
# Describe the sink
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = hadoop112
a1.sinks.k1.port = 4141
a1.sinks.k2.type = avro
a1.sinks.k2.hostname = hadoop112
a1.sinks.k2.port = 4142
# Describe the channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinkgroups.g1.sinks = k1 k2
a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c1

在hadoop112启动三个终端，在flume安装目录依次执行下面命令

[atguigu@hadoop112 flume-1.7.0]$ bin/flume-ng agent -c conf/ -n a2 -f job/group2/flume-flume-console1.conf -Dflume.root.logger=INFO,console 


[atguigu@hadoop112 flume-1.7.0]$ bin/flume-ng agent -c conf/ -n a3 -f job/group2/flume-flume-console2.conf -Dflume.root.logger=INFO,console


[atguigu@hadoop112 flume-1.7.0]$ bin/flume-ng agent -c conf/ -n a1 -f job/group2/flume-netcat2-flume.conf

然后开启端口发送数据

运行结果

两端口都会打印数据

六：聚合

6.1：案例 hadoop113 上的 Flume-1 监控文件/opt/data/test.log， hadoop114 上的 Flume-2 监控某一个端口的数据流， Flume-1 与 Flume-2 将数据发送给 hadoop112 上的 Flume-3，Flume-3 将最终数据打印到控制台

# hadoop113的flume配置文件: flume1-file-flume.conf

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# Describe/configure the source
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /opt/data/test.log
a1.sources.r1.shell = /bin/bash -c
# Describe the sink
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = hadoop112
a1.sinks.k1.port = 4141
# Describe the channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

#hadoop114的flume配置文件 vim flume1-netcat-flume.conf

# Name the components on this agent
a2.sources = r1
a2.sinks = k1
a2.channels = c1
# Describe/configure the source
a2.sources.r1.type = netcat
a2.sources.r1.bind = hadoop114
a2.sources.r1.port = 44444
# Describe the sink
a2.sinks.k1.type = avro
a2.sinks.k1.hostname = hadoop112
a2.sinks.k1.port = 4141
# Use a channel which buffers events in memory
a2.channels.c1.type = memory
a2.channels.c1.capacity = 1000
a2.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a2.sources.r1.channels = c1
a2.sinks.k1.channel = c1

# hadoop112的flume配置文件：flume1-flume-logger.conf

# Name the components on this agent
a3.sources = r1
a3.sinks = k1
a3.channels = c1
# Describe/configure the source
a3.sources.r1.type = avro
a3.sources.r1.bind = hadoop112
a3.sources.r1.port = 4141
# Describe the sink
# Describe the sink
a3.sinks.k1.type = logger
# Describe the channel
a3.channels.c1.type = memory
a3.channels.c1.capacity = 1000
a3.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a3.sources.r1.channels = c1
a3.sinks.k1.channel = c1

启动命令

[atguigu@hadoop114 flume-1.7.0]$ bin/flume-ng agent -c conf/ -n a2 -f job/group3/flume1-netcat-flume.conf 


[atguigu@hadoop113 flume-1.7.0]$ bin/flume-ng agent -c conf/ -n a1 -f job/group3/flume1-file-flume.conf

[atguigu@hadoop112 flume-1.7.0]$ bin/flume-ng agent -n a3 -c conf/ -f job/group3/flume1-flume-logger.conf -Dflume.root.logger=INFO,console

测试结果

Flume进阶一

一：flume事务

三：复制和多路复用
Flume 支持将事件流向一个或者多个目的地。这种模式可以将相同数据复制到多个 channel 中，或者将不同数据分发到不同的 channel 中，sink 可以选择传送到不同的目的地

四：故障转移 Flume支持使用将多个sink逻辑上分到一个sink组，sink组配合不同的SinkProcessor 可以实现负载均衡和错误恢复的功能，当某个flume出现故障，另外等候优先级高的flume会顶上

4.1：案例
flume监控44444端口然后分别发往两个sink，两个sink的优先级不同，观察输出效果

六：聚合

大数据系统相关栏目本月热门文章

Flume进阶一

一：flume事务

三：复制和多路复用 Flume 支持将事件流向一个或者多个目的地。这种模式可以将相同数据复制到多个 channel 中，或者将不同数据分发到不同的 channel 中，sink 可以选择传送到不同的目的地

四：故障转移 Flume支持使用将多个sink逻辑上分到一个sink组，sink组配合不同的SinkProcessor 可以实现负载均衡和错误恢复的功能，当某个flume出现故障，另外等候优先级高的flume会顶上

4.1： 案例 flume监控44444端口然后分别发往两个sink，两个sink的优先级不同，观察输出效果

六：聚合

大数据系统相关栏目本月热门文章

三：复制和多路复用
Flume 支持将事件流向一个或者多个目的地。这种模式可以将相同数据复制到多个 channel 中，或者将不同数据分发到不同的 channel 中，sink 可以选择传送到不同的目的地

4.1：案例
flume监控44444端口然后分别发往两个sink，两个sink的优先级不同，观察输出效果