第88课：Spark Streaming从Flume Pull数据案例实战及内幕源码解密

pond2539 · 发表于 2019-1-30 09:42:54

本节课分成二部分讲解：
一、Spark Streaming on Pulling from Flume实战
二、Spark Streaming on Pulling from Flume源码解析

先简单介绍下Flume的两种模式：推模式（Flume push to Spark Streaming）和拉模式（Spark Streaming pull from Flume ）
　　采用推模式：推模式的理解就是Flume作为缓存，存有数据。监听对应端口，如果服务可以连接，就将数据push过去。(简单，耦合要低)，缺点是Spark Streaming程序没有启动的话，Flume端会报错，同时会导致Spark Streaming程序来不及消费的情况。

　　采用拉模式：拉模式就是自己定义一个sink，Spark Streaming自己去channel里面取数据，根据自身条件去获取数据，稳定性好。

Flume pull实战：
第一步：安装Flume，本节课不在说明，参考（第87课：Flume推送数据到SparkStreaming案例实战和内幕源码解密）
第二步：配置Flume，首先参照官网（http://spark.apache.org/docs/latest/streaming-flume-integration.html）要求添加依赖或直接下载3个jar包，并将其放入Flume安装目录下的lib目录中

　　spark-streaming-flume-sink_2.10-1.6.0.jar、scala-library-2.10.5.jar、commons-lang3-3.3.2.jar
　　

　　
第三步：配置Flume环境参数，修改flume-conf.properties，从flume-conf.properties.template复制一份进行修改

　　#Flume pull模式
　　agent0.sources = source1
　　agent0.channels = memoryChannel
　　agent0.sinks = sink1
　　

　　#配置Source1
　　agent0.sources.source1.type = spooldir
　　agent0.sources.source1.spoolDir = /home/hadoop/flume/tmp/TestDir
　　agent0.sources.source1.channels = memoryChannel
　　agent0.sources.source1.fileHeader = false
　　agent0.sources.source1.interceptors = il
　　agent0.sources.source1.interceptors.il.type = timestamp
　　

　　#配置Sink1
　　agent0.sinks.sink1.type = org.apache.spark.streaming.flume.sink.SparkSink
　　agent0.sinks.sink1.hostname = SparkMaster
　　agent0.sinks.sink1.port = 9999

　　agent0.sinks.sink1.channel = memoryChannel

　　

　　#配置channel
　　agent0.channels.memoryChannel.type = file
　　agent0.channels.memoryChannel.checkpointDir = /home/hadoop/flume/tmp/checkpoint
　　agent0.channels.memoryChannel.dataDirs = /home/hadoop/flume/tmp/dataDir

启动flume命令：
　　root@SparkMaster:~/flume/flume-1.6.0/bin# ./flume-ng agent --conf ../conf/ --conf-file ../conf/flume-conf.properties --name agent0 -Dflume.root.logger=INFO,console
或者root@SparkMaster:~/flume/flume-1.6.0# flume-ng agent --conf ./conf/ --conf-file ./conf/flume-conf.properties --name agent0 -Dflume.root.logger=INFO,console

第四步：编写简单的业务代码（Java版）
　　
　　package com.dt.spark.SparkApps.sparkstreaming;
　　import java.util.Arrays;
　　import org.apache.spark.SparkConf;
　　import org.apache.spark.api.java.function.FlatMapFunction;
　　import org.apache.spark.api.java.function.Function2;
　　import org.apache.spark.api.java.function.PairFunction;
　　import org.apache.spark.streaming.Durations;
　　import org.apache.spark.streaming.api.java.JavaDStream;
　　import org.apache.spark.streaming.api.java.JavaPairDStream;
　　import org.apache.spark.streaming.api.java.JavaReceiverInputDStream;
　　import org.apache.spark.streaming.api.java.JavaStreamingContext;
　　import org.apache.spark.streaming.flume.FlumeUtils;
　　import org.apache.spark.streaming.flume.SparkFlumeEvent;
　　import scala.Tuple2;
　　public class SparkStreamingPullDataFromFlume {
　　 public static void main(String[] args) {
　　       SparkConf conf = new SparkConf().setMaster("spark://SparkMaster:7077");
　　       conf.setAppName("SparkStreamingPullDataFromFlume");
　　       JavaStreamingContext jsc = new JavaStreamingContext(conf, Durations.seconds(30));
　　       // 获取数据
　　       JavaReceiverInputDStream lines = FlumeUtils.createPollingStream(jsc, "SparkMaster", 9999);
　　       // 进行单词切分
　　       JavaDStream words = lines.flatMap(new FlatMapFunction() {
　　          public Iterable call(SparkFlumeEvent event) throws Exception {
　　             String line = new String(event.event().getBody().toString());
　　             return Arrays.asList(line.split(" "));
　　          }
　　       });
　　       // 进行map操作，转换成（key，value）格式
　　       JavaPairDStream pairs = words.mapToPair(new PairFunction() {
　　          public Tuple2 call(String word) throws Exception {
　　             return new Tuple2(word, 1);
　　          }
　　       });
　　       // 进行reduceByKey动作，将key相同的value值进行合并
　　       JavaPairDStream wordsCount = pairs.reduceByKey(new Function2() {
　　          public Integer call(Integer v1, Integer v2) throws Exception {
　　             return v1 + v2;
　　          }
　　       });
　　       wordsCount.print();
　　       jsc.start();
　　       jsc.awaitTermination();
　　       jsc.close();
　　 }
　　}
将程序打包成jar文件上传到Spark集群中

第五步：启动HDFS、Spark集群和Flume
启动Flume:root@SparkMaster:~/flume/flume-1.6.0/bin# ./flume-ng agent --conf ../conf/ --conf-file ../conf/flume-conf.properties --name agent0 -Dflume.root.logger=INFO,console
第六步：往/home/hadoop/flume/tmp/TestDir目录中上传测试文件，查看Flume的日志变化
第七步：通过spark-submit命令运行程序：
　　./spark-submit --class com.dt.spark.SparkApps.SparkStreamingPullDataFromFlume --name SparkStreamingPullDataFromFlume /home/hadoop/spark/SparkStreamingPullDataFromFlume.jar
　　每隔30秒查看运行结果
　　

　　
　　第二部分：源码分析

　　
1、创建createPollingStream （FlumeUtils.scala ）
注意：默认的存储方式是MEMORY_AND_DISK_SER_2
/**
* Creates an input stream that is to be used with the Spark Sink deployed on a Flume agent.
* This stream will poll the sink for data and will pull events as they are available.
* This stream will use a batch size of 1000 events and run 5 threads to pull data.
* @param hostname Address of the host on which the Spark Sink is running
* @param port Port of the host at which the Spark Sink is listening
* @param storageLevel Storage level to use for storing the received objects
*/
def createPollingStream(
ssc: StreamingContext,
hostname: String,
port: Int,
storageLevel: StorageLevel = StorageLevel.MEMORY_AND_DISK_SER_2
  ): ReceiverInputDStream[SparkFlumeEvent] = {
  createPollingStream(ssc, Seq(new InetSocketAddress(hostname, port)), storageLevel)
}
2、参数配置：默认的全局参数，private 级别配置无法修改
private val DEFAULT_POLLING_PARALLELISM = 5
private val DEFAULT_POLLING_BATCH_SIZE = 1000
/**
* Creates an input stream that is to be used with the Spark Sink deployed on a Flume agent.
* This stream will poll the sink for data and will pull events as they are available.
* This stream will use a batch size of 1000 events and run 5 threads to pull data.
* @param addresses List of InetSocketAddresses representing the hosts to connect to.
* @param storageLevel Storage level to use for storing the received objects
*/
def createPollingStream(
ssc: StreamingContext,
addresses: Seq[InetSocketAddress],
storageLevel: StorageLevel
  ): ReceiverInputDStream[SparkFlumeEvent] = {
  createPollingStream(ssc, addresses, storageLevel,
DEFAULT_POLLING_BATCH_SIZE, DEFAULT_POLLING_PARALLELISM)
}
3、创建FlumePollingInputDstream对象
/**
* Creates an input stream that is to be used with the Spark Sink deployed on a Flume agent.
* This stream will poll the sink for data and will pull events as they are available.
* @param addresses List of InetSocketAddresses representing the hosts to connect to.
* @param maxBatchSize Maximum number of events to be pulled from the Spark sink in a
*                   single RPC call
* @param parallelism Number of concurrent requests this stream should send to the sink. Note
*                   that having a higher number of requests concurrently being pulled will
*                   result in this stream using more threads
* @param storageLevel Storage level to use for storing the received objects
*/
def createPollingStream(
ssc: StreamingContext,
addresses: Seq[InetSocketAddress],
storageLevel: StorageLevel,
maxBatchSize: Int,
parallelism: Int
  ): ReceiverInputDStream[SparkFlumeEvent] = {
  new FlumePollingInputDStream[SparkFlumeEvent](ssc, addresses, maxBatchSize,
parallelism, storageLevel)
}
4、继承自ReceiverInputDstream并覆写getReciver方法，调用FlumePollingReciver接口
private[streaming] class FlumePollingInputDStream[T: ClassTag](
_ssc: StreamingContext,
val addresses: Seq[InetSocketAddress],
val maxBatchSize: Int,
val parallelism: Int,
storageLevel: StorageLevel
  ) extends ReceiverInputDStream[SparkFlumeEvent](_ssc) {
override def getReceiver(): Receiver[SparkFlumeEvent] = {
new FlumePollingReceiver(addresses, maxBatchSize, parallelism, storageLevel)
  }
}
5、ReceiverInputDstream 构建了一个线程池，设置为后台线程；并使用lazy和工厂方法创建线程和NioClientSocket（NioClientSocket底层使用NettyServer的方式）
lazy val channelFactoryExecutor =
  Executors.newCachedThreadPool(new ThreadFactoryBuilder().setDaemon(true).
setNameFormat("Flume Receiver Channel Thread - %d").build())
lazy val channelFactory =
  new NioClientSocketChannelFactory(channelFactoryExecutor, channelFactoryExecutor)
6、receiverExecutor 内部也是线程池；connections是指链接分布式Flume集群的FlumeConnection实体句柄的个数，线程拿到实体句柄访问数据。
lazy val receiverExecutor = Executors.newFixedThreadPool(parallelism,
  new ThreadFactoryBuilder().setDaemon(true).setNameFormat("Flume Receiver Thread - %d").build())
private lazy val connections = new LinkedBlockingQueue[FlumeConnection]()
7、启动时创建NettyTransceiver，根据并行度(默认5个)循环提交FlumeBatchFetcher
override def onStart(): Unit = {
  // Create the connections to each Flume agent.
  addresses.foreach(host => {
val transceiver = new NettyTransceiver(host, channelFactory)
val client = SpecificRequestor.getClient(classOf[SparkFlumeProtocol.Callback], transceiver)
connections.add(new FlumeConnection(transceiver, client))
  })
  for (i
      batchReceived = true
      seq = eventBatch.getSequenceNumber
      val events = toSparkFlumeEvents(eventBatch.getEvents)
      if (store(events)) {
         sendAck(client, seq)
      } else {
         sendNack(batchReceived, client, seq)
      }
      case None =>
   }
} catch {
9、获取一批一批数据方法
/**
* Gets a batch of events from the specified client. This method does not handle any exceptions
* which will be propogated to the caller.
* @param client Client to get events from
* @return [[Some]] which contains the event batch if Flume sent any events back, else [[None]]
*/
private def getBatch(client: SparkFlumeProtocol.Callback): Option[EventBatch] = {
  val eventBatch = client.getEventBatch(receiver.getMaxBatchSize)
  if (!SparkSinkUtils.isErrorBatch(eventBatch)) {
// No error, proceed with processing data
logDebug(s"Received batch of ${eventBatch.getEvents.size} events with sequence " +
   s"number: ${eventBatch.getSequenceNumber}")
Some(eventBatch)
  } else {
logWarning("Did not receive events from Flume agent due to error on the Flume agent: " +
   eventBatch.getErrorMsg)
None
  }
}

　　备注：
　　资料来源于：DT_大数据梦工厂
　　更多私密内容，请关注微信公众号：DT_Spark
　　如果您对大数据Spark感兴趣，可以免费听由王家林老师每天晚上20：00开设的Spark永久免费公开课，地址YY房间号：68917580

账号		自动登录	找回密码
密码			立即注册

VMware vcenter+vSphere 6.5 U2共享

【跟谁学】韩宇极简英语课-技术人员不得不

用Zabbix通过JMX方式监控weblogic

winhex数据恢复教程（非常巨大，内容丰富）

Symantec Backup Exec 2015 2016/2012 BE20

NetScaler VPX部署之：NetScaler Gateway调

zabbix3.4.1安装部署+微信推送信息+大屏显

[经验分享] 第88课：Spark Streaming从Flume Pull数据案例实战及内幕源码解密

扫码加入运维网微信交流群