设为首页 收藏本站
查看: 1265|回复: 0

[经验分享] 第88课:Spark Streaming从Flume Pull数据案例实战及内幕源码解密

[复制链接]

尚未签到

发表于 2019-1-30 09:42:54 | 显示全部楼层 |阅读模式
本节课分成二部分讲解:
    一、Spark Streaming on Pulling from Flume实战
    二、Spark Streaming on Pulling from Flume源码解析


先简单介绍下Flume的两种模式:推模式(Flume push to Spark Streaming)和 拉模式(Spark Streaming pull from Flume )
  采用推模式推模式的理解就是Flume作为缓存,存有数据。监听对应端口,如果服务可以连接,就将数据push过去。(简单,耦合要低),缺点是Spark Streaming程序没有启动的话,Flume端会报错,同时会导致Spark Streaming程序来不及消费的情况。

  采用拉模式:拉模式就是自己定义一个sink,Spark Streaming自己去channel里面取数据,根据自身条件去获取数据,稳定性好。



Flume pull实战:
第一步:安装Flume,本节课不在说明,参考(第87课:Flume推送数据到SparkStreaming案例实战和内幕源码解密
第二步:配置Flume,首先参照官网(http://spark.apache.org/docs/latest/streaming-flume-integration.html)要求添加依赖或直接下载3个jar包,并将其放入Flume安装目录下的lib目录中

  spark-streaming-flume-sink_2.10-1.6.0.jar、scala-library-2.10.5.jar、commons-lang3-3.3.2.jar
  
  
第三步:配置Flume环境参数,修改flume-conf.properties,从flume-conf.properties.template复制一份进行修改

  #Flume pull模式
  agent0.sources = source1
  agent0.channels = memoryChannel
  agent0.sinks = sink1
  

  #配置Source1
  agent0.sources.source1.type = spooldir
  agent0.sources.source1.spoolDir = /home/hadoop/flume/tmp/TestDir
  agent0.sources.source1.channels = memoryChannel
  agent0.sources.source1.fileHeader = false
  agent0.sources.source1.interceptors = il
  agent0.sources.source1.interceptors.il.type = timestamp
  

  #配置Sink1
  agent0.sinks.sink1.type = org.apache.spark.streaming.flume.sink.SparkSink
  agent0.sinks.sink1.hostname = SparkMaster
  agent0.sinks.sink1.port = 9999

  agent0.sinks.sink1.channel = memoryChannel

  

  #配置channel
  agent0.channels.memoryChannel.type = file
  agent0.channels.memoryChannel.checkpointDir = /home/hadoop/flume/tmp/checkpoint
  agent0.channels.memoryChannel.dataDirs = /home/hadoop/flume/tmp/dataDir


启动flume命令:
  root@SparkMaster:~/flume/flume-1.6.0/bin# ./flume-ng agent --conf ../conf/ --conf-file ../conf/flume-conf.properties --name agent0 -Dflume.root.logger=INFO,console
或者root@SparkMaster:~/flume/flume-1.6.0# flume-ng agent --conf ./conf/ --conf-file ./conf/flume-conf.properties --name agent0 -Dflume.root.logger=INFO,console


第四步:编写简单的业务代码(Java版)
  
  package com.dt.spark.SparkApps.sparkstreaming;
  import java.util.Arrays;
  import org.apache.spark.SparkConf;
  import org.apache.spark.api.java.function.FlatMapFunction;
  import org.apache.spark.api.java.function.Function2;
  import org.apache.spark.api.java.function.PairFunction;
  import org.apache.spark.streaming.Durations;
  import org.apache.spark.streaming.api.java.JavaDStream;
  import org.apache.spark.streaming.api.java.JavaPairDStream;
  import org.apache.spark.streaming.api.java.JavaReceiverInputDStream;
  import org.apache.spark.streaming.api.java.JavaStreamingContext;
  import org.apache.spark.streaming.flume.FlumeUtils;
  import org.apache.spark.streaming.flume.SparkFlumeEvent;
  import scala.Tuple2;
  public class SparkStreamingPullDataFromFlume {
      public static void main(String[] args) {
          SparkConf conf = new SparkConf().setMaster("spark://SparkMaster:7077");
          conf.setAppName("SparkStreamingPullDataFromFlume");
          JavaStreamingContext jsc = new JavaStreamingContext(conf, Durations.seconds(30));
          // 获取数据
          JavaReceiverInputDStream lines = FlumeUtils.createPollingStream(jsc, "SparkMaster", 9999);
          // 进行单词切分
          JavaDStream words = lines.flatMap(new FlatMapFunction() {
              public Iterable call(SparkFlumeEvent event) throws Exception {
                  String line = new String(event.event().getBody().toString());
                  return Arrays.asList(line.split(" "));
              }
          });
          // 进行map操作,转换成(key,value)格式
          JavaPairDStream pairs = words.mapToPair(new PairFunction() {
              public Tuple2 call(String word) throws Exception {
                  return new Tuple2(word, 1);
              }
          });
          // 进行reduceByKey动作,将key相同的value值进行合并
          JavaPairDStream wordsCount = pairs.reduceByKey(new Function2() {
              public Integer call(Integer v1, Integer v2) throws Exception {
                  return v1 + v2;
              }
          });
          wordsCount.print();
          jsc.start();
          jsc.awaitTermination();
          jsc.close();
      }
  }
将程序打包成jar文件上传到Spark集群中


第五步:启动HDFS、Spark集群和Flume
启动Flume:root@SparkMaster:~/flume/flume-1.6.0/bin# ./flume-ng agent --conf ../conf/ --conf-file ../conf/flume-conf.properties --name agent0 -Dflume.root.logger=INFO,console
第六步:往/home/hadoop/flume/tmp/TestDir目录中上传测试文件,查看Flume的日志变化
第七步:通过spark-submit命令运行程序:
  ./spark-submit --class com.dt.spark.SparkApps.SparkStreamingPullDataFromFlume --name SparkStreamingPullDataFromFlume /home/hadoop/spark/SparkStreamingPullDataFromFlume.jar
  每隔30秒查看运行结果
  

  
  第二部分:源码分析

  
1、创建createPollingStream (FlumeUtils.scala )
注意:默认的存储方式是MEMORY_AND_DISK_SER_2
/**
* Creates an input stream that is to be used with the Spark Sink deployed on a Flume agent.
* This stream will poll the sink for data and will pull events as they are available.
* This stream will use a batch size of 1000 events and run 5 threads to pull data.
* @param hostname Address of the host on which the Spark Sink is running
* @param port Port of the host at which the Spark Sink is listening
* @param storageLevel Storage level to use for storing the received objects
*/
def createPollingStream(
    ssc: StreamingContext,
    hostname: String,
    port: Int,
    storageLevel: StorageLevel = StorageLevel.MEMORY_AND_DISK_SER_2
  ): ReceiverInputDStream[SparkFlumeEvent] = {
  createPollingStream(ssc, Seq(new InetSocketAddress(hostname, port)), storageLevel)
}
2、参数配置:默认的全局参数,private 级别配置无法修改
private val DEFAULT_POLLING_PARALLELISM = 5
private val DEFAULT_POLLING_BATCH_SIZE = 1000
/**
* Creates an input stream that is to be used with the Spark Sink deployed on a Flume agent.
* This stream will poll the sink for data and will pull events as they are available.
* This stream will use a batch size of 1000 events and run 5 threads to pull data.
* @param addresses List of InetSocketAddresses representing the hosts to connect to.
* @param storageLevel Storage level to use for storing the received objects
*/
def createPollingStream(
    ssc: StreamingContext,
    addresses: Seq[InetSocketAddress],
    storageLevel: StorageLevel
  ): ReceiverInputDStream[SparkFlumeEvent] = {
  createPollingStream(ssc, addresses, storageLevel,
    DEFAULT_POLLING_BATCH_SIZE, DEFAULT_POLLING_PARALLELISM)
}
3、创建FlumePollingInputDstream对象
/**
* Creates an input stream that is to be used with the Spark Sink deployed on a Flume agent.
* This stream will poll the sink for data and will pull events as they are available.
* @param addresses List of InetSocketAddresses representing the hosts to connect to.
* @param maxBatchSize Maximum number of events to be pulled from the Spark sink in a
*                     single RPC call
* @param parallelism Number of concurrent requests this stream should send to the sink. Note
*                    that having a higher number of requests concurrently being pulled will
*                    result in this stream using more threads
* @param storageLevel Storage level to use for storing the received objects
*/
def createPollingStream(
    ssc: StreamingContext,
    addresses: Seq[InetSocketAddress],
    storageLevel: StorageLevel,
    maxBatchSize: Int,
    parallelism: Int
  ): ReceiverInputDStream[SparkFlumeEvent] = {
  new FlumePollingInputDStream[SparkFlumeEvent](ssc, addresses, maxBatchSize,
    parallelism, storageLevel)
}
4、继承自ReceiverInputDstream并覆写getReciver方法,调用FlumePollingReciver接口
private[streaming] class FlumePollingInputDStream[T: ClassTag](
    _ssc: StreamingContext,
    val addresses: Seq[InetSocketAddress],
    val maxBatchSize: Int,
    val parallelism: Int,
    storageLevel: StorageLevel
  ) extends ReceiverInputDStream[SparkFlumeEvent](_ssc) {
   override def getReceiver(): Receiver[SparkFlumeEvent] = {
    new FlumePollingReceiver(addresses, maxBatchSize, parallelism, storageLevel)
  }
}
5、ReceiverInputDstream 构建了一个线程池,设置为后台线程;并使用lazy和工厂方法创建线程和NioClientSocket(NioClientSocket底层使用NettyServer的方式)
lazy val channelFactoryExecutor =
  Executors.newCachedThreadPool(new ThreadFactoryBuilder().setDaemon(true).
    setNameFormat("Flume Receiver Channel Thread - %d").build())
lazy val channelFactory =
  new NioClientSocketChannelFactory(channelFactoryExecutor, channelFactoryExecutor)
6、receiverExecutor 内部也是线程池;connections是指链接分布式Flume集群的FlumeConnection实体句柄的个数,线程拿到实体句柄访问数据。
lazy val receiverExecutor = Executors.newFixedThreadPool(parallelism,
  new ThreadFactoryBuilder().setDaemon(true).setNameFormat("Flume Receiver Thread - %d").build())
private lazy val connections = new LinkedBlockingQueue[FlumeConnection]()
7、启动时创建NettyTransceiver,根据并行度(默认5个)循环提交FlumeBatchFetcher
override def onStart(): Unit = {
  // Create the connections to each Flume agent.
  addresses.foreach(host => {
    val transceiver = new NettyTransceiver(host, channelFactory)
    val client = SpecificRequestor.getClient(classOf[SparkFlumeProtocol.Callback], transceiver)
    connections.add(new FlumeConnection(transceiver, client))
  })
  for (i
          batchReceived = true
          seq = eventBatch.getSequenceNumber
          val events = toSparkFlumeEvents(eventBatch.getEvents)
          if (store(events)) {
            sendAck(client, seq)
          } else {
            sendNack(batchReceived, client, seq)
          }
        case None =>
      }
    } catch {
9、获取一批一批数据方法
/**
* Gets a batch of events from the specified client. This method does not handle any exceptions
* which will be propogated to the caller.
* @param client Client to get events from
* @return [[Some]] which contains the event batch if Flume sent any events back, else [[None]]
*/
private def getBatch(client: SparkFlumeProtocol.Callback): Option[EventBatch] = {
  val eventBatch = client.getEventBatch(receiver.getMaxBatchSize)
  if (!SparkSinkUtils.isErrorBatch(eventBatch)) {
    // No error, proceed with processing data
    logDebug(s"Received batch of ${eventBatch.getEvents.size} events with sequence " +
      s"number: ${eventBatch.getSequenceNumber}")
    Some(eventBatch)
  } else {
    logWarning("Did not receive events from Flume agent due to error on the Flume agent: " +
      eventBatch.getErrorMsg)
    None
  }
}


  备注:
  资料来源于:DT_大数据梦工厂
  更多私密内容,请关注微信公众号:DT_Spark
  如果您对大数据Spark感兴趣,可以免费听由王家林老师每天晚上20:00开设的Spark永久免费公开课,地址YY房间号:68917580




运维网声明 1、欢迎大家加入本站运维交流群:群②:261659950 群⑤:202807635 群⑦870801961 群⑧679858003
2、本站所有主题由该帖子作者发表,该帖子作者与运维网享有帖子相关版权
3、所有作品的著作权均归原作者享有,请您和我们一样尊重他人的著作权等合法权益。如果您对作品感到满意,请购买正版
4、禁止制作、复制、发布和传播具有反动、淫秽、色情、暴力、凶杀等内容的信息,一经发现立即删除。若您因此触犯法律,一切后果自负,我们对此不承担任何责任
5、所有资源均系网友上传或者通过网络收集,我们仅提供一个展示、介绍、观摩学习的平台,我们不对其内容的准确性、可靠性、正当性、安全性、合法性等负责,亦不承担任何法律责任
6、所有作品仅供您个人学习、研究或欣赏,不得用于商业或者其他用途,否则,一切后果均由您自己承担,我们对此不承担任何法律责任
7、如涉及侵犯版权等问题,请您及时通知我们,我们将立即采取措施予以解决
8、联系人Email:admin@iyunv.com 网址:www.yunweiku.com

所有资源均系网友上传或者通过网络收集,我们仅提供一个展示、介绍、观摩学习的平台,我们不对其承担任何法律责任,如涉及侵犯版权等问题,请您及时通知我们,我们将立即处理,联系人Email:kefu@iyunv.com,QQ:1061981298 本贴地址:https://www.yunweiku.com/thread-669454-1-1.html 上篇帖子: Flume NG 学习笔记(十) Transaction、Sink、Source和Channel开发 下篇帖子: 四种开源日志系统:scribe、chukwa、kafka、flume 比较
您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

扫码加入运维网微信交流群X

扫码加入运维网微信交流群

扫描二维码加入运维网微信交流群,最新一手资源尽在官方微信交流群!快快加入我们吧...

扫描微信二维码查看详情

客服E-mail:kefu@iyunv.com 客服QQ:1061981298


QQ群⑦:运维网交流群⑦ QQ群⑧:运维网交流群⑧ k8s群:运维网kubernetes交流群


提醒:禁止发布任何违反国家法律、法规的言论与图片等内容;本站内容均来自个人观点与网络等信息,非本站认同之观点.


本站大部分资源是网友从网上搜集分享而来,其版权均归原作者及其网站所有,我们尊重他人的合法权益,如有内容侵犯您的合法权益,请及时与我们联系进行核实删除!



合作伙伴: 青云cloud

快速回复 返回顶部 返回列表