高峰之巅 发表于 2017-12-18 13:12:43

暑假第二弹:基于docker的hadoop分布式集群系统的搭建和测试

  早在四月份的时候,就已经开了这篇文章。当时是参加数据挖掘的比赛,在计科院大佬的建议下用TensorFlow搞深度学习,而且要在自己的hadoop分布式集群系统下搞。
  当时可把我们牛逼坏了,在没有基础的前提下,用一个月的时间搭建自己的大数据平台并运用人工智能框架来解题。
  结果可想而知:GG~~~~(只是把hadoop搭建起来了。。。。最后还是老老实实的写爬虫)

  当时搭建是用VM虚拟机,等于是在17台机器上运行17个CentOS 7,现在我们用docker来打包环境。
  一、技术架构
  Docker 1.12.6
  CentOS 7
  JDK1.8.0_121
  Hadoop2.7.3 :分布式计算框架
  Zookeeper-3.4.9:分布式应用程序协调服务
  Hbase1.2.4:分布式存储数据库
  Spark-2.0.2:大数据分布式计算引擎
  Python-2.7.13
  TensorFlow1.0.1:人工智能学习系统
  二、搭建环境制作镜像
  1、下载镜像:docker pull centos
  2、启动容器:docker run -it -d --name hadoop centos
  3、进入容器:docker exec -it hadoop /bin/bash
  4、安装java(这些大数据工具需要jdk的支持,有些组件就是用java写的)我这里装在/usr
  配置环境变量/etc/profile
  

#config java  
export JAVA_HOME
=/usr/java/jdk1.8.0_121  
export JRE_HOME
=/usr/java/jdk1.8.0_121/jre  
export>=$JAVA_HOME/lib  
export PATH
=:$PATH:$JAVA_HOME/bin:$JRE_HOME/bin  

  5、安装hadoop(http://hadoop.apache.org/releases.html)我这里装在/usr/local/
  配置环境变量/etc/profile
  

#config hadoop  
export HADOOP_HOME
=/usr/local/hadoop/  
export PATH
=$HADOOP_HOME/bin:$PATH  
export PATH
=$PATH:$HADOOP_HOME/sbin  
#hadoop
?~D彗??W彖~G件路?D?~D?~M置  
export HADOOP_LOG_DIR
=${HADOOP_HOME}/logs  

  source /etc/profile让环境变量生效
  改配置/usr/local/hadoop/etc/hadoop/:
  (1)slaves(添加datanode节点)
  

Slave1  
Slave2
  

  (2)core-site.xml
  

<configuration>  <property>
  <name>hadoop.tmp.dir</name>
  <value>file:/usr/local/hadoop/tmp</value>
  <description>Abase for other temporary directories.</description>
  </property>
  <property>
  <name>fs.defaultFS</name>
  <value>hdfs://Master:9000</value>
  </property>
  
</configuration>
  

  (3)hdfs-site.xml
  

<configuration>  <property>
  <name>dfs.namenode.secondary.http-address</name>
  <value>Master:9001</value>
  </property>
  <property>
  <name>dfs.namenode.name.dir</name>
  <value>file:/usr/local/hadoop/dfs/name</value>
  </property>
  <property>
  <name>dfs.datanode.data.dir</name>
  <value>file:/usr/local/hadoop/dfs/data</value>
  </property>
  <property>
  <name>dfs.replication</name>
  <value>2</value>
  </property>
  <property>
  <name>dfs.webhdfs.enabled</name>
  <value>true</value>
  </property>
  
</configuration>
  

  (4)创建mapred-site.xml
  

<configuration>  <property>
  <name>mapreduce.framework.name</name>
  <value>yarn</value>
  </property>
  <property>
  <name>mapreduce.jobhistory.address</name>
  <value>Master:10020</value>
  </property>
  <property>
  <name>mapreduce.jobhistory.webapp.address</name>
  <value>Master:19888</value>
  </property>
  
</configuration>
  

  (5)yarn-site.xml
  

<configuration>  <property>
  <name>yarn.nodemanager.aux-services</name>
  <value>mapreduce_shuffle</value>
  </property>
  <property>
  
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
  <value>org.apache.hadoop.mapred.ShuffleHandler</value>
  </property>
  <property>
  <name>yarn.resourcemanager.address</name>
  <value>Master:8032</value>
  </property>
  <property>
  <name>yarn.resourcemanager.scheduler.address</name>
  <value>Master:8030</value>
  </property>
  <property>
  <name>yarn.resourcemanager.resource-tracker.address</name>
  <value>Master:8031</value>
  </property>
  <property>
  <name>yarn.resourcemanager.admin.address</name>
  <value>Master:8033</value>
  </property>
  <property>
  <name>yarn.resourcemanager.webapp.address</name>
  <value>Master:8088</value>
  </property>
  
</configuration>
  

  6、安装zookeeper(https://zookeeper.apache.org/)我这里装在/usr/local/
  配置环境变量/etc/profile
  

#config zookeeper  
export ZOOKEEPER_HOME
=/usr/local/zookeeper  
export PATH
=$PATH:$ZOOKEEPER_HOME/bin:$ZOOKEEPER_HOME/conf  

  (1)/usr/local/zookeeper/conf/zoo.cfg
  

initLimit=10  
# The number of ticks that can pass between
  
# sending a request and getting an acknowledgement
  
syncLimit
=5  
# the directory where the snapshot is stored.
  
#
do not use /tmp for storage, /tmp here is just  
# example sakes.
  
dataDir
=/usr/local/zookeeper/data  
# the port at
which the clients will connect  
clientPort
=2181  
# the maximum number of client connections.
  
# increase this
if you need to handle more clients  
#maxClientCnxns
=60  
#
  
# Be sure to read the maintenance section of the
  
# administrator guide before turning on autopurge.
  
#
  
# http:
//zookeeper.apache.org/doc/current/zookeeperAdmin.html#sc_maintenance  
#
  
# The number of snapshots to retain in dataDir
  
#autopurge.snapRetainCount=3
  
# Purge task interval in hours
  
# Set to "0" to disable auto purge feature
  
#autopurge.purgeInterval=1
  

  7、安装hbase(http://hbase.apache.org/)我这里装在/usr/local/
  (1)/usr/local/hbase/conf/hbase-env.sh
  

export JAVA_HOME=/usr/java/jdk1.8.0_121  
export HBASE_MANAGES_ZK
=false  

  (2)hbase-site.xml
  

<configuration>  <property>
  <name>hbase.rootdir</name>
  <value>hdfs://Master:9000/hbase</value>
  </property>
  

  <property>
  <name>hbase.zookeeper.property.clientPort</name>
  <value>2181</value>
  </property>
  <property>
  <name>zookeeper.session.timeout</name>
  <value>120000</value>
  </property>
  <property>
  <name>hbase.zookeeper.quorum</name>
  <value>Master,Slave1,Slave2</value>
  </property>
  <property>
  <name>hbase.tmp.dir</name>
  <value>/usr/local/hbase/data</value>
  </property>
  <property>
  <name>hbase.cluster.distributed</name>
  <value>true</value>
  </property>
  
</configuration>
  

  (3)core-site.xml
  

<configuration>  <property>
  <name>hadoop.tmp.dir</name>
  <value>file:/usr/local/hadoop/tmp</value>
  <description>Abase for other temporary directories.</description>
  </property>
  <property>
  <name>fs.defaultFS</name>
  <value>hdfs://Master:9000</value>
  </property>
  
</configuration>
  

  (4)hdfs-site.xml
  

<configuration>  

  <property>
  <name>dfs.replication</name>
  <value>3</value>
  </property>
  
</configuration>
  

  (5)regionservers(代表我的三个节点)
  

Master #namenode  
Slave1 #datanode01
  
Slave2 #datanode02
  

  8、安装 spark(http://spark.apache.org/)我这里装在/usr/local/
  配置环境变量:
  

#config spark  
export SPARK_HOME
=/usr/local/spark  
export PATH
=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin  

  (1)cp ./conf/slaves.template ./conf/slaves
  在slaves中添加节点:
  

Slave1  
Slave2
  

  (2)spark-env.sh
  

export SPARK_DIST_CLASSPATH=$(/usr/local/hadoop/bin/hadoop>
export HADOOP_CONF_DIR
=/usr/local/hadoop/etc/hadoop  
export SPARK_MASTER_IP
=10.211.1.129  
export JAVA_HOME
=/usr/java/jdk1.8.0_121  

  9、如果要用tf训练数据的话:pip install tensorflow
  至此我们namenode(Master)节点配置完了。。。。。。
  10、exit退出容器
  生成镜像:docker commit edcabfcd69ff vitoyan/hadoop
  发布:docker push

  去Docker Hub看一看:

  三、测试
  如果要做完全分布式的话,还需要添加多个节点(多个容器或者主机)。。。。。
  由一个namenode控制多个datanode。
  1、安装ssh和net工具:yum install openssh-server net-tools openssh-clients -y
  2、生成公钥:ssh-keygen -t rsa
  3、把密钥追加到远程主机(容器):ssh-copy-id -i ~/.ssh/id_rsa.pubroot@10.211.1.129(这样两个容器不用密码就可以互相访问---handoop集群的前提)
  4、在宿主机上查看hadoop容器的ip:docker exec hadoop hostname -i (再用同样的方式给容器互相添加公钥)
  5、修改hostname分别为Master,Slave1、2、3、4、5.。。。。。以区分各个容器
  6、每个容器添加/etc/hosts:
  

10.211.1.129 Master  

10.211.1.130 Slave1  

10.211.1.131 Slave2  

10.102.25.3Slave3  

10.102.25.4Slave4  

10.102.25.5Slave5  

10.102.25.6Slave6  

10.102.25.7Slave7  

10.102.25.8Slave8  

10.102.25.9Slave9  

10.102.25.10 Slave10  

10.102.25.11 Slave11  

10.102.25.12 Slave12  

10.102.25.13 Slave13  

10.102.25.14 Slave14  

10.102.25.15 Slave15  

10.102.25.16 Slave16  

  7、对应Slave的hadoop配置只需要copy,然后改成对应的主机名。
  8、基本命令:
  (1)、启动hadoop分布式集群系统
  cd /usr/local/hadoop
  hdfs namenode -format
  sbin/start-all.sh
  检查是否启动成功:jps

  (2)、启动zookeeper分布式应用程序协调服务
  cd /usr/local/zookeeper/bin
  ./zkServer.sh start
  检查是否启动成功:zkServer.sh status
  (3)、启动hbase分布式数据库
  cd /usr/local/hbase/bin/
  ./start-hbase.sh
  (5)、启动spark大数据计算引擎集群
  cd /usr/local/spark/
  sbin/start-master.sh
  sbin/start-slaves.sh
  集群管理:http://master:8080
  集群基准测试:http://blog.itpub.net/8183550/viewspace-684152/
  我的hadoop镜像:https://hub.docker.com/r/vitoyan/hadoop/
  欢迎pull
  over!!!!!
页: [1]
查看完整版本: 暑假第二弹:基于docker的hadoop分布式集群系统的搭建和测试