搭建hadoop环境，执行wordcount

olga · 发表于 2016-12-4 11:00:18

1.机器选择，没有资源只能选择自己手头的一个服务器，部署一个伪分布式吧，
　　2.hadoop版本选择：hadoop分为 1.XX 和 2.XX 两个版本这两个版本之间差别还是挺大的，安装配置都不一样，所以一定确定自己用哪个具体用哪个参考：http://younglibin.iyunv.com/blog/1921385（这里使用的老版本 1.2.1 ），
　　由于我选择的是一台服务器，所以选择了伪分布式部署（参考：http://hadoop.apache.org/docs/r1.2.1/single_node_setup.html）
　　3. 开始搭建：选择服务器：172.16.236.11
　　创建 libin 用户密码 password
　　将本地下载的hadoop包： scp hadoop-1.2.1.tar.gz libin@172.16.236.11:~/
　　4. ssh libin@172.16.236.11
　　tar -zxvf hadoop-1.2.1.tar.gz
　　一下就是配置hadoop了
5.vi conf/core-site.xml ;定义HadoopMaster的URI和端口
写道
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>



<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
6.vi conf/hdfs-site.xml配置 : 配置数据存储的副本数量
写道
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>



<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
7. vi conf.mapre-site.xml : 配置jobtracker执行的服务器和端口
写道
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>



<configuration>
<property>
<name>mapred.job.tracker</name>
<value>localhost:9001</value>
</property>
</configuration>
8.创建 ssh免登陆：
　　这个是必须的，因为 hadoop在执行的时候，需要在服务器之间执行一些文件拷贝，如果不配置，就会频繁的提示输入密码，所以这里是必须的
　　Now check that you can ssh to the localhost without a passphrase:
　　$ ssh localhost
　　If you cannot ssh to localhost without a passphrase, execute the following commands:
写道
$ ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
$ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
9.初始化 namenode节点
写道
bin/hadoop namenode -format
　　看到：
　　13/08/14 13:48:12 INFO common.Storage: Storage directory /tmp/hadoop-libin/dfs/name has been successfully formatted.
　　说明 namenode 初始化成功
10.启动hadoop集群
写道
libin@d03:~/hadoop-1.2.1$ bin/start-all.sh
starting namenode, logging to /home/libin/hadoop-1.2.1/libexec/../logs/hadoop-libin-namenode-d03.out
localhost: starting datanode, logging to /home/libin/hadoop-1.2.1/libexec/../logs/hadoop-libin-datanode-d03.out
localhost: Error: JAVA_HOME is not set.
　　按照错误提示，应该是java_home没有配置，需要在 conf/hadoop-env.sh 配置
　　export JAVA_HOME=/home/libin/jdk1.6.0_31
　　配置完成后再次启动 hadoop 使用jps 看到一下进程存在说明hadoop启动成功
写道
libin@d03:~/hadoop-1.2.1$ jps
22002 TaskTracker
22119 Jps
21706 DataNode
21841 SecondaryNameNode
20710 NameNode
20967 JobTracker
12.使用hadoop命令查看 hadoop 一些文件信息
　　我们一般会创建 input 目录和 output目录，方便hadoop在执行的时候需要的一些输入参数在input中定义，输出结果在output中
写道
libin@d03:~/hadoop-1.2.1$ hadoop fs -ls
Warning: $HADOOP_HOME is deprecated.

ls: Cannot access .: No such file or directory.

libin@d03:~/hadoop-1.2.1$ hadoop fs -mkdir input
Warning: $HADOOP_HOME is deprecated.

libin@d03:~/hadoop-1.2.1$ hadoop fs -mkdir output
Warning: $HADOOP_HOME is deprecated.

libin@d03:~/hadoop-1.2.1$ hadoop fs -ls
Warning: $HADOOP_HOME is deprecated.

Found 2 items
drwxr-xr-x - libin supergroup 0 2013-08-14 13:57 /user/libin/input
drwxr-xr-x - libin supergroup 0 2013-08-14 13:57 /user/libin/output
13 在本地创建一个文件，将该文件上传到 hadoop文件目录下
写道
libin@d03:~/hadoop-1.2.1$ hadoop fs -put input-local/libin input/libin
Warning: $HADOOP_HOME is deprecated.

libin@d03:~/hadoop-1.2.1$ hadoop fs -ls
Warning: $HADOOP_HOME is deprecated.

Found 2 items
drwxr-xr-x - libin supergroup 0 2013-08-14 14:00 /user/libin/input
drwxr-xr-x - libin supergroup 0 2013-08-14 13:57 /user/libin/output
libin@d03:~/hadoop-1.2.1$ hadoop fs -ls /user/libin/input
Warning: $HADOOP_HOME is deprecated.

Found 1 items
-rw-r--r-- 1 libin supergroup 29 2013-08-14 14:00 /user/libin/input/libin
libin@d03:~/hadoop-1.2.1$ hadoop fs -ls input
Warning: $HADOOP_HOME is deprecated.

Found 1 items
-rw-r--r-- 1 libin supergroup 29 2013-08-14 14:00 /user/libin/input/libin
libin@d03:~/hadoop-1.2.1$
14、执行以下hadoop自带的例子吧：
　　以下列出了 hadoop自带的一些例子
写道
libin@d03:~/hadoop-1.2.1$ hadoop jar hadoop-examples-1.2.1.jar
Warning: $HADOOP_HOME is deprecated.

An example program must be given as the first argument.
Valid program names are:
aggregatewordcount: An Aggregate based map/reduce program that counts the words in the input files.
aggregatewordhist: An Aggregate based map/reduce program that computes the histogram of the words in the input files.
dbcount: An example job that count the pageview counts from a database.
grep: A map/reduce program that counts the matches of a regex in the input.
join: A job that effects a join over sorted, equally partitioned datasets
multifilewc: A job that counts words from several files.
pentomino: A map/reduce tile laying program to find solutions to pentomino problems.
pi: A map/reduce program that estimates Pi using monte-carlo method.
randomtextwriter: A map/reduce program that writes 10GB of random textual data per node.
randomwriter: A map/reduce program that writes 10GB of random data per node.
secondarysort: An example defining a secondary sort to the reduce.
sleep: A job that sleeps at each map and reduce task.
sort: A map/reduce program that sorts the data written by the random writer.
sudoku: A sudoku solver.
teragen: Generate data for the terasort
terasort: Run the terasort
teravalidate: Checking results of terasort
wordcount: A map/reduce program that counts the words in the input files.
libin@d03:~/hadoop-1.2.1$
15.执行最经典的 wordcount 也算是hadoop中的hello word 了
写道
libin@d03:~/hadoop-1.2.1$ hadoop jar hadoop-examples-1.2.1.jar wordcount
Warning: $HADOOP_HOME is deprecated.

Usage: wordcount <in> <out>
libin@d03:~/hadoop-1.2.1$ hadoop jar hadoop-examples-1.2.1.jar wordcount input/libin output
Warning: $HADOOP_HOME is deprecated.

13/08/14 14:02:14 INFO mapred.JobClient: Cleaning up the staging area hdfs://localhost:9000/tmp/hadoop-libin/mapred/staging/libin/.staging/job_201308141349_0001
13/08/14 14:02:14 ERROR security.UserGroupInformation: PriviledgedActionException as:libin cause:org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory output already exists
org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory output already exists
at org.apache.hadoop.mapreduce.lib.output.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:137)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:973)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:936)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:936)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:550)
at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:580)
at org.apache.hadoop.examples.WordCount.main(WordCount.java:82)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:64)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:160)
　　这里提示需要两个参数 in 和out
　　进而提示 Output directory output already exists ，hadoop在执行之前会将你定义的输出目录生成，如果存在就不执行了，是因为，hadoop 是分布式的，如果你重复执行一个用例的话，会导致后边的结果覆盖前面的结果，所以这里只要发现out目录存在，就不会执行修改 out目录
写道
libin@d03:~/hadoop-1.2.1$ hadoop jar hadoop-examples-1.2.1.jar wordcount input/libin output/wordcount
Warning: $HADOOP_HOME is deprecated.

13/08/14 14:02:27 INFO input.FileInputFormat: Total input paths to process : 1
13/08/14 14:02:27 INFO util.NativeCodeLoader: Loaded the native-hadoop library
13/08/14 14:02:27 WARN snappy.LoadSnappy: Snappy native library not loaded
13/08/14 14:02:27 INFO mapred.JobClient: Running job: job_201308141349_0002
13/08/14 14:02:28 INFO mapred.JobClient: map 0% reduce 0%
13/08/14 14:02:32 INFO mapred.JobClient: map 100% reduce 0%
13/08/14 14:02:40 INFO mapred.JobClient: map 100% reduce 100%
13/08/14 14:02:40 INFO mapred.JobClient: Job complete: job_201308141349_0002
13/08/14 14:02:40 INFO mapred.JobClient: Counters: 29
13/08/14 14:02:40 INFO mapred.JobClient: Job Counters
13/08/14 14:02:40 INFO mapred.JobClient: Launched reduce tasks=1
13/08/14 14:02:40 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=3336
13/08/14 14:02:40 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
13/08/14 14:02:40 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
13/08/14 14:02:40 INFO mapred.JobClient: Launched map tasks=1
13/08/14 14:02:40 INFO mapred.JobClient: Data-local map tasks=1
13/08/14 14:02:40 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=8179
13/08/14 14:02:40 INFO mapred.JobClient: File Output Format Counters
13/08/14 14:02:40 INFO mapred.JobClient: Bytes Written=37
13/08/14 14:02:40 INFO mapred.JobClient: FileSystemCounters
13/08/14 14:02:40 INFO mapred.JobClient: FILE_BYTES_READ=71
13/08/14 14:02:40 INFO mapred.JobClient: HDFS_BYTES_READ=138
13/08/14 14:02:40 INFO mapred.JobClient: FILE_BYTES_WRITTEN=110523
13/08/14 14:02:40 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=37
13/08/14 14:02:40 INFO mapred.JobClient: File Input Format Counters
13/08/14 14:02:40 INFO mapred.JobClient: Bytes Read=29
13/08/14 14:02:40 INFO mapred.JobClient: Map-Reduce Framework
13/08/14 14:02:40 INFO mapred.JobClient: Map output materialized bytes=71
13/08/14 14:02:40 INFO mapred.JobClient: Map input records=8
13/08/14 14:02:40 INFO mapred.JobClient: Reduce shuffle bytes=71
13/08/14 14:02:40 INFO mapred.JobClient: Spilled Records=14
13/08/14 14:02:40 INFO mapred.JobClient: Map output bytes=69
13/08/14 14:02:40 INFO mapred.JobClient: CPU time spent (ms)=1460
13/08/14 14:02:41 INFO mapred.JobClient: Total committed heap usage (bytes)=401997824
13/08/14 14:02:41 INFO mapred.JobClient: Combine input records=10
13/08/14 14:02:41 INFO mapred.JobClient: SPLIT_RAW_BYTES=109
13/08/14 14:02:41 INFO mapred.JobClient: Reduce input records=7
13/08/14 14:02:41 INFO mapred.JobClient: Reduce input groups=7
13/08/14 14:02:41 INFO mapred.JobClient: Combine output records=7
13/08/14 14:02:41 INFO mapred.JobClient: Physical memory (bytes) snapshot=311259136
13/08/14 14:02:41 INFO mapred.JobClient: Reduce output records=7
13/08/14 14:02:41 INFO mapred.JobClient: Virtual memory (bytes) snapshot=1118924800
13/08/14 14:02:41 INFO mapred.JobClient: Map output records=10
16、查看执行结果
写道
libin@d03:~/hadoop-1.2.1$ hadoop fs -cat output/wordcount/part-r-00000
Warning: $HADOOP_HOME is deprecated.

a 4
c 1
d 1
is 1
li 1
libin 1
tmp? 1
libin@d03:~/hadoop-1.2.1$
　　大功告成，下一步就可以在这个基础上开发新的 mapReduce程序了！
　　这里配置的ip最好使用域名来做解析，但是域名解析又要牵扯到 DNS反响解析，所以这里没有这样配置，如果是配置集群，请配置DNS反响解析

账号		自动登录	找回密码
密码			立即注册

Centos6.5×64安装配置openmeetings3.0.3详

大疆运维招人啦，

C++ :try 语句块和异常处理

C++的多态

Red Hat RHCE 8 (EX294) Cert Guide

Java/C++ 区别：看完这一篇，就够用！

别再用过时库了！这 13 个顶级 C++ 库才是

[经验分享] 搭建hadoop环境，执行wordcount

浏览过的版块

扫码加入运维网微信交流群