Hadoop---安装Hadoop
Hadoop是一个开源的框架,可编写和运行分布式应用处理大规模数据。运行Hadoop需要Java1.6或更高版本。JDK的下载地址为:http://www.oracle.com/technetwork/java/javase/downloads/jdk-7u3-download-1501626.html
下载JDK1.6,利用SAMBA,FTP拷到Linux服务器上。
./jdk-6u29-linux-i586-rpm.bin得到jdk-6u29-linux-i586.rpm
rpm -ivh jdk-6u29-linux-i586.rpm
JDK默认安装在/usr/java下
接下来就是配置JAVA_HOME目录,vi ~/.bash_profile,添加JAVA_HOME
[*]# User specific environment and startup programs
[*]
[*]PATH=$PATH:$HOME/bin
[*]JAVA_HOME=/usr/java/jdk1.6.0_29
[*]
[*]export PATH
[*]export JAVA_HOME
[*]unset USERNAME
[*]~
source ~/.bash_profile使更改的变量值生效。
Hadoop的下载地址为http://labs.renren.com/apache-mirror/hadoop/common/hadoop-1.0.2/
将hadoop-1.0.2.tar.gz拷贝到Linux服务器上
使用tar zxvf hadoop-1.0.2.tar.gz进行解压
进行解压的文件的bin目录下,例如:/opt/hadoop-1.0.2/bin
不加任何参数运行Hadoop
./hadoop
得到
[*]# ./hadoop
[*]Usage: hadoop [--config confdir] COMMAND
[*]where COMMAND is one of:
[*]namenode -format format the DFS filesystem
[*]secondarynamenode run the DFS secondary namenode
[*]namenode run the DFS namenode
[*]datanode run a DFS datanode
[*]dfsadmin run a DFS admin client
[*]mradmin run a Map-Reduce admin client
[*]fsck run a DFS filesystem checking utility
[*]fs run a generic filesystem user client
[*]balancer run a cluster balancing utility
[*]fetchdt fetch a delegation token from the NameNode
[*]jobtracker run the MapReduce job Tracker node
[*]pipes run a Pipes job
[*]tasktracker run a MapReduce task Tracker node
[*]historyserver run job history servers as a standalone daemon
[*]job manipulate MapReduce jobs
[*]queue get information regarding JobQueues
[*]version print the version
[*]jar run a jar file
[*]distcp copy file or directories recursively
[*]archive -archiveName NAME -p*create a hadoop archive
[*]classpath prints the class path needed to get the
[*] Hadoop jar and the required libraries
[*]daemonlog get/set the log level for each daemon
[*] or
[*]CLASSNAME run the class named CLASSNAME
[*]Most commands print help when invoked w/o parameters.
各参数的中文描述为:
[*]# ./hadoop
[*]Usage: hadoop [--config confdir] COMMAND
[*]where COMMAND is one of:
[*]namenode -format 格式化DFS文件系统
[*]secondarynamenode 运行DFS的第二个namenode
[*]namenode 运行DFS的namenode
[*]datanode 运行一个DFS的datanode
[*]dfsadmin 运行一个DFS的admin客户端
[*]mradmin 运行一个MapReduce的admin客户端
[*]fsck 运行一个DFS文件系统的检查工具
[*]fs 运行一个普通的文件系统用户客户端
[*]balancer 运行一个集群负载均衡工具
[*]fetchdt 从NameNode取一行词
[*]jobtracker 运行MapReduce的Tracker 节点
[*]pipes 运行 Pipes 作业
[*]tasktracker 运行MapReduce的task Tracker 节点
[*]historyserver 运行一个独立的history server守护进程
[*]job 处理MapReduce作业
[*]queue 得到JobQueues的信息
[*]version 打印版本
[*]jar 运行一个 jar file
[*]distcp 递归地复制文件或者目录
[*]archive -archiveName NAME -p*生成一个Hadoop档案
[*]classpath 打印找到Hadoop jar and the required libraries 所需要的目录
[*]daemonlog 获取每个daemon的日志级别
[*] or
[*]CLASSNAME 运行名为CLASSNAME的类大多数命令会在使用w/o参数时打印出帮助信息
[*]Most commands print help when invoked w/o parameters.
如运行./hadoop>
[*]# ./hadoop classpath
[*]/opt/hadoop-1.0.2/libexec/../conf:/usr/java/jdk1.6.0_29/lib/tools.jar:/opt/hadoop-1.0.2/libexec/..:/opt/hadoop-1.0.2/libexec/../hadoop-core-1.0.2.jar:/opt/hadoop-1.0.2/libexec/../lib/asm-3.2.jar:/opt/hadoop-1.0.2/libexec/../lib/aspectjrt-1.6.5.jar:/opt/hadoop-1.0.2/libexec/../lib/aspectjtools-1.6.5.jar:/opt/hadoop-1.0.2/libexec/../lib/commons-beanutils-1.7.0.jar:/opt/hadoop-1.0.2/libexec/../lib/commons-beanutils-core-1.8.0.jar:/opt/hadoop-1.0.2/libexec/../lib/commons-cli-1.2.jar:/opt/hadoop-1.0.2/libexec/../lib/commons-codec-1.4.jar:/opt/hadoop-1.0.2/libexec/../lib/commons-collections-3.2.1.jar:/opt/hadoop-1.0.2/libexec/../lib/commons-configuration-1.6.jar:/opt/hadoop-1.0.2/libexec/../lib/commons-daemon-1.0.1.jar:/opt/hadoop-1.0.2/libexec/../lib/commons-digester-1.8.jar:/opt/hadoop-1.0.2/libexec/../lib/commons-el-1.0.jar:/opt/hadoop-1.0.2/libexec/../lib/commons-httpclient-3.0.1.jar:/opt/hadoop-1.0.2/libexec/../lib/commons-lang-2.4.jar:/opt/hadoop-1.0.2/libexec/../lib/commons-logging-1.1.1.jar:/opt/hadoop-1.0.2/libexec/../lib/commons-logging-api-1.0.4.jar:/opt/hadoop-1.0.2/libexec/../lib/commons-math-2.1.jar:/opt/hadoop-1.0.2/libexec/../lib/commons-net-1.4.1.jar:/opt/hadoop-1.0.2/libexec/../lib/core-3.1.1.jar:/opt/hadoop-1.0.2/libexec/../lib/hadoop-capacity-scheduler-1.0.2.jar:/opt/hadoop-1.0.2/libexec/../lib/hadoop-fairscheduler-1.0.2.jar:/opt/hadoop-1.0.2/libexec/../lib/hadoop-thriftfs-1.0.2.jar:/opt/hadoop-1.0.2/libexec/../lib/hsqldb-1.8.0.10.jar:/opt/hadoop-1.0.2/libexec/../lib/jackson-core-asl-1.8.8.jar:/opt/hadoop-1.0.2/libexec/../lib/jackson-mapper-asl-1.8.8.jar:/opt/hadoop-1.0.2/libexec/../lib/jasper-compiler-5.5.12.jar:/opt/hadoop-1.0.2/libexec/../lib/jasper-runtime-5.5.12.jar:/opt/hadoop-1.0.2/libexec/../lib/jdeb-0.8.jar:/opt/hadoop-1.0.2/libexec/../lib/jersey-core-1.8.jar:/opt/hadoop-1.0.2/libexec/../lib/jersey-json-1.8.jar:/opt/hadoop-1.0.2/libexec/../lib/jersey-server-1.8.jar:/opt/hadoop-1.0.2/libexec/../lib/jets3t-0.6.1.jar:/opt/hadoop-1.0.2/libexec/../lib/jetty-6.1.26.jar:/opt/hadoop-1.0.2/libexec/../lib/jetty-util-6.1.26.jar:/opt/hadoop-1.0.2/libexec/../lib/jsch-0.1.42.jar:/opt/hadoop-1.0.2/libexec/../lib/junit-4.5.jar:/opt/hadoop-1.0.2/libexec/../lib/kfs-0.2.2.jar:/opt/hadoop-1.0.2/libexec/../lib/log4j-1.2.15.jar:/opt/hadoop-1.0.2/libexec/../lib/mockito-all-1.8.5.jar:/opt/hadoop-1.0.2/libexec/../lib/oro-2.0.8.jar:/opt/hadoop-1.0.2/libexec/../lib/servlet-api-2.5-20081211.jar:/opt/hadoop-1.0.2/libexec/../lib/slf4j-api-1.4.3.jar:/opt/hadoop-1.0.2/libexec/../lib/slf4j-log4j12-1.4.3.jar:/opt/hadoop-1.0.2/libexec/../lib/xmlenc-0.52.jar:/opt/hadoop-1.0.2/libexec/../lib/jsp-2.1/jsp-2.1.jar:/opt/hadoop-1.0.2/libexec/../lib/jsp-2.1/jsp-api-2.1.jar
由上可知,运行一个(Java)Hadoop程序的命令为hadoop jar。就像命令显示的那样,用Java写的Hadoop程序被打包为jar执行文件。
在Hadoop的目录下,有一个名为hadoop-examples-1.0.2.jar的文件(不同版本的Hadoop,jar文件不同),jar里打包了一些例子。可以在hadoop-1.0.2/src/examples/org/apache/hadoop/examples目录下找到这些例子。例子如下:
[*]# ll
[*]total 228
[*]-rw-rw-r-- 1 root root2797 Mar 25 08:01 AggregateWordCount.java
[*]-rw-rw-r-- 1 root root2879 Mar 25 08:01 AggregateWordHistogram.java
[*]drwxr-xr-x 2 root root4096 Apr 11 21:50 dancing
[*]-rw-rw-r-- 1 root root 13089 Mar 25 08:01 DBCountPageView.java
[*]-rw-rw-r-- 1 root root3751 Mar 25 08:01 ExampleDriver.java
[*]-rw-rw-r-- 1 root root3334 Mar 25 08:01 Grep.java
[*]-rw-rw-r-- 1 root root6582 Mar 25 08:01 Join.java
[*]-rw-rw-r-- 1 root root8282 Mar 25 08:01 MultiFileWordCount.java
[*]-rw-rw-r-- 1 root root 853 Mar 25 08:01 package.html
[*]-rw-rw-r-- 1 root root 11914 Mar 25 08:01 PiEstimator.java
[*]-rw-rw-r-- 1 root root 40350 Mar 25 08:01 RandomTextWriter.java
[*]-rw-rw-r-- 1 root root 10190 Mar 25 08:01 RandomWriter.java
[*]-rw-rw-r-- 1 root root7809 Mar 25 08:01 SecondarySort.java
[*]-rw-rw-r-- 1 root root9156 Mar 25 08:01 SleepJob.java
[*]-rw-rw-r-- 1 root root8040 Mar 25 08:01 Sort.java
[*]drwxr-xr-x 2 root root4096 Apr 11 21:50 terasort
[*]-rw-rw-r-- 1 root root2395 Mar 25 08:01 WordCount.java
我们使用WordCount来试运行一个Hadoop。
不指定任何参数执行wordcount将显示一些有关用法的信息:
[*]# ./hadoop jar /opt/hadoop-1.0.2/hadoop-examples-1.0.2.jar wordcount
[*]Usage: wordcount
从网上下载一篇英文散文,保存为test.txt,保存在/opt/data目录下
再次执行wordcount得到
[*]# ./hadoop jar /opt/hadoop-1.0.2/hadoop-examples-1.0.2.jar wordcount /opt/data/test.txt /opt/data/output
[*]12/04/11 22:48:41 INFO util.NativeCodeLoader: Loaded the native-hadoop library
[*]****file:/opt/data/test.txt
[*]12/04/11 22:48:41 INFO input.FileInputFormat: Total input paths to process : 1
[*]12/04/11 22:48:41 WARN snappy.LoadSnappy: Snappy native library not loaded
[*]12/04/11 22:48:42 INFO mapred.JobClient: Running job: job_local_0001
[*]12/04/11 22:48:42 INFO util.ProcessTree: setsid exited with exit code 0
[*]12/04/11 22:48:42 INFO mapred.Task:Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@9ced8e
[*]12/04/11 22:48:42 INFO mapred.MapTask: io.sort.mb = 100
[*]12/04/11 22:48:43 INFO mapred.MapTask: data buffer = 79691776/99614720
[*]12/04/11 22:48:43 INFO mapred.MapTask: record buffer = 262144/327680
[*]12/04/11 22:48:43 INFO mapred.MapTask: Starting flush of map output
[*]12/04/11 22:48:43 INFO mapred.MapTask: Finished spill 0
[*]12/04/11 22:48:43 INFO mapred.Task: Task:attempt_local_0001_m_000000_0 is done. And is in the process of commiting
[*]12/04/11 22:48:43 INFO mapred.JobClient:map 0% reduce 0%
[*]12/04/11 22:48:45 INFO mapred.LocalJobRunner:
[*]12/04/11 22:48:45 INFO mapred.Task: Task 'attempt_local_0001_m_000000_0' done.
[*]12/04/11 22:48:45 INFO mapred.Task:Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@281d4b
[*]12/04/11 22:48:45 INFO mapred.LocalJobRunner:
[*]12/04/11 22:48:45 INFO mapred.Merger: Merging 1 sorted segments
[*]12/04/11 22:48:45 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 6079 bytes
[*]12/04/11 22:48:45 INFO mapred.LocalJobRunner:
[*]12/04/11 22:48:45 INFO mapred.Task: Task:attempt_local_0001_r_000000_0 is done. And is in the process of commiting
[*]12/04/11 22:48:45 INFO mapred.LocalJobRunner:
[*]12/04/11 22:48:45 INFO mapred.Task: Task attempt_local_0001_r_000000_0 is allowed to commit now
[*]12/04/11 22:48:45 INFO output.FileOutputCommitter: Saved output of task 'attempt_local_0001_r_000000_0' to /opt/data/output
[*]12/04/11 22:48:46 INFO mapred.JobClient:map 100% reduce 0%
[*]12/04/11 22:48:48 INFO mapred.LocalJobRunner: reduce > reduce
[*]12/04/11 22:48:48 INFO mapred.Task: Task 'attempt_local_0001_r_000000_0' done.
[*]12/04/11 22:48:49 INFO mapred.JobClient:map 100% reduce 100%
[*]12/04/11 22:48:49 INFO mapred.JobClient: Job complete: job_local_0001
[*]12/04/11 22:48:49 INFO mapred.JobClient: Counters: 20
[*]12/04/11 22:48:49 INFO mapred.JobClient: File Output Format Counters
[*]12/04/11 22:48:49 INFO mapred.JobClient: Bytes Written=4241
[*]12/04/11 22:48:49 INFO mapred.JobClient: FileSystemCounters
[*]12/04/11 22:48:49 INFO mapred.JobClient: FILE_BYTES_READ=301803
[*]12/04/11 22:48:49 INFO mapred.JobClient: FILE_BYTES_WRITTEN=368355
[*]12/04/11 22:48:49 INFO mapred.JobClient: File Input Format Counters
[*]12/04/11 22:48:49 INFO mapred.JobClient: Bytes Read=5251
[*]12/04/11 22:48:49 INFO mapred.JobClient: Map-Reduce Framework
[*]12/04/11 22:48:49 INFO mapred.JobClient: Map output materialized bytes=6083
[*]12/04/11 22:48:49 INFO mapred.JobClient: Map input records=21
[*]12/04/11 22:48:49 INFO mapred.JobClient: Reduce shuffle bytes=0
[*]12/04/11 22:48:49 INFO mapred.JobClient: Spilled Records=946
[*]12/04/11 22:48:49 INFO mapred.JobClient: Map output bytes=9182
[*]12/04/11 22:48:49 INFO mapred.JobClient: Total committed heap usage (bytes)=321134592
[*]12/04/11 22:48:49 INFO mapred.JobClient: CPU time spent (ms)=0
[*]12/04/11 22:48:49 INFO mapred.JobClient: SPLIT_RAW_BYTES=88
[*]12/04/11 22:48:49 INFO mapred.JobClient: Combine input records=970
[*]12/04/11 22:48:49 INFO mapred.JobClient: Reduce input records=473
[*]12/04/11 22:48:49 INFO mapred.JobClient: Reduce input groups=473
[*]12/04/11 22:48:49 INFO mapred.JobClient: Combine output records=473
[*]12/04/11 22:48:49 INFO mapred.JobClient: Physical memory (bytes) snapshot=0
[*]12/04/11 22:48:49 INFO mapred.JobClient: Reduce output records=473
[*]12/04/11 22:48:49 INFO mapred.JobClient: Virtual memory (bytes) snapshot=0
[*]12/04/11 22:48:49 INFO mapred.JobClient: Map output records=970
查看统计的结果
[*]# more /opt/data/output/*
[*]::::::::::::::
[*]/opt/data/output/part-r-00000
[*]::::::::::::::
[*]"Eat, 1
[*]"How 1
[*]"she 1
[*]And 1
[*]But 1
[*]Darkness 1
[*]Epicurean 1
[*]Eyes".1
[*]He 1
[*]I 24
[*]If 2
[*]In 1
[*]It 3
[*]Nature2
[*]Occasionally, 1
[*]Only 1
[*]Particularly 1
[*]Persian 1
[*]Recently 1
[*]So 1
[*]Sometimes 1
[*]Such 1
[*]The 2
[*]Their 1
[*]There 1
[*]To 2
[*]Use 1
[*]We 3
[*]What 1
[*]When 1
[*]Yet, 1
wordcount程序有一些不足,分词完全根据空格而不是根据标点符号,这使得“"Eat”,“eat”,“Eat”分别成为单独的单词。可以修改wordcount.java来修改这个不足
将StringTokenizer itr = new StringTokenizer(line)改为
StringTokenizer itr = new StringTokenizer(line," \t\n\r\f,.:;?![]'")
重新编译再运行一次,得到的结果就好多了。
页:
[1]