背景需求和现状
目前的日志系统还称不上系统,只是在几台服务器上存着所有的日志,依靠NFS共享数据,并运算,带来的问题诸多:
a) 数据存放凌乱,缺乏系统的目录管理;
b) 存储空间有限,并且扩展非常麻烦;
c) CV/PV等日志分散存放,合并不方便;
d) 媒体服务日志数据集中存放,数据庞大而难以做到轻量级备份;
e) 丢失数据的情况时有发生,且无从恢复;
f) 数据抓取性能低下,时常成为运算瓶颈;
k) 编译
i. 下载就去官方网站,我就不写了
ii. 我们把Hadoop都安装在/usr/local/
tar zxvf hadoop-0.20.2.tar.gz
ln -s hadoop-0.20.2 hadoop
cd hadoop
iii. 配置Hadoop(我cp的是官方的默认配置,没有写。我写的这个是单机的,集群参考:http://hadoop.apache.org/common/docs/r0.20.2/cluster_setup.html )
conf/core-site.xml:
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
conf/mapred-site.xml:
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>localhost:9001</value>
</property>
</configuration>
iv. 格式化Hadoop
错误如果出现:Error: JAVA_HOME is not set. 表示没有配置java home。
我们把javahome配置为全局;
vi /etc/environment
增加jave_home和/usr/local/hadoop/bin:
JAVA_HOME="/usr/lib/jvm/java-6-sun"
v. 启动Hadoop
start-all.sh
vi. 检查Hadoop是否正常
Netstat –nl |more
tcp6 0 0 127.0.0.1:9000 :::* LISTEN
tcp6 0 0 127.0.0.1:9001 :::* LISTEN
tcp6 0 0 :::50090 :::* LISTEN
tcp6 0 0 :::50070 :::* LISTEN
vii. 测试
hadoop fs -put CHANGES.txt input/
hadoop fs -ls input
这个例子是计算有多少个单词的
hadoop jar hadoop-*-examples.jar grep input output '[a-z.]+'
root@hadoop-desktop:/usr/local/hadoop# hadoop fs -cat output/* |more
cat: Source must be a file.
3828 .
1969 via
1375 to
viii.
l) Api介绍
见附件:Word: hadoop的API.docx
m)
Hive 安装
n) 下载,去官方下载最新版,我就不写了。
o) 解压 ;
tar zxvf hive-0.5.0-bin.tar.gz ;
ln –s hive-0.5.0-bin hive
p) 配置hive环境
vi /etc/environment
HIVE_HOME="/usr/local/hive/"
q) 创建hive存储
hadoop fs -mkdir /user/hive/warehouse
hadoop fs -chmod g+w /user/hive/warehouse
r) 启动hive
hive
进入: hive> 标识符
创建pokes表。
hivr> CREATE TABLE pokes (foo INT, bar STRING);
加载测试数据,加载的文件是2列。
hive> LOAD DATA LOCAL INPATH './examples/files/kv1.txt' OVERWRITE INTO TABLE pokes;
hive> select count(1) from pokes;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapred.reduce.tasks=<number>
Starting Job = job_201004131133_0018, Tracking URL = http://localhost:50030/jobdetails.jsp?jobid=job_201004131133_0018
Kill Command = /usr/local/hadoop/bin/../bin/hadoop job -Dmapred.job.tracker=localhost:9001 -kill job_201004131133_0018
2010-04-13 16:32:12,188 Stage-1 map = 0%, reduce = 0%
2010-04-13 16:32:29,536 Stage-1 map = 100%, reduce = 0%
2010-04-13 16:32:38,768 Stage-1 map = 100%, reduce = 33%
2010-04-13 16:32:44,916 Stage-1 map = 100%, reduce = 100%
Ended Job = job_201004131133_0018
OK
500
Time taken: 38.379 seconds
hive> select count(bar),bar from pokes group by bar;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapred.reduce.tasks=<number>
Starting Job = job_201004131133_0017, Tracking URL = http://localhost:50030/jobdetails.jsp?jobid=job_201004131133_0017
Kill Command = /usr/local/hadoop/bin/../bin/hadoop job -Dmapred.job.tracker=localhost:9001 -kill job_201004131133_0017
2010-04-13 16:26:55,791 Stage-1 map = 0%, reduce = 0%
2010-04-13 16:27:11,165 Stage-1 map = 100%, reduce = 0%
2010-04-13 16:27:20,268 Stage-1 map = 100%, reduce = 33%
2010-04-13 16:27:25,348 Stage-1 map = 100%, reduce = 100%
Ended Job = job_201004131133_0017
OK
3 val_0
1 val_10
……………
Time taken: 37.979 seconds
s) hive api:
参考: http://hadoop.apache.org/hive/docs/current/api/org/apache/hadoop/hive/conf/
t) hive 使用mysql 做meta
参考:http://www.mazsoft.com/blog/post/2010/02/01/Setting-up-HadoopHive-to-use-MySQL-as-metastore.aspx
是 把meta信息存在mysql,防止hdfs挂了而得不到数据列表。感觉没有必要,因为hdfs挂了,有meta信息没有什么用。
u)
Hive & hadoop 的一些测试:
v) 加载gz 或者bz2格式元数据占用空间&时间的比较:
hive> load data local inpath 'ok.txt.gz' overwrite into table page_test2 partition(dt='2010-04-16');
Copying data from file:/usr/local/ok.txt.gz
Loading data to table page_test2 partition {dt=2010-04-16}
OK
Time taken: 3.649 seconds
下面是Hadoop存储的hive表的文件大小:
root@hadoop:/tmp/hadoop-root/dfs/data/current# du -ch blk_-945326243445352181
22M blk_-945326243445352181
22M total
w) 加载文本文件:
hive> load data local inpath 'ok.txt' overwrite into table page_test partition(dt='2010-04-17');
Copying data from file:/usr/local/ok.txt
Loading data to table page_test partition {dt=2010-04-17}
OK
Time taken: 41.593 seconds
下面是Hadoop存储的hive表的文件大小:
root@hadoop:/tmp/hadoop-root/dfs/data/current# du -ch blk_7538941016314062501
64M blk_7538941016314062501
64M total
x) 源文件大小:
root@hadoop:/usr/local# du -ch ok.txt
196M ok.txt
196M total
root@hadoop:/usr/local# du -ch ok.txt.gz
22M ok.txt.gz
22M total
y) Hive查询比较:
可以从结果看出压缩的数据查询速度比不压缩的还快一点,奇怪了。
Gz文件导入并创建分区后使用hive查询: hive> select count(1) from page_test2 a where a.dt='2010-04-16';
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapred.reduce.tasks=<number>
Starting Job = job_201004131133_0026, Tracking URL = http://localhost:50030/jobdetails.jsp?jobid=job_201004131133_0026
Kill Command = /usr/local/hadoop/bin/../bin/hadoop job -Dmapred.job.tracker=localhost:9001 -kill job_201004131133_0026
2010-04-16 13:43:39,435 Stage-1 map = 0%, reduce = 0%
2010-04-16 13:47:30,921 Stage-1 map = 100%, reduce = 100%
Ended Job = job_201004131133_0026
OK
17166483
Time taken: 239.447 seconds
Txt文件导入并创建分区后使用hive QL查询: hive> select count(1) from page_test a where a.dt='2010-04-16';
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapred.reduce.tasks=<number>
Starting Job = job_201004131133_0025, Tracking URL = http://localhost:50030/jobdetails.jsp?jobid=job_201004131133_0025
Kill Command = /usr/local/hadoop/bin/../bin/hadoop job -Dmapred.job.tracker=localhost:9001 -kill job_201004131133_0025
2010-04-16 13:37:11,927 Stage-1 map = 0%, reduce = 0%
2010-04-16 13:42:01,382 Stage-1 map = 100%, reduce = 22%
2010-04-16 13:42:13,683 Stage-1 map = 100%, reduce = 100%
Ended Job = job_201004131133_0025
OK
17166483
Time taken: 314.291 seconds
Txt 没有创建分区使用hive查询 没有记录下来,是400多秒
z) a
Hive 开发
a) 打开hive service:
在10000端口打开 hive服务
HIVE_PORT=10000 ./bin/hive --service hiveserver
b) 查看服务是否启动:
netstat –nl |grep 100000
c) 写测试程序:
官方给的例子,这个我编译过去,执行有错误,没有查出那里问题。
import java.sql.SQLException;
import java.sql.Connection;
import java.sql.ResultSet;
import java.sql.Statement;
import java.sql.DriverManager;
public class HiveJdbcClient {
private static String driverName = "org.apache.hadoop.hive.jdbc.HiveDriver";
/**
* @param args
* @throws SQLException
*/
public static void main(String[] args) throws SQLException {
try {
Class.forName(driverName);
} catch (ClassNotFoundException e) {
// TODO Auto-generated catch block
e.printStackTrace();
System.exit(1);
}
Connection con = DriverManager.getConnection("jdbc:hive://localhost:10000/default", "", "");
Statement stmt = con.createStatement();
String tableName = "testHiveDriverTable";
stmt.executeQuery("drop table " + tableName);
ResultSet res = stmt.executeQuery("create table " + tableName + " (key int, value string)");
// show tables
String sql = "show tables '" + tableName + "'";
System.out.println("Running: " + sql);
res = stmt.executeQuery(sql);
if (res.next()) {
System.out.println(res.getString(1));
}
// describe table
sql = "describe " + tableName;
System.out.println("Running: " + sql);
res = stmt.executeQuery(sql);
while (res.next()) {
System.out.println(res.getString(1) + "\t" + res.getString(2));
}
// load data into table
// NOTE: filepath has to be local to the hive server
// NOTE: /tmp/a.txt is a ctrl-A separated file with two fields per line
String filepath = "/tmp/a.txt";
sql = "load data local inpath '" + filepath + "' into table " + tableName;
System.out.println("Running: " + sql);
res = stmt.executeQuery(sql);
// select * query
sql = "select * from " + tableName;
System.out.println("Running: " + sql);
res = stmt.executeQuery(sql);
while (res.next()) {
System.out.println(String.valueOf(res.getInt(1)) + "\t" + res.getString(2));
}
// regular hive query
sql = "select count(1) from " + tableName;
System.out.println("Running: " + sql);
res = stmt.executeQuery(sql);
while (res.next()) {
System.out.println(res.getString(1));
}
}
}