最近给热云公司共享数据,我们把原始数据给到他们,让他们做计算。每天同步一次,数据量压缩后10几个G,数据来自hive的mapreduce查询。通过insert overwrite local directory select语句将数据写入本地的NFS,然后对数据压缩,并在NFS的服务端机器提供文件下载功能。由于压缩前数据量太大,大概有90G左右。因此在hive作业最后写入select结果数据到本地文件系统时直接报错中断了。而且就算能拷贝到本地,之后的压缩时间没有好几个小时也甭想完成。于是就想到了用启用hadoop的数据压缩功能,使mapreduce作业直接输出压缩好的数据到本地文件系统。具体步骤如下:
1.执行hive语句之前,在hive-cli中设置如下参数:
set mapreduce.output.fileoutputformat.compress=true;
set mapreduce.output.fileoutputformat.compress.type=BLOCK;
set mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.GzipCodec;
set mapreduce.map.output.compress=true;
set mapreduce.map.output.compress.codec=org.apache.hadoop.io.compress.GzipCodec;
此时执行测试语句:
insert overwrite directory '/user/reyun/testrawdataexport'
select keywords,count(1) from qiku.illegalpackage group by keywords;
发现报错:
Diagnostic Messages for this Task:
Error: org.apache.hadoop.mapreduce.task.reduce.Shuffle$ShuffleError: error in shuffle in fetcher#1
at org.apache.hadoop.mapreduce.task.reduce.Shuffle.run(Shuffle.java:134)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:376)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1642)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
Caused by: java.io.IOException: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
at org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl.checkReducerHealth(ShuffleSchedulerImpl.java:333)
at org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl.copyFailed(ShuffleSchedulerImpl.java:255)
at org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:351)
at org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:193)
显然,shuffle时候出问题了。再看下具体执行任务的节点的报错信息:
2015-10-14 10:00:03,555 INFO [main] org.apache.hadoop.io.compress.CodecPool: Got brand-new compressor [.gz]
2015-10-14 10:00:03,556 WARN [main] org.apache.hadoop.mapred.IFile: Could not obtain compressor from CodecPool
压缩器是hadoop带的功能,hdfs在做文件压缩时,只是向CodecPool获取压缩器,但显然CodecPool里没有可用的压缩器,因此需要我们配置。即下面第二步所做的事情。
2.配置hadoop,使其加载各种压缩工具,提供压缩功能。在hadoop的core-site.xml文件中加入如下片段: