elixiat 发表于 2016-12-5 11:17:43

hadoop学习4——使用hadoop压缩(zipping)文件

  hadoop0.20.2
  1.使用streaming命令(摘至hadoop开发文档):

除了纯文本格式的输出,你还可以生成gzip文件格式的输出,你只需设置streaming作业中的选项‘-jobconf mapred.output.compress=true -jobconf mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCode’。
  2.使用程序:
  输入文件:

$ bin/hadoop fs -ls /temp/in
Found 2 items
-rw-r--r--   1 Administrator supergroup         52 2012-02-09 10:02 /temp/in/t1.txt
-rw-r--r--   1 Administrator supergroup         35 2012-02-09 10:02 /temp/in/t2.txt

  调试代码:

public class ZipFile {
public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value,
OutputCollector<Text, IntWritable> output, Reporter reporter)
throws IOException {
output.collect((Text)value, null);
}
}
public static void main(String[] args) {
JobClient client = new JobClient();
JobConf conf = new JobConf(com.hadoop.test.ZipFile.class);
// TODO: specify output types
//conf.setOutputKeyClass(Text.class);
//conf.setOutputValueClass(IntWritable.class);
// TODO: specify input and output DIRECTORIES (not files)
FileInputFormat.setInputPaths(conf, new Path("/temp/in"));
FileOutputFormat.setOutputPath(conf, new Path("/temp/out-" + System.currentTimeMillis()));
// TODO: specify a mapper
conf.setMapperClass(Map.class);
// TODO: specify a reducer
//conf.setReducerClass(org.apache.hadoop.mapred.lib.IdentityReducer.class);
FileOutputFormat.setCompressOutput(conf, true);
FileOutputFormat.setOutputCompressorClass(conf, org.apache.hadoop.io.compress.GzipCodec.class);
//      conf.setOutputFormat(NonSplitableTextInputFormat.class);
//      conf.setInputFormat(TextInputFormat.class);
//conf.setOutputFormat(TextOutputFormat.class);
conf.setNumReduceTasks(0);

client.setConf(conf);
try {
JobClient.runJob(conf);
} catch (Exception e) {
e.printStackTrace();
}
}
}
  输出文件:

$ bin/hadoop fs -ls /temp/out-1328857284203
Found 2 items
-rw-r--r--   3 Administrator supergroup         67 2012-02-10 15:01 /temp/out-1328857284203/part-00000.gz
-rw-r--r--   3 Administrator supergroup         53 2012-02-10 15:01 /temp/out-1328857284203/part-00001.gz

  使用命令:
  $ bin/hadoop fs -get /temp/out-1328857284203/part-00000.gz out1.gz
  把压缩后的文件下载到本地也是zip格式的文件,打开,解压打开跟原文件一致。
页: [1]
查看完整版本: hadoop学习4——使用hadoop压缩(zipping)文件