Hadoop的OutputFormat和InputFormat

remington_young 发表于 2016-12-7 09:56:45

　　Hadoop用于数据的输入和输出，需要指定OutputFormat和InputFormat，这两个类的目的是为了指明读数据和写数据相关的包括格式等信息。
　　InputFormat：

public abstract
List<InputSplit> getSplits(JobContext context
) throws IOException, InterruptedException;
public abstract
RecordReader<K,V> createRecordReader(InputSplit split,
TaskAttemptContext context
) throws IOException,
InterruptedException;
　　createRecordReader：指明具体的读操作
　　getSplits：获取要读的数据块
　　我们可以看到InputSplit的类：

public abstract long getLength() throws IOException, InterruptedException;
public abstract
String[] getLocations() throws IOException, InterruptedException;
　　具体的路径和长度
　　OutputFormat：

public abstract RecordWriter<K, V>
getRecordWriter(TaskAttemptContext context
) throws IOException, InterruptedException;
public abstract void checkOutputSpecs(JobContext context
) throws IOException,
InterruptedException;
public abstract
OutputCommitter getOutputCommitter(TaskAttemptContext context
) throws IOException, InterruptedException;
　　getRecordWriter：具体记录的写的方式
　　checkOutputSpecs：检测数据输出空间
　　getOutputCommitter：写flush操作

页: [1]

运维网's Archiver

Hadoop的OutputFormat和InputFormat