hadoop的序列化机制

sunkezai · 发表于 2016-12-6 08:47:37

hadoop不用java的serialization机制
doug cutting 是这样解释的：
引用

Why didn’t I use Serialization when we first started Hadoop? Because it looked
big and hairy and I thought we needed something lean and mean, where we had
precise control over exactly how objects are written and read, since that is central
to Hadoop. With Serialization you can get some control, but you have to fight for
it.
The logic for not using RMI was similar. Effective, high-performance inter-process
communications are critical to Hadoop. I felt like we’d need to precisely control
how things like connections, timeouts and buffers are handled, and RMI gives you
little control over those.

总的意思就是：serialization对hadoop很重要，所以我们要自己实现我们专用的序列化机制。不使用RMI也是一样的道理
运用hadoop的序列化
在hadoop的框架中要使一个类可序列化，要实现Writable接口的两个方法：

public interface Writable {
void write(DataOutput out) throws IOException;
void readFields(DataInput in) throws IOException;
}

比java的实现Serializable复杂很多。但是通过比较可以发现，hadoop的序列化机制产生的数据量远小于java的序列化所产生的数据量。
在这两个方法中自己控制对fileds的输入和输出。如果类中包含有其他对象的引用，那么那个对象也是要实现Writable接口的（当然也可以不实现Writable借口，只要自己处理好对对象的fileds的存贮就可以了）。
下面是一个简单的例子：
类Attribute

package siat.miner.etl.instance
import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.Writable;
public class Attribute implements Writable{
public static int ATTRIBUTE_TYPE_STRING = 1;//string type
public static int ATTRIBUTE_TYPE_NOMINAL = 2;//nominal type
public static int ATTRIBUTE_TYPE_REAL = 3;//real type
private IntWritable type;
private Text name;
public IntWritable getType() {
return type;
}
public void setType(int type) {
this.type = new IntWritable(type);
}
public Text getName() {
return name;
}
public void setName(String name) {
this.name = new Text(name);
}
public Attribute() {
super();
this.type = new IntWritable(0);
this.name = new Text("");
}
public Attribute(int type, String name) {
super();
this.type = new IntWritable(type);
this.name = new Text(name);
}
@Override
public void readFields(DataInput in) throws IOException {
// TODO Auto-generated method stub
type.readFields(in);
name.readFields(in);
}
@Override
public void write(DataOutput out) throws IOException {
// TODO Auto-generated method stub
type.write(out);
name.write(out);
}
}

类TestA：

package siat.miner.etl.test;
import java.io.ByteArrayInputStream;
import java.io.ByteArrayOutputStream;
import java.io.DataInput;
import java.io.DataInputStream;
import java.io.DataOutput;
import java.io.DataOutputStream;
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Writable;
import siat.miner.etl.instance.Attribute;
public class TestA implements Writable{
private Attribute a;
private IntWritable b;
/**
* @param args
* @throws IOException
*/
public static void main(String[] args) throws IOException {
// TODO Auto-generated method stub
Attribute a = new Attribute(Attribute.ATTRIBUTE_TYPE_NOMINAL, "name");
TestA ta = new TestA(a, new IntWritable(1));
ByteArrayOutputStream bos = new ByteArrayOutputStream();
DataOutputStream oos = new DataOutputStream(bos);
ta.write(oos);
TestA tb = new TestA();
tb.readFields(new DataInputStream(new ByteArrayInputStream(bos.toByteArray())));
}
public TestA(Attribute a, IntWritable b) {
super();
this.a = a;
this.b = b;
}
public TestA() {
// TODO Auto-generated constructor stub
}
@Override
public void readFields(DataInput in) throws IOException {
// TODO Auto-generated method stub
a = new Attribute();
a.readFields(in);
b = new IntWritable();
b.readFields(in);
}
@Override
public void write(DataOutput out) throws IOException {
// TODO Auto-generated method stub
a.write(out);
b.write(out);
}
}

可以看到，hadoop的序列化机制就是利用java的DataInput和DataOutput来完成对基本类型的序列化，然后让用户自己来处理对自己编写的类的序列化。

账号		自动登录	找回密码
密码			立即注册

Centos6.5×64安装配置openmeetings3.0.3详

大疆运维招人啦，

C++ :try 语句块和异常处理

C++的多态

Red Hat RHCE 8 (EX294) Cert Guide

Java/C++ 区别：看完这一篇，就够用！

别再用过时库了！这 13 个顶级 C++ 库才是

[经验分享] hadoop的序列化机制

浏览过的版块

扫码加入运维网微信交流群