用protobuf替换hadoop中rpc的返回值

yzc164 发表于 2016-12-10 10:49:46

　　protobuf是谷歌开发的一套序列化结构化数据用以通讯协议，存储数据等的框架，支持c++、java、python等语言。Hadoop 0.23及之后版本将使用protobuf实现rpc的序列化及反序列化。这里做了一个实验，在hadoop 0.19上实现用protobuf序列化/反序列一个rpc的返回值。
　　使用protobuf需要首先下载并安装，大概步骤是下载并解压tar包后，依次执行下面步骤：
　　$ ./configure

$ make

$ make check

$ make install # 该步骤需要root权限(sudo)
　　由于hadoop使用的java语言，需要到java目录下编译jar包，步骤如下：
　　$ mvn test # 需要本地先安装maven哦
　　$ mvn install
　　$ mvn package #该步骤会在target目录上生成一个jar包，
　　 #包名为：protobuf-java-2.4.1.jar 该jar包需要放到hadoop的lib目录下，
　　 #供编译及运行时使用。
　　

　　本地环境安装完后，下一步是写proto文件。为了方便，这里选择了用proto写ClusterStatus类，对应的proto文件内容如下：

package mapred;option java_package = "org.apache.hadoop.mapred";option java_outer_classname = "ClusterStatusProtos";message ClusterStatus {optional int32 task_trackers = 1;optional int32 map_tasks = 2;optional int32 reduce_tasks = 3;optional int32 max_map_tasks = 4;optional int32 max_reduce_tasks = 5;enum JTState {INITIALIZING = 0;RUNNING = 1;}optional JTState state = 6;}

proto文件写好后，用protoc工具编译下生成对应的java文件，命令行如下：
protoc -I=$SRC_DIR --java_out=$DST_DIR $SRC_DIR/my.proto
　　其实主要是用--java_out制定下生成的java文件放到哪里去。上面的proto文件编译后生成的文件为: org/apache/hadoop/mapred/ClusterStatusProtos.java。
　　下一步是修改ClusterStatus.java文件，为了不需要改变ObjectWritable.java中hadoop对象的序列化方式，我们在ClusterStatus类中包装了一个ClusterStatusProtos.ClusterStatus.Builder对象status，去掉该类的所有成员变量并把所有接口改为操作status对象，对应的diff文件如下：

Index: src/mapred/org/apache/hadoop/mapred/ClusterStatus.java===================================================================--- src/mapred/org/apache/hadoop/mapred/ClusterStatus.java (revision 106658)+++ src/mapred/org/apache/hadoop/mapred/ClusterStatus.java (working copy)@@ -21,9 +21,12 @@import java.io.DataInput;import java.io.DataOutput;import java.io.IOException;+import java.io.InputStream;import org.apache.hadoop.io.Writable;-import org.apache.hadoop.io.WritableUtils;+import org.apache.hadoop.io.DataOutputOutputStream;+import org.apache.hadoop.mapred.ClusterStatusProtos;+import org.apache.hadoop.mapred.ClusterStatusProtos.ClusterStatus.JTState;/*** Status information on the current state of the Map-Reduce cluster.@@ -50,14 +53,10 @@* @see JobClient*/public class ClusterStatus implements Writable {++ClusterStatusProtos.ClusterStatus.Builder status = + ClusterStatusProtos.ClusterStatus.newBuilder();-private int task_trackers;-private int map_tasks;-private int reduce_tasks;-private int max_map_tasks;-private int max_reduce_tasks;-private JobTracker.State state;-ClusterStatus() {}/**@@ -70,13 +69,13 @@* @param state the {@link JobTracker.State} of the <code>JobTracker</code>*/ClusterStatus(int trackers, int maps, int reduces, int maxMaps,- int maxReduces, JobTracker.State state) {- task_trackers = trackers;- map_tasks = maps;- reduce_tasks = reduces;- max_map_tasks = maxMaps;- max_reduce_tasks = maxReduces;- this.state = state;+ int maxReduces, JTState state) {+ status.setTaskTrackers(trackers)+ .setMapTasks(maps)+ .setReduceTasks(reduces)+ .setMaxMapTasks(maxMaps)+ .setMaxReduceTasks(maxReduces)+ .setState(state);}@@ -86,7 +85,7 @@* @return the number of task trackers in the cluster.*/public int getTaskTrackers() {- return task_trackers;+ return status.getTaskTrackers();}/**@@ -95,7 +94,7 @@* @return the number of currently running map tasks in the cluster.*/public int getMapTasks() {- return map_tasks;+ return status.getMapTasks();}/**@@ -104,7 +103,7 @@* @return the number of currently running reduce tasks in the cluster.*/public int getReduceTasks() {- return reduce_tasks;+ return status.getReduceTasks();}/**@@ -113,7 +112,7 @@* @return the maximum capacity for running map tasks in the cluster.*/public int getMaxMapTasks() {- return max_map_tasks;+ return status.getMaxMapTasks();}/**@@ -122,7 +121,7 @@* @return the maximum capacity for running reduce tasks in the cluster.*/public int getMaxReduceTasks() {- return max_reduce_tasks;+ return status.getMaxReduceTasks();}/**@@ -131,26 +130,17 @@** @return the current state of the <code>JobTracker</code>.*/-public JobTracker.State getJobTrackerState() {- return state;+public JTState getJobTrackerState() {+ return status.getState();}public void write(DataOutput out) throws IOException {- out.writeInt(task_trackers);- out.writeInt(map_tasks);- out.writeInt(reduce_tasks);- out.writeInt(max_map_tasks);- out.writeInt(max_reduce_tasks);- WritableUtils.writeEnum(out, state);+ status.build().writeDelimitedTo(+ DataOutputOutputStream.constructOutputStream(out));}public void readFields(DataInput in) throws IOException {- task_trackers = in.readInt();- map_tasks = in.readInt();- reduce_tasks = in.readInt();- max_map_tasks = in.readInt();- max_reduce_tasks = in.readInt();- state = WritableUtils.readEnum(in, JobTracker.State.class);+ status = ClusterStatusProtos.ClusterStatus.parseDelimitedFrom((InputStream)in).toBuilder();}}

需要注意的是，由于protobuf在输入或输出时只接受InputStream/OutputStream对象，而hadoop的输入流是DataOutput/DataInput，这就需要进行转换。对于DataOutput到OutputStream的转换，采用HADOOP-7379的DataOutputOutputStream.java的wrap方法，而DataInput到InputStream，则是通过强制转换实现的。然后调整JobTracker.java及LocalJobRunner.java等文件后，就可以编译通过并运行了。　　遗憾的是，修改后没能测试性能优势。
　　

　　可能遇见的问题：
　　编译报错或运行时错误，提示类或方法找不到。
　　解决方案：
　　1、确保protobuf-java-2.4.1.jar包已经放到hadoop的lib目录下了；
　　2、全量编译：ant clean; ant。

　　

　　参考资料：
　　1、http://code.google.com/apis/protocolbuffers/docs/javatutorial.html
　　2、https://issues.apache.org/jira/browse/HADOOP-7379

页: [1]

运维网's Archiver

用protobuf替换hadoop中rpc的返回值