设为首页 收藏本站
查看: 826|回复: 0

[经验分享] 远程调用执行Hadoop Map/Reduce

[复制链接]

尚未签到

发表于 2016-12-9 06:31:48 | 显示全部楼层 |阅读模式
  在Web项目中,由用户下发任务后,后台服务器远程调用JobTracker所在服务器,运行Map/Reduce更符合B/S架构的习惯。
  由于网上没有相关资料,所以自己实现了一个,现在分享一下。
  注:基于Hadoop1.1.2版本
  转发请注明地址:http://sgq0085.iyunv.com/admin/blogs/1879442
  一个常见的WordCount如下:

package com.gqshao.hadoop.remote;
import java.io.IOException;
import java.util.*;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.*;
import org.apache.hadoop.mapreduce.lib.output.*;
import org.apache.hadoop.util.*;
public class WordCount extends Configured implements Tool {
public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
context.write(word, one);
}
}
}
public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
context.write(key, new IntWritable(sum));
}
}
public int run(String[] args) throws Exception {
this.getClass().getResource("/hadoop/");
Configuration conf = getConf();
Job job = new Job(conf);
conf.set("mapred.job.tracker", "192.168.0.128:9001");
conf.set("fs.default.name", "hdfs://192.168.0.128:9000");
conf.set("hadoop.job.ugi", "hadoop");
conf.set("Hadoop.tmp.dir", "/user/gqshao/temp/");
job.setJarByClass(WordCount.class);
job.setJobName("wordcount");
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
String hdfs = "hdfs://192.168.0.128:9000";
args = new String[] { hdfs + "/user/gqshao/input/big", hdfs + "/user/gqshao/output/WordCount/" + new Date().getTime() };
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
boolean success = job.waitForCompletion(true);
return success ? 0 : 1;
}
public static void main(String[] args) throws Exception {
int ret = ToolRunner.run(new WordCount(), args);
System.exit(ret);
}
}

 在这里输入和输出目录都是指向HDFS上的,但实际运行的时候(一般 -Xms128m -Xmx512m -XX:MaxPermSize=128M)发现输出中有如下信息:
信息: Running job: job_local_0001
  证明该Map/Reduce程序运行在Local中。也就是说,这种方式只能提前打好Jar包,放到Cluster服务器上,在通过Jar运行。
  转发请注明地址:http://sgq0085.iyunv.com/admin/blogs/1879442
  如何远程运行Map/Reduce程序,经研究发现两点。
  1.需要将Hadoop的配置文件加载到当前进程的ClassLoader中,或将配置文件放到/bin目录下。
  通过跟踪 job.waitForCompletion(true);→submit();→info = jobClient.submitJobInternal(conf);→status = jobSubmitClient.submitJob(jobId, submitJobDir.toString(), jobCopy.getCredentials());
  发现private JobSubmissionProtocol jobSubmitClient;分别有两个实现
  在org.apache.hadoop.mapred.JobClient中init()方法中可以看到如果设置了conf中如果设置了mapred.job.tracker则在Hadoop Cluster中运行,否则是Local

  public void init(JobConf conf) throws IOException {
String tracker = conf.get("mapred.job.tracker", "local");
tasklogtimeout = conf.getInt(
TASKLOG_PULL_TIMEOUT_KEY, DEFAULT_TASKLOG_TIMEOUT);
this.ugi = UserGroupInformation.getCurrentUser();
if ("local".equals(tracker)) {
conf.setNumMapTasks(1);
this.jobSubmitClient = new LocalJobRunner(conf);
} else {
this.rpcJobSubmitClient =
createRPCProxy(JobTracker.getAddress(conf), conf);
this.jobSubmitClient = createProxy(this.rpcJobSubmitClient, conf);
}        
}
  所以需要在运行时加载某目录下配置文件
  方法如下:

    /**
* 加载配置文件
*/
public static void setConf(Class<?> clazz, Thread thread, String path) {
URL url = clazz.getResource(path);
try {
File confDir = new File(url.toURI());
if (!confDir.exists()) {
return;
}
URL key = confDir.getCanonicalFile().toURI().toURL();
ClassLoader classLoader = thread.getContextClassLoader();
classLoader = new URLClassLoader(new URL[] { key }, classLoader);
thread.setContextClassLoader(classLoader);
} catch (Exception e) {
e.printStackTrace();
}
}
  2.设置运行时Jar包
  继续看jobClient.submitJobInternal(conf);可以发现client在提交作业到Hadoop时需要把作业打包成jar,然后copy到fs的submitJarFile路径中。所以必须指定conf中的运行的Jar包。
  方法如下:

    /**
* 动态生成Jar包
*/
public static File createJar(Class<?> clazz) throws Exception {
String fqn = clazz.getName();
String base = fqn.substring(0, fqn.lastIndexOf("."));
base = "/" + base.replaceAll("\\.", Matcher.quoteReplacement("/"));
URL root = clazz.getResource("");
JarOutputStream out = null;
final File jar = File.createTempFile("HadoopRunningJar-", ".jar", new File(System.getProperty("java.io.tmpdir")));
System.out.println(jar.getAbsolutePath());
Runtime.getRuntime().addShutdownHook(new Thread() {
public void run() {
jar.delete();
}
});
try {
File path = new File(root.toURI());
Manifest manifest = new Manifest();
manifest.getMainAttributes().putValue("Manifest-Version", "1.0");
manifest.getMainAttributes().putValue("Created-By", "RemoteHadoopUtil");
out = new JarOutputStream(new FileOutputStream(jar), manifest);
writeBaseFile(out, path, base);
} finally {
out.flush();
out.close();
}
return jar;
}
/**
* 递归添加.class文件
*/
private static void writeBaseFile(JarOutputStream out, File file, String base) throws IOException {
if (file.isDirectory()) {
File[] fl = file.listFiles();
if (base.length() > 0) {
base = base + "/";
}
for (int i = 0; i < fl.length; i++) {
writeBaseFile(out, fl, base + fl.getName());
}
} else {
out.putNextEntry(new JarEntry(base));
FileInputStream in = null;
try {
in = new FileInputStream(file);
byte[] buffer = new byte[1024];
int n = in.read(buffer);
while (n != -1) {
out.write(buffer, 0, n);
n = in.read(buffer);
}
} finally {
in.close();
}  
}
}
  修改后的WordCount如下:

public class WordCount extends Configured implements Tool {
public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
System.out.println("line===>" + line);
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
context.write(word, one);
}
}
}
public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
context.write(key, new IntWritable(sum));
}
}
public int run(String[] args) throws Exception {
Configuration conf = getConf();
Job job = new Job(conf);
System.out.println(conf.get("mapred.job.tracker"));
System.out.println(conf.get("fs.default.name"));
/**
* TODO:调用二
*/
File jarFile = RemoteHadoopUtil.createJar(WordCount.class);
((JobConf) job.getConfiguration()).setJar(jarFile.toString());
job.setJarByClass(WordCount.class);
job.setJobName("wordcount");
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
String hdfs = "hdfs://192.168.0.128:9000";
args = new String[] { hdfs + "/user/gqshao/input/WordCount/", hdfs + "/user/gqshao/output/WordCount/" + new Date().getTime() };
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
boolean success = job.waitForCompletion(true);
System.out.println(job.isComplete());
System.out.println("JobID: " + job.getJobID());
return success ? 0 : 1;
}
public static void main(String[] args) throws Exception {
/**
* TODO:调用一
*/
RemoteHadoopUtil.setConf(WordCount.class, Thread.currentThread(), "/hadoop");
int ret = ToolRunner.run(new WordCount(), args);
System.exit(ret);
}
}
  转发请注明地址:http://sgq0085.iyunv.com/admin/blogs/1879442
 附件中有完整代码和测试用例,欢迎讨论。解压后在文件目录中运行mvn eclipse:clean eclipse:eclipse即可(前提是需要有Maven)

运维网声明 1、欢迎大家加入本站运维交流群:群②:261659950 群⑤:202807635 群⑦870801961 群⑧679858003
2、本站所有主题由该帖子作者发表,该帖子作者与运维网享有帖子相关版权
3、所有作品的著作权均归原作者享有,请您和我们一样尊重他人的著作权等合法权益。如果您对作品感到满意,请购买正版
4、禁止制作、复制、发布和传播具有反动、淫秽、色情、暴力、凶杀等内容的信息,一经发现立即删除。若您因此触犯法律,一切后果自负,我们对此不承担任何责任
5、所有资源均系网友上传或者通过网络收集,我们仅提供一个展示、介绍、观摩学习的平台,我们不对其内容的准确性、可靠性、正当性、安全性、合法性等负责,亦不承担任何法律责任
6、所有作品仅供您个人学习、研究或欣赏,不得用于商业或者其他用途,否则,一切后果均由您自己承担,我们对此不承担任何法律责任
7、如涉及侵犯版权等问题,请您及时通知我们,我们将立即采取措施予以解决
8、联系人Email:admin@iyunv.com 网址:www.yunweiku.com

所有资源均系网友上传或者通过网络收集,我们仅提供一个展示、介绍、观摩学习的平台,我们不对其承担任何法律责任,如涉及侵犯版权等问题,请您及时通知我们,我们将立即处理,联系人Email:kefu@iyunv.com,QQ:1061981298 本贴地址:https://www.iyunv.com/thread-311518-1-1.html 上篇帖子: 大致了解下Hadoop RPC机制 下篇帖子: hadoop 1.1.1 配置文件说明
您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

扫码加入运维网微信交流群X

扫码加入运维网微信交流群

扫描二维码加入运维网微信交流群,最新一手资源尽在官方微信交流群!快快加入我们吧...

扫描微信二维码查看详情

客服E-mail:kefu@iyunv.com 客服QQ:1061981298


QQ群⑦:运维网交流群⑦ QQ群⑧:运维网交流群⑧ k8s群:运维网kubernetes交流群


提醒:禁止发布任何违反国家法律、法规的言论与图片等内容;本站内容均来自个人观点与网络等信息,非本站认同之观点.


本站大部分资源是网友从网上搜集分享而来,其版权均归原作者及其网站所有,我们尊重他人的合法权益,如有内容侵犯您的合法权益,请及时与我们联系进行核实删除!



合作伙伴: 青云cloud

快速回复 返回顶部 返回列表