hadoop中使用hprof工具进行性能分析

neocai 发表于 2016-12-12 09:29:09

在编写完成MapReduce程序之后，调优就成为了一个大问题。如何使用现有工具快速地分析出任务的性能？

对于本地的java应用程序，进行分析可能稍微简单，但是hadoop是一个分布式框架，MapReduce任务可能在集群中的任意机器上被调度运行。而且本地Job运行器是一个与集群差异非常大的环境，数据流的形式也不同，应该在实际集群上对比新的执行时间和旧的执行时间。

hadoop的任务中可以选择启用profile，这可以在特定的Map/Reduce任务启动执行hprof分析。prof是一个JDK自带的分析工具，虽然只有基本功能，但是同样能够提供程序CPU运行和堆使用情况等相关的有用信息。

官网的说明：

https://docs.oracle.com/javase/8/docs/technotes/samples/hprof.html

经过半天的摸索，大概使用方法介绍一下，所有的参数如下：

Option Name and ValueDescription                Default
--------------------------------                -------
heap=dump|sites|all heap profiling             all
cpu=samples|times|oldCPU usage                   off
monitor=y|n          monitor contention          n
format=a|b          text(txt) or binary output a
file=<file>          write data to file          java.hprof[.txt]
net=<host>:<port>    send data over a socket    off
depth=<size>       stack trace depth          4
interval=<ms>       sample interval in ms       10
cutoff=<value>       output cutoff point          0.0001
lineno=y|n          line number in traces?       y
thread=y|n          thread in traces?          n
doe=y|n             dump on exit?                y
msa=y|n             Solaris micro state accounting n
force=y|n          force output to <file>       y
verbose=y|n          print messages about dumps y

CPU使用分析

cpu=samples

hprof工具通过抽样分析（定时）线程可以收集所有的运行线程，记录最频繁的StackTraces，官网上给出的例子，通过javac来进行性能分析：

Command used: javac -J-agentlib:hprof=cpu=samples Hello.java
CPU SAMPLES BEGIN (total = 126) Fri Oct 22 12:12:14 2004
rank selfaccum count trace method
1 53.17% 53.17%    67 300027 java.util.zip.ZipFile.getEntry
2 17.46% 70.63%    22 300135 java.util.zip.ZipFile.getNextEntry
35.56% 76.19%    7 300111 java.lang.ClassLoader.defineClass2
43.97% 80.16%    5 300140 java.io.UnixFileSystem.list
52.38% 82.54%    3 300149 java.lang.Shutdown.halt0
61.59% 84.13%    2 300136 java.util.zip.ZipEntry.initFields
71.59% 85.71%    2 300138 java.lang.String.substring
81.59% 87.30%    2 300026 java.util.zip.ZipFile.open
90.79% 88.10%    1 300118 com.sun.tools.javac.code.Type$ErrorType.<init>
100.79% 88.89%    1 300134 java.util.zip.ZipFile.ensureOpen

count代表一个特定的StackTrace被采样的次数，而不是方法真实被调用了几次。这个选项不需要字节码注入，也不需要修改ClassLoader，不会对程序的正常执行造成多大的干扰。

cpu=times

hprof工具还可以通过字节码注入的方式分析每个方法的入口和出口，保存具体方法的调用次数和时间消耗，同样也会带来性能的消耗，比上一种方式要慢很多。

Command used: javac -J-agentlib:hprof=cpu=times Hello.java
CPU TIME (ms) BEGIN (total = 103099259) Fri Oct 22 12:21:23 2004
rank selfaccum count trace method
15.28%5.28%    1 308128 com.sun.tools.javac.Main.compile
25.16% 10.43%    1 308127 com.sun.tools.javac.main.Main.compile
35.15% 15.58%    1 308126 com.sun.tools.javac.main.Main.compile
44.07% 19.66%    1 308060 com.sun.tools.javac.main.JavaCompiler.compile
53.90% 23.56%    1 306652 com.sun.tools.javac.comp.Enter.main
63.90% 27.46%    1 306651 com.sun.tools.javac.comp.Enter.complete
73.74% 31.21%    4 305626 com.sun.tools.javac.jvm.ClassReader.listAll
83.74% 34.95%    18 305625 com.sun.tools.javac.jvm.ClassReader.list
93.24% 38.18%    1 305831 com.sun.tools.javac.comp.Enter.classEnter
103.24% 41.42%    1 305827 com.sun.tools.javac.comp.Enter.classEnter
113.24% 44.65%    1 305826 com.sun.tools.javac.tree.Tree$TopLevel.accept

这里的数据count代表了方法进入的真实次数。

Heap内存分析

heap=sites

hprof工具还可以打印出Java堆的相关对象信息。

下面的SITES纪录告诉我们最多的ZipEntry对象在一个特定的Site下，占用44%的总内存。

Command used: javac -J-agentlib:hprof=heap=sites Hello.java
SITES BEGIN (ordered by live bytes) Fri Oct 22 11:52:24 2004
percent       live       alloc'edstack class
rank selfaccum bytes objs bytesobjs trace name
1 44.73% 44.73% 1161280 145161161280 14516 302032 java.util.zip.ZipEntry
28.95% 53.67% 232256 14516 232256 14516 302033 com.sun.tools.javac.util.List
35.06% 58.74% 131504 2 131504 2 301029 com.sun.tools.javac.util.Name[]
45.05% 63.79% 131088 1 131088 1 301030 byte[]
55.05% 68.84% 131072 1 131072 1 301710 byte[]

还会打印如下的堆栈信息：

TRACE 302032:
java.util.zip.ZipEntry.<init>(ZipEntry.java:101)
java.util.zip.ZipFile$3.nextElement(ZipFile.java:435)
java.util.zip.ZipFile$3.nextElement(ZipFile.java:413)
com.sun.tools.javac.jvm.ClassReader.openArchive(ClassReader.java:1442)
TRACE 302033:
com.sun.tools.javac.util.List.<init>(List.java:43)
com.sun.tools.javac.util.List.<init>(List.java:51)
com.sun.tools.javac.util.ListBuffer.append(ListBuffer.java:98)
com.sun.tools.javac.jvm.ClassReader.openArchive(ClassReader.java:1442)

每个栈帧都包含了类型名称，方法名称和行号，用户可以通过depth设置栈的深度（默认是4），堆栈信息揭露了哪些方法触发了堆内存分配。

heap=dump

如果想得到一个大而全的当前堆活动对象，可以使用这个选项，但是这会导致一个巨大的输出文件。

分析源码并使用

在hadoop的源码中，方法中指定了profile的使用：

org.apache.hadoop.mapred.MapReduceChildJVM
public static List<String> getVMCommand(InetSocketAddress taskAttemptListenerAddr,
Task task,
ID jvmID)

if (conf.getProfileEnabled()) {
if (conf.getProfileTaskRange(task.isMapTask()
).isIncluded(task.getPartition())) {
vargs.add(
String.format(
conf.getProfileParams(),
getTaskLogFile(TaskLog.LogName.PROFILE)
)
);
if (task.isMapTask()) {
vargs.add(conf.get(MRJobConfig.TASK_MAP_PROFILE_PARAMS, ""));
}
else {
vargs.add(conf.get(MRJobConfig.TASK_REDUCE_PROFILE_PARAMS, ""));
}
}
}

hadoop任务中可以通过编码来设置profile：

conf.setProfileEnabled(true);
conf.setProfileParams(…);
conf.setProfileTaskRange...

也可以通过设置参数：

[*]mapreduce.task.profile=true，可以设置profile enabled状态，开启profile模式；
[*]mapreduce.task.profile.maps=0-2，我们不可能将所有map都进行profile，profile是非常消耗资源的(事实上能够看出使用profile的map/reduce执行速度明显变慢)，不建议在生产环境中使用profiler，那么使用这个参数就可以指定执行profile的part；
[*]mapreduce.task.profile.reduces=0-2，同上；
[*]mapreduce.task.profile.params，指定profile的选项，默认值：-agentlib:hprof=cpu=samples,heap=sites,force=n,thread=y, verbose=n,file=%s，

最后的profiler文件放在<LOG_DIR>中，与stderr, stdout在同文件夹，名称为profile.out

hadoop jar命令中加入如下的参数：

-Dmapreduce.task.profile=true -Dmapreduce.task.profile.params="-agentlib:hprof=cpu=samples,heap=dump,force=y,interval=100,thread=y,verbose=n,file=%s"

经过实验采样分析后的CPU指数:

CPU SAMPLES BEGIN (total = 62259) Wed Nov 19 14:49:57 2014
rank selfaccum count trace method
1 34.01% 34.01% 21173 300882 sun.nio.ch.EPollArrayWrapper.epollWait
26.02% 40.02% 3746 301467 com.xxx.Counter.update
35.65% 45.68% 3518 301353 java.lang.String.split
42.66% 48.34% 1656 301358 java.lang.Double.parseDouble
52.28% 50.62% 1422 301240 java.io.FileInputStream.readBytes
62.27% 52.89% 1414 301354 java.lang.Double.parseDouble
72.08% 54.97% 1292 301349 java.util.HashMap.hash
82.07% 57.04% 1291 301346 sun.nio.cs.UTF_8$Decoder.decodeArrayLoop
91.35% 58.39% 842 301373 java.util.HashMap.hash
100.98% 59.37% 611 301364 org.apache.hadoop.io.compress.snappy.SnappyDecompressor.decompress

考虑到使用cpu=times时，使用字节码增强技术可能导致计算量增大，MR任务可能出现超时的情况（超时的日志如下，表明TaskTracker可能一段时间内没有向JobTracker发送必要的信息），如果进行测试工作，可以将超时参数暂时设置得稍大一点，以避免这种情况。

AttemptID:attempt_1413206225298_36800_m_000000_1 Timed out after 1200 secs

页: [1]

运维网's Archiver

hadoop中使用hprof工具进行性能分析