yarn oom问题一例

线上部分job运行失败,报OOM的错误:
yarn oom问题一例

因为是maptask报错,怀疑是map数量过少,导致oom,因此调整参数,增加map数量,但是问题依然存在。看来和map的数量没有关系。
通过jobid查找jobhistory中对应的日志信息,定位到出错的task id和对应的host.通过日志查看出问题的containerid.
由于container是由RM进行分配的,查看RM的日志,可以看到container的分配情况:
比如下面的例子:
1
2014-05-06 16:00:00,632 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSSchedulerNode: Assigned container container_1399267192386_43455_01_000037 of capacity <memory:1536, vCores:1> on host xxxx:44614, which currently has 4 containers, <memory:6144, vCores:4> used and <memory:79872, vCores:42> available
可以看到container的id,host和内存大小,cpu 大小。
进一步查看NM的相关container日志:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
2014-05-05 10:14:47,001 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: Start request for container_1399203487215_21532_01_000035 by user hdfs
2014-05-05 10:14:47,001 INFO org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=hdfs IP=10.201.203.111       OPERATION=Start Container Request       TARGET=ContainerManageImpl      RESULT=SUCCESS  APPID=application_1399203487215_21532   CONTAINERID=container_1399203487215_21532_01_000035
2014-05-05 10:14:47,001 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application: Adding container_1399203487215_21532_01_000035 to application application_1399203487215_21532
2014-05-05 10:14:47,055 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_1399203487215_21532_01_000035 transitioned from NEW to LOCALIZING
2014-05-05 10:14:47,058 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Created localizer for container_1399203487215_21532_01_000035
2014-05-05 10:14:47,060 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Writing credentials to the nmPrivate file /home/vipshop/hard_disk/10/yarn/local/nmPrivate/container_1399203487215_21532_01_000035.tokens. Credentials list:
2014-05-05 10:14:47,412 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_1399203487215_21532_01_000035 transitioned from LOCALIZING to LOCALIZED
2014-05-05 10:14:47,454 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_1399203487215_21532_01_000035 transitioned from LOCALIZED to RUNNING
2014-05-05 10:14:47,493 INFO org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: launchContainer: [bash/home/vipshop/hard_disk/6/yarn/local/usercache/hdfs/appcache/application_1399203487215_21532/container_1399203487215_21532_01_000035/default_container_executor.sh]
2014-05-05 10:14:48,827 INFO org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Copying from /home/vipshop/hard_disk/10/yarn/local/nmPrivate/container_1399203487215_21532_01_000035.tokens to /home/vipshop/hard_disk/11/yarn/local/usercache/hdfs/appcache/application_1399203487215_21532/container_1399203487215_21532_01_000035.tokens
2014-05-05 10:14:49,169 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Starting resource-monitoring for container_1399203487215_21532_01_000035
2014-05-05 10:14:49,305 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 21209 for container-id container_1399203487215_21532_01_000035: 66.7 MB of 1.5 GB physical memory used; 2.1 GB of 3.1 GB virtual memory used
2014-05-05 10:14:53,063 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 21209 for container-id container_1399203487215_21532_01_000035: 984.1 MB of 1.5 GB physical memory used; 2.1 GB of 3.1 GB virtual memory used
2014-05-05 10:14:56,379 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 21209 for container-id container_1399203487215_21532_01_000035: 984.5 MB of 1.5 GB physical memory used; 2.1 GB of 3.1 GB virtual memory used
.......
2014-05-05 10:19:26,823 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 21209 for container-id container_1399203487215_21532_01_000035: 1.1 GB of 1.5 GB physical memory used; 2.1 GB of 3.1 GB virtual memory used
2014-05-05 10:19:27,459 INFO org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=hdfs IP=10.201.203.111       OPERATION=Stop Container Request        TARGET=ContainerManageImpl      RESULT=SUCCESS  APPID=application_1399203487215_21532   CONTAINERID=container_1399203487215_21532_01_000035
2014-05-05 10:19:27,459 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_1399203487215_21532_01_000035 transitioned from RUNNING to KILLING
2014-05-05 10:19:27,459 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: Cleaning up container container_1399203487215_21532_01_000035
2014-05-05 10:19:27,800 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_1399203487215_21532_01_000035 transitioned from KILLING to CONTAINER_CLEANEDUP_AFTER_KILL
可以看到,虽然container分配的内存为1.5,但是在使用到1.1G(1.1 GB of 1.5 GB physical memory used)时task被kill掉了。。还有400多M的剩余,看来不是task的整个内存大小分配的太小导致,比较像perm的问题(默认为64m)
更新mapred的设置如下:
1
2
3
4
5
6
7
8
9
10
11
12
13
<property>
   <name>mapreduce.map.java.opts</name>
   <value>-Xmx1280m -Xms1280m -Xmn256m -XX:SurvivorRatio=6 -XX:MaxPermSize=128m</value>
 </property>
 <property>
   <name>mapreduce.reduce.java.opts</name>
   <value>-Xmx1280m -Xms1280m -Xmn256m -XX:SurvivorRatio=6 -XX:MaxPermSize=128m</value>
 </property>
 <property>
   <name>mapred.child.java.opts</name>
   <value>-Xmx1280m -Xms1280m -Xmn256m -XX:SurvivorRatio=6 -XX:MaxPermSize=128m</value>
   <final>true</final>
 </property>
重新运行job,成功。
其实对应java的oom问题来说,最好的方法是打印gc的信息和dump内存的堆栈,然后使用MAT一类的工具来进行分析。


本文转自菜菜光 51CTO博客,原文链接:http://blog.51cto.com/caiguangguang/1407424,如需转载请自行联系原作者
上一篇:重磅下载 |2019 Flink Forward 大会37+演讲PDF合辑,不容错过!


下一篇:JVM内存机制与常见问题排查