HBase Thrift2 CPU过高问题分析

目录
目录1
1.现象描述1
2.问题定位2
3.解决方案5
4.相关代码5
1.现象描述
外界连接9090端口均超时,但telnet端口总是成功。使用top命令观察,发现单个线程的CPU最高达99.99%,但并不总是99.9%,而是在波动。当迁走往该机器的流量后,能够访问成功,但仍然有超时,读超时比写超时多:
#./hbase_stress--hbase=110.13.136.207:9090--test=2--timeout=10 [2016-11-2710:15:21/771][139756154767104/31562][ERROR][hbase_stress.cpp:302]TransportException(thrift://110.13.136.207:9090):EAGAIN(timedout) [2016-11-2710:15:31/775][139756154767104/31562][ERROR][hbase_stress.cpp:302]TransportException(thrift://110.13.136.207:9090):EAGAIN(timedout) PIDUSERPRNIVIRTRESSHRS%CPU%MEMTIME+COMMAND 20727zhangsan20010.843g9.263g26344R99.926.41448:00java 20729zhangsan20010.843g9.263g26344R99.926.41448:00java 20730zhangsan20010.843g9.263g26344R99.926.41449:10java 20728zhangsan20010.843g9.263g26344R99.826.41448:00java 20693zhangsan20010.843g9.263g26344S0.026.40:00.00java 20727zhangsan20010.843g9.263g26344R75.526.41448:06java 20728zhangsan20010.843g9.263g26344R75.226.41448:06java 20729zhangsan20010.843g9.263g26344R75.226.41448:06java 20730zhangsan20010.843g9.263g26344R75.226.41449:15java 20716zhangsan20010.843g9.263g26344S24.926.493:48.75java |
2.问题定位
使用ps命令找出CPU最多的线程,和top显示的一致:
$ps-mp20693-oTHREAD,tid,time|sort-rn zhangsan18.819----207301-00:11:23 zhangsan18.719----207291-00:10:13 zhangsan18.719----207281-00:10:13 zhangsan18.719----207271-00:10:13 zhangsan16.119-futex_--2073120:44:51 zhangsan5.219-futex_--2073206:46:39 |
然后借助jstack,发现为GC进程:
"Gangworker#0(ParallelCMSThreads)"os_prio=0tid=0x00007fb7200d4000nid=0x50f7runnable "Gangworker#1(ParallelCMSThreads)"os_prio=0tid=0x00007fb7200d5800nid=0x50f8runnable "Gangworker#2(ParallelCMSThreads)"os_prio=0tid=0x00007fb7200d7800nid=0x50f9runnable "Gangworker#3(ParallelCMSThreads)"os_prio=0tid=0x00007fb7200d9000nid=0x50farunnable |
使用jstat工具查看GC,情况很不乐观,问题就是有GC引起的:
$jstat-gcutil206931000100 S0S1EOMCCSYGCYGCTFGCFGCTGCT 0.0099.67100.00100.0098.0894.4142199369.1322708434869.60135238.733 0.0099.67100.00100.0098.0894.4142199369.1322708434870.44835239.580 0.0099.67100.00100.0098.0894.4142199369.1322708434870.44835239.580 0.0099.67100.00100.0098.0894.4142199369.1322708434870.44835239.580 $jstat-gccapacity20693 NGCMNNGCMXNGCS0CS1CECOGCMNOGCMXOGCOCMCMNMCMXMCCCSMNCCSMXCCSCYGCFGC 191808.01107520.01107520.0110720.0110720.0886080.0383680.08094144.08094144.08094144.00.01077248.031584.00.01048576.03424.04219927156 $jstat-gcold20693 MCMUCCSCCCSUOCOUYGCFGCFGCTGCT 31584.030978.73424.03232.78094144.08094144.0421992717434964.34735333.479 $jstat-gcoldcapacity20693 OGCMNOGCMXOGCOCYGCFGCFGCTGCT 383680.08094144.08094144.08094144.0421992719234982.62335351.755 $jstat-gcnewcapacity20693 NGCMNNGCMXNGCS0CMXS0CS1CMXS1CECMXECYGCFGC 191808.01107520.01107520.0110720.0110720.0110720.0110720.0886080.0886080.04219927202 $jstat-gc20693 S0CS1CS0US1UECEUOCOUMCMUCCSCCCSUYGCYGCTFGCFGCTGCT 110720.0110720.00.0110395.9886080.0886080.08094144.08094144.031584.030978.73424.03232.742199369.1322720634996.53835365.671 $jstat-gcnew20693 S0CS1CS0US1UTTMTTDSSECEUYGCYGCT 110720.0110720.00.0110396.96655360.0886080.0886080.042199369.132 |
使用lsof显示该进程的连接数不多,完全在安全范围内,问题应当是有对象不能被回收。使用jmap查看内存详情,先看堆的使用情况:
$jmap-heap20693 AttachingtoprocessID20693,pleasewait... Debuggerattachedsuccessfully. Servercompilerdetected. JVMversionis25.77-b03 usingparallelthreadsinthenewgeneration. usingthread-localobjectallocation. ConcurrentMark-SweepGC HeapConfiguration: MinHeapFreeRatio=40 MaxHeapFreeRatio=70 MaxHeapSize=9422503936(8986.0MB) NewSize=196411392(187.3125MB) MaxNewSize=1134100480(1081.5625MB) OldSize=392888320(374.6875MB) NewRatio=2 SurvivorRatio=8 MetaspaceSize=21807104(20.796875MB) CompressedClassSpaceSize=1073741824(1024.0MB) MaxMetaspaceSize=17592186044415MB G1HeapRegionSize=0(0.0MB) HeapUsage: NewGeneration(Eden+1SurvivorSpace): capacity=1020723200(973.4375MB) used=1020398064(973.1274261474609MB) free=325136(0.3100738525390625MB) 99.96814650632022%used EdenSpace: capacity=907345920(865.3125MB) used=907345920(865.3125MB) free=0(0.0MB) 100.0%used FromSpace: capacity=113377280(108.125MB) used=113052144(107.81492614746094MB) free=325136(0.3100738525390625MB) 99.71322649476156%used ToSpace: capacity=113377280(108.125MB) used=0(0.0MB) free=113377280(108.125MB) 0.0%used concurrentmark-sweepgeneration: capacity=8288403456(7904.4375MB) used=8288403424(7904.437469482422MB) free=32(3.0517578125E-5MB) 99.9999996139184%used 10216internedStringsoccupying934640bytes. |
进一步查看对象的情况:
$jmap-histo20693 num#instances#bytesclassname ---------------------------------------------- 1:728352122518411456[B 2:498271471993085880java.util.TreeMap$Entry 3:12855993617087664java.util.TreeMap 4:4285217445662568org.apache.hadoop.hbase.client.ClientScanner 5:4285222377099536org.apache.hadoop.hbase.client.Scan 6:4284875377069000org.apache.hadoop.hbase.client.ScannerCallable 7:4285528342921344[Ljava.util.HashMap$Node; 8:4284880308511360org.apache.hadoop.hbase.client.ScannerCallableWithReplicas 9:8570671274261472java.util.LinkedList 10:4285579205707792java.util.HashMap 11:4285283205693584org.apache.hadoop.hbase.client.RpcRetryingCaller 12:3820914152836560org.apache.hadoop.hbase.filter.SingleColumnValueFilter 13:4291904137340928java.util.concurrent.ConcurrentHashMap$Node 14:8570636137130176java.util.TreeMap$EntrySet 15:4285278137128896org.apache.hadoop.hbase.io.TimeRange 16:8570479137127664java.util.concurrent.atomic.AtomicBoolean 17:289140992525088org.apache.hadoop.hbase.NoTagsKeyValue 18:428654068584640java.lang.Integer 19:428529868564768java.util.TreeMap$KeySet 20:428527568564400java.util.TreeSet 21:428500668560096java.uti 您可能想查找下面的文章:文章分类最近更新的内容
|