Hazelcastクラスターのメンバーが膨大な数の "IsStillRunningService"オブジェクトのためにメモリ不足になる

Hazelcast IExecutorサービスとIMapを3.5バージョンで使用するシステムがあります。私たちは最近、Hazelcastクラスタメンバーが生産中にメモリ不足になり、最後にすべてのノードがOOMでクラッシュすることに気付きました。
原因分析を行っているうちに、数千以下のログエントリがあり、ログファイルサイズが指数関数的に増加したことがわかりました。また、ログが存在していたストレージスペースもスペースを使い果たしました。Hazelcastクラスターのメンバーが膨大な数の "IsStillRunningService"オブジェクトのためにメモリ不足になる

WARNING: [10.7.90.189]:30103 [FB] [3.5] Asking if operation execution has been started: com.hazelcast.spi.impl.operationservice.i[email protected]48b3ac3b 
Mar 30, 2016 11:09:29 AM com.hazelcast.spi.impl.operationservice.impl.Invocation 
WARNING: [10.7.90.189]:30103 [FB] [3.5] While asking 'is-executing': Invocation{ serviceName='hz:core:partitionService', op=com.hazelcast.spi.impl.operationservice.impl.operations.IsStillExecutingOperation{serviceName='hz:core:partition 
Service', partitionId=-1, callId=59834, invocationTime=1459349279980, waitTimeout=-1, callTimeout=5000}, partitionId=-1, replicaIndex=0, tryCount=0, tryPauseMillis=0, invokeCount=1, callTimeout=5000, target=Address[1.2.3.4]:30102, b 
ackupsExpected=0, backupsCompleted=0} 
com.hazelcast.core.OperationTimeoutException: No response for 10000 ms. Aborting invocation! Invocation{ serviceName='hz:core:partitionService', op=com.hazelcast.spi.impl.operationservice.impl.operations.IsStillExecutingOperation{servic 
eName='hz:core:partitionService', partitionId=-1, callId=268177, invocationTime=1459349295209, waitTimeout=-1, callTimeout=5000}, partitionId=-1, replicaIndex=0, tryCount=0, tryPauseMillis=0, invokeCount=1, callTimeout=5000, target=Addr 
ess[10.7.90.190]:30102, backupsExpected=0, backupsCompleted=0} No response has been received! backups-expected:0 backups-completed: 0 
     at com.hazelcast.spi.impl.operationservice.impl.Invocation.newOperationTimeoutException(Invocation.java:491) 
     at com.hazelcast.spi.impl.operationservice.impl.IsStillRunningService$IsOperationStillRunningCallback.setOperationTimeout(IsStillRunningService.java:224) 
     at com.hazelcast.spi.impl.operationservice.impl.IsStillRunningService$IsOperationStillRunningCallback.onFailure(IsStillRunningService.java:219) 
     at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture$1.run(InvocationFuture.java:137) 
     at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) 
     at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) 
     at java.lang.Thread.run(Thread.java:745) 
     at com.hazelcast.util.executor.HazelcastManagedThread.executeRun(HazelcastManagedThread.java:76) 
     at com.hazelcast.util.executor.HazelcastManagedThread.run(HazelcastManagedThread.java:92)

私は、クラスタメンバーはハートビートを作り続けて、メンバー全員が生きていることを確認していますが、デフォルトは10秒だと思います。問題は、メンバーのいずれかが応答しなくなったり、状態が悪化したりすると、メンバーの残りのメンバーが実行中の呼び出しを続けてしまうことです。ヒープダンプを調べた後、> 73％のヒープが "IsStillRunningService"オブジェクトでいっぱいであることがわかりました。

質問：

正確に何が悪かったのかを知るために取得する方法は？
ストレージスペースが不足している場合は、同時発生であるかどうか、またはコアレーションがある可能性がありますか？私たちは1週間に2回起こったので、他の人につながる可能性があると考えています。

HazelcastのXML構成：

<hazelcast xsi:schemaLocation="http://www.hazelcast.com/schema/config http://www.hazelcast.com/schema/config/hazelcast-config-3.5.xsd" 
      xmlns="http://www.hazelcast.com/schema/config" 
      xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"> 

    <map name="myMap"> 
     <backup-count>0</backup-count> 
     <time-to-live-seconds>43200</time-to-live-seconds> 
     <eviction-policy>LRU</eviction-policy> 
     <max-size policy="USED_HEAP_PERCENTAGE">75</max-size> 
     <eviction-percentage>10</eviction-percentage> 
     <in-memory-format>OBJECT</in-memory-format> 
    </map> 

    <executor-service name="calculation"> 
      <pool-size>10</pool-size> 
     <queue-capacity>400</queue-capacity> 
    </executor-service> 

    <executor-service name="loader"> 
      <pool-size>5</pool-size> 
     <queue-capacity>400</queue-capacity> 
    </executor-service> 

    <properties> 
     <property name="hazelcast.icmp.timeout">5000</property> 
     <property name="hazelcast.initial.wait.seconds">10</property> 
     <property name="hazelcast.connection.monitor.interval">5000</property> 
    </properties> 

    <network> 
     <port auto-increment="true" port-count="100">30101</port> 
     <join> 
      <multicast enabled="false"> 
       <multicast-group>224.2.2.3</multicast-group> 
       <multicast-port>54327</multicast-port> 
      </multicast> 
      <tcp-ip enabled="true"> 
       <interface>1.2.3.4</interface> 
       <interface>1.2.3.5</interface> 
       <interface>1.2.3.6</interface> 
      </tcp-ip> 
      <aws enabled="false"/> 
     </join> 
     <interfaces enabled="false"> 
      <interface>127.0.0.1</interface> 
     </interfaces> 
    </network> 
</hazelcast>

StackTrace
LinkedBlockingQueue which holds IsStillRunningService Objects

出典

2016-04-03 A.K.Desai

あなたは3.6にアップグレードすることができます。 is-still-runningを使用してOOMEに実行されないように修正されました。 3.7では、全体的なメカニズムが削除され、問題の少ないアプローチに置き換えられます。

https://github.com/hazelcast/hazelcast/pull/7719

出典

2016-04-04 03:58:36 pveentjer

ありがとうございます。私たちは3.6.2バージョンを先に進め、過去2〜3週間のライブにはこのようなエラーは見られませんでした。 –

Hazelcastクラスターのメンバーが膨大な数の "IsStillRunningService"オブジェクトのためにメモリ不足になる

答えて

関連する問題