2016-09-02 12 views
0

クラスタ内のノードが毎回OOMEに当たっています。最近この問題に遭遇したマシン(XX.XX.XX.187)のログを以下に示します。 XX.XX.XX.187とXX.XX.XX.184(これはクラスタ内の別のノード)間の通信に問題があるようですが、わかりません。マシンがOOMEにヒットすると、Hazelcastはシャットダウンされます。OutOfMemoryError - HazelCast v3.6.4

次のログは興味深いものである:

2016-08-28T10:30:44:715+0200 - com.hazelcast.cluster.impl.operations.MasterConfirmationOperation WARNING: [XX.XX.XX.187]:45000 [dev] [3.6.4] Address[XX.XX.XX.184]:45000 has sent MasterConfirmation, but this node is not master! 
2016-08-28T10:31:14:715+0200 - com.hazelcast.cluster.impl.operations.MasterConfirmationOperation WARNING: [XX.XX.XX.187]:45000 [dev] [3.6.4] Address[XX.XX.XX.184]:45000 has sent MasterConfirmation, but this node is not master! 
2016-08-28T10:31:44:715+0200 - com.hazelcast.cluster.impl.operations.MasterConfirmationOperation WARNING: [XX.XX.XX.187]:45000 [dev] [3.6.4] Address[XX.XX.XX.184]:45000 has sent MasterConfirmation, but this node is not master! 
2016-08-28T10:31:51:986+0200 - com.hazelcast.nio.tcp.TcpIpConnection INFO: [XX.XX.XX.187]:45000 [dev] [3.6.4] Connection [Address[XX.XX.XX.184]:45000] lost. Reason: Socket explicitly closed 
2016-08-28T10:31:51:987+0200 - com.hazelcast.cluster.ClusterService INFO: [XX.XX.XX.187]:45000 [dev] [3.6.4] Removing Member [XX.XX.XX.184]:45000 
2016-08-28T10:31:51:987+0200 - com.hazelcast.partition.InternalPartitionService INFO: [XX.XX.XX.187]:45000 [dev] [3.6.4] Removing Member [XX.XX.XX.184]:45000 
2016-08-28T10:31:51:988+0200 - com.hazelcast.cluster.ClusterService INFO: [XX.XX.XX.187]:45000 [dev] [3.6.4] 

Members [3] { 
Member [XX.XX.XX.186]:45000 
Member [XX.XX.XX.187]:45000 this 
Member [XX.XX.XX.185]:45000 
} 

2016-08-28T10:31:51:988+0200 - com.hazelcast.transaction.TransactionManagerService INFO: [XX.XX.XX.187]:45000 [dev] [3.6.4] Committing/rolling-back alive transactions of Member [XX.XX.XX.184]:45000, UUID: 3fb9cff4-a33d-4ca1-b51d-6b387fbb0856 
2016-08-28T10:31:52:497+0200 - com.hazelcast.nio.tcp.SocketAcceptorThread INFO: [XX.XX.XX.187]:45000 [dev] [3.6.4] Accepting socket connection from /XX.XX.XX.184:54655 
2016-08-28T10:31:52:497+0200 - com.hazelcast.nio.tcp.TcpIpConnectionManager INFO: [XX.XX.XX.187]:45000 [dev] [3.6.4] Established socket connection between /XX.XX.XX.187:45000 and /XX.XX.XX.184:54655 
2016-08-28T10:32:14:715+0200 - com.hazelcast.cluster.impl.operations.MasterConfirmationOperation WARNING: [XX.XX.XX.187]:45000 [dev] [3.6.4] MasterConfirmation has been received from Address[XX.XX.XX.184]:45000, but it is not a member of this cluster! 
2016-08-28T10:32:14:718+0200 - com.hazelcast.nio.tcp.TcpIpConnection INFO: [XX.XX.XX.187]:45000 [dev] [3.6.4] Connection [Address[XX.XX.XX.184]:45000] lost. Reason: java.io.EOFException[Remote socket closed!] 
2016-08-28T10:32:14:718+0200 - com.hazelcast.nio.tcp.nonblocking.NonBlockingSocketReader WARNING: [XX.XX.XX.187]:45000 [dev] [3.6.4] hz._hzInstance_1_dev.IO.thread-in-1 Closing socket to endpoint Address[XX.XX.XX.184]:45000, Cause:java.io.EOFException: Remote socket closed! 
2016-08-28T10:32:30:801+0200 - com.hazelcast.nio.tcp.SocketAcceptorThread INFO: [XX.XX.XX.187]:45000 [dev] [3.6.4] Accepting socket connection from /XX.XX.XX.184:34362 
2016-08-28T10:32:30:802+0200 - com.hazelcast.nio.tcp.TcpIpConnectionManager INFO: [XX.XX.XX.187]:45000 [dev] [3.6.4] Established socket connection between /XX.XX.XX.187:45000 and /XX.XX.XX.184:34362 
2016-08-28T10:32:30:812+0200 - com.hazelcast.cluster.impl.operations.JoinCheckOperation INFO: [XX.XX.XX.187]:45000 [dev] [3.6.4] Ignoring join check from Address[XX.XX.XX.184]:45000, because this node is not master... 
2016-08-28T10:34:30:801+0200 - com.hazelcast.cluster.impl.operations.JoinCheckOperation INFO: [XX.XX.XX.187]:45000 [dev] [3.6.4] Ignoring join check from Address[XX.XX.XX.184]:45000, because this node is not master... 
2016-08-28T10:36:30:800+0200 - com.hazelcast.cluster.impl.operations.JoinCheckOperation INFO: [XX.XX.XX.187]:45000 [dev] [3.6.4] Ignoring join check from Address[XX.XX.XX.184]:45000, because this node is not master... 
2016-08-28T10:38:30:802+0200 - com.hazelcast.cluster.impl.operations.JoinCheckOperation INFO: [XX.XX.XX.187]:45000 [dev] [3.6.4] Ignoring join check from Address[XX.XX.XX.184]:45000, because this node is not master... 
2016-08-28T10:38:30:845+0200 - com.hazelcast.nio.tcp.TcpIpConnection INFO: [XX.XX.XX.187]:45000 [dev] [3.6.4] Connection [Address[XX.XX.XX.184]:45000] lost. Reason: java.io.EOFException[Remote socket closed!] 
2016-08-28T10:38:30:845+0200 - com.hazelcast.nio.tcp.nonblocking.NonBlockingSocketReader WARNING: [XX.XX.XX.187]:45000 [dev] [3.6.4] hz._hzInstance_1_dev.IO.thread-in-1 Closing socket to endpoint Address[XX.XX.XX.184]:45000, Cause:java.io.EOFException: Remote socket closed! 
2016-08-28T10:38:36:857+0200 - com.hazelcast.cluster.ClusterService INFO: [XX.XX.XX.187]:45000 [dev] [3.6.4] 

Members [4] { 
Member [XX.XX.XX.186]:45000 
Member [XX.XX.XX.187]:45000 this 
Member [XX.XX.XX.185]:45000 
Member [XX.XX.XX.184]:45000 
} 

2016-08-28T10:38:36:857+0200 - com.hazelcast.nio.tcp.InitConnectionTask INFO: [XX.XX.XX.187]:45000 [dev] [3.6.4] Connecting to /XX.XX.XX.184:45000, timeout: 0, bind-any: true 
2016-08-28T10:38:36:858+0200 - com.hazelcast.nio.tcp.TcpIpConnectionManager INFO: [XX.XX.XX.187]:45000 [dev] [3.6.4] Established socket connection between /XX.XX.XX.187:58928 and /XX.XX.XX.184:45000 
...... 
.... 
2016-08-28T23:07:44:727+0200 - com.hazelcast.cluster.impl.ClusterHeartbeatManager WARNING: [XX.XX.XX.187]:45000 [dev] [3.6.4] Ignoring heartbeat from Member [XX.XX.XX.186]:45000 since it is expired (now: Sun Aug 28 23:07:47 CEST 2016, timestamp: Sun Aug 28 23:05:12 CEST 2016) 
2016-08-28T23:07:44:727+0200 - com.hazelcast.cluster.impl.ClusterHeartbeatManager WARNING: [XX.XX.XX.187]:45000 [dev] [3.6.4] Ignoring heartbeat from Member [XX.XX.XX.186]:45000 since it is expired (now: Sun Aug 28 23:07:47 CEST 2016, timestamp: Sun Aug 28 23:05:07 CEST 2016) 
2016-08-28T23:07:45:056+0200 - com.hazelcast.cluster.impl.ClusterHeartbeatManager WARNING: [XX.XX.XX.187]:45000 [dev] [3.6.4] Ignoring heartbeat from Member [XX.XX.XX.186]:45000 since it is expired (now: Sun Aug 28 23:07:48 CEST 2016, timestamp: Sun Aug 28 23:05:17 CEST 2016) 
2016-08-28T23:07:52:793+0200 - com.hazelcast.cluster.impl.ClusterHeartbeatManager INFO: [XX.XX.XX.187]:45000 [dev] [3.6.4] System clock apparently jumped from 2016-08-28T23:05:03.295 to 2016-08-28T23:07:52.792 since last heartbeat (+164497 ms) 
2016-08-28T23:07:52:794+0200 - com.hazelcast.cluster.impl.ClusterHeartbeatManager WARNING: [XX.XX.XX.187]:45000 [dev] [3.6.4] Resetting heartbeat timestamps because of huge system clock jump! Clock-Jump: 164497 ms, Heartbeat-Timeout: 300000 ms 
2016-08-28T23:09:34:960+0200 - com.hazelcast.nio.tcp.TcpIpConnection INFO: [XX.XX.XX.187]:45000 [dev] [3.6.4] Connection [Address[XX.XX.XX.185]:45000] lost. Reason: Socket explicitly closed 
2016-08-28T23:09:34:960+0200 - com.hazelcast.nio.tcp.TcpIpConnection INFO: [XX.XX.XX.187]:45000 [dev] [3.6.4] Connection [Address[XX.XX.XX.186]:45000] lost. Reason: Socket explicitly closed 
2016-08-28T23:09:34:961+0200 - com.hazelcast.nio.tcp.TcpIpConnection INFO: [XX.XX.XX.187]:45000 [dev] [3.6.4] Connection [Address[XX.XX.XX.184]:45000] lost. Reason: Socket explicitly closed 
2016-08-28T23:09:34:962+0200 - com.hazelcast.instance.Node WARNING: [XX.XX.XX.187]:45000 [dev] [3.6.4] Terminating forcefully... 
2016-08-28T23:09:34:962+0200 - com.hazelcast.instance.Node INFO: [XX.XX.XX.187]:45000 [dev] [3.6.4] Shutting down connection manager... 
2016-08-28T23:09:34:963+0200 - com.hazelcast.instance.Node INFO: [XX.XX.XX.187]:45000 [dev] [3.6.4] Shutting down node engine... 
2016-08-28T23:09:35:291+0200 - com.hazelcast.instance.NodeExtension INFO: [XX.XX.XX.187]:45000 [dev] [3.6.4] Destroying node NodeExtension. 
2016-08-28T23:09:35:291+0200 - com.hazelcast.instance.Node INFO: [XX.XX.XX.187]:45000 [dev] [3.6.4] Hazelcast Shutdown is completed in 329 ms. 

のOutOfMemoryExceptionが6ギガバイトの周りに消費するように成長するNonBlockingSocketWriter.writeQueueによって引き起こされます。

ログを通過してこの問題の原因が考えられますか?どんな助けや指導も感謝しています。

答えて

0

まず、NTPを使用して、マシンの時計を同期させてください。既にそれを行っている場合は、各マシンのシステム負荷はどれくらいですか? JVmが長時間停止し、その理由でハートビートが無視され、メンバーがすでに再接続を引き起こして死んでいると思われる可能性があるというGCの問題がある可能性があります。

+1

...とgithubの-スタックオーバーフロー、グループの別の答えを投稿しても意味がありません。あなたは、GCの問題によりJVMが停止し、ハートビートが失われることを推測するのは間違いありません。だから、私たちは主に、HazelcastのNonBlockingSocketWriterに起因すると思われるGCの問題のデバッグに興味があります。ヒープダンプ解析では、NonBlockingSocketWriterの 'writeQueue'フィールドが大きくなりすぎてOOMエラーが発生しています。 https://github.com/hazelcast/hazelcast/issues/8827でさらにコメントと更新を掲載します – Kiran