2016-03-31 15 views
2

数日前の 1.0.0-rc3を使用して、ジョブごとのYARNセッションのコンテキストでJobManager HAを動作させようとしていますが、タスクマネージャ複数のネットワークインターフェイスを使用しています( )。リーダーゲートウェイとポートの取得に失敗しました

は、手動でジョブマネージャのプロセスを殺した後、新たに割り当てられた 第2のジョブマネージャのjobmanager.log読み取り:IPが見つからない

2016-03-02 18:01:09,635 WARN Remoting             
    - Tried to associate with unreachable remote address [akka.tcp://[email protected]:34811]. 
Address is now gated for 5000 ms, all messages to this address will be delivered to dead letters. 
Reason: Connection refused: /10.127.68.136:34811 
2016-03-02 18:01:09,644 WARN org.apache.flink.runtime.webmonitor.JobManagerRetriever  
    - Failed to retrieve leader gateway and port. 
akka.actor.ActorNotFound: Actor not found for: ActorSelection[Anchor(akka.tcp://[email protected]:34811/), 
Path(/user/jobmanager)] 
    at akka.actor.ActorSelection$$anonfun$resolveOne$1.apply(ActorSelection.scala:65) 
    at akka.actor.ActorSelection$$anonfun$resolveOne$1.apply(ActorSelection.scala:63) 
    at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32) 
    at akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.processBatch$1(BatchingExecutor.scala:67) 
    at akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:82) 
    at akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:59) 
    at akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:59) 
    at scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72) 
    at akka.dispatch.BatchingExecutor$Batch.run(BatchingExecutor.scala:58) 
    at akka.dispatch.ExecutionContexts$sameThreadExecutionContext$.unbatchedExecute(Future.scala:74) 
    at akka.dispatch.BatchingExecutor$class.execute(BatchingExecutor.scala:110) 
    at akka.dispatch.ExecutionContexts$sameThreadExecutionContext$.execute(Future.scala:73) 
    at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40) 
    at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248) 
    at akka.pattern.PromiseActorRef.$bang(AskSupport.scala:267) 
    at akka.actor.EmptyLocalActorRef.specialHandle(ActorRef.scala:508) 
    at akka.actor.DeadLetterActorRef.specialHandle(ActorRef.scala:541) 
    at akka.actor.DeadLetterActorRef.$bang(ActorRef.scala:531) 
    at akka.remote.RemoteActorRefProvider$RemoteDeadLetterActorRef.$bang(RemoteActorRefProvider.scala:87) 
    at akka.remote.EndpointWriter.postStop(Endpoint.scala:561) 
    at akka.actor.Actor$class.aroundPostStop(Actor.scala:475) 
    at akka.remote.EndpointActor.aroundPostStop(Endpoint.scala:415) 
    at akka.actor.dungeon.FaultHandling$class.akka$actor$dungeon$FaultHandling$$finishTerminate(FaultHandling.scala:210) 
    at akka.actor.dungeon.FaultHandling$class.terminate(FaultHandling.scala:172) 
    at akka.actor.ActorCell.terminate(ActorCell.scala:369) 
    at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:462) 
    at akka.actor.ActorCell.systemInvoke(ActorCell.scala:478) 
    at akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:279) 
    at akka.dispatch.Mailbox.run(Mailbox.scala:220) 
    at akka.dispatch.Mailbox.exec(Mailbox.scala:231) 
    at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) 
    at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) 
    at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) 
    at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) 

が古いジョブマネージャからです。これまでのところ、これは期待される行動ですか?

問題は、新しいジョブマネージャーにも発生し、古いジョブマネージャーには、古いジョブ に接続しようとしましたが失敗しました。 ZooKeeperLeaderRetrievalServiceが関連taskmanager.logに見られるように、利用可能 ネットワークインタフェースを介してサイクリングを開始:5回繰り返した後

2016-03-02 18:01:13,636 INFO org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService 
- Starting ZooKeeperLeaderRetrievalService. 
2016-03-02 18:01:13,646 INFO org.apache.flink.runtime.util.LeaderRetrievalUtils   
    - Trying to select the network interface and address to use by connecting to the leading 
JobManager. 
2016-03-02 18:01:13,646 INFO org.apache.flink.runtime.util.LeaderRetrievalUtils   
    - TaskManager will try to connect for 10000 milliseconds before falling back to heuristics 
2016-03-02 18:01:13,712 INFO org.apache.flink.runtime.net.ConnectionUtils    
    - Retrieved new target address /10.127.68.136:34811. 
2016-03-02 18:01:14,079 INFO org.apache.flink.runtime.net.ConnectionUtils    
    - Trying to connect to address /10.127.68.136:34811 
2016-03-02 18:01:14,082 INFO org.apache.flink.runtime.net.ConnectionUtils    
    - Failed to connect from address 'task.manager.eth0.hostname.com/10.127.68.136': Connection 
refused 
2016-03-02 18:01:14,082 INFO org.apache.flink.runtime.net.ConnectionUtils    
    - Failed to connect from address '/10.127.68.136': Connection refused 
2016-03-02 18:01:14,082 INFO org.apache.flink.runtime.net.ConnectionUtils    
    - Failed to connect from address '/10.120.193.110': Connection refused 
2016-03-02 18:01:14,082 INFO org.apache.flink.runtime.net.ConnectionUtils    
    - Failed to connect from address '/10.127.68.136': Connection refused 
2016-03-02 18:01:14,083 INFO org.apache.flink.runtime.net.ConnectionUtils    
    - Failed to connect from address '/127.0.0.1': Connection refused 
2016-03-02 18:01:14,083 INFO org.apache.flink.runtime.net.ConnectionUtils    
    - Failed to connect from address '/10.120.193.110': Connection refused 
2016-03-02 18:01:14,083 INFO org.apache.flink.runtime.net.ConnectionUtils    
    - Failed to connect from address '/10.127.68.136': Connection refused 
2016-03-02 18:01:14,083 INFO org.apache.flink.runtime.net.ConnectionUtils    
    - Failed to connect from address '/127.0.0.1': Connection refused 

、タスクマネージャは HEURISTIC戦略が終わるリーダーを取得しようとして使用して停止します新しいjobmanagerが発見し、タスクマネージャは、eth1のを使用して jobmanagerで登録することができるされた後

2016-03-02 18:01:23,650 INFO org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService 
- Stopping ZooKeeperLeaderRetrievalService. 
2016-03-02 18:01:23,655 INFO org.apache.zookeeper.ClientCnxn        
    - EventThread shut down 
2016-03-02 18:01:23,655 INFO org.apache.zookeeper.ZooKeeper        
    - Session: 0x25229757cff035b closed 
2016-03-02 18:01:23,664 INFO org.apache.flink.runtime.taskmanager.TaskManager   
    - TaskManager will use hostname/address 'task.manager.eth1.hostname.com' (10.120.193.110) 
for communication. 

:これからのeth1(10.120.193.110)を使用。問題は、eth1への接続が不可能なことです。だから、flink は常にeth0を使うべきです。我々は後で見るの例外がある:

java.io.IOException: Connecting the channel failed: Connecting to remote task manager + 'other.task.manager.eth1.hostname/10.120.193.111:46620' 
has failed. This might indicate that the remote task manager has been lost. 
    at org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory$ConnectingChannel.waitForChannel(PartitionRequestClientFactory.java:196) 
    at org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory$ConnectingChannel.access$000(PartitionRequestClientFactory.java:131) 
    at org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory.createPartitionRequestClient(PartitionRequestClientFactory.java:83) 
    at org.apache.flink.runtime.io.network.netty.NettyConnectionManager.createPartitionRequestClient(NettyConnectionManager.java:60) 
    at org.apache.flink.runtime.io.network.partition.consumer.RemoteInputChannel.requestSubpartition(RemoteInputChannel.java:115) 
    at org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.requestPartitions(SingleInputGate.java:388) 
    at org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.getNextBufferOrEvent(SingleInputGate.java:411) 
    at org.apache.flink.streaming.runtime.io.BarrierBuffer.getNextNonBlocked(BarrierBuffer.java:108) 
    at org.apache.flink.streaming.runtime.io.StreamInputProcessor.processInput(StreamInputProcessor.java:175) 
    at org.apache.flink.streaming.runtime.tasks.OneInputStreamTask.run(OneInputStreamTask.java:65) 
    at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:224) 
    at org.apache.flink.runtime.taskmanager.Task.run(Task.java:559) 
    at java.lang.Thread.run(Thread.java:744) 

根本的な原因はまだ古いjobmanager 場所を使用しているネットワークインタフェースの選択であると思われるので、右のインターフェイスを選択することができません。特に、 ネットワークインターフェイス上での反復順序が、HEURISTICとSLOWの戦略の間で異なります。 これは、間違ったインタフェースが選択されることにつながります。

+0

問題の内容をより明確に説明できますか? –

+0

私は再び編集されました –

答えて

1

問題は、今後のバグ修正リリース1.0.1で修正する必要があります。また、現時点でFlinkのバージョン1.1-SNAPSHOTを使用することもできます。

関連する問題