タイムラインとベンチマークでTensorflowパフォーマンスを打ち破る

TF 0.12.1を使用して、Tensorflowのパフォーマンスがどのように低下しているかを理解しようとしています。特に、Inception-v3モデルと、フォワードパスステップがどのくらいの時間を要しているかを見ています。タイムラインとベンチマークでTensorflowパフォーマンスを打ち破る

最初のステップは、推論ステップだけでベンチマークを実行することでした。キューイング時間を避けるために、トレーニングの例を一定のテンソルに設定し、開始モデルを通して実行します。コードの鉄道方法8つのGPUは、32のバッチサイズ、および1台のPARAMサーバについて

def train(dataset): 
    """Train on dataset for a number of steps.""" 
    with tf.Graph().as_default(), tf.device('/cpu:0'): 
    # Create a variable to count the number of train() calls. This equals the 
    # number of batches processed * FLAGS.num_gpus. 
    global_step = tf.get_variable(
     'global_step', [], 
     initializer=tf.constant_initializer(0), trainable=False) 

    # Calculate the learning rate schedule. 
    num_batches_per_epoch = (dataset.num_examples_per_epoch()/
          FLAGS.batch_size) 
    decay_steps = int(num_batches_per_epoch * FLAGS.num_epochs_per_decay) 

    # Decay the learning rate exponentially based on the number of steps. 
    lr = tf.train.exponential_decay(FLAGS.initial_learning_rate, 
            global_step, 
            decay_steps, 
            FLAGS.learning_rate_decay_factor, 
            staircase=True) 

    # Create an optimizer that performs gradient descent. 
    opt = tf.train.RMSPropOptimizer(lr, RMSPROP_DECAY, 
            momentum=RMSPROP_MOMENTUM, 
            epsilon=RMSPROP_EPSILON) 

    # Get images and labels for ImageNet and split the batch across GPUs. 
    assert FLAGS.batch_size % FLAGS.num_gpus == 0, (
     'Batch size must be divisible by number of GPUs') 
    split_batch_size = int(FLAGS.batch_size/FLAGS.num_gpus) 
    num_classes = dataset.num_classes() + 1 

    # Calculate the gradients for each model tower. 
    tower_grads = [] 
    reuse_variables = None 
    for i in xrange(FLAGS.num_gpus): 
     with tf.device('/gpu:%d' % i): 
     with tf.name_scope('%s_%d' % (inception.TOWER_NAME, i)) as scope: 
      # Force all Variables to reside on the CPU. 
      with slim.arg_scope([slim.variables.variable], device='/cpu:0'): 
      # Calculate the loss for one tower of the ImageNet model. This 
      # function constructs the entire ImageNet model but shares the 
      # variables across all towers. 
      image_shape = (FLAGS.batch_size, FLAGS.image_size, FLAGS.image_size, 3) 
      labels_shape = (FLAGS.batch_size) 
      images = tf.zeros(image_shape, dtype=tf.float32) 
      labels = tf.zeros(labels_shape, dtype=tf.int32) 

      logits = _tower_loss(images, labels, num_classes, 
           scope, reuse_variables) 

      # Reuse variables for the next tower. 
      reuse_variables = True 

    # Build an initialization operation to run below. 
    init = tf.initialize_all_variables() 

    # Start running operations on the Graph. allow_soft_placement must be set to 
    # True to build towers on GPU, as some of the ops do not have GPU 
    # implementations. 
    sess = tf.Session(config=tf.ConfigProto(
     allow_soft_placement=True, 
     log_device_placement=FLAGS.log_device_placement)) 
    sess.run(init) 

    # Start the queue runners. 
    tf.train.start_queue_runners(sess=sess) 

    for step in xrange(FLAGS.max_steps): 
     start_time = time.time() 
     loss_value = sess.run(logits) 
     duration = time.time() - start_time 
     examples_per_sec = FLAGS.batch_size/float(duration) 
     format_str = ('%s: step %d, loss =(%.1f examples/sec; %.3f ' 
         'sec/batch)') 
     print(format_str % (datetime.now(), step, 
          examples_per_sec, duration))

未満で、我々は、往路を行いlogits操作当たり0.44秒を観察します。しかし、タイムラインツールを実行すると、推論時間がはるかに小さくなります（下図参照）。 GPUランタイムでは、最初のバーストが続き、その後に長いGPUバーストが続くことを観察します。最初のバーストはフォワードパス、2番目のバーストはバックプロパゲーションであると仮定します。

初期バーストが本当に往路時間であれば、それは0.44秒よりも実質的に小さいです。誰もこの結果の不一致を説明できますか？ベンチマークアプリの間違いですか、タイムラインツールが完全な画像をキャプチャしていないのですか？さらに、実際に説明することができない最初の大きなバーストの前に、いくつかのGPU操作があります。これについての洞察は非常に高く評価されます！

出典

2017-05-04 user3249763

TensorFlowは、TF 0.12.1以降、多くの重要なパフォーマンスの改善を受けています。堅実なパフォーマンス数値に興味がある場合は、TensorFlowの最新バージョン、またはリリースされたバージョン1.2を使用してください。

高性能モデルを出発点として作業したい場合は、Inception-v3モデルを含むhttps://github.com/tensorflow/benchmarksからの作業を強くお勧めします。

1ステップの詳細なパフォーマンスを理解するために、私はC++ TensorFlowランタイムのインストルメントをお勧めします。

また、システムを「ウォームアップ」して完全に初期化するためには、実験をいくつか繰り返して実行することが重要です（この場合、Python内のオーバーヘッドは重要な意味を持ちます）。

メモ：モデルを調整する場合は、allow_soft_placement=Trueを設定しないでください。今のところ、期待しているすべての操作がGPUに本当に置かれていることを確認する方がよいでしょう。 log_device_placementパラメータで制御されるログ出力を確認することで確認できます。

出典

2017-05-08 19:38:11 saeta

タイムラインとベンチマークでTensorflowパフォーマンスを打ち破る

答えて

関連する問題