Hadoopでリデューサ出力を分割する

私のReduce操作で生成される出力ファイルは、巨大です（Gzipping後に1 GB）。私は200 MBの小さなファイルにブレーク出力を生成したい。出力を縮小するプロパティ/ Javaクラスがサイズまたはサイズで出力されていますか？行の？ハイドロパッドジョブのパフォーマンスに悪影響を与えるので、リデューサーの数を増やすことはできません。Hadoopでリデューサ出力を分割する

出典

2012-05-03 hznut

なぜ私はもっと多くのレデューサーを使うことができないのか不思議ですが、私はあなたの言葉であなたを連れて行きます。

複数の出力を使用して、1つのレデューサーから複数のファイルに書き込むことができます。たとえば、各レデューサーの出力ファイルが1GBで、代わりに256MBのファイルが必要だとします。つまり、1つのファイルではなく、1つの減速機につき4つのファイルを書き込む必要があります。あなたの仕事のドライバで

は、この操作を行います。あなたの減速で

JobConf conf = ...; 

// You should probably pass this in as parameter rather than hardcoding 4. 
conf.setInt("outputs.per.reducer", 4); 

// This sets up the infrastructure to write multiple files per reducer. 
MultipleOutputs.addMultiNamedOutput(conf, "multi", YourOutputFormat.class, YourKey.class, YourValue.class);

を、次の操作を行います。

@Override 
public void configure(JobConf conf) { 
    numFiles = conf.getInt("outputs.per.reducer", 1); 
    multipleOutputs = new MultipleOutputs(conf); 

    // other init stuff 
    ... 
} 

@Override 
public void reduce(YourKey key 
        Iterator<YourValue> valuesIter, 
        OutputCollector<OutKey, OutVal> ignoreThis, 
        Reporter reporter) { 
    // Do your business logic just as you're doing currently. 
    OutKey outputKey = ...; 
    OutVal outputVal = ...; 

    // Now this is where it gets interesting. Hash the value to find 
    // which output file the data should be written to. Don't use the 
    // key since all the data will be written to one file if the number 
    // of reducers is a multiple of numFiles. 
    int fileIndex = (outputVal.hashCode() & Integer.MAX_VALUE) % numFiles; 

    // Now use multiple outputs to actually write the data. 
    // This will create output files named: multi_0-r-00000, multi_1-r-00000, 
    // multi_2-r-00000, multi_3-r-00000 for reducer 0. For reducer 1, the files 
    // will be multi_0-r-00001, multi_1-r-00001, multi_2-r-00001, multi_3-r-00001. 
    multipleOutputs.getCollector("multi", Integer.toString(fileIndex), reporter) 
     .collect(outputKey, outputValue); 
} 

@Overrider 
public void close() { 
    // You must do this!!!! 
    multipleOutputs.close(); 
}

この擬似コードは心の古いMapReduceのAPIで書かれていました。同等のapiはmapreduce apiを使用して存在しますが、いずれにしても、すべて設定する必要があります。

出典

2012-05-04 06:21:18 deridex

還元剤の数を増やすことはできません。なぜなら、データのシャッフルを増やす必要があるため、仕事が遅くなるからです。私はこれを理論的にも実際にも確認しています。しかし私はあなたの提案した解決策です。私はそれを試してみましょう。 – hznut

これを行うプロパティはありません。あなた自身の出力形式&レコードライターを書く必要があります。

出典

2012-05-03 21:20:14

Hadoopでリデューサ出力を分割する

答えて

関連する問題