複数のデータセットで.hdf5ファイルをサブサンプリングする

大きな.h5ファイルからいくつかの「行」を抽出して、より小さなサンプルファイルを作成しようとしています。複数のデータセットで.hdf5ファイルをサブサンプリングする

私のサンプルが元のファイルのように見えるように、私はランダムに行を抽出しています。

#Get length of files and prepare samples 
source_file = h5py.File(args.data_path, "r") 
dataset = source_file['X'] 
indices = np.sort(np.random.choice(dataset.shape[0],args.nb_rows)) 

#checking we're extracting a subsample 
if args.nb_rows > dataset.shape[0]: 
    raise ValueError("Can't extract more rows than dataset contains. Dataset has %s rows" % dataset.shape[0]) 

target_file = h5py.File(target, "w") 
for k in source_file.keys(): 
    dataset = source_file[k] 
    dataset = dataset[indices,:,:,:] 
    dest_dataset = target_file.create_dataset(k, shape=(dataset.shape), dtype=np.float32) 
dest_dataset.write_direct(dataset) 
target_file.close() 
source_file.close()

しかし、nb_rowsが（10,000のような）場合、私はTypeError("Indexing elements must be in increasing order")を得ています。インデックスはソートされているので、このエラーは発生しないはずです。私は何かを誤解していますか？

出典

2017-07-06 Malo Marrec

あなたは重複していると思います。明らかに

あなたはargs.nb_rows > dataset.shape[0]た場合の重複取得します：

In [499]: np.random.choice(10, 20) 
Out[499]: array([2, 4, 1, 5, 2, 8, 4, 3, 7, 0, 2, 6, 6, 8, 9, 3, 8, 4, 2, 5]) 
In [500]: np.sort(np.random.choice(10, 20)) 
Out[500]: array([1, 1, 1, 2, 2, 2, 4, 4, 4, 5, 5, 5, 5, 6, 6, 7, 8, 8, 8, 9])

をしかし、数があるとき、あなたはまだ、重複を取得することができます小さい：

In [502]: np.sort(np.random.choice(10, 9)) 
Out[502]: array([0, 0, 1, 1, 1, 5, 5, 9, 9])

電源を入れreplaceオフ：

In [504]: np.sort(np.random.choice(10, 9, replace=False)) 
Out[504]: array([0, 1, 2, 3, 4, 5, 6, 7, 8])

出典

2017-07-06 23:16:53 hpaulj

複数のデータセットで.hdf5ファイルをサブサンプリングする

答えて

関連する問題