2016-03-28 6 views
1

私はパンダで前処理したいデータセットを持っています。これは、そのデータセットの2行のサンプルです:pandasで自分のcsvファイルを再フォーマットする(値のセットを列に変換し、別のセットを対応する値に一致させる)

| text | rank | date  | provinces.0 | provinces.1  | provinces.2  | provinces.3 | provinces.4 | provinces.5 | provinces.6    | provinces.7    | provinces.8 | provinces.9   | provinces.10 | provinceFrequency.0 | provinceFrequency.1 | provinceFrequency.2 | provinceFrequency.3 | provinceFrequency.4 | provinceFrequency.5 | provinceFrequency.6 | provinceFrequency.7 | provinceFrequency.8 | provinceFrequency.9 | provinceFrequency.10 | | 
|--------|------|-----------|-------------|------------------|------------------|-------------|-------------|---------------|---------------------------|---------------------------|--------------|----------------------|--------------|---------------------|---------------------|---------------------|---------------------|---------------------|---------------------|---------------------|---------------------|---------------------|---------------------|----------------------|---| 
| Topic1 | 0 | 9/13/2015 | Ontario  | Quebec   | British Columbia | Alberta  | Manitoba | Saskatchewan | Newfoundland and Labrador |       |    |      |    | 192     | 378     | 8     | 10     | 1     | 1     | 1     |      |      |      |      | | 
| Topic2 | 1 | 9/13/2015 | Manitoba | British Columbia | Alberta   | Ontario  | Nova Scotia | New Brunswick | Quebec     | Newfoundland and Labrador | Saskatchewan | Prince Edward Island | Nunavut  | 7     | 61     | 51     | 112     | 7     | 8     | 11     | 2     | 2     | 1     | 2     | | 
|  |  |   |    |     |     |    |    |    |       |       |    |      |    |      |      |      |      |      |      |      |      |      |      |      | | 

このデータセットは、別の州でカナダのツイーターにトピックをトレンド含まれ、「provinces.0」の列は、州の名前を示し、「provinceFrequency.0は」数を示していますその地域の注目のトピックのヒット数。

私はこの形式に変換したい:

| Topic | Date  | Ontario | Quebec | Nova Scotia | New Brunswick | Manitoba | British Columbia | Prince Edward Island | Saskatchewan | Alberta | Newfoundland and Labrador | 
|--------|-----------|---------|--------|-------------|---------------|----------|------------------|----------------------|--------------|---------|---------------------------| 
| Topic1 | 9/13/2015 | 192  | 378 | -   | -    | 1  | 8    | -     | 1   | 10  | 1       | 

私はネイティブのPythonコードとコードの多くとそれを行うことができ、パンダでそれを行うための簡単な方法はありますか?

+1

@moderatorsは、IMOこの質問はかなり困難であるので、I SO ___keep__に投票します.SO – MaxU

+0

ソースから宛先へのフィールドマッピングを提供した場合に役立ちます。それ以外の場合は、読者は水平方向にたくさんスクロールし、それらのマッピングを導き出す必要があります。 – J0e3gan

答えて

2

このタスクは少しトリッキーだった:

import pandas as pd 

df = pd.read_csv(r'D:\download\Sheet1.csv') 

# `id_vars` helper list for `melt()` 
id_vars = df.columns[df.columns.str.contains('provinces\.')].tolist() 

# `value_vars` helper list for `melt()` 
val_vars = df.columns[df.columns.str.contains('provinceFrequency\.')].tolist() 

mlt = pd.melt(df, id_vars=id_vars, value_vars=val_vars) 

mlt['variable'] = mlt['variable'].str.replace(r'provinceFrequency', 'provinces') 

# add column with the _correct_ province 
mlt['prov'] = mlt.apply(lambda row: row[row['variable']], axis=1) 

new = mlt[['prov', 'value']].reset_index() 

# free memory 
del mlt 

# set original df's index for joining in future 
new['idx'] = new['index']%len(df) 

# pivot (convert rows to columns) 
pvt = pd.pivot_table(new, index='idx', columns='prov', values='value', aggfunc='first') 

# free memory 
del new 

# join original `DF` with the pivoted DF `PVT` using index 
rslt = df[['text','rank','date']].join(pvt) 

print(rslt) 

出力:ステップバイ

 text rank  date Alberta British Columbia Manitoba \ 
0 Topic1  0 9/13/2015  10.0    8.0  1.0 
1 Topic2  1 9/13/2015  51.0    61.0  7.0 

    New Brunswick Newfoundland and Labrador Nova Scotia Nunavut Ontario \ 
0   NaN      1.0   NaN  NaN 192.0 
1   8.0      2.0   7.0  2.0 112.0 

    Prince Edward Island Quebec Saskatchewan 
0     NaN 378.0   1.0 
1     1.0 11.0   2.0 

ステップ:

In [263]: id_vars 
Out[263]: 
['provinces.0', 
'provinces.1', 
'provinces.2', 
'provinces.3', 
'provinces.4', 
'provinces.5', 
'provinces.6', 
'provinces.7', 
'provinces.8', 
'provinces.9', 
'provinces.10'] 

In [264]: val_vars 
Out[264]: 
['provinceFrequency.0', 
'provinceFrequency.1', 
'provinceFrequency.2', 
'provinceFrequency.3', 
'provinceFrequency.4', 
'provinceFrequency.5', 
'provinceFrequency.6', 
'provinceFrequency.7', 
'provinceFrequency.8', 
'provinceFrequency.9', 
'provinceFrequency.10'] 

In [265]: mlt 
Out[265]: 
    provinces.0  provinces.1  provinces.2 provinces.3 provinces.4 \ 
0  Ontario   Quebec British Columbia  Alberta  Manitoba 
1  Manitoba British Columbia   Alberta  Ontario Nova Scotia 
2  Ontario   Quebec British Columbia  Alberta  Manitoba 
3  Manitoba British Columbia   Alberta  Ontario Nova Scotia 
4  Ontario   Quebec British Columbia  Alberta  Manitoba 
5  Manitoba British Columbia   Alberta  Ontario Nova Scotia 
6  Ontario   Quebec British Columbia  Alberta  Manitoba 
7  Manitoba British Columbia   Alberta  Ontario Nova Scotia 
8  Ontario   Quebec British Columbia  Alberta  Manitoba 
9  Manitoba British Columbia   Alberta  Ontario Nova Scotia 
10  Ontario   Quebec British Columbia  Alberta  Manitoba 
11 Manitoba British Columbia   Alberta  Ontario Nova Scotia 
12  Ontario   Quebec British Columbia  Alberta  Manitoba 
13 Manitoba British Columbia   Alberta  Ontario Nova Scotia 
14  Ontario   Quebec British Columbia  Alberta  Manitoba 
15 Manitoba British Columbia   Alberta  Ontario Nova Scotia 
16  Ontario   Quebec British Columbia  Alberta  Manitoba 
17 Manitoba British Columbia   Alberta  Ontario Nova Scotia 
18  Ontario   Quebec British Columbia  Alberta  Manitoba 
19 Manitoba British Columbia   Alberta  Ontario Nova Scotia 
20  Ontario   Quebec British Columbia  Alberta  Manitoba 
21 Manitoba British Columbia   Alberta  Ontario Nova Scotia 

     provinces.5    provinces.6    provinces.7 \ 
0 Saskatchewan Newfoundland and Labrador      NaN 
1 New Brunswick      Quebec Newfoundland and Labrador 
2 Saskatchewan Newfoundland and Labrador      NaN 
3 New Brunswick      Quebec Newfoundland and Labrador 
4 Saskatchewan Newfoundland and Labrador      NaN 
5 New Brunswick      Quebec Newfoundland and Labrador 
6 Saskatchewan Newfoundland and Labrador      NaN 
7 New Brunswick      Quebec Newfoundland and Labrador 
8 Saskatchewan Newfoundland and Labrador      NaN 
9 New Brunswick      Quebec Newfoundland and Labrador 
10 Saskatchewan Newfoundland and Labrador      NaN 
11 New Brunswick      Quebec Newfoundland and Labrador 
12 Saskatchewan Newfoundland and Labrador      NaN 
13 New Brunswick      Quebec Newfoundland and Labrador 
14 Saskatchewan Newfoundland and Labrador      NaN 
15 New Brunswick      Quebec Newfoundland and Labrador 
16 Saskatchewan Newfoundland and Labrador      NaN 
17 New Brunswick      Quebec Newfoundland and Labrador 
18 Saskatchewan Newfoundland and Labrador      NaN 
19 New Brunswick      Quebec Newfoundland and Labrador 
20 Saskatchewan Newfoundland and Labrador      NaN 
21 New Brunswick      Quebec Newfoundland and Labrador 

    provinces.8   provinces.9 provinces.10  variable value \ 
0   NaN     NaN   NaN provinces.0 192.0 
1 Saskatchewan Prince Edward Island  Nunavut provinces.0 7.0 
2   NaN     NaN   NaN provinces.1 378.0 
3 Saskatchewan Prince Edward Island  Nunavut provinces.1 61.0 
4   NaN     NaN   NaN provinces.2 8.0 
5 Saskatchewan Prince Edward Island  Nunavut provinces.2 51.0 
6   NaN     NaN   NaN provinces.3 10.0 
7 Saskatchewan Prince Edward Island  Nunavut provinces.3 112.0 
8   NaN     NaN   NaN provinces.4 1.0 
9 Saskatchewan Prince Edward Island  Nunavut provinces.4 7.0 
10   NaN     NaN   NaN provinces.5 1.0 
11 Saskatchewan Prince Edward Island  Nunavut provinces.5 8.0 
12   NaN     NaN   NaN provinces.6 1.0 
13 Saskatchewan Prince Edward Island  Nunavut provinces.6 11.0 
14   NaN     NaN   NaN provinces.7 NaN 
15 Saskatchewan Prince Edward Island  Nunavut provinces.7 2.0 
16   NaN     NaN   NaN provinces.8 NaN 
17 Saskatchewan Prince Edward Island  Nunavut provinces.8 2.0 
18   NaN     NaN   NaN provinces.9 NaN 
19 Saskatchewan Prince Edward Island  Nunavut provinces.9 1.0 
20   NaN     NaN   NaN provinces.10 NaN 
21 Saskatchewan Prince Edward Island  Nunavut provinces.10 2.0 

         prov 
0      Ontario 
1     Manitoba 
2      Quebec 
3   British Columbia 
4   British Columbia 
5      Alberta 
6      Alberta 
7      Ontario 
8     Manitoba 
9     Nova Scotia 
10    Saskatchewan 
11    New Brunswick 
12 Newfoundland and Labrador 
13      Quebec 
14      NaN 
15 Newfoundland and Labrador 
16      NaN 
17    Saskatchewan 
18      NaN 
19  Prince Edward Island 
20      NaN 
21     Nunavut 

In [269]: mlt[['prov', 'value']].reset_index() 
Out[269]: 
    index      prov value 
0  0     Ontario 192.0 
1  1     Manitoba 7.0 
2  2      Quebec 378.0 
3  3   British Columbia 61.0 
4  4   British Columbia 8.0 
5  5     Alberta 51.0 
6  6     Alberta 10.0 
7  7     Ontario 112.0 
8  8     Manitoba 1.0 
9  9    Nova Scotia 7.0 
10  10    Saskatchewan 1.0 
11  11    New Brunswick 8.0 
12  12 Newfoundland and Labrador 1.0 
13  13      Quebec 11.0 
14  14      NaN NaN 
15  15 Newfoundland and Labrador 2.0 
16  16      NaN NaN 
17  17    Saskatchewan 2.0 
18  18      NaN NaN 
19  19  Prince Edward Island 1.0 
20  20      NaN NaN 
21  21     Nunavut 2.0 

In [270]: # set original index for joining in future 

In [271]: new['idx'] = new['index']%len(df) 

In [272]: new 
Out[272]: 
    index      prov value idx 
0  0     Ontario 192.0 0 
1  1     Manitoba 7.0 1 
2  2      Quebec 378.0 0 
3  3   British Columbia 61.0 1 
4  4   British Columbia 8.0 0 
5  5     Alberta 51.0 1 
6  6     Alberta 10.0 0 
7  7     Ontario 112.0 1 
8  8     Manitoba 1.0 0 
9  9    Nova Scotia 7.0 1 
10  10    Saskatchewan 1.0 0 
11  11    New Brunswick 8.0 1 
12  12 Newfoundland and Labrador 1.0 0 
13  13      Quebec 11.0 1 
14  14      NaN NaN 0 
15  15 Newfoundland and Labrador 2.0 1 
16  16      NaN NaN 0 
17  17    Saskatchewan 2.0 1 
18  18      NaN NaN 0 
19  19  Prince Edward Island 1.0 1 
20  20      NaN NaN 0 
21  21     Nunavut 2.0 1 

In [273]: pvt = pd.pivot_table(new, index='idx', columns='prov', values='value', aggfunc='first') 

In [274]: pvt 
Out[274]: 
prov Alberta British Columbia Manitoba New Brunswick \ 
idx 
0  10.0    8.0  1.0   NaN 
1  51.0    61.0  7.0   8.0 

prov Newfoundland and Labrador Nova Scotia Nunavut Ontario \ 
idx 
0       1.0   NaN  NaN 192.0 
1       2.0   7.0  2.0 112.0 

prov Prince Edward Island Quebec Saskatchewan 
idx 
0      NaN 378.0   1.0 
1      1.0 11.0   2.0 

In [275]: rslt = df[['text','rank','date']].join(pvt) 

In [276]: rslt 
Out[276]: 
    text rank  date Alberta British Columbia Manitoba \ 
0 Topic1  0 9/13/2015  10.0    8.0  1.0 
1 Topic2  1 9/13/2015  51.0    61.0  7.0 

    New Brunswick Newfoundland and Labrador Nova Scotia Nunavut Ontario \ 
0   NaN      1.0   NaN  NaN 192.0 
1   8.0      2.0   7.0  2.0 112.0 

    Prince Edward Island Quebec Saskatchewan 
0     NaN 378.0   1.0 
1     1.0 11.0   2.0 
+0

それは絶対に素晴らしかった! – Iman

+0

@Iman、私は助けることができてうれしい。あなたはあなたの質問を編集して、あなたのサンプル入力データと期待される出力をあなたの質問の中にテキストとして入れてください。たぶん、司会者はこの質問を閉じないでしょう。私は/それが他の人のためにも役立つかもしれないと思います... – MaxU

+0

私はhttp://www.tablesgenerator.com/markdown_tablesを使用して、あなたの助けに感謝しました。タイトルを変更する – Iman

関連する問題