2016-09-13 26 views
1

:私は必要なもの塗りつぶし欠損値(パンダ)

email user_name sessions ymo 
[email protected] JD 1 2015-03-01 
[email protected] JD 2 2015-05-01 

all_ymo 

[Timestamp('2015-01-01 00:00:00'), 
Timestamp('2015-02-01 00:00:00'), 
Timestamp('2015-03-01 00:00:00'), 
Timestamp('2015-04-01 00:00:00'), 
Timestamp('2015-05-01 00:00:00'), 
Timestamp('2015-06-01 00:00:00'), 
Timestamp('2015-07-01 00:00:00'), 
Timestamp('2015-08-01 00:00:00'), 
Timestamp('2015-09-01 00:00:00'), 
Timestamp('2015-10-01 00:00:00'), 
Timestamp('2015-11-01 00:00:00'), 
Timestamp('2015-12-01 00:00:00')] 

email user_name sessions ymo 
[email protected] JD 0 2015-01-01 
[email protected] JD 0 2015-02-01 
[email protected] JD 1 2015-03-01 
[email protected] JD 0 2015-04-01 
[email protected] JD 2 2015-05-01 
[email protected] JD 0 2015-06-01 
[email protected] JD 0 2015-07-01 
[email protected] JD 0 2015-08-01 
[email protected] JD 0 2015-09-01 
[email protected] JD 0 2015-10-01 
[email protected] JD 0 2015-11-01 
[email protected] JD 0 2015-12-01 

ymo列がpd.Timestampのです

残念ながら、この回答:Adding values for missing data combinations in Pandasは、既存のymo値。

私はこのような何かを試してみましたが、それは遅い極めてです:

for em in all_emails: 
    existent_ymo = fill_ymo[fill_ymo['email'] == em]['ymo'] 
    existent_ymo = set([pd.Timestamp(datetime.date(t.year, t.month, t.day)) for t in existent_ymo]) 
    missing_ymo = list(existent_ymo - all_ymo) 
    multi_ind = pd.MultiIndex.from_product([[em], missing_ymo], names=col_names) 
    fill_ymo = sessions.set_index(col_names).reindex(multi_ind, fill_value=0).reset_index() 
+0

不足している項目が満たされ上回った場合は、pd.data_rangeで始まる新しいデータフレームを取り込みます。次に、日付が一致するセッション値を追加します。電子メールアドレスとuser_nameが1の場合は、メモリを節約するためにデータフレームに1つしか含めることはできません(サイズが問題の場合) – dodell

答えて

2

私はperiodsで、より一般的なソリューションを作成してみてください。

print (df) 
    email user_name sessions  ymo 
0 [email protected]  JD   1 2015-03-01 
1 [email protected]  JD   2 2015-05-01 
2 [email protected]  AB   1 2015-03-01 
3 [email protected]  AB   2 2015-05-01 


mbeg = pd.period_range('2015-01', periods=12, freq='M') 
print (mbeg) 
PeriodIndex(['2015-01', '2015-02', '2015-03', '2015-04', '2015-05', '2015-06', 
      '2015-07', '2015-08', '2015-09', '2015-10', '2015-11', '2015-12'], 
      dtype='int64', freq='M') 
#convert column ymo to period 
df.ymo = df.ymo.dt.to_period('m') 
#groupby and reindex with filling 0 
df = df.groupby(['email','user_name']) 
     .apply(lambda x: x.set_index('ymo') 
     .reindex(mbeg, fill_value=0) 
     .drop(['email','user_name'], axis=1)) 
     .rename_axis(('email','user_name','ymo')) 
     .reset_index() 
print (df) 

     email user_name  ymo sessions 
0 [email protected]  JD 2015-01   0 
1 [email protected]  JD 2015-02   0 
2 [email protected]  JD 2015-03   1 
3 [email protected]  JD 2015-04   0 
4 [email protected]  JD 2015-05   2 
5 [email protected]  JD 2015-06   0 
6 [email protected]  JD 2015-07   0 
7 [email protected]  JD 2015-08   0 
8 [email protected]  JD 2015-09   0 
9 [email protected]  JD 2015-10   0 
10 [email protected]  JD 2015-11   0 
11 [email protected]  JD 2015-12   0 
12 [email protected]  AB 2015-01   0 
13 [email protected]  AB 2015-02   0 
14 [email protected]  AB 2015-03   1 
15 [email protected]  AB 2015-04   0 
16 [email protected]  AB 2015-05   2 
17 [email protected]  AB 2015-06   0 
18 [email protected]  AB 2015-07   0 
19 [email protected]  AB 2015-08   0 
20 [email protected]  AB 2015-09   0 
21 [email protected]  AB 2015-10   0 
22 [email protected]  AB 2015-11   0 
23 [email protected]  AB 2015-12   0 

そして、もし必要datetimes使用to_timestamp

日付時刻と
df.ymo = df.ymo.dt.to_timestamp() 
print (df) 
     email user_name  ymo sessions 
0 [email protected]  JD 2015-01-01   0 
1 [email protected]  JD 2015-02-01   0 
2 [email protected]  JD 2015-03-01   1 
3 [email protected]  JD 2015-04-01   0 
4 [email protected]  JD 2015-05-01   2 
5 [email protected]  JD 2015-06-01   0 
6 [email protected]  JD 2015-07-01   0 
7 [email protected]  JD 2015-08-01   0 
8 [email protected]  JD 2015-09-01   0 
9 [email protected]  JD 2015-10-01   0 
10 [email protected]  JD 2015-11-01   0 
11 [email protected]  JD 2015-12-01   0 
12 [email protected]  AB 2015-01-01   0 
13 [email protected]  AB 2015-02-01   0 
14 [email protected]  AB 2015-03-01   1 
15 [email protected]  AB 2015-04-01   0 
16 [email protected]  AB 2015-05-01   2 
17 [email protected]  AB 2015-06-01   0 
18 [email protected]  AB 2015-07-01   0 
19 [email protected]  AB 2015-08-01   0 
20 [email protected]  AB 2015-09-01   0 
21 [email protected]  AB 2015-10-01   0 
22 [email protected]  AB 2015-11-01   0 
23 [email protected]  AB 2015-12-01   0 

ソリューション:

print (df) 
    email user_name sessions  ymo 
0 [email protected]  JD   1 2015-03-01 
1 [email protected]  JD   2 2015-05-01 
2 [email protected]  AB   1 2015-03-01 
3 [email protected]  AB   2 2015-05-01 

mbeg = pd.date_range('2015-01-31', periods=12, freq='M') - pd.offsets.MonthBegin() 

df = df.groupby(['email','user_name']) 
     .apply(lambda x: x.set_index('ymo') 
     .reindex(mbeg, fill_value=0) 
     .drop(['email','user_name'], axis=1)) 
     .rename_axis(('email','user_name','ymo')) 
     .reset_index() 
print (df) 
     email user_name  ymo sessions 
0 [email protected]  JD 2015-01-01   0 
1 [email protected]  JD 2015-02-01   0 
2 [email protected]  JD 2015-03-01   1 
3 [email protected]  JD 2015-04-01   0 
4 [email protected]  JD 2015-05-01   2 
5 [email protected]  JD 2015-06-01   0 
6 [email protected]  JD 2015-07-01   0 
7 [email protected]  JD 2015-08-01   0 
8 [email protected]  JD 2015-09-01   0 
9 [email protected]  JD 2015-10-01   0 
10 [email protected]  JD 2015-11-01   0 
11 [email protected]  JD 2015-12-01   0 
12 [email protected]  AB 2015-01-01   0 
13 [email protected]  AB 2015-02-01   0 
14 [email protected]  AB 2015-03-01   1 
15 [email protected]  AB 2015-04-01   0 
16 [email protected]  AB 2015-05-01   2 
17 [email protected]  AB 2015-06-01   0 
18 [email protected]  AB 2015-07-01   0 
19 [email protected]  AB 2015-08-01   0 
20 [email protected]  AB 2015-09-01   0 
21 [email protected]  AB 2015-10-01   0 
22 [email protected]  AB 2015-11-01   0 
23 [email protected]  AB 2015-12-01   0 
2
  • は、列'sessions'
  • の月開始の日付と reindex
  • ffillbfill['email', 'user_name']
  • fillna(0)を生成

mbeg = pd.date_range('2015-01-31', periods=12, freq='M') - pd.offsets.MonthBegin() 

df1 = df.set_index('ymo').reindex(mbeg) 

df1[['email', 'user_name']] = df1[['email', 'user_name']].ffill().bfill() 
df1['sessions'] = df1['sessions'].fillna(0).astype(int) 

df1 

enter image description here

+0

残念ながら、同じ日付の別のユーザーの行があると機能しませんこれは、ValueError:重複軸から再インデックスできません。 – LetMeSOThat4U