値を変更してデータフレームを一貫してホットエンコードする方法は？

私は、データフレームの形式でコンテンツのストリームを取得しています。各バッチは、列の値が異なります。例えば1つのバッチは、次のようになります。値を変更してデータフレームを一貫してホットエンコードする方法は？

day1_data = {'state': ['MS', 'OK', 'VA', 'NJ', 'NM'], 
      'city': ['C', 'B', 'G', 'Z', 'F'], 
      'age': [27, 19, 63, 40, 93]}

など別の1：

day2_data = {'state': ['AL', 'WY', 'VA'], 
      'city': ['A', 'B', 'E'], 
      'age': [42, 52, 73]}

は、どのように列は、列の一貫性の数を返すように、ホットエンコードすることができますか？

私はパンダのget_dummiesを使用する場合は（）のバッチのそれぞれに、それは異なる数の列を返します。

df1 = pd.get_dummies(pd.DataFrame(day1_data)) 
df2 = pd.get_dummies(pd.DataFrame(day2_data)) 

len(df1.columns) == len(df2.columns)

私は、各列のすべての可能な値を得ることができ、質問にもその情報を使用することです毎日のバッチごとに1つのホットエンコーディングを生成する最も簡単な方法は何ですか？列の数は一貫していますか？

出典

2017-12-30 oshi2016

両方のデータソースには、「年齢」、「都市」、および「状態」という同じ列があります。\tこれはいつものケースですか？そうでない場合は、別の列でより現実的な例を提供してください。興味深い質問。 –

特定の列に含まれるすべての値を事前に知っていますか？ – akilat90

なぜそれらを連結してからダミーを取得しないのですか？ –

可能な値はすべて事前にわかっています。次に、それを行うのはやや面白い方法です。

import numpy as np 
import pandas as pd 

# This is a one time process 
# Keep all the possible data here in lists 
# Can add other categorical variables too which have this type of data 
all_possible_states= ['AL', 'MS', 'MS', 'OK', 'VA', 'NJ', 'NM', 'CD', 'WY'] 
all_possible_cities= ['A', 'B', 'C', 'D', 'E', 'G', 'Z', 'F'] 

# Declare our transformer class 
from sklearn.base import BaseEstimator, TransformerMixin 
from sklearn.preprocessing import LabelEncoder, OneHotEncoder 

class MyOneHotEncoder(BaseEstimator, TransformerMixin): 

    def __init__(self, all_possible_values): 
     self.le = LabelEncoder() 
     self.ohe = OneHotEncoder() 
     self.ohe.fit(self.le.fit_transform(all_possible_values).reshape(-1,1)) 

    def transform(self, X, y=None): 
     return self.ohe.transform(self.le.transform(X).reshape(-1,1)).toarray() 

# Allow the transformer to see all the data here 
encoders = {} 
encoders['state'] = MyOneHotEncoder(all_possible_states) 
encoders['city'] = MyOneHotEncoder(all_possible_cities) 
# Do this for all categorical columns 

# Now this is our method which will be used on the incoming data 
def encode(df): 

    tup = (encoders['state'].transform(df['state']), 
      encoders['city'].transform(df['city']), 
      # Add all other columns which are not to be transformed 
      df[['age']]) 

    return np.hstack(tup) 

# Testing: 
day1_data = pd.DataFrame({'state': ['MS', 'OK', 'VA', 'NJ', 'NM'], 
     'city': ['C', 'B', 'G', 'Z', 'F'], 
     'age': [27, 19, 63, 40, 93]}) 

print(encode(day1_data)) 
[[ 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 
    0. 0. 27.] 
[ 0. 0. 0. 0. 0. 1. 0. 0. 0. 1. 0. 0. 0. 0. 
    0. 0. 19.] 
[ 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 
    1. 0. 63.] 
[ 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 
    0. 1. 40.] 
[ 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 1. 
    0. 0. 93.]] 


day2_data = pd.DataFrame({'state': ['AL', 'WY', 'VA'], 
      'city': ['A', 'B', 'E'], 
      'age': [42, 52, 73]}) 

print(encode(day2_data)) 
[[ 1. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 
    0. 0. 42.] 
[ 0. 0. 0. 0. 0. 0. 0. 1. 0. 1. 0. 0. 0. 0. 
    0. 0. 52.] 
[ 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 1. 0. 
    0. 0. 73.]]

コメントを行っても問題が解決しない場合は、私に尋ねてください。

出典

2018-01-04 12:49:04

値を変更してデータフレームを一貫してホットエンコードする方法は？

答えて

関連する問題