Scikit-Learnの特徴抽出でCountVectorizerをマージする

私はscikit-learnには新しく、私が取り組んでいることについていくつかの助けが必要でした。Scikit-Learnの特徴抽出でCountVectorizerをマージする

多項式Naive Bayes分類を使用して、2種類のドキュメント（A型とB型など）を分類しようとしています。これらのドキュメントの用語カウントを取得するために、私はsklearn.feature_extraction.textにCountVectorizerクラスを使用しています。

問題は、トークンを抽出するために、2つのタイプのドキュメントで異なる正規表現が必要であるという点です（CountVectorizationのtoken_patternパラメータ）。あなたが試すことができ

vecA = CountVectorizer(token_pattern="[a-zA-Z]+", ...) 
vecA.fit(list_of_type_A_document_content) 
... 
vecB = CountVectorizer(token_pattern="[a-zA-Z0-9]+", ...) 
vecB.fit(list_of_type_B_document_content) 
... 
# Somehow merge the two vectorizers results and get the final sparse matrix

出典

2016-05-06 Archit Shukla

：私は、最初のタイプAのトレーニング文書をロードし、タイプBのような何かをすることが可能ですする方法を見つけるように見えることはできません

vecA = CountVectorizer(token_pattern="[a-zA-Z]+", ...) 
vecA.fit_transform(list_of_type_A_document_content) 
vecB = CountVectorizer(token_pattern="[a-zA-Z0-9]+", ...) 
vecB.fit_transform(list_of_type_B_document_content) 
combined_features = FeatureUnion([('CountVectorizer', vectA),('CountVect', vectB)]) 
combined_features.transform(test_data)

あなたは、バージョン0.13.1から

可能です http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.FeatureUnion.html

からFeatureUnionについての詳細を読むことができます

出典

2016-11-29 02:44:19

Scikit-Learnの特徴抽出でCountVectorizerをマージする

答えて

関連する問題