SklearnのCountVectorizer（）にカスタムテキストデータフォーマットを使用する方法は？

いいねintroduction on how to use sklearn for text analyticsがあります。SklearnのCountVectorizer（）にカスタムテキストデータフォーマットを使用する方法は？

しかし、上のチュートリアルでは、詳細に指定されていない「束」オブジェクトでsklearnのデータセットを使用しています。そのため、私はsklearnメソッドを使用するためにデータを希望の形式にしています。私はさらなる処理のための私のテキストデータにCountVectorizer()を使用したいが、CountVectorizer.fit_transform（my_string_array）を呼び出すことは、常にいくつかのエラーをスローしたい：

AttributeError: 'list' object has no attribute 'lower'

私がこれまでに以下のnumpyの配列型を初期化しようとしましたそしてそれらに私の文字列をロードしますが、それらのどれも働いた：

np.chararray（形状）
np.empty（形状、DTYPE = STR/OBJ）

出典

2017-02-22 ben0it8

簡単な例：

from sklearn.feature_extraction.text import CountVectorizer 

docs = ['This is the first document', 'This is the second document'] 
count_vect = CountVectorizer() 
X_train_counts = count_vect.fit_transform(docs)

docsは、テキストはすでにあなたはそれはdoesnのことCountVectorizerを伝える必要があり、その後トークン化されている場合は文字列のコレクション、すなわちリスト、numpyの配列など

する必要がありますCountVectorizerがmentioned hereとして処理するために、シーケンスまたは文字列のリストを必要と

from sklearn.feature_extraction.text import CountVectorizer 

docs = [['This', 'is', 'the', 'first', 'document'], 
     ['This', 'is', 'the', 'second', 'document']] 
count_vect = CountVectorizer(tokenizer=lambda text: text) 
X_train_counts = count_vect.fit_transform(docs)

出典

2017-02-22 15:54:38 elyase

問題は私のデータが文字列の配列の配列です。 – ben0it8

@ ben0it8、Ok、なぜあなたのデータが1Dではなく2Dであるのですか？[['string1'、 'string2'、....]、['string1'、 'string2'、....] （ドキュメントのコレクション）？すでに文書をトークン化していますか？ – elyase

はい、私の配列の各要素は1つの文書の文字列に対応しています。 – ben0it8

：

トンは、文字列を分割する必要があります

input : string {‘filename’, ‘file’, ‘content’}

If ‘filename’, the sequence passed as an argument to fit is expected to be a list of filenames that need reading to fetch the raw content to analyze.

If ‘file’, the sequence items must have a ‘read’ method (file-like object) that is called to fetch the bytes in memory.

Otherwise the input is expected to be the sequence strings or bytes items are expected to be analyzed directly.

あなたは[['string1', 'string2', ....], ['string1', 'string2', ....]を提供しています。外側の配列は配列なので、必要条件は完了です。

次に、CountVectorizer（）は、指定したリストの要素を繰り返します。

期待しています：object of type string、それに小文字の文字列を作るためにlower()を呼び出します。それはリストである['string1', 'string2', ....]を得て、明らかにはlower（）メソッドを持っていません。したがって、エラー。

ソリューション：代わりに文字列のリストのリストで、唯一の単一のリストを使用している場合は、私の意見では、それは）（CountVectorizerに渡された結果を変更しません。

MKE実行して単一の文字列を（使用している文書リストあたりの）文字列の内側のリスト：dataは、文字列のリストのリストを使用して文字列データをある

data = [" ".join(x) for x in data]

。あなたのデータを想定し

は次のとおりです。

data = [['yo', 'dude'],['how','are', 'you']] 
data = [" ".join(x) for x in data]

出力：

['yo dude', 'how are you']

これが今エラーなしでCountVectorizerに渡すことができます。

出典

2017-02-23 02:45:06

SklearnのCountVectorizer（）にカスタムテキストデータフォーマットを使用する方法は？

答えて

関連する問題