nltk naivebayesクラシファイアでどのように周波数を追加できますか？

私は現在、nltkを使用してnaivebayesクラシファイアを学習しています。nltk naivebayesクラシファイアでどのように周波数を追加できますか？

文書（http://www.nltk.org/book/ch06.html）1.3文書分類では、フィーチャセットの例があります。

featuresets = [(document_features(d), c) for (d,c) in documents] 
train_set, test_set = featuresets[100:], featuresets[:100] 
classifier = nltk.NaiveBayesClassifier.train(train_set) 

all_words = nltk.FreqDist(w.lower() for w in movie_reviews.words()) 
word_features = list(all_words)[:2000] [1] 

def document_features(document): [2] 
    document_words = set(document) [3] 
    features = {} 
    for word in word_features: 
     features['contains({})'.format(word)] = (word in document_words) 
    return features

のでfeaturesetsのフォームの例がある{（ '（廃棄物）が含まれて'：Falseに、 '（ロット）が含まれ'：偽、...}、 'NEG'）...}

偽 に '（廃棄物）が含まれて'：2

は、しかし、私は '（廃棄物）が含まれて' から単語辞書形式を変更したいです。私はその形式（ '含む（廃棄）'：2）世界の頻度を計算できるため、文書をよく説明していると思います。 ...}

しかし、私はそうFEATURESETは次のようになりは、{（ 'NEG'、... 5：2、 '（たくさん）が含まれて' '（廃棄物）が含まれています'}） に '（廃棄）'が含まれているかどうか心配しています：2とに '廃棄物'が含まれています：1はnaivebayesclassifierとまったく異なる言葉です。それで、の類似性を説明することはできません。 '廃棄（廃棄）'：2とに '廃棄物'が含まれています：1。

{ '（ロット）が含ま'：1と '（廃棄物）を含有する'：1}と{ '（廃棄物）を含んで'：2 '（廃棄物）を含有する'：1}はプログラムと同じにすることができます。

nltk.naivebayesclassifierは単語の頻度を理解できますか？

これは、私がNLTKの単純ベイズ分類器の扱いは、論理的に別個の値を備えています

def split_and_count_word(data): #belongs_to : Main #Role : make featuresets from korean words using konlpy. #Parameter : dictionary data(dict of contents ex.{'politic':{'parliament': [content,content]}..}) #Return : list featuresets([{'word':True',...},'politic'] == featureset + category) featuresets = [] twitter = konlpy.tag.Twitter()#Korean word splitter for big_cat in data: for small_cat in data[big_cat]: #save category name needed in featuresets category = str(big_cat[0:3])+'/'+str(small_cat) count = 0; print(small_cat) for one_news in data[big_cat][small_cat]: count+=1; if count%100==0: print(count,end=' ') #one_news is list in list so open it! doc = one_news #split word as using konlpy list_of_splited_word = twitter.morphs(doc[:-63])#delete useless sentences. #get word length is higher than two and get list of splited words list_of_up_two_word = [word for word in list_of_splited_word if len(word)>1] dict_of_featuresets = make_featuresets(list_of_up_two_word) #save featuresets.append((dict_of_featuresets,category)) return featuresets def make_featuresets(data): #belongs_to : split_and_count_word #Role : make featuresets #Parameter : list list_of_up_two_word(ex.['비누','떨어','지다'] #Return : dictionary {word : True for word in data} #PROBLEM :( #cannot consider the freqency of word return {word : True for word in data} def naive_train(featuresets): #belongs_to : Main #Role : Learning by naive bayes rule #Parameter : list featuresets([{'word':True',...},'pol/pal']) #Return : object classifier(nltk naivebayesclassifier object), # list test_set(the featuresets that are randomly selected) random.shuffle(featuresets) train_set, test_set = featuresets[1000:], featuresets[:1000] classifier = naivebayes.NaiveBayesClassifier.train(train_set) return classifier,test_set featuresets = split_and_count_word(data) classifier,test_set = naive_train(featuresets)

出典

2016-10-20 dizwe

を使用するコードです。値はTrueとFalseに限定されませんが、量として扱われることはありません。機能がf=2とf=3の場合は、別個の値としてカウントされます。そのようなモデルに量を追加する唯一の方法は、例えば、f=1,f="few"（2-5）、f="several"（6-10）、f="many"（11以上）のようなバケットにソートすることです。（注：このルートに行く場合、バケットの値の範囲を選択するためのアルゴリズムがあります）。そして、そのモデルでも、「1」と「数」の間に「少数」があることを「認識」しません。数量を直接処理するには、別の機械学習ツールが必要です。

出典

2016-11-13 20:21:43 alexis

私にアイデアを与えてくれてありがとう。それでは、すでにフィーチャディクショナリに含まれている単語を追加できないということですか？たとえば、辞書は{** "hello"：True、 "hello"：True **、 "my"：True ...}となります。次に、他の便利な機械学習モジュールをお勧めしますか？ – dizwe

あなたが@ abergerへのあなたのコメントですでに指摘したように、あなたはdictで同じキーを2回持つことはできません。定量化された解決策にあなたを直接指すことはできません。 nltkの['MaxentClassifier']（http://www.nltk.org/api/nltk.classify.html#nltk.classify.maxent.MaxentClassifier）は数値加重を使用しますが、通常はAPIによって" nominal "あなたが提供する機能。だからあなたはそれを使う正しい方法を見つけなければならないでしょう。 scikit-learnも見てください。最適な分類子はタスクに依存しますので、いくつか試してみてください！ – alexis

ありがとう、私はそれを試してみよう！ – dizwe

nltk naivebayesクラシファイアでどのように周波数を追加できますか？

答えて

関連する問題