のmovie_reviewsのデータセットunigramモデルのトレーニングコードを以下に示します。 bigram、trigramモデルを考慮して、その性能を訓練し分析したいと思っています。どうすればそれをすることができますか？n-gram（movie_reviews）のNaive Bayesクラシファイアの学習方法

import nltk.classify.util 
from nltk.classify import NaiveBayesClassifier 
from nltk.corpus import movie_reviews 
from nltk.corpus import stopwords 
from nltk.tokenize import word_tokenize 

def create_word_features(words): 
    useful_words = [word for word in words if word not in stopwords.words("english")] 
    my_dict = dict([(word, True) for word in useful_words]) 
    return my_dict 

pos_data = [] 
for fileid in movie_reviews.fileids('pos'): 
    words = movie_reviews.words(fileid) 
    pos_data.append((create_word_features(words), "positive"))  

neg_data = [] 
for fileid in movie_reviews.fileids('neg'): 
    words = movie_reviews.words(fileid) 
    neg_data.append((create_word_features(words), "negative")) 

train_set = pos_data[:800] + neg_data[:800] 
test_set = pos_data[800:] + neg_data[800:] 

classifier = NaiveBayesClassifier.train(train_set) 

accuracy = nltk.classify.util.accuracy(classifier, test_set)

2017-12-28 Olivia Brown

投稿を見ました：[ngrams単純ベイズ分類器で]（https://stackoverflow.com/質問/ 14003291/n-gram-with-naive-bayes-classifier）？ – Sax

または：[Naive Bayesクラシファイアエラーのnグラム]（https://stackoverflow.com/questions/19209895/n-grams-with-naive-bayes-classifier-error） – Sax

単にあなたのfeaturizerを変更

from nltk import ngrams 

def create_ngram_features(words, n=2): 
    ngram_vocab = ngrams(words, n) 
    my_dict = dict([(ng, True) for ng in ngram_vocab]) 
    return my_dict

ところで、あなたのコードは、あなたのストップワードリストのセットを使用して行うと、一度だけ、それを初期化するために、あなたのfeaturizerを変更した場合、多く速くなります。

stoplist = set(stopwords.words("english")) 

def create_word_features(words): 
    useful_words = [word for word in words if word not in stoplist] 
    my_dict = dict([(word, True) for word in useful_words]) 
    return my_dict

誰かが本当にそれは「技術的に」独特のリスト（すなわち、セット）ですので、セットタイプにストップワードリストを変換するために、NLTKの人々に伝える必要があります。

import nltk.classify.util 
from nltk.classify import NaiveBayesClassifier 
from nltk.corpus import movie_reviews 
from nltk.corpus import stopwords 
from nltk.tokenize import word_tokenize 
from nltk import ngrams 

def create_ngram_features(words, n=2): 
    ngram_vocab = ngrams(words, n) 
    my_dict = dict([(ng, True) for ng in ngram_vocab]) 
    return my_dict 

for n in [1,2,3,4,5]: 
    pos_data = [] 
    for fileid in movie_reviews.fileids('pos'): 
     words = movie_reviews.words(fileid) 
     pos_data.append((create_ngram_features(words, n), "positive"))  

    neg_data = [] 
    for fileid in movie_reviews.fileids('neg'): 
     words = movie_reviews.words(fileid) 
     neg_data.append((create_ngram_features(words, n), "negative")) 

    train_set = pos_data[:800] + neg_data[:800] 
    test_set = pos_data[800:] + neg_data[800:] 

    classifier = NaiveBayesClassifier.train(train_set) 

    accuracy = nltk.classify.util.accuracy(classifier, test_set) 
    print(str(n)+'-gram accuracy:', accuracy)

ベンチマークの楽し[出力]の場合

>>> from nltk.corpus import stopwords 
>>> type(stopwords.words('english')) 
<class 'list'> 
>>> type(set(stopwords.words('english'))) 
<class 'set'>

：

1-gram accuracy: 0.735 
2-gram accuracy: 0.7625 
3-gram accuracy: 0.8275 
4-gram accuracy: 0.8125 
5-gram accuracy: 0.74

あなたの元のコードは、0.725の精度を返します。 ngrams

import nltk.classify.util 
from nltk.classify import NaiveBayesClassifier 
from nltk.corpus import movie_reviews 
from nltk.corpus import stopwords 
from nltk.tokenize import word_tokenize 
from nltk import everygrams 

def create_ngram_features(words, n=2): 
    ngram_vocab = everygrams(words, 1, n) 
    my_dict = dict([(ng, True) for ng in ngram_vocab]) 
    return my_dict 

for n in range(1,6): 
    pos_data = [] 
    for fileid in movie_reviews.fileids('pos'): 
     words = movie_reviews.words(fileid) 
     pos_data.append((create_ngram_features(words, n), "positive"))  

    neg_data = [] 
    for fileid in movie_reviews.fileids('neg'): 
     words = movie_reviews.words(fileid) 
     neg_data.append((create_ngram_features(words, n), "negative")) 

    train_set = pos_data[:800] + neg_data[:800] 
    test_set = pos_data[800:] + neg_data[800:] 
    classifier = NaiveBayesClassifier.train(train_set) 

    accuracy = nltk.classify.util.accuracy(classifier, test_set) 
    print('1-gram to', str(n)+'-gram accuracy:', accuracy)

の

使っより受注[外]：

1-gram to 1-gram accuracy: 0.735 
1-gram to 2-gram accuracy: 0.7625 
1-gram to 3-gram accuracy: 0.7875 
1-gram to 4-gram accuracy: 0.8 
1-gram to 5-gram accuracy: 0.82

出典

2017-12-28 08:53:35 alvas

私はそれがあるとは思わないset（）関数の使用ここで 'stopplot = set（stopwords.words（" english "））' 'stopwords.words（" english "）'は既にセットです。 –

'stopwords.words（" english "）は、固有のリストだからネイティブのpythonタイプはリストなので、技術的には"セット "です。それをセットにキャストし、一度だけ初期化すると、コードのスピードアップ=） – alvas

Ohhありがとう –

n-gram（movie_reviews）のNaive Bayesクラシファイアの学習方法

答えて

単にあなたのfeaturizerを変更

：

`1-gram accuracy: 0.735 2-gram accuracy: 0.7625 3-gram accuracy: 0.8275 4-gram accuracy: 0.8125 5-gram accuracy: 0.74`

あなたの元のコードは、0.725の精度を返します。 ngrams

使っより受注[外]：

`1-gram to 1-gram accuracy: 0.735 1-gram to 2-gram accuracy: 0.7625 1-gram to 3-gram accuracy: 0.7875 1-gram to 4-gram accuracy: 0.8 1-gram to 5-gram accuracy: 0.82`

n-gram（movie_reviews）のNaive Bayesクラシファイアの学習方法

答えて

単にあなたのfeaturizerを変更

： 1-gram accuracy: 0.735 2-gram accuracy: 0.7625 3-gram accuracy: 0.8275 4-gram accuracy: 0.8125 5-gram accuracy: 0.74 あなたの元のコードは、0.725の精度を返します。 ngrams

使っより受注[外]： 1-gram to 1-gram accuracy: 0.735 1-gram to 2-gram accuracy: 0.7625 1-gram to 3-gram accuracy: 0.7875 1-gram to 4-gram accuracy: 0.8 1-gram to 5-gram accuracy: 0.82

関連する問題

：

`1-gram accuracy: 0.735 2-gram accuracy: 0.7625 3-gram accuracy: 0.8275 4-gram accuracy: 0.8125 5-gram accuracy: 0.74`

あなたの元のコードは、0.725の精度を返します。 ngrams

使っより受注[外]：

`1-gram to 1-gram accuracy: 0.735 1-gram to 2-gram accuracy: 0.7625 1-gram to 3-gram accuracy: 0.7875 1-gram to 4-gram accuracy: 0.8 1-gram to 5-gram accuracy: 0.82`