2017-09-27 14 views
0

私は次のようにword2vecモデルを構築しています。なぜGensim word2vecに一文字の語彙がありますか?

from gensim.models import word2vec, Phrases 
documents = ["the mayor of new york was there", "human computer interaction and machine learning has now become a trending research area","human computer interaction is interesting","human computer interaction is a pretty interesting subject", "human computer interaction is a great and new subject", "machine learning can be useful sometimes","new york mayor was present", "I love machine learning because it is a new subject area", "human computer interaction helps people to get user friendly applications"] 
sentence_stream = [doc.split(" ") for doc in documents] 

bigram = Phrases(sentence_stream, min_count=1, delimiter=b' ') 
trigram = Phrases(bigram[sentence_stream], min_count=1, delimiter=b' ') 

for sent in sentence_stream: 
    bigrams_ = bigram[sent] 
    trigrams_ = trigram[bigram[sent]] 

    print(bigrams_) 
    print(trigrams_) 


# Set values for various parameters 
num_features = 10 # Word vector dimensionality      
min_word_count = 1 # Minimum word count       
num_workers = 4  # Number of threads to run in parallel 
context = 5   # Context window size                      
downsampling = 1e-3 # Downsample setting for frequent words 


model = word2vec.Word2Vec(trigrams_, workers=num_workers, \ 
      size=num_features, min_count = min_word_count, \ 
      window = context, sample = downsampling) 

vocab = list(model.wv.vocab.keys()) 
print(vocab[:10]) 

しかし、私がモデルの語彙のために得る出力は、次のように1文字です。

['h', 'u', 'm', 'a', 'n', ' ', 'c', 'o', 'p', 't'] 

私はbigramsとtrigramsを正しく取得しています。したがって、コードを間違えたところで私はちょうど混乱しています。何が問題なのか教えてください。

+2

'trigrams_'は内側のリストがあなたの文を表すネストされたリストでなければなりません。ここに文字列のリストがあるようです。 – Kasramvd

+0

ああ、それは解決しました:) –

答えて

0

これが私の問題を解決しました。次のように、リストのリストをword2vecモデルに渡す必要があります。

trigram_sentences_project = [] 


bigram = Phrases(sentence_stream, min_count=1, delimiter=b' ') 
trigram = Phrases(bigram[sentence_stream], min_count=1, delimiter=b' ') 


for sent in sentence_stream: 
    #bigrams_ = [b for b in bigram[sent] if b.count(' ') == 1] 
    #trigrams_ = [t for t in trigram[bigram[sent]] if t.count(' ') == 2] 
    bigrams_ = bigram[sent] 
    trigrams_ = trigram[bigram[sent]] 
    trigram_sentences_project.append(trigrams_) 
関連する問題