0
私は次のようにword2vecモデルを構築しています。なぜGensim word2vecに一文字の語彙がありますか?
from gensim.models import word2vec, Phrases
documents = ["the mayor of new york was there", "human computer interaction and machine learning has now become a trending research area","human computer interaction is interesting","human computer interaction is a pretty interesting subject", "human computer interaction is a great and new subject", "machine learning can be useful sometimes","new york mayor was present", "I love machine learning because it is a new subject area", "human computer interaction helps people to get user friendly applications"]
sentence_stream = [doc.split(" ") for doc in documents]
bigram = Phrases(sentence_stream, min_count=1, delimiter=b' ')
trigram = Phrases(bigram[sentence_stream], min_count=1, delimiter=b' ')
for sent in sentence_stream:
bigrams_ = bigram[sent]
trigrams_ = trigram[bigram[sent]]
print(bigrams_)
print(trigrams_)
# Set values for various parameters
num_features = 10 # Word vector dimensionality
min_word_count = 1 # Minimum word count
num_workers = 4 # Number of threads to run in parallel
context = 5 # Context window size
downsampling = 1e-3 # Downsample setting for frequent words
model = word2vec.Word2Vec(trigrams_, workers=num_workers, \
size=num_features, min_count = min_word_count, \
window = context, sample = downsampling)
vocab = list(model.wv.vocab.keys())
print(vocab[:10])
しかし、私がモデルの語彙のために得る出力は、次のように1文字です。
['h', 'u', 'm', 'a', 'n', ' ', 'c', 'o', 'p', 't']
私はbigramsとtrigramsを正しく取得しています。したがって、コードを間違えたところで私はちょうど混乱しています。何が問題なのか教えてください。
'trigrams_'は内側のリストがあなたの文を表すネストされたリストでなければなりません。ここに文字列のリストがあるようです。 – Kasramvd
ああ、それは解決しました:) –