tmパッケージとRTextToolsパッケージの出力が異なるのはなぜですか？

私は260 RTIアプリケーションのデータセットを持っています。私は彼らにLDAを実行するはずです。私はtmとRTextToolsパッケージを使ってterm-doc行列を作成しました。しかし、出力は大きく異なります。 Tmパッケージには疎なエントリ数は表示されません。総用語の数は大きく異なります。ここでコードです：あなたはより良い問題を理解するために、データセットが必要な場合はtmパッケージとRTextToolsパッケージの出力が異なるのはなぜですか？

library("tm") 
library("RTextTools") 
<I read the data here into a variable called 'data'> 
doc = Corpus(VectorSource(data)) 
m = create_matrix(data, language = "english", removeNumbers = TRUE, removePunctuation = TRUE, stemWords = TRUE, weighting = weightTf) #RtextTools statement 
tdm <- TermDocumentMatrix(doc, control = list(removePunctuation = TRUE, removeNumbers = TRUE, language = "english", stemWords = TRUE, stopWords = TRUE, weighting = weightTf) #tm statement 
>m 
#<<DocumentTermMatrix (documents: 260, terms: 951)>> 
Non-/sparse entries: 2669/244591 
Sparsity   : 99% 
>tdm 
#<<TermDocumentMatrix (terms: 1024, documents: 1)>> 
Non-/sparse entries: 1024/0 
Sparsity   : 0%

、私に知らせてください。

出典

2017-07-06 BlackSwan

?termFreqを参照 - stemWords = TRUE, stopWords = TRUEの代わりにstemming=TRUE, stopwords=TRUEになる必要があります。 SimpleCorpusオブジェクトは、制御パラメータをオーバーライドする可能性があるTermDocumentMatrixの既定の動作をトリガーします。

出典

2017-07-06 12:42:43 lukeA

VCorpusを使用することをお勧めしますか？ – BlackSwan

@HimabinduBoddupalliはい。 – lukeA

doc = VCorpus（VectorSource（data）） tdm < - TermDocumentMatrix（doc、control = list（language = "english"、removeNumbers = TRUE、removePuncutation = TRUE、ステミング= TRUE、ストップワード= TRUE、重み付け= weightTf））動作しません。 – BlackSwan

tmパッケージとRTextToolsパッケージの出力が異なるのはなぜですか？

答えて

関連する問題