テキストをtmパッケージのRオブジェクトに戻す

私はtmパッケージを初めて使用しており、助けに感謝します。 tmパッケージ（下記参照）のさまざまな機能を使って、不要なシンボルやストップワードを抽出した投稿がたくさんあります。最後に、必要なクリーンな文字列を含む201のドキュメントが残っていますが、RオブジェクトではなくVCorpusオブジェクトです。これらの処理された文書をすべて1つのテキストファイルにまとめて、長い文字列にすることはできますか？テキストをtmパッケージのRオブジェクトに戻す

つまり、VCorpusオブジェクトをデータフレームまたはリストまたは別のRオブジェクトに変換するにはどうすればよいですか？

corpus <-iconv(posts$message, "latin1", "ASCII", sub="") 

corpus <- Corpus(VectorSource(docs)) 
corpus <- tm_map(corpus, PlainTextDocument) 
corpus <- tm_map(corpus, removePunctuation) 

corpus <- tm_map(corpus, removeNumbers) 
corpus <- tm_map(corpus, tolower) 

#remove speical characters for emails 

for(j in seq(corpus)) 
{ 
    corpus[[j]] <- gsub("/", " ", corpus[[j]]) 
    corpus[[j]] <- gsub("@", " ", corpus[[j]]) 
    corpus[[j]] <- gsub("\\|", " ", corpus[[j]]) 
} 


library(SnowballC) 

corpus <- tm_map(corpus, stemDocument) 

#remove common English stopwords 
docs <- tm_map(docs, removeWords, stopwords("english")) 

#remove words that will be common in our given context 
docs <- tm_map(docs, removeWords, c("department", "email", "job", "fresher", "internship")) 

#removeUrls 
removeURL <- function(x) gsub("http[[:alnum:]]*", "", x) 

corpus <- tm_map(corpus, removeURL) 

> corpus 
<<VCorpus>> 
Metadata: corpus specific: 0, document level (indexed): 0 
Content: documents: 201

出典

2016-08-03 the_darkside

完全に異なる質問に質問を編集しないでください。代わりに新しい質問を開きます。 – MrFlick

コーパスはプレーンテキストドキュメントのリストです。あなたは文字配列として、すべてのコンテンツを抽出したい場合は、あなたがしたい場合

が

# library(tm) 
data("crude") 
x <- tm_map(crude, stemDocument, lazy = TRUE) 
x <- tm_map(x, content_transformer(tolower)) 

xx <- sapply(x, content) 
str(xx)

使用lapplyではなくsapplyを使用して試験した全てのコンテンツを抽出するために、リスト上のループにsapplyとcontentを使用することができますリスト。

出典

2016-08-03 22:31:53 MrFlick

自分のデータで置き換えると、次のエラーが表示されます： 'UseMethod（" meta "、x）のエラー：クラス" character "のオブジェクトに適用される 'meta'の適用可能なメソッドがありません。 –

次に、あなたの質問に再現可能な例がありますので、作成したサンプルとコーパスとがどのように異なるかを確認できます。 'removeURL'を' content_transformer'にラップする必要があります。後者の機能に関するドキュメントを読むことをお勧めします。 – MrFlick

テキストをtmパッケージのRオブジェクトに戻す

答えて

関連する問題