分類アルゴリズムは、R

を使用してテキストのために私が使用していますが、私は予測する新しい文書のための私は、テキスト「説明」と分類アルゴリズムは、R

スクリプト以下の「クラス」の過去のデータを使用して新しい文書のクラスを予測したかったですより正確な精度を得ることができない場合、どのアルゴリズムを使用して精度を上げることができるかを知る手助けができます。ご意見をお聞かせください。

library(plyr) 
library(tm) 
library(e1071) 

setwd("C:/Data") 

past <- read.csv("Past - Copy.csv",header=T,na.strings=c("")) 
future <- read.csv("Future - Copy.csv",header=T,na.strings=c("")) 

training <- rbind.fill(past,future) 

Res_Desc_Train <- subset(training,select=c("Class","Description")) 

##Step 1 : Create Document Matrix of ticket Descriptions available past data 

docs <- Corpus(VectorSource(Res_Desc_Train$Description)) 
docs <-tm_map(docs,content_transformer(tolower)) 

#remove potentially problematic symbols 
toSpace <- content_transformer(function(x, pattern) { return (gsub(pattern, " ", x))}) 
removeSpecialChars <- function(x) gsub("[^a-zA-Z0-9 ]","",x) 
docs <- tm_map(docs, content_transformer(tolower)) 
docs <- tm_map(docs, removeNumbers) 
docs <- tm_map(docs, removePunctuation) 
docs <- tm_map(docs, stripWhitespace) 
docs <- tm_map(docs, removeWords, stopwords('english')) 


#inspect(docs[440]) 
dataframe<-data.frame(text=unlist(sapply(docs, `[`, "content")), stringsAsFactors=F) 

dtm <- DocumentTermMatrix(docs,control=list(stopwords=FALSE,wordLengths =c(2,Inf))) 

##Let's remove the variables which are 95% or more sparse. 
dtm <- removeSparseTerms(dtm,sparse = 0.95) 

Weighteddtm <- weightTfIdf(dtm,normalize=TRUE) 
mat.df <- as.data.frame(data.matrix(Weighteddtm), stringsAsfactors = FALSE) 
mat.df <- cbind(mat.df, Res_Desc_Train$Class) 
colnames(mat.df)[ncol(mat.df)] <- "Class" 
Assignment.Distribution <- table(mat.df$Class) 

Res_Desc_Train_Assign <- mat.df$Class 

Assignment.Distribution <- table(mat.df$Class) 

### Feature has different ranges, normalizing to bring ranges from 0 to 1 
### Another way to standardize using z-scores 

normalize <- function(x) { 
    y <- min(x) 
    z <- max(x) 
    temp <- x - y 
    temp1 <- (z - y) 
    temp2 <- temp/temp1 
    return(temp2) 
} 
#normalize(c(1,2,3,4,5)) 

num_col <- ncol(mat.df)-1 
mat.df_normalize <- as.data.frame(lapply(mat.df[,1:num_col], normalize)) 
mat.df_normalize <- cbind(mat.df_normalize, Res_Desc_Train_Assign) 
colnames(mat.df_normalize)[ncol(mat.df_normalize)] <- "Class" 

#names(mat.df) 
outcomeName <- "Class" 

train = mat.df_normalize[c(1:nrow(past)),] 
test = mat.df_normalize[((nrow(past)+1):nrow(training)),] 


train$Class <- as.factor(train$Class) 

###SVM Model 
x <- subset(train, select = -Class) 
y <- train$Class 
model <- svm(x, y, probability = TRUE) 
test1 <- subset(test, select = -Class) 
svm.pred <- predict(model, test1, decision.values = TRUE, probability = TRUE) 
svm_prob <- attr(svm.pred, "probabilities") 

finalresult <- cbind(test,svm.pred,svm_prob)

出典

2017-07-19 user3734568

SVMモデルを調整しようとしますか？

デフォルトパラメータを使用してモデルを実行しているため、精度を向上させることができません。モデルの実行は、パラメータを変更し、モデルを実行し、精度を確認してから、プロセス全体を再度繰り返す反復プロセスです。

model <- tune(svm, train.x=x, train.y=y, kernel="radial", ranges=list(cost=10^(-1:2), gamma=c(.5,1,2))) 
print(model) 
#select values of cost & gamma from here and pass it to tuned_model 

tuned_model <- svm(x, y, kernel="radial", cost=<cost_from_tune_model_output>, gamma=<gamma_from_tune_model_output>) 
#now check accuracy of this model using test dataset and accordingly adjust tune parameter. Repeat the whole process again.

出典

2017-07-19 12:50:03 Prem

あなたの助けをお寄せいただきありがとうございます。共有するソリューションを使用し、正確さを高めることができるかどうかを確認します。実際には、精度は約52％低くなっています – user3734568

その場合、トレーニングデータセットを増やして、正しく。 – Prem

私はあなたの提案のおかげで、私は列車のデータセットに13383文書を持っているモデルを訓練するために、より多くのデータセットを取得できるかどうかを確認していただきありがとうございます。 – user3734568

分類アルゴリズムは、R

答えて

関連する問題