2016-11-21 6 views
2

トレーニングとテストのデータでsvmモデルをトレーニングしようとしています。私は、テストとトレーニングのデータを組み合わせる場合、プログラムがうまく動作しますが、私はそれらを分割し、モデルの精度をテストする場合には、テストセットが編成よりも大きくなっているテストと電車のデータセットのフィーチャの数が異なります

Traceback (most recent call last): 
    File "/home/PycharmProjects/analysis.py", line 160, in <module> 
    main() 
    File "/home/PycharmProjects/analysis.py", line 156, in main 
    learn_model(tf_idf_train,target,tf_idf_test) 
    File "/home/PycharmProjects/analysis.py", line 113, in learn_model 
    predicted = classifier.predict(data_test) 
    File "/home/.local/lib/python3.4/site-packages/sklearn/svm/base.py", line 573, in predict 
    y = super(BaseSVC, self).predict(X) 
    File "/home/.local/lib/python3.4/site-packages/sklearn/svm/base.py", line 310, in predict 
    X = self._validate_for_predict(X) 
    File "/home/.local/lib/python3.4/site-packages/sklearn/svm/base.py", line 479, in _validate_for_predict 
    (n_features, self.shape_fit_[1])) 
    ValueError: X.shape[1] = 19137 should be equal to 4888, the number of features at training time 

ここで述べています。テストセットにはtrainsetよりも多くの機能があります。その値には誤差があります。ここ

は私のコードです:

def load_train_file(): 
    with open('~1k comments.csv',encoding='ISO-8859-1',) as csv_file: 
    reader = csv.reader(csv_file,delimiter=",",quotechar='"') 
    reader.__next__() 
    data =[] 
    target = [] 
    for row in reader: 
    if row[0] and row[1]: 
    data.append(row[0]) 
    target.append(row[1]) 

    return data,target 


    def load_file(): 
    with open('comments.csv',encoding='ISO-8859-1',) as csv_file: 
    reader = csv.reader(csv_file,delimiter=",",quotechar='"') 
    reader.__next__() 
    data =[] 
    target = [] 
    for row in reader: 
    if row[0] and row[1]: 
    data.append(row[0]) 
    target.append(row[1]) 
    print(len(data)) 

    return data 

    # preprocess creates the term frequency matrix for the review data set 
    def preprocess(): 
    dataTrain,targetTrain = load_train_file() 
    testData=load_file() 
    count_vectorizer = CountVectorizer(binary='true') 
    dataTrain = count_vectorizer.fit_transform(dataTrain) 
    tfidf_train_data = TfidfTransformer(use_idf=True).fit_transform(dataTrain) 

    count_vectorizer = CountVectorizer() 
    testData = count_vectorizer.fit_transform(testData) 
    tfidf_test_data = TfidfTransformer(use_idf=True).fit_transform(testData) 

    return tfidf_train_data,tfidf_test_data 

    def learn_model(data,target,testData): 
    data_train,data_test,target_train,target_test = cross_validation.train_test_split(data,target,test_size=0.001,random_state=43) 
    e = np.zeros(testData.shape[0]) 
    data_train1, data_test, target_train1, target_test = cross_validation.train_test_split(testData, e,test_size=.9,random_state=43) 
    classifier = SVC(gamma=.01, C=100.) 
    classifier.fit(data_train, target_train) 
    predicted = classifier.predict(data_test) 
    for x in range(0,50): 
    print(testData[x]+str(predicted[x])) 

    def evaluate_model(target_true,target_predicted): 
    print (classification_report(target_true,target_predicted)) 
    print ("The accuracy score is {:.2%}".format(accuracy_score(target_true,target_predicted))) 

    def main(): 
    data,target = load_train_file() 
    datatest=load_file() 


    tf_idf_train,tf_idf_test = preprocess() 
    # print(tf_idf_train.shape()) 
    # print(tf_idf_test.shape()) 

    learn_model(tf_idf_train,target,tf_idf_test) 
    # learn_model(data,target,datatest) 


    main() 

この問題を解決することができますか?

答えて

5

電車用とテスト用の両方に同じベクトル化器と変圧器を使用する必要があります。また、ベクタライザーはテストデータに適合してはいけません。だからではなく、このような

count_vectorizer = CountVectorizer(binary='true') 
dataTrain = count_vectorizer.fit_transform(dataTrain) 
tfidf_train_data = TfidfTransformer(use_idf=True).fit_transform(dataTrain) 

count_vectorizer = CountVectorizer() 
testData = count_vectorizer.fit_transform(testData) 
tfidf_test_data = TfidfTransformer(use_idf=True).fit_transform(testData) 

使用何か:

count_vectorizer = CountVectorizer(binary=True) 
tfidf_transformer = TfidfTransformer(use_idf=True) 
dataTrain = count_vectorizer.fit_transform(dataTrain) 
tfidf_train_data = transformer.fit_transform(dataTrain) 

testData = count_vectorizer.transform(testData) 
tfidf_test_data = tfidf_transformer.transform(testData) 

また、それがよりよい作るためにPipelineを使用することができます。

from sklearn.pipeline import make_pipeline 
pipe = make_pipeline(
    CountVectorizer(binary=True), 
    TfidfTransformer(use_idf=True), 
) 
tfidf_train_data = pipe.fit_transform(dataTrain) 
tfidf_test_data = pipe.transform(testData) 

あるいはAでCountVectorizerとTfidfTransformerを組み合わせTfidfVectorizerを使用します単一ベクトル化オブジェクト:

from sklearn.feature_extraction.text import TfidfVectorizer 
vec = TfidfVectorizer(binary=True, use_idf=True) 
tfidf_train_data = vec.fit_transform(dataTrain) 
tfidf_test_data = vec.transform(testData) 
関連する問題