samedi 27 juin 2015

predict classes of test data using k folding using sklearn

I am working on a data mining project and I am using the sklearn package in python for classifying my data.

in order to train my data and evaluate the quality of the predicted values, I am using the sklearn.cross_validation.cross_val_predict function.

however, when I try to run my model on the test data, it asks for the base class, which are not available.

I have seen (possible) work-arounds using the sklearn.grid_search.GridSearchCV function but am loathe to use such a method for a fixed set of parameters.

going throught the sklearn.cross_validation documentation, I have come across the cross_val_score function. Since I am fairly new to the world of classification problems I am not quite sure if this the function which would solve my problem.

Any help will be awesome!

Thanks!

edit:

Hello! I get the impression I was fairly vague with my original query. I'll try to detail what it is that I am exactly doing. Here goes:

I have generated 3 numpy.ndarrays X,X_test and y with nrows = 10158, 22513 and 10158 respectively which correspond to my train data, test data and class labels for the train data.

Thereafter, I run the following code :

    from sklearn.svm import SVC
    from sklearn.cross_validation import cross_val_predict
    clf = SVC()
    testPred = cross_val_predict(clf,X,y,cv=2)

This works fine and I can then use stemPred and y as mentioned in the tutorials.

However, I am looking to predict the classes of X_test. The error message is rather self-explanatory and says:

    ValueError: Found arrays with inconsistent numbers of samples: [10158 22513]

The current work around (I do not know if this is a work around or the only way to do it) I am using is:

    from sklearn import grid_search
    # thereafter I create the parameter grid (grid) and appropriate scoring function (scorer)
    model = grid_search.GridSearchCV(estimator = clf, param_grid = grid, scoring = scorer, refit = True, cv = 2, n_jobs = -1)
    model.fit(X,y)
    model.best_estimator_.fit(X,y)
    testPred = model.best_estimator_.predict(X_test)

This technique works fine for the time-being; however, if I didn't have to use the GridSearchCV function I'd be able to sleep much better.

Aucun commentaire:

Enregistrer un commentaire