WEKA - Classification - Training and Test Set
I am performing a classification problem using 3 different classifiers namely, Decision Tree, Naive Bayes and IBK. I have two data sets which are the same in layout and attribute names but the values in each are different. Training Set Example; State Population HouseholdIncome FamilyIncome perCapInc NumUnderPov EducationLevel_1 EducationLevel_2 EducationLevel_3 UnemploymentRate EmployedRate ViolentCrimesPerPop Crime Rate 8, 0.19, 0.37, 0.39, 0.4, 0.08, 0.1, 0.18, 0.48, 0.27 ,0.68 ,0.2 ,Low I would like my decision tree to predict using the 12 attributes if the Target Class value is Low, Med or High based on the ViolentCrimesPerPop figure which in this example is 0.2. My question is.... On my Test set do I just provide more un-seen examples in the same format or should I take away one of the attributes so i can see if it has learnt anything?
It is not a good thing to test your classifier over your same training data, because your model has learnt, hopefully, to classify those instances correctly. The usual set up is to train over the training dataset and then test it over a different dataset (with the same format/structure), to see how it performs.
It is a good idea to separate your dataset into three separate sets: Training, Testing and Validation. The training set is used to train each of the models that you are building. This is usually checked for performance using a testing set. As the designer continues to adjust the parameters of their model (for example, pruning options on Decision Trees and k for k-NN or Neural Network parameters), you can see how well the model is performing against the testing set. Finally, once these parameters have been completed for your model, you can then run these against a validation set to confirm that the model did not over-fit on the testing data (due to parameter adjustments applied to the model itself). A further discussion of these sets may be found here. Generally, I have used a data split of 60-20-20, however it is common to use 50-25-25 as well, it really comes down to how much data you have to play with. I hope this helps!
imbalanced data classification with boosting algorithms
How to create ARFF file for 2D data points?
How to use weighted vote for classification using weka
Convert Web page to ARFF File for Weka classification
Liblinear bias greater than 2 improving accuracy?
Weka: Does training helps if test run is followed by training run?
Difference between logistic regression with binary output and classification
Weka - How to find input format for classifiers
How to incorporate Weka Naive Bayes model into Java Code
RapidMiner: Classifying new examples without re-running the existing trained model
How to check whether data is being overfiited for that model in weka
Feature Extraction for Face Dectection
rapid-miner formating datsets with many parameter
text classification methods? SVM and decision tree
Multilabel classification with SVM using rapidminer
Add values from multiple columns in pivot table