WEKA - Classification - Training and Test Set
I am performing a classification problem using 3 different classifiers namely, Decision Tree, Naive Bayes and IBK. I have two data sets which are the same in layout and attribute names but the values in each are different. Training Set Example; State Population HouseholdIncome FamilyIncome perCapInc NumUnderPov EducationLevel_1 EducationLevel_2 EducationLevel_3 UnemploymentRate EmployedRate ViolentCrimesPerPop Crime Rate 8, 0.19, 0.37, 0.39, 0.4, 0.08, 0.1, 0.18, 0.48, 0.27 ,0.68 ,0.2 ,Low I would like my decision tree to predict using the 12 attributes if the Target Class value is Low, Med or High based on the ViolentCrimesPerPop figure which in this example is 0.2. My question is.... On my Test set do I just provide more un-seen examples in the same format or should I take away one of the attributes so i can see if it has learnt anything?
It is not a good thing to test your classifier over your same training data, because your model has learnt, hopefully, to classify those instances correctly. The usual set up is to train over the training dataset and then test it over a different dataset (with the same format/structure), to see how it performs.
It is a good idea to separate your dataset into three separate sets: Training, Testing and Validation. The training set is used to train each of the models that you are building. This is usually checked for performance using a testing set. As the designer continues to adjust the parameters of their model (for example, pruning options on Decision Trees and k for k-NN or Neural Network parameters), you can see how well the model is performing against the testing set. Finally, once these parameters have been completed for your model, you can then run these against a validation set to confirm that the model did not over-fit on the testing data (due to parameter adjustments applied to the model itself). A further discussion of these sets may be found here. Generally, I have used a data split of 60-20-20, however it is common to use 50-25-25 as well, it really comes down to how much data you have to play with. I hope this helps!
How to cut a dendrogram in r
Building weka classifier
Does Orange data mining software has multi-layer perceptron classification?
User Classification in RapidMiner - output should be the user based on a fed test data
Error in building mean image file(Caffe)
caffe: probability distribution for regression / expanding classification (softmax layer) to allow 3D output
Does MLE produce a generative or discriminative classifier?
Basic Hidden Markov Model, Viterbi algorithm
Where do I write the code for LIBSVM?
How to understand the output of ADTree classification in WEKA
Issues regarding classification instead of regression using deep learing
Caffe produces negative loss values (Multi label classification with lmdb)
ibm watson document classification
Sparse Representation Classifier Accuracy
Multi-Class Classification in Caffe of HDF5 data
Unknown identification using Random Forest