Nearest Neighbors

We use a subset of the "Iris Plants Database" dataset (provided by WEKA, contained in the "iris.arff" file).

Each plant record (i.e., example) is represented by the 5 attributes.

SepalLength: the sepal length in cm.
SepalWidth: the sepal width in cm.
PetalLength: the petal length in cm.
PetalWidth: the petal width in cm.
Class: the classification attribute, with the possible values {Iris-setosa, Iris-versicolor, Iris-virginica}.

We want to predict the class for each of the following plants:

Plant #16. (SepalLength=4.6; SepalWidth=3.6; PetalLength=1.0; PetalWidth=0.2).
Plant #17. (SepalLength=6.1; SepalWidth=2.8; PetalLength=4.0; PetalWidth=1.3).
Plant #18. (SepalLength=7.7; SepalWidth=3.0; PetalLength=6.1; PetalWidth=2.3).

Part 1 - Manual Computation

Apply the Nearest Neighbor learning algorithm to classify the three to-be-predicted plants (i.e., Plants #16-18), to determine what kind of plant it is.

Try the three different values for the neighborhood size; i.e., k=1; 3; and 5. Use one of the geometry distance functions (e.g., Manhattan or Euclidean distance function).

For k=1, convert the data of the set of Plants #16-18 (together with their predicted class) into the ARFF format, and save it in the "plants_test1.arff" file.
For k=3, convert the data of the set of Plants #16-18 (together with their predicted class) into the ARFF format, and save it in the "plants_test2.arff" file.
For k=5, convert the data of the set of Plants #16-18 (together with their predicted class) into the ARFF format, and save it in the "plants_test3.arff" file.

Part 2 - Analysis with WEKA

Convert the dataset containing 15 examples (i.e., Plants #1-15) into the ARFF format (supported by WEKA), and save it in the "plants_train.arff" file.
Launch the WEKA tool, and then activate the "Explorer" environment.
Open the "plants_train" dataset (i.e., saved in the "plants_train.arff" file). - For each attribute and for each of its possible values, how many instances in each class have the feature value (i.e., the class distribution of the feature values)?

Go to the "Classify" tab. Select the IBk classifier. In the "Test options" panel select the "Supplied test set" option. Activate the nearby "Set..." button and locate the "plants_test1.arff" file. Run the classifier and observe the results shown in the "Classifier output" window.

- How many instances used for the training? How many for the test?
- How many instances are incorrectly classified?
- What is the MAE (mean absolute error) made by the learned classifier?
- What can you infer from the information shown in the Confusion Matrix?
- Visualize the errors made by the learned classifier. In the plot, how can you differentiate between the correctly and incorrectly classified instances? In the plot, how can you see the detailed information of an incorrectly classified instance?
- How can you save the learned classifier to a file?

Now, click on the "IBk - K 1 - W 0" label (i.e., close to the "Choose" button). Set KNN equal to 3 (i.e., to use the neighborhood size of 3), and then click the "OK" button to save the new setting. Activate the nearby "Set..." button and locate the "plants_test2.arff" file. Run the classifier and observe the results shown in the "Classifier output" window.

- How many instances are incorrectly classified?
- What is the MAE (mean absolute error) made by the learned classifier?
- What can you infer from the information shown in the Confusion Matrix?
- Visualize the errors made by the learned classifier. In the plot, how can you differentiate between the correctly and incorrectly classified instances? In the plot, how can you see the detailed information of an incorrectly classified instance?

Now, click on the "IBk - K 3 - W 0" label (i.e., close to the "Choose" button). Set KNN equal to 5 (i.e., to use the neighborhood size of 5), and then click the "OK" button to save the new setting. Activate the nearby "Set..." button and locate the "plants_test3.arff" file. Run the classifier and observe the results shown in the "Classifier output" window.

- How many instances are incorrectly classified?
- What is the MAE (mean absolute error) made by the learned classifier?
- What can you infer from the information shown in the Confusion Matrix?
- Visualize the errors made by the learned classifier. In the plot, how can you differentiate between the correctly and incorrectly classified instances? In the plot, how can you see the detailed information of an incorrectly classified instance?