Final Project

Important Dates:

Project proposal due by August 10.
Final project paper due August 24.

Overview

The final project will give you the opportunity to use the techniques covered in the course to:

Requirements

Your final project must include the application of three or more data-mining techniques to a dataset that you choose. In addition, you must write a clear and compelling presentation of the results that you obtain, both from the data mining and any other analysis that you perform.

Choosing a Dataset and a Problem to Solve

You should begin by selecting a dataset to analyze. Each student in the class must choose a different dataset. Possible sources of datasets include: You are welcome to choose any dataset that interests you and that has enough data to enable a meaningful analysis.

In making your choice, you should be sure consider what problem or problems you would be able to solve by employing data mining on the dataset. In other words, you should ask yourself: How could I use data mining to answer one or more questions about this dataset?

The problems/questions will typically be of two basic types:

  1. classification/estimation: Is it possible to determine, predict, or estimate some attribute in the dataset, based on the values of other attributes? Medical diagnosis and credit-card application ratings that we have covered in lecture are examples of this type of problem. If you choose a problem of this type, you will employ some type of classification learning and/or numeric estimation.
  2. finding associations: Are there non-obvious associations or relationships between attributes in the dataset? Market-basket analysis (e.g., finding products that customers tend to purchase together) is one example of this type of problem. If you choose a problem of this type, you will employ some type of association learning.

Important notes:

You will use the Weka data-mining software to perform the actual data mining. Don't worry if you are uncertain at this point about the particular algorithms that you will use. All that matters for now is that you have a sense of the general data-mining approach that you will employ.

It will almost certainly be necessary to transform the dataset in some way before performing data mining. You should consider what transformations will be needed on the datasets that you are considering.

Writing a Project Proposal

Before beginning work on your project, you will need to submit a brief proposal outlining what you intend to do. This will allow us to make sure that you are on the right track, and to give you some initial guidance. Your proposal should include the following: Your project proposal should be submitted to me by email.

Splitting Your Dataset

Your project will probably include an application of data mining to learn a model that predicts the value of some output or class variable. In order to validate the model or models that you produce, you should separate your dataset into two files. One file -- containing N percent of the examples (where N is typically somewhere between 70 and 90) -- should be used for training, and the remaining 100 - N percent should be used for testing. The testing examples should not be touched until you have developed a model using the training examples and are ready to test it. Note that you should save your dataset as an ARFF file before splitting it.

Writing the Report

You should submit a written report on your final project that incorporates all of the items mentioned in the Requirements section above. Your report should include at least the following sections: Your report does not need to be overly long. Just make sure that it is a clear and complete presentation of the steps that you took and the conclusions you reached.

Submitting Your Work

You should email me the following files: