Projects Overview

All the projects this semester are related to pattern recognition and data mining. A pattern recognition system typically consists of three components: data collection, feature extraction, and classification. The data collection component captures the raw data of the object. For face recognition, for example, it might be a photo. Operating on the raw data, the feature extraction component calculates feature measurements, such eye color, hair color, eye spacing, hair texture, nose size relative to face size, nose shape, ear size relative to face size, ear shape, etc. Operating on the feature measurements, usually referred to as the feature vector, the classifier decides which class to place the object into. For example, if the system is trying to identify individuals from a population of n people, then there are n classes; if the system is trying to distinguish males from females, then there are two classes. Usually the difficulty of the problem increases as the number of classes increases. A pattern recognition system must be trained to become usable, so the data are usually separated into two parts, one for training the system to create decision boundaries in feature space and one for testing the system to determine its accuracy.

Most of the projects this semester concern biometrics. The common biometrics include face, iris, fingerprint, voice print, etc. This semester's biometric projects include the less common biometrics of mouse movements, stylometry, and keystroke patterns. We have chosen the less common biometrics because it is easier to perform new and unique research and to publish in these areas. All biometrics have authentication and identification applications. In authentication (verification) applications a user is either accepted or rejected (binary response, yes you are the person you claim to be or no you are not). In identification applications a user is identified from within a population of, say, n users (one-of-n response), which is usually a more difficult problem.

This semester we will continue projects on the mouse movement, stylometry, and keystroke biometrics. These continued projects will have new directions/focuses compared to the earlier ones. In the earlier projects we attacked the identification problem to establish the feasibility of the biometric, reasoning that a good result (reasonably high recognition accuracy) on identification would be more significant than one on the easier authentication problem. Also, for ease of implementation we used a simple classification technique called nearest neighbor. This semester we will focus more on the authentication problem, which is usually considered the more important one and the one for which comparable evaluation statistics can be obtained.

This semester's biometric projects will focus on the front-end system components, data gathering and feature extraction, and provide the feature data to the Authentication and Data Mining teams for back-end classification processing. The feature vector files should be in the following text readable format. The first record identifies the file (type of biometric and other important characteristics of the data), and each subsequent record consists of the ID of the user, the date the biometric sample was taken, the number of features in the feature vector, and the feature vector data. A more precise format may be provided later by the instructor or the Data Mining team. If the date of capture for the earlier-collected data is not known (most likely), try to estimate it (say, within a month) from discussions with the customers or previous team members, otherwise use a default of 1/1/1900. System performance results can be reported in the technical papers of both the front-end biometric team and the back-end team obtaining the results.

Although we probably cannot undertake all six of these projects this semester, it is anticipated that we will do at least four and possibly five of them.