Stylometry System

Background

Stylometry is the study of the unique linguistic styles and writing behaviors of individuals in order to determine authorship. Stylometry can be used to attribute authorship to anonymous or disputed documents, and it has legal as well as academic and literary applications. We are interested in the question of the authorship of email, an area of forensic linguistics.

Stylometry uses statistical analysis, pattern recognition, and artificial intelligence techniques. For features, stylometry typically analyzes the text by using word frequencies and identifying patterns in common parts of speech. It might also use what are called "rare pairs", an individual's habits of word collocation. Pattern recognition techniques that have been used include neural networks, genetic algorithms, and Support Vector Machines (SVM).

Project

We will create a pattern recognition system to identify the author of arbitrary email using stylometry features. The system will consist of three components: data collection, feature extraction, and classification. A paper and MIT Thesis describe existing systems. The paper uses this Java-based data mining software (C4.5, neural network, and SVM).

Data Collection

For data we will use the email gathered from an earlier study of the keystroke biometric. Because the data from that study is not in simple text form (it includes timing and other information for each keystroke), the initial step will be to write a short program to convert the data into simple text files.

Feature Extraction

The feature extraction component calculates features, such as the average word length and the letter frequencies.

Classification

Initially, we will use a simple classification technique called nearest neighbor, and if time permits, we may use more sophisticated classification techniques like SVM.

If time permits, we might combine the stylometry identification system with the keystroke biometric system to determine and study the expected increase in identification accuracy.