Stylometry is the study of the unique linguistic styles and writing behaviors of individuals
in order to determine authorship.
Stylometry can be used to attribute authorship to anonymous or disputed documents, and
it has legal as well as academic and literary applications.
We are interested in the question of the authorship of email, an area of forensic linguistics.
Stylometry uses statistical analysis, pattern recognition, and artificial intelligence techniques.
For features, stylometry typically analyzes the text by using word frequencies
and identifying patterns in common parts of speech.
It might also use what are called "rare pairs", an individual's habits of word collocation.
Pattern recognition techniques that have been used include neural networks, genetic algorithms,
and Support Vector Machines (SVM).
We will create a pattern recognition system to identify
the author of arbitrary email using stylometry features.
The system will consist of three components: data collection, feature extraction, and classification.
A paper and
MIT Thesis describe existing systems.
The paper uses this Java-based data mining software
(C4.5, neural network, and SVM).
For data we will use the email gathered from an earlier study of the keystroke biometric.
Because the data from that study is not in simple text form (it includes timing and other information for each keystroke),
the initial step will be to write a short program to convert the data into simple text files.
The feature extraction component calculates features, such as the average word length and the letter frequencies.
Initially, we will use a simple classification technique called nearest neighbor, and
if time permits, we may use more sophisticated classification techniques like SVM.
If time permits, we might combine the stylometry identification system with the keystroke biometric
system to determine and study the expected increase in identification accuracy.