Stylometry System

Background

Stylometry is the study of the unique linguistic styles and writing behaviors of individuals in order to determine authorship. Stylometry can be used to attribute authorship to anonymous or disputed documents, and it has legal as well as academic and literary applications. We are interested in the question of the authorship of email, an area of forensic linguistics.

Stylometry uses statistical analysis, pattern recognition, and artificial intelligence techniques. For features, stylometry typically analyzes the text by using word frequencies and identifying patterns in common parts of speech. It might also use what are called "rare pairs", an individual's habits of word collocation. A paper and MIT Thesis describe existing systems. The paper uses this Java-based data mining software (C4.5, neural network, and SVM).

Project

We will continue to develop and test the system previously developed and subsequently tested. The system will consist of three components: data collection, feature extraction, and classification. For data we will use plaintext emails; each team member will create ten diferent emails, each of length 100-200 words, on ten different subjects (what you like/don't like to eat, what you like/don't like to do, what type of schoolwork you like/dislike, etc.). The feature extraction component calculates feature measurements, such as the average word length and the letter frequencies. For classification we will initially use a simple technique called nearest neighbor.