Stylometry System


For general background information see Overview of Biometric Projects.

Stylometry is the study of the unique linguistic styles and writing behaviors of individuals in order to determine authorship. Stylometry can be used to attribute authorship to anonymous or disputed documents, and it has legal as well as academic and literary applications. We are interested in the question of the authorship of email, an area of forensic linguistics.

Stylometry uses statistical analysis, pattern recognition, and artificial intelligence techniques. For features, stylometry typically analyzes the text by using word frequencies and identifying patterns in common parts of speech. It might also use what are called "rare pairs", an individual's habits of word collocation. A paper and MIT Thesis describe some existing systems.


We will continue to develop and test the system previously developed and tested, see Stylometry Technical Paper (fall 2007) and associated slides. Also see Authentication Technical Paper (fall 2007) and associated slides.

The data collection simply involves the collection of plaintext files. The feature extraction component already exists but is written in C# and should probably be rewritten in Java. It calculates such measurements as the average word length and the letter frequencies, and additional features might be added.

This semester's project will have two primary focuses. First, we will become familiar with the system and collect as much additional data as possible, including data from each team member. Each participant (subject) will create ten different emails, each of length 100-200 words, and each one on a different subject (e.g., what you like/don't like to eat, what you like/don't like to do, what type of schoolwork you like/dislike, etc.). This semester in order to increase the accuracy of the system we would like to increase the number of feature measurements, so some programming will be required. Second, and following what we did last semester, we will format the feature-vector data for ease of processing by the Biometric Authentication System team, see Feature Data Format.