Stylometry System


Stylometry is the study of the unique linguistic styles and writing behaviors of individuals in order to determine authorship. Stylometry can be used to attribute authorship to anonymous or disputed documents, and it has legal as well as academic and literary applications. We are interested in the question of the authorship of email, an area of forensic linguistics.

Stylometry uses statistical analysis, pattern recognition, and artificial intelligence techniques. For features, stylometry typically analyzes the text by using word frequencies and identifying patterns in common parts of speech. It might also use what are called "rare pairs", an individual's habits of word collocation. A paper and MIT Thesis describe existing systems. The paper uses this Java-based data mining software (C4.5, neural network, and SVM).


We will continue to develop and test the system previously developed and subsequently tested. The data collection and feature extraction components already exist and it may not be necessary to modify them. The feature extraction component calculates such measurements as the average word length and the letter frequencies, and additional features might be added. The nearest neighbor classifier was not implemented correctly by the previous team so that might be fixed, but that is really not necessary since the feature extraction files will be passed to the back-end teams for processing.

This semester's project will have two focuses. First, we will become familiar with the system and collect additional data. For data we will use plaintext emails. We will collect data from as many participants (subjects) as possible. Each participant (including each team member) will create ten different emails, each of length 100-200 words, and each one on a different subject (e.g., what you like/don't like to eat, what you like/don't like to do, what type of schoolwork you like/dislike, etc.).

Second, and most importantly, we will format the feature-vector data for ease of processing by other project systems, specifically the Biometric Authentication System and the Data Mining Systems teams.

Also, if time permits, we will correctly implement the nearest neighbor algorithm, rerun the previous experiment, possibly improve the method of running experiments, and run a larger experiment by combining the new data with the old.