Stylometry System
Long-Text Experiments

Background

Stylometry (see Wikipedia definition) is the study of the unique linguistic styles and writing behaviors of individuals in order to determine authorship. Stylometry can be used to attribute authorship to anonymous or disputed documents, and it has legal as well as academic and literary applications. Stylometry has been used to determine (or narrow the possibilities of) the authorship of historic documents, of ransom notes, and of other documents in forensics, etc.

Stylometry uses statistical analysis, pattern recognition, and artificial intelligence techniques. For features, stylometry typically analyzes the text by using word frequencies and identifying patterns in common parts of speech. A framework paper and MIT Thesis describe some existing systems.

This is a continuation of previous projects, see Research Day 2010 paper, Research Day 2011 paper, and especially IJCB2011 Conference Paper.

Project

Biometric systems consist of data collection, feature extraction, and pattern classification. Here, the data are simply plaintext files.

Last semester we developed a reasonably robust Pace University Stylometry Biometric System (PSBS) and the feature set is currently being enlarged by Vinnie Monaco. The design of the stylometry features is based on the following criteria:

Last semester we used the PSBS in an effort to enhance the Pace University Keystroke Biometric System (PKBS) on the answers entered by students taking online short-answer tests, see above IJCB2011 Conference Paper. However, because the stylometry results were rather poor last semester, this project will focus solely on stylometry and on much longer text input with the aim of obtaining reasonable accuracy on the PSBS.

We have some long-text samples and expect to obtain more from DPS students and graduates teaching at various institutions. Most of this semester's effort will be running experiments to obtain accuracy (e.g., Equal Error Rate) as a function of text length.