Stylometry System
Long-Text Experiments

Background

Stylometry (see Wikipedia definition) is the study of the unique linguistic styles and writing behaviors of individuals in order to determine authorship. Stylometry can be used to attribute authorship to anonymous or disputed documents, and it has legal as well as academic and literary applications. Stylometry has been used to determine (or narrow the possibilities of) the authorship of historic documents, of ransom notes, and of other documents in forensics, etc.

Stylometry uses statistical analysis, pattern recognition, and artificial intelligence techniques. The input data are simply plaintext files. For features, stylometry typically analyzes the text by using word frequencies and identifying patterns in common parts of speech. A framework paper and MIT Thesis describe some existing systems.

Last semester we developed a reasonably robust Pace University Stylometry Biometric System (PSBS) and the feature set is being enlarged currently by Vinnie Monaco. The design of the stylometry features is based on the following criteria:

The 2008 federal Higher Education Opportunity Act requires institutions of higher learning to make greater access control efforts for the purposes of assuring that students of record are those actually accessing the systems and taking exams in online courses by adopting identification technologies as they become more ubiquitous. To meet these needs, keystroke and stylometry biometrics were investigated at Pace University towards developing a robust system to authenticate (verify) online test takers. The performance of the stylometry system on online tests, however, was rather poor and simply fusing the keystroke and stylometry systems by combining their features did not boost the performance of the keystroke system alone. This work has been described in last semester's technical paper from Research Day 2011 and extended in the IJCB2011 paper to be presented at the International Joint Conference on Biometrics in October 2011.

Everything related to last semester's stylometry project (user guides for input system and feature extractor, technical papers) is at Spring 2011 and Revised data collected.

Project

Because the stylometry results were rather poor last semester, this project will focus solely on stylometry and on much longer text input with the aim of obtaining reasonable accuracy on the PSBS.

This semester we will first find books on the Internet, for example see Project Gutenberg where you can cut-and-paste an HTML book into Notepad to get it into text form (.txt). We will start with 30 authors and 10 writing samples from each author, 5 for training and 5 for testing.

Most of this semester's effort will be running experiments to obtain accuracy (e.g., Equal Error Rate) as a function of text length or as a function of population size (number of authors). We would like to run experiments with samples of different word lengths - for example, the first 250 words of each of the 300 samples, the first 500 words of each sample, etc.

After experimenting with texts from books, we might experiment with student essays from English classes or other student generated material. We have some long-text student samples and expect to obtain more from DPS students and DPS graduates teaching at various institutions.

Fast Agile XP Deliverables

We will use the agile methodology, particularly Extreme Programming (XP) which involves small releases and fast turnarounds in roughly two-week iterations. Many of these deliverables can be done in parallel by different members or subsets of the team.

The following is the current list of deliverables (ordered by the date initiated, initiated date marked in bold red if programming involved, deliverable modifications marked in red, completion date and related comments marked in green, pseudo-code marked in blue):

  1. 9/22 10/25 Find 30 authors that have each written 10 books (e.g., from Project Gutenberg). To facilitate distinguishing among the authors we will use a variety of books types, say six different book types (science fiction, romance, etc.) and five books of each type for a totalof 30 authors. But not to make it too easy all the books will be reasonably contemporary, written within the last 50 years. We now have 300 books (30 authors and 10 books from each author), half for training and half for testing PSBS.
  2. 9/22 10/25 Create a program that reads in a book text file and outputs samples of different lengths - the first 250, 500, 750, 1000, 1500, 2000, 2500, 3000, 4000, 5000 words (10 different word lengths).
  3. 9/22 10/25 Experiment 1: using the 250-word samples, train PSBS on 5 samples from each of the 30 authors and test PSBS on the other 5 samples from each of the 30 authors to obtain a Receiver Operating Characteristic (ROC) Curve. Work with Vinnie Monaco to learn how to run the various programs of the system.
  4. 9/22 10/25 Run similar experiments using the other sample lengths.
  5. 11/21 Run a 10K-word experiment similar to the others. Include 250, 500, 1000, 2000, 5000, and 10000 word ROC curves on one graph.
  6. 11/21 Choose 15 of the 30 authors and run a 15-author experiment, and for comparison you could use the other 15 authors for another 15-author experiment. Performance of the 15-author experiments will likely increase over that of the 30-author experiment because performance typical increases as the number of users decreases. Include only 2000, 5000, and 10000 word ROC curves on one graph.
  7. 11/21 Rerun the previous experiment but this time add all 10 samples from the other 15 authors to the training set. This additional training, even though the samples are not from the test authors, will likely increase performance (we found that to be the case in previous keystroke experiments). And, as in the previous experiment, for comparison you could run the other 15-author experiment. Include only 2000, 5000, and 10000 word ROC curves on one graph.
  8. 11/21 All the experiments so far involved strong training -- that is, the system was trained on samples from the test subjects. Finally, and only if time permits, you could try a weak training experiment -- train on all 10 samples from 15 authors and test on the 10 samples of the other 15. Performance will likely decrease from that of the 15-author, strong-training experiment. However, because you are training on more samples from each author, you could conceivably get better performance, so it would be interesting to see what happens. Include only 2000, 5000, and 10000 word ROC curves on one graph.
  9. 12/1 Include a table of recommended additional features in the technical paper.