Stylometry System
Long-Text Experiments

Background

Stylometry (see Wikipedia definition) is the study of the unique linguistic styles and writing behaviors of individuals in order to determine authorship. Stylometry can be used to attribute authorship to anonymous or disputed documents, and it has legal as well as academic and literary applications. Stylometry has been used to determine (or narrow the possibilities of) the authorship of historic documents, of ransom notes, and of other documents in forensics, etc.

Stylometry uses statistical analysis, pattern recognition, and artificial intelligence techniques. The input data are simply plaintext files. For features, stylometry typically analyzes the text by using word frequencies and identifying patterns in common parts of speech. A framework paper and MIT Thesis describe some existing systems.

Last semester we developed a reasonably robust Pace University Stylometry Biometric System (PSBS) and the feature set is being enlarged currently by Vinnie Monaco. The design of the stylometry features is based on the following criteria:

The 2008 federal Higher Education Opportunity Act requires institutions of higher learning to make greater access control efforts for the purposes of assuring that students of record are those actually accessing the systems and taking exams in online courses by adopting identification technologies as they become more ubiquitous. To meet these needs, keystroke and stylometry biometrics were investigated at Pace University towards developing a robust system to authenticate (verify) online test takers. The performance of the stylometry system on online tests, however, was rather poor and simply fusing the keystroke and stylometry systems by combining their features did not boost the performance of the keystroke system alone. This work has been described in last semester's technical paper from Research Day 2011 and extended in the IJCB2011 paper to be presented at the International Joint Conference on Biometrics in October 2011.

Project

Everything related to last semester's stylometry project (user guides for input system and feature extractor, technical papers) is at Spring 2011 and Revised data collected.

Because the stylometry results were rather poor last semester, this project will focus solely on stylometry and on much longer text input with the aim of obtaining reasonable accuracy on the PSBS.

This semester we will first find books on the Internet, for example see Project Gutenberg where you can cut-and-paste an HTML book into Notepad to get it into text form (.txt). We will start with 30 authors and 10 writing samples from each author, 5 for training and 5 for testing.

Most of this semester's effort will be running experiments to obtain accuracy (e.g., Equal Error Rate) as a function of text length or as a function of population size (number of authors). We would like to run experiments with samples of different word lengths - for example, the first 250 words of each of the 300 samples, the first 500 words of each sample, etc. We will likely

We have some long-text samples and expect to obtain more from DPS students and DPS graduates teaching at various institutions.

Fast Agile XP Deliverables

We will use the agile methodology, particularly Extreme Programming (XP) which involves small releases and fast turnarounds in roughly two-week iterations. Many of these deliverables can be done in parallel by different members or subsets of the team. The following is the current list of deliverables (ordered by the date initiated, deliverable modifications marked in red, initiated date marked in bold red if programming involved, completion date and related comments marked in green, pseudo-code marked in blue):

  1. 9/22 Work with your customers to determine the type or types of books to use. For example, do we want all the books to be of the same type, like science fiction or romance? Or do we want books of a variety of types and from different time periods to facilitate distinguishing among them? Then find 30 authors that have each written 10 books. We now have 300 books, half for training and half for testing PSBS.
  2. 9/22 Create a program that reads in a book text file and outputs samples of different lengths - the first 250, 500, 750, 1000, 1500, 2000, 2500, 3000, 4000, 5000 words (10 different word lengths).
  3. 9/22 (1-2 weeks) Experiment 1: using the 250-word samples, train PSBS on 5 samples from each of the 30 authors and test PSBS on the other 5 samples from each of the 30 authors to obtain a Receiver Operating Characteristic (ROC) Curve. Work with Vinnie Monaco to learn how to run the various programs of the system.
  4. 9/22 (1-2 weeks) Run similar experiments using the other sample lengths.
Your first task is to obtain data in text readable form for input to the PSBS. Books are available from so go to that site and access a book. I just did and was able to Step 1. For experimental purposes we can start with 30 authors and 10 samples (books) from each author, 5 for training and 5 for testing. Therefore, find 30 authors that have written 10 books. We now have 300 books. Step 2. Here comes the hard part. So we need a program that begins to read a book text file and outputs samples of different lengths. Step 3.