Stylometry System

Background

Stylometry (see Wikipedia definition) is the study of the unique linguistic styles and writing behaviors of individuals in order to determine authorship. Stylometry can be used to attribute authorship to anonymous or disputed documents, and it has legal as well as academic and literary applications. Stylometry has been used to determine (or narrow the possibilities of) the authorship of historic documents, of ransom notes, and of other documents in forensics, etc.

Stylometry uses statistical analysis, pattern recognition, and artificial intelligence techniques. For features, stylometry typically analyzes the text by using word frequencies and identifying patterns in common parts of speech. A framework paper and MIT Thesis describe some existing systems.

This is a continuation of previous projects, see the associated Research Day 2010 paper entitled Stylometry System - Use Cases and Feasibility Study and especially last semester's technical paper.

Project

Biometric systems consist of data collection, feature extraction, and pattern classification. Here, the data are simply plaintext files, and you will use the text data that corresonds to similarly captured keystroke data.

The design of the stylometry features is based on the following criteria:

The first major objective of this work is to obtain authentication accuracy of the stylometry system and compare it to that of the keystroke system. The second major objective is to combine the keystroke and stylometry systems to increase accuracy of the combined system over that of either the standalone keystroke system or the standalone stylometry system.

Last semester Vinnie Monaco developed the data capturing and feature extraction programs for this work. The data capturing program captures the text portion of the keystroke data collected by the keystroke project so that we are working on the same underlying data. We also use a generic Feature Data Format so we can

Before we can use these programs, however, Vinnie will revise the data capturing program to use the more accurate Java applet method used is earlier work rather than the Javascript method tested last semester.

This semester's team, together with the Keystroke team, will obtain new data samples from a large population of about 50 subjects, 10 data samples from each subject in two sets of five samples each recorded with a separation of at least two weeks. You will also learn how to run the programs (code) of the system and run various experiments.

Fast Agile XP Deliverables

We will use the agile methodology, particularly Extreme Programming (XP) which involves small releases and fast turnarounds in roughly two-week iterations. Many of these deliverables can be done in parallel by different members or subsets of the team. The following is the current list of deliverables (ordered by the date initiated, deliverable modifications marked in red, initiated date marked in bold red if programming involved, completion date and related comments marked in green, pseudo-code marked in blue):

  1. 2/1 (one week) Review the work from last semester. Work with your instructor, customer John Stewart, and Vinnie Monaco to plan the work for the semester.
  2. 2/1 (ongoing for the semester). Collect data samples together with the Keystroke team.
  3. 2/1 (1-2 weeks). Work with Vinnie Monaco to learn how to run the various programs of the system.