Stylometry System

Stylometry (see Wikipedia definition) is the study of the unique linguistic styles and writing behaviors of individuals in order to determine authorship. Stylometry can be used to attribute authorship to anonymous or disputed documents, and it has legal as well as academic and literary applications. Stylometry has been used to determine (or narrow the possibilities of) the authorship of historic documents, of ransom notes and other documents in forensics, etc.

Stylometry uses statistical analysis, pattern recognition, and artificial intelligence techniques. For features, stylometry typically analyzes the text by using word frequencies and identifying patterns in common parts of speech. A paper and MIT Thesis describe some existing systems.

Previous System

A previous system was developed and tested, see Stylometry Technical Paper (fall 2007) and associated slides. Also see Authentication Technical Paper (fall 2007) and associated slides. A rather primitive feature extraction component was developed in C# (not of current value). Since we will not be using the previously developed system, this is a new project rather than a continuation.

Project

This project has two parts which can be done in parallel until completion of the first part.

Part 1 (two-three week duration)

Conduct a library and internet search to determine an interesting and unique application of stylometry for research. Unique applications might include determining the age or gender of the author, verifying one's identity in biometric applications (such as the identity of a student taking an online test), or determining email authorship. A table enumerating all the possible applications of stylometry would be appropriate. Unless otherwise determined, the focus this semester will be determining email authorship, an area of forensic linguistics.

The DPS customer is particularly interested in creating stylometric profiles of a user based on the user's social networking comments. A profile from a networking site such as facebook can be scanned for comments from a user as these comments are tagged with the authors name. Emails from the same person can then be tested, with the existing system, against this profile for matching. The software would scan html pages from a user profile, extract comments that follow the posting persons name, and use these comments to build a sylometric profile of the user. This stylometric profile could then be used to identify authors' of emails. Emails and profile comments are both informal online forms of communication, and the use of special characters may be similar in these two communication venues. This special application might require the coding of unique stylometry features to capture such things as the usage of chatroom shorthand.

Build logical arguments why such software would be both valuable and unique, and develop use cases to support the arguments.

Part 2 (full semester duration)

Develop a powerful stylometry system. First, read the existing literature and check for the possibility of available existing software.

The system and experimental setup will have several components.