Data Mining Customer-Related Subway Incidents

Project Continuation

This is a continuation of an earlier project, and the results of that project can be found in a Research Day paper entitled Data Mining Customer-Related Subway Incidents. Rujul Inamdar (listed as the project's subject matter expert) did most of the data mining work on the earlier project and will serve as a consultant on this project.


We have a large database of train incidence reports from the N.Y. Transit Authority. The data for each incident consists of the time of day, day of week, season (winter, spring, summer, fall), occurring station, occurring Borough (Manhattan, Brooklyn, Bronx, Queens), and the trouble code. A brief summary of the data in the database and some related material are shown in Predictive Model for Customer Related Subway Incidents and Analysis of Service Related Contributory or Causative Factors of Subway Rail Rage.


We want to conduct data mining experiments on the database. We are particularly interested in problems involving train customers, and these involve trouble codes for armed customer, door incident, sick customer, injured customer, unruly customer, vandalism, etc. We will begin by examining the number of different types of incidences over time (from year to year), over seasons, at various stations and Boroughs, etc. By spotting trends and making discoveries in the data, our overall goal is to make improvements in the system of handling incidences and possibly reduce the number of incident occurrences.

The following is an example scenario that might be investigated. A passenger’s frustration builds while waiting for a train that has been delayed. Upon it’s arrival, the train crew informs the customers of further extensive delays. The customer’s vented displeasure over the perceived poor service escalates from a heated verbal exchange to shoving match and threats of further physical harm against a member of the train crew. Related incidence reports in the database might include a report of customer threats against a member of the train crew. Then, using the time, date, and location of the reported threats, we might mine the database to discover that a train coming into that platform was indeed delayed for over thirty minutes and might have been the source of the customer's anger. On the other hand, an examination of a similar threat might find no obvious poor service that might have caused the anger, perhaps indicating that the customer was likely irritated by a cause unrelated to train service (mental illness, family problems, etc.). By examining a large number of such incidents, we might be able to estimate percentages on the pathways of a Customer Aggression Model.

Through a literature search, we would also like to obtain a comparison of how such incidents are monitored and handled by all the large metropolitan metro systems (London, Berlin, Moscow, Paris, Tokyo, Shanghai, etc.).

Inroads on the following items should be made early in the semester:


Weka is a set of Java algorithms for data mining, see the following links