Big Data False Alarms

Background

This topic is a Big Data challenge and an opportunity at the same time. Security monitoring of Hadoop transactions has always been a troublesome effort because of the large volume of alerts created by data monitoring tools when users execute querying of tables. Many false-positive alerts are part of this unwieldy volume. How many can be safely ignored? A great number of alerts are bypassed by security analysts whose job is to review these alerts due to time and volume constraints. This disregard of scrutiny of the alert logs is intentional since there is not enough time to analyze, and act upon, every alert within the allotted security level agreement time-span.

There have been organizations that have suffered negative consequences as a result of not identifying a real alert that was among the many false positive alerts. This challenge keeps increasing with the use of Big Data, given the faster velocity, larger volume, and variety of data streams, that need to be ingested, or disbursed, in real, or quasi real-time.

Can we confirm and prioritize alarms reported in the violations logs in order to allow security analysts to concentrate on the true alerts primarily?

Project Description

The deputy security director from the CISO (Chief Information Security Office) in your company has asked you to come up with an approach to make the analysis of the logs - from a newly installed Data Leak Detection/Data Loss Prevention security application monitoring the BigData tables - more manageable. She is overwhelmed by the sheer size of alerts being generated as Hadoop databases are accessed 24 by 7, online and in batch mode.

Your customer is interested in the creation of a process to confirm and prioritize true alerts. She believes that if her security analysts had to review a confirmed and prioritized list of alerts, they could do their scrutiny more thoroughly, and be able to provide pertinent action on a timely basis. She indicates that her Data Leak Detection tool is not intelligent enough to eliminate the large volume of false alarms.

High-level intent: To confirm the true nature of these alerts, and then to classify them by categories: true and false alarms for the review of the security analysts. Eventually, we would like to create an automated interface that the analysts can use to provide newly gained information about true positives back to the learning model and the DLD. We can mine the data using Big Data analytics as well as other data mining tools.

This project is a continuation of two earlier projects:

  1. Fall 2016 project, see 2016 Fall Project Paper
  2. Spring 2016 project, see 2016 Research Day Conference paper.

Questions to Ponder

What Big Data analytic approach can we use to understand the nature of the false positive alerts in the generated logs?