Text Analytics

Topic: Text Analytics: Building a Dynamic Cybersecurity Thesaurus using Twitter Data

This is a continuation of last semester's project and here's the resulting Fall 2017 Technical Paper.

Background With the emergence of big data, text analysis systems are built to support processing of large data sets. For meaningful analysis/mining, text needs to be prepared by removing 'noise.' The process of preparing the text for analysis is called preprocessing. Preprocessing techniques include but are not limited to tokenizing the file into individual words, removing stop words, stemming the text, and building a thesaurus. A thesaurus represents a precompiled list of words in a given domain of knowledge that provides a standard vocabulary for indexing and searching. For this project the domain of knowledge is cybersecurity and/or specific concepts from the cybersecurity domain (e.g. private key encryption or SSL).

Project Description This project aims to build a thesaurus (dictionary) of cybersecurity concepts based on Twitter data.

Project Deliverables:

Data Set: A data set should be generated for this project.

