Computer Science





CS 325/CIT 348 [Data Mining]






4 Hours per week















Introduction Data Mining [ISBN: 0321321367]

P. Tan, M. Steinbach, & V. Kumar

Pearson Prentice Hall/ 2006




Data Mining: Introductory and Advanced Topics


M. Dunham/Pearson Prentice Hall/ 2003

Internet; Journals




Spring 2011




Dr. A. Joseph and Dr. J. Lawler




Course Description: This course will provide an overview of topics such as data mining and knowledge discovery; data mining with structured and unstructured data; foundations of pattern clustering; clustering paradigms; clustering for data mining; data mining using neural networks and genetic algorithms; fast discovery of association rules; applications of data mining to pattern classification; and feature selection. The goal of this course is to introduce students to current machine learning and related data mining methods. It is intended to provide enough background to allow students to apply machine and data mining techniques to learning problems in a variety of application areas.










Dr. A. Joseph



163 Williams St., 2nd floor, Room 231



212 346 1492


Office Hours:


Monday (NYC)            9:00am – 2:00pm







Grading Policy


Final examination:



In-class examinations (6 -- 20 minutes exams):


30% [best 5 of 6]




Student participation and contribution:







Project and project presentation:


15%  (3% for presentation)




Extra credit assignment (Optional):


Note: Only for students who are otherwise fulfilling all the course requirements.


10% (Due week 12 and no later)


Final grade Determination


90% -- 100%

85% -- 89%



82% -- 84%



80% -- 81%



75% --79%



70% -- 74%



65% -- 69%



60% -- 64%



Below 60%



Note: Grade is computed to the nearest whole number.



Learning Objectives and Outcomes


Students are expected to accomplish the following learning objectives and attained the corresponding outcomes by the end of the course.


Objective #1

Students will develop an intimate understanding of data and their characteristics.



a.       Demonstrate a clear understanding and knowledge of the complexity and possible solutions to the problem of data collection and data organization capabilities and the available expertise to analyze the data.

b.       Know when to determine and prepare data for quality analysis and its importance to informed decision making as well as be able to identify and clearly explain at least six indicators of data quality.

c.        Able to define and discuss a global definition of data warehouse as well as know the categories of the data it contains and the main transformation methods use to prepare them.

d.       Understand and know different ways in which data are characterized as well as how to identify and preprocess them.

e.        Able to demonstrate deep knowledge and understanding of data similarity and dissimilarity with regard to the operations involved and data analysis.


Objective #2

Students will develop a sound knowledge and understanding of the data preparation and exploration.



a.       Able to demonstrate ability to analyze basic representations and characteristics of raw data, apply different normalization techniques on numerical attributes, and recognize different techniques for data preparation.

b.       Able to compare different methods for elimination of missing data as well as compare different methods for outlier detection.

c.         Can apply summary statistics such as mean, median, and standard deviation to capture important characteristics in data sets.

d.       Know and able explain the purpose and significance of data visualization as well as know the forms, representations, and procedures of visualization techniques appropriate for a particular application.

e.        Able to identify the differences in dimensionality reduction based of features and reduction of value techniques as well as can clearly explain data reduction in the preprocessing phase.

f.        Show unambiguous understanding of the basic principles of feature selection and feature composition tasks.

g.        Demonstrate a clear understanding of the differences between decision tree and decision rule representation in a classification model.

h.       Able to identify the basic components of an artificial neural network and its properties and capabilities in such learning tasks as classification and pattern association.

i.         Able to describe the main steps of a genetic algorithm with an illustrative example.


Objective #3

Students will improve their team-building, social, organizational, and collaborative skills through assignments, team activities, and projects and that they can further develop in other classes and in their professional careers.



a.       Demonstrate an ability to work effectively in teams.

b.       Demonstrate the ability for effective verbal and written communication

c.        Able to differentiate between the different types of learning teams and can clearly explain the stages of team development and the characteristics of an effective team.

d.       Know the importance of task, friendship, and interaction to a team’s performance

e.        Able to demonstrate a clear understanding of the role and significance of team norms, teamwork skills, communication; leadership, decision making, and conflict management in the effective functioning of a team.


Objective #4

Students will develop foundational knowledge and understanding of the core concepts of data mining inherent in classification, cluster analysis, and association analysis as well as their examples of their applications.



a.       Show clear understanding by being to describe or discuss hierarchical (e.g., agglomerative), partitional (e.g., k-means), ROCK, and ABSCAN algorithms as well as their appropriateness to different data clustering applications.

b.       Able to briefly describe supervised, unsupervised, and relative cluster evaluation measures as well as to compare and contrast them

c.        Able to demonstrate using illustrative examples basic knowledge and understanding of statistical, distance, decision tree, neural networks, rule-based, and support vector machine algorithms in solving the classification problem.

d.       Able to evaluate, compare, and contrast the performance of two or more classifier models using different techniques.

e.        Demonstrate the ability to differentiate between and descriptively explain the different types association analysis related algorithms such as a priori, sampling, partitioning, parallel, distributed, frequent pattern growth.

f.        Able to compare and contrast qualitative and quantitative measures of for evaluating the quality of association patterns 


Objective #5

Students will acquire the knowledge, skills, and expertise needed to design and develop innovative and imitative algorithms for competitive products, processes, or services in a technology oriented financial and health informatics related enterprise.



a.       Develop skills and expertise in applying the knowledge of classification, clustering, association algorithms to solve problems relating to financial and health care services, processes, or products.

b.       Demonstrate the needed know -how to design and develop or modify algorithms for specific data mining applications in finance and health care.


Objective #6

Students will be provided with opportunities to increase their knowledge of and exposure to entrepreneurial skills through course activities, assignments, and interactions with mentors.



a.       Acquire entrepreneurial skills while interacting with financial, health care, and/or information technology experts for at least 10 hours to determine and execute the project as measured by different reporting mechanisms.



Tentative Examination Schedule:


Course Section

In-class examination Dates

Project Due date

Final Examination Date

CS 325/CIT 348

CRN: 23191/23190

2/9, 2/23, 3/9, 3/30, 4/13, & 4/27

April 14, 2011

May 5, 2011



Note 1: In general, the lessons will highlight inquiry-based lecture-discussion and may include storytelling. The central focus of the course will be critical thinking and problem-solving. To get the most out of the course, each student is expected to study the reading assignments and genuinely attempt each homework problem before coming to class. The idea is to come to class ready with questions about and ideas relating to the course materials and associated problems.


Note 2: In the interest of learning, it is very important to come to class prepared to learn – do all required assignments. Failure to do so could diminish your ability to get the most out of each lesson and the class. Remember that learning is action oriented. That is, it is not enough to come to class to listen to what others have to say. You should come to class prepared to become involve in all aspects of classroom activities because learning is an active process.


Note 3: It is very important you read and familiarize yourself with SCSIS Statement of Student Responsibilities (see Blackboard).









Data: Types of data; data quality; data preprocessing; measures of similarity and dissimilarity; large data sets; and data warehouses.

Readings: chapter 2

Problems: chapter 2/ 1, 3, 6, 7, 9, & 12





Data Preparation and Exploration: Raw data representation, characteristics, and transformation; missing data; summary statistics; decision trees and rules; data reduction techniques; neural networks; genetic algorithms; and visualization.

Reading: chapter 3 & handouts

Problem: Chapter 3/ 1, 2, 4, 6, & 17.





Classification: Introduction; approach to solve a classification problem; decision tree induction; model overfitting; evaluating a classifier performance; comparing classifiers; rule based classifiers; nearest neighbors classifiers; Bayesian classifiers; neural networks; and support vector machines

Reading: chapter s 4 & 5

Problems: chapter 4/. 1;

Chapter 5/ 1.





Cluster Analysis: Introduction; K-means; agglomerative hierarchical clustering; DBSCAN database; cluster evaluation; and clustering with categorical attributes.

Reading: chapter 8

Problems: To be assigned.





Project Submission and Presentation






Association Analysis: problem definition; generation and compact representation of frequent itemsets; rule generation; algorithms (sampling, partitioning, parallel, distributed, & FP-growth algorithm); & measuring and evaluation of association patterns.

Reading: Chapter 6

Problems: Chapter 6/ 1






Review for Final Examination









Final Examination.






Note 1: This course is structured around freely formed small collaborative groups in a cooperative learning environment.  Students are encouraged to work together in their respective groups to form effective and productive teams that share the learning experience within the context of the course, help each other with learning difficulties, spend time to get to know each other, and spend time each week to discuss and help one another with the course work (content and assignments).  Each group member is responsible for the completion and submission of each assignment.  Each group member will be individually graded. 


Note 2: During the first class session, student background information will be collected to get a sense of the diversity of student educational background and an assessment test will be given to determine students’ knowledge of the subject.


Group project: Students in small groups of two to four will participate in a project or research and prepare a report that involves the use of a low level or high-level programming language.  In this project, students will write a program to determine the solution of a technical problem, and then demonstrate their knowledge and understanding of how the program is processed in the typical digital computer system.  Assignment of grade to individual students for group project will be based upon their involvement in the following items: programming, report writing, proofreading and correction of programming codes and written report, and combinations of the above.


Web support: This course is supported with most or all of the following Blackboard postings: lesson questions, lessons (PowerPoint), instructions and guidelines pertaining to the course, computer architecture and related news, group and class discussions boards, email correspondence about the course, homework solutions, examination grades, and miscellaneous course related activities and information.


Supplementary materials: Handouts in class or web postings of current events and issues affecting computer architecture.  Some books that may be helpful for the course will be posted on Blackboard.


In class group activity and participation: Students are recommended to bring to class current newsworthy events in computer organization/architecture and related news to share with the class.  Students will inform the class of the news events and their significance to computing.  Devote 15-20 minutes to this activity.


The collaborative groups are designed to function outside of the classroom.  Collaborative group activities will be reinforced inside the class during the lessons.  Student groups are encouraged to function cohesively and to participate in class activities. Devote 30-45 minutes of each class period to collaborative group activities.



Students are strongly encouraged to download posted lessons from Blackboard, review them, and should be able to ask intelligent questions about the material in these lessons.


Every effort will be made to present each lesson using the storytelling format supported with subsequent discussion and elaboration on the central points of the lesson.


The key elements of a story are the following: causality, conflict, complication, and character.



The following excerpts about collaborative learning are from research documents:


·         In the university environment, educational success and social adjustments  depend primarily on the availability and effectiveness of developmental academic support systems.


·         Most organized learning occurs in some kind of group  group characteristics and group processes significantly contribute to success or failure in the classroom and directly effect the quality and quantity of learning within the group.


·         Group work invariably produces tensions that are normally absent, unnoticed, or suppressed in traditional classes.  Students bring with them a variety of personality types, cognitive styles, expectations about their own role in the classroom and their relationship to the teacher, peers, and the subject matter of the course.


·         Collaborative learning involves both management and decision-making skills to choose among competing needs.  The problems encountered with collaboration have management, political, competence, and ethical dimensions


·         The two key underlying principles of the collaborative pedagogy are that active student involvement is a more powerful learning tool than the passive attendance and that students working in groups can make for more effective learning than students acting alone.   The Favorable outcomes of collaborative learning include greater conceptual understanding, a heightened ability to apply concepts, and improved attendance.  Moreover, students become responsible for their own learning is likely to increase their skills for coping with ambiguity, uncertainty, and continuous change, all of which are characteristics of contemporary organizations.



Who creates a new activity in the face of risk and uncertainty for the purpose of achieving success and growth by identifying opportunities and putting together the required resources to benefit from them?


Creativity is the ability to develop new ideas and to discover new ways to of looking at problems and opportunities


Innovation is the ability to apply creative solutions to those problems and opportunities to enhance or to enrich people’s lives.


Each group may be viewed as a small business that is seeking creative and innovative ways to maximize its product, academic outcome or average group grade.  A satisfactory product is the break-even group average grade of 85%.  Groups getting average grades above 85% are profitable enterprises.