Learning Objectives

This course focuses on both concepts and practice. We will introduce (a) the core data mining concepts and (b) practical skills for applying data mining techniques to solve real-world problems.

Concepts

  • Study the major data mining problems as different types of computational tasks (prediction, classification, clustering, etc.) and the algorithms appropriate for addressing these tasks
  • Learn how to analyze data through statistical and graphical summarization, supervised and unsupervised learning algorithms
  • Systematically evaluate data mining algorithms and understand how to choose algorithms for different analysis tasks

Practice

  • Learn how to gather and process raw data into suitable input for a range of data mining algorithms
  • Critique the methods and results from a data mining practice
  • Design and implement data mining applications using real-world datasets, and evaluate and select proper data mining algorithms to apply to practical scenarios

Course Content

Topics to be covered:

  • Data exploration and visualization
  • Supervised learning (or predictive analysis): Regression, Classification
  • Unsupervised learning (or descriptive analysis): Clustering, Dimension reduction
  • Evaluation and model assessment
  • Special topics: Network mining, Text mining, Recommendation
  • Selected topics (TBD): advanced topics in clustering and classification techniques, outliner analysis, data science research trends, etc.

See the course schedule for weekly topics.

Computing

This course will use R for computing. R is freely available online. We will be using R Studio as our default IDE, which can be downloaded for free. We will use R Markdown for creating reproducible data science documents.

Prerequisites

Students are expected to be familiar with the basics of Linear Algebra, Probability and Statistics, and should be comfortable with programming. We will use R for computing, and hence familiarity of R is preferred. If you have never programmed before, get started by checking a list of learning resources on the course website here.

Grading

Grades are based on three major activities listed below. Assignments are due as scheduled, and grades on late work will be decreased by 10% per day late.

  • 30% in-class & post-class participation (including quizzes and reading assignments)
  • 40% homework and midterm
  • 30% final project (including several milestones)

Class Participation

Class participation will be assessed through online quizzes assigned each week, as well as the students’ participation in class.

Readings

This course does not have a single textbook. It will use materials from several recommended books listed below. These books are available online (some are available online over Pitt network). There will be reading assignments over the course of the semester. Links to the electronic copies of these readings will be provided. There are also other recommended books for further reading and for learning R.

Optional Course Textbooks

  • Data Mining and Business Analytics with R, Johannes Ledolter, Wiley, 2013, ISBN: 978-1118447147 (online access via Pitt network) (hereafter referred as “DMR”)
  • Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data (2nd ed.), Bing Liu, Springer, 2011, ISBN: 978-3642194597 (available online) (hereafter referred as “WDM”)
  • Practical Data Science with R, Nina Zumel and John Mount, Manning Publications 2014, ISBN: 9781617291562 (online access via Pitt network) (hereafter referred as “DSR”)

University Policies

See the university policies page.