math189r

Info

Mathematics of Big Data I

Professor Weiqing Gu
Harvey Mudd College
Fall 2016

Meeting Times

M_ 06:30-09:15PM. SHAN B460 (Lecture)
Tr 06:30-08:00PM. SHAN B460 (Optional Section)

Summary of Goals

Description

This is a course in how to utilize data: infer, predict, coerce, and classify. We will cover a large breadth of material, spanning supervised and unsupervised learning, recommender systems, and Bayesian modelling, to a high level of mathematical rigor. Upon successful completion of the course, students should be fully equipped to enter industry as a data scientist, read active research in the field of Machine Learning, and approach huge (data and otherwise) problems seen in the real world.

Additionally, another goal of this course is to become comfortable using Amazon Web Services and GitHub as these tools are extremely prevalent in industry and academia when developing and deploying models. To that end, all code for homework and your final project will be hosted on GitHub.

Structure

There will be mandatory Monday lectures with readings to be completed before class (detailed below).

A section held each Thursday will either review prerequisite material, go over supplementary material, or investigate an interesting application of our coursework. They will either be taught by the instructor or a teaching assistant. For the review sections (the first two) attendance is recommended to anyone who (for the Linear Algebra review) doesn't know what any of {Cholesky Decomposition, SVD, inner product, outer product} are, or (for the Probability review) doesn't know what any of {Bayes' Rule, binomial distribution, Bernoulli distribution, multinomial distribution, Poisson distribution, Gaussian distribution, covariance} are. Notes will be posted prior to the meetings so just check those out before and see if you feel comfortable with the material. We expect around half of the students to be comfortable with more than 75% of the linear algebra and probability we will be using in the course. This is fine! Just come to the

For sections after the review sections (ie. Convex Optimization Overview, Sparsity; SVM Training), attendance is again not required but highly (!) recommended. Sections are designed to give you inspiration and insight into your final project, and shed light on the material in a new way.

Textbook

Murphy, Kevin P. Machine learning: a probabilistic perspective. MIT press, 2012.

Grading

Reading Summaries

All readings are compulsory, but some are more compulsory than others.

To encourage the goal of reading active research in the field, half-page reading summaries for all non-Murphy readings will be due at the beginning of class. They must be legible, and demonstrate that you have read the paper with a high degree of confidence. Credit will be given on a {0-10} scale for each summary. Your summaries should be written at a high level, and should focus on the main point of the readings (ie. avoid complicated math). As long as your summary is reasonable, you will be given full credit.

Homework (Every ~2 Weeks. Due Monday)

For coding: feel free to use any of {R, Julia, Python, Matlab}. If you want to use something not on that list, just ask us with a good reason and we'll probably say yes.

The homework is split approximately evenly between mathematical analysis and extension of our course material and application of algorithms to real world data.

Midterm

The midterm will be 3 hour take-home exam covering all topics seen until October 10 (inclusive). You will receive the exam in class on Oct. 10 (the Monday before break) and have until Oct. 24 (the Monday after break) to complete it. More detailed instructions will be given on the exam, but you are to turn the exam into the designated box outside Prof. Gu's office immediately after completion, or, if she is not there, under the door. The hard cut-off for handing the exam in is at the beginning of class on Oct. 24.

Final Project

Description

This is by far the largest component of the course. You will discover, explore, and attack a real world problem of your choosing. There are 3 types of projects you can work on, shown below in order of increasing difficulty.

  1. Application of existing algorithm to a new problem and potentially new data.
  2. Algorithmic work. Extend an existing algorithm or conceive a new one to solve some problem. This inherently includes (1) because you will need to test this new algorithm on data.
  3. Theoretical work. Create a new convergence bound on a learning algorithm. Show that at some limit one learning algorithm becomes another. Etc.
These also have increasing risk. For example, you cannot turn in a paper saying you worked on a convergence bound for months with no results. Type (2) has medium risk because part of the process of creating a new algorithm is creating baselines to improve upon.

At any time during the course please feel free to come and discuss your problem and ask questions with the instructor or TAs.

Requirements

Due Dates

Only one copy of each item need be turned in per group.

Disabilities

Students who need disability-related accommodations are encouraged to discuss this with the instructor as soon as possible.

Teaching Assistants (TAs)

Name Email (@hmc.edu)
Conner DiPaolo (head grader/tutor) cdipaolo
Paul David (CGU Graduate Student) paul.david (@cgu.edu)
Kathryn Dover kdover
Zoe Tucker ztucker
Ricky Pan rpan
Natchanon Suaysom nsuaysom
Mek Jenrungrot mjenrungrot
Herrick Fang hfang
Bo Zhang bzhang