math189r

Mathematics of Big Data I

Professor Weiqing Gu
Harvey Mudd College
Fall 2016

Meeting Times

M_ 06:30-09:15PM. SHAN B460 (Lecture)
Tr 06:30-08:00PM. SHAN B460 (Optional Section)

Summary of Goals

Gain a comprehensive view of machine learning as an academic discipline and understand the mathematics behind it.
Be able to read recent academic papers in the machine learning literature and apply those algorithms and concepts to real world problems.
Become comfortable with industry and academia standard tools (such as AWS and GitHub) and be able to find and work with large, public datasets.

Description

This is a course in how to utilize data: infer, predict, coerce, and classify. We will cover a large breadth of material, spanning supervised and unsupervised learning, recommender systems, and Bayesian modelling, to a high level of mathematical rigor. Upon successful completion of the course, students should be fully equipped to enter industry as a data scientist, read active research in the field of Machine Learning, and approach huge (data and otherwise) problems seen in the real world.

Additionally, another goal of this course is to become comfortable using Amazon Web Services and GitHub as these tools are extremely prevalent in industry and academia when developing and deploying models. To that end, all code for homework and your final project will be hosted on GitHub.

Structure

There will be mandatory Monday lectures with readings to be completed before class (detailed below).

A section held each Thursday will either review prerequisite material, go over supplementary material, or investigate an interesting application of our coursework. They will either be taught by the instructor or a teaching assistant. For the review sections (the first two) attendance is recommended to anyone who (for the Linear Algebra review) doesn't know what any of {Cholesky Decomposition, SVD, inner product, outer product} are, or (for the Probability review) doesn't know what any of {Bayes' Rule, binomial distribution, Bernoulli distribution, multinomial distribution, Poisson distribution, Gaussian distribution, covariance} are. Notes will be posted prior to the meetings so just check those out before and see if you feel comfortable with the material. We expect around half of the students to be comfortable with more than 75% of the linear algebra and probability we will be using in the course. This is fine! Just come to the

For sections after the review sections (ie. Convex Optimization Overview, Sparsity; SVM Training), attendance is again not required but highly (!) recommended. Sections are designed to give you inspiration and insight into your final project, and shed light on the material in a new way.

Textbook

Murphy, Kevin P. Machine learning: a probabilistic perspective. MIT press, 2012.

Grading

5% Reading Summaries
35% Homework
20% Midterm
40% Final Project
[Up to 5% Extra Credit]

Reading Summaries

All readings are compulsory, but some are more compulsory than others.

To encourage the goal of reading active research in the field, half-page reading summaries for all non-Murphy readings will be due at the beginning of class. They must be legible, and demonstrate that you have read the paper with a high degree of confidence. Credit will be given on a {0-10} scale for each summary. Your summaries should be written at a high level, and should focus on the main point of the readings (ie. avoid complicated math). As long as your summary is reasonable, you will be given full credit.

Homework (Every ~2 Weeks. Due Monday)

For coding: feel free to use any of {R, Julia, Python, Matlab}. If you want to use something not on that list, just ask us with a good reason and we'll probably say yes.

The homework is split approximately evenly between mathematical analysis and extension of our course material and application of algorithms to real world data.

Midterm

The midterm will be 3 hour take-home exam covering all topics seen until October 10 (inclusive). You will receive the exam in class on Oct. 10 (the Monday before break) and have until Oct. 24 (the Monday after break) to complete it. More detailed instructions will be given on the exam, but you are to turn the exam into the designated box outside Prof. Gu's office immediately after completion, or, if she is not there, under the door. The hard cut-off for handing the exam in is at the beginning of class on Oct. 24.

Final Project

Description

This is by far the largest component of the course. You will discover, explore, and attack a real world problem of your choosing. There are 3 types of projects you can work on, shown below in order of increasing difficulty.

Application of existing algorithm to a new problem and potentially new data.
Algorithmic work. Extend an existing algorithm or conceive a new one to solve some problem. This inherently includes (1) because you will need to test this new algorithm on data.
Theoretical work. Create a new convergence bound on a learning algorithm. Show that at some limit one learning algorithm becomes another. Etc.

These also have increasing risk. For example, you cannot turn in a paper saying you worked on a convergence bound for months with no results. Type (2) has medium risk because part of the process of creating a new algorithm is creating baselines to improve upon.

At any time during the course please feel free to come and discuss your problem and ask questions with the instructor or TAs.

Requirements

Maximum of 1 partner (we may concede to 2 partners in extreme scenarios eg. huge coding project). All partners must contribute equally.
You must use at least one (1) dataset with at least one (1) million data points as a significant part of your project.
Your submission must be submitted as a pdf in NIPS format. Note that this means you must use LaTeX with their style file. (NIPS, Neural Information Processing Systems, is one of the major machine learning conferences).
All code used in the production of your final report should be clean (suggested format) and placed into a public GitHub repository under one of your partner's accounts. Place a footnote to this URL somewhere in your final pdf. This is not required but it is recommended to place your code under some open-source license such as MIT.

Due Dates

Only one copy of each item need be turned in per group.

Sept. 26 - Final Project Proposal. Typed (LaTeX) one (1) page maximum explaining your problem, what data sets you are likely to use (you must find some candidates), who your partners are, and what methods (of those you know of) you think you might use. Note that this is not 100% final but it should be within some epsilon of your final project.
Oct. 24 - Midterm Progress Report. Typed (LaTeX); three (3) page maximum detailing your (significant) progress towards your goal, which algorithms you have used, any necessary insight (mathematical or otherwise) you have conjured, etc. If you have not made significant progress you must still show everything you have tried (which is expected to be a lot). The progress report is due the same time as your take home midterm.
Dec. 5 - Final Project Submission. Must conform to the requirements above (NOTE: NIPS papers have a hard cap at 9 pages and a cap at 8. Read the formatting requirements!)
Week of Nov. 28 or Week of Dec. 5 - Final Project Presentation. 10 minutes long. The instructor and class will choose up to 10 extraordinary presentations to be shown on the last day of presentation. We will allocate 3 days for the class to present during these two weeks according with everyone's schedules.

Disabilities

Students who need disability-related accommodations are encouraged to discuss this with the instructor as soon as possible.

Teaching Assistants (TAs)

Name	Email (@hmc.edu)
Conner DiPaolo (head grader/tutor)	cdipaolo
Paul David (CGU Graduate Student)	paul.david (@cgu.edu)
Kathryn Dover	kdover
Zoe Tucker	ztucker
Ricky Pan	rpan
Natchanon Suaysom	nsuaysom
Mek Jenrungrot	mjenrungrot
Herrick Fang	hfang
Bo Zhang	bzhang

math189r

Info

Mathematics of Big Data I

Meeting Times

Summary of Goals

Description

Structure

Textbook

Grading

Reading Summaries

Homework (Every ~2 Weeks. Due Monday)

Midterm

Final Project

Description

Requirements

Due Dates

Disabilities

Teaching Assistants (TAs)