Course Description
Gives students hands-on experience applying data science techniques and machine learning algorithms to real-world problems. Students will work in small teams on internal challenges, many of which will be sponsored by local companies and organizations, and will represent the university in larger teams for external challenges at the national and global level, such as those hosted by Kaggle. Students will be expected to participate in both internal and external challenges, attend meetings, and present short presentations to the group when appropriate.
Motivation
Data science is one of the fastest-growing sectors of our economy, and there is a great demand for data scientists with practical experience applying statistical techniques and machine learning algorithms to real data. While several courses in the CS curriculum develop these techniques, in the areas of machine learning, statistical modeling, network science, numerical analysis, and data science more broadly, and while these courses often include a hands-on project, no course specifically focuses on putting this myriad of tools to work on real data and developing intuition for when to apply certain techniques over others. The present course will fill in this gap, allowing students to work in teams both small and large to solve real-world prediction challenges, gaining valuable experience whether entering the workforce or remaining in academia.
Topics
To accompany the prediction challenges and other activities hosted by the team, we will have short presentations on topics relevant to the current competition or data science more broadly. A non-exhaustive list of topics is as follows.
- Basic Concepts: classification and regression, prediction vs causation, regularization and overfitting.
- Algorithms: linear regression, logistic regression, support vector machines, boosting, decision trees and forests, neural networks, gradient and stochastic gradient descent.
- Practical Techniques: ensemble methods and aggregation, tradeoffs in regularization, and parameter and hyperparameter tuning, data imputation techniques, cross-validation.
- Software and Tools: tutorials on several modern data science software packages; as of this writing, this would include e.g. scikit-learn, pandas, vowpal wabbit, and xgboost.
- Context and Industry Practice: via weekly presentations from practicing data scientists, students will learn about techniques actually used in industry and academia, and which algorithms work well for which problems.
Assessment
The general requirement for the course is to participate in the competitions and other activities of the team. As the specifics of these competitions and activities will change from semester to semester, the course is formally structured as follows. You will submit three written reports to Moodle detailing what you have done. These reports should be structured as follows:
0. Attendance
- We'll have a sign-in sheet for attendance every week. This will be used for the participation part of the grade. Missing 4-5 weeks drops grade by an increment (B to B- for instance), and missing 6+ weeks drops grade by two increments (B to C+ for instance).
1. Midterm report 1 (The Proposal)
- Report 1 Due: September 24, 11:59 PM.
- Purpose: To make sure you are on track
- Format: Text
- Length: 3-4 paragraphs
-
Content:
- Summary: Brief description of the activities you have been involved in and are planning. Some examples might be: competitions, prediction tasks, and so on. Please also note who you have been working with, if anyone.
- Techniques: Brief description of the techniques you have used
- Goals: Your goals for the remainder of the semester, both in terms of activities (what will you do) and education (what do you hope to learn)
2. Midterm report 2 (The Progress Report)
- Report 2 Due: October 22, 11:59 PM.
- Purpose: To make sure you are on track
- Format: Text
- Length: 3-4 paragraphs
-
Content:
- Summary: Give a more detailed description of the activities you have been involved in so far, with links to (or mentions of) specific data sets or problems, and who you have been working with, if anyone. Describe which of your goals you have met already, and what you plan to do for the remainder of the semester.
- Techniques: Brief description of the techniques you have used
- Goals: Your goals for the remainder of the semester, both in terms of activities (what will you do) and education (what do you hope to learn)
3. Final report
- Final Report Due: December 10, 11:59 PM.
- Purpose: To assess your level of participation, effort, and learning
- Format: PDF
- Length: Roughly 3 pages single spaced for 4802 students, and 4+ pages single spaced for 5802 students
-
Content:
- Summary: A 1-2 paragraph description of the activities you were involved in (e.g. competitions, prediction tasks, etc) and who you worked with, if anyone
- For each activity, give a detailed account of your approach and techniques, including descriptions of any hurdles you had to overcome. Include relevant plots and figures (though note they do not count toward the page count). If you participated in prediction competitions, include links to the leaderboard and/or a screenshot showing your score.
- Goals: Briefly describe whether you accomplished your goals from the midterm report.
- Attachments: Include any relevant code or other digital artifacts