Homework 10
Due: 2024/11/19 - 5PM
Programming (10 points)
Some more exercises with PCA and train/test datasets
0. Take a look at the Evaluation Metrics section of our WK10 notebook, and read Classification: Accuracy, recall, precision, and related metrics from the supplemental material.
1. Use the HW10 template to start a repository in your organization’s GitHub space. It should be named HW10. Open the notebook file using GitHub Codespaces to continue the exercises.
2. Complete the exercises in the notebook. Like last week, there are some interpretation questions on this one.
3. Submit a link to your repository with the completed exercises using Brightspace.
Project (5 points)
Project
Milestone 01: Ideation and High-Level Description
Welcome to the final project description !
This is as open ended assignment, where you get to explore and play with some of the concepts we covered in the class presentations, homeworks and exercises. It is also an opportunity to really get familiar with some of the recurring themes of ML that we talked about in class (everything is a number, reduce quantity but keep quality, etc…).
The goal is to make a “thing” with data, while keeping in mind some of the socio-economic-political implications of designing a system that works with data. Whose data is it ? Who is designing it and for whom ?
The Thing
The project can have one of many forms.
It can be about data visualization, exploration and/or organization. Maybe there’s a dataset, or type of data, that you’re interested in exploring in depth. We can use some of the modeling techniques we saw to find patterns and create different narratives around huge collections of images or sounds, for example. This could involve the design and implementation of a simple dashboard or navigation interface, or just the backend for such an imaginary system.
The project could be about model development, similar to the classification and regression exercises. Maybe there’s some aspect of an environmental issue that can be detected and quantified using satellite image processing and/or weather data.
Even though we haven’t explicitly looked at specific examples, you could work on something generative, where you create new data from existing data. The first step in these kinds of models involves extracting relevant features from very diverse data samples. PCA is an example of a feature extraction algorithm, but there are others. Generative text has its own family of models and math tricks.
The project can also be a tool for creative expression and abstract exploration of data. We’ve been looking at the mathematical and programming components of a lot of different kinds of models. Can these be adjusted/hacked for more subjective uses ?
And, finally, the project could be about collecting data to create a dataset that doesn’t exist. This might require practical skills from other classes, but as long as it’s informed by the materials we saw in this class, it’s a good way to practice ML-thinking.
This could also be an opportunity to start a bigger project that you will continue to develop as part of your thesis or independent research.
This Milestone
Research the field and datasets for ideas and possibilities.
The books in our resources page can be a starting point. Another great resource, for projects and datasets, is Kaggle. The Hugging Face tasks page and blog are also good. They tend to use larger, more complex, models and datasets that are a little bit beyond the scope of our class, but they could be a source of ideas.
Write about a couple of possibilities that you’d like to explore. It would be good to have more than $1$ option, but you don’t need more than $3$.
Include some preliminary information about data and datasets. Does the data exist ? Is it in one dataset or will you have to combine datasets ? How big are these datasets ?
We’ll discuss specific models and strategies in the next milestone, but for now, you should have an idea of the kind of processing required by each idea. Think about the overall processing, modeling, evaluation flow. Is there a lot of pre-processing ? Does it make sense to split data into train/test datasets ? Would it use one of the classic ML techniques we’ve looked at, or does it require bigger, more robust, more specific types of modeling ? How would you evaluate your model/tool ?
Write about your ideas in a text file, and include some images, graphs, sketches if appropriate.
Submission
Document your research and ideas in a repository called Project
in your GitHub account. You can write your proposal and findings using Markdown, and GitHub will process and display them nicely, with formatted text and images.
Submit a link to this repository through Brightspace.
Future Milestones
- Milestone 02 - 2024/11/26: Data and a Plan. Maybe some preliminary graphs.
- Milestone 03 - 2024/12/03: Initial Modeling/Programming. Explore classic ML, if sensible.
- Milestone 04 - 2024/12/10: Tuning and Adjusting.
- Milestone 05 - 2024/12/17: Final Presentation and Discussion