STAT437 – Unsupervised Learning - Final Project

发布时间：2024-06-08

Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

STAT437 – Unsupervised Learning - Final Project – 100 Points

Due : May 7th by 11:59pm CST on Canvas.

Group Report

You should work in groups of 3-4 people

You must do at least 25% of the work in order to get full credit.

To receive full credit, you should follow the steps and answer the questions given in this document for your project. However, if you think that there are additional questions or analyses that would add additional insights to your overall research goal, you’re more than welcome to pursue these in addition to what is stipulated in this document.

Main Purpose of the Analysis

Identifying, Exploring, and Describing the “ Inherent” Clusters

The purpose of this analysis is to learn as much as you can about the dataset, the clustering structure, and the clusters that exist in your dataset. Ideally, you will be able to identify and describe each of the “ inherent” clusters that exist in your high dimensional dataset.

Actionable Insights

In your research motivation you should identify at least one type of person/application etc. in which the insights extracted from your clusters would be useful and actionable.

Thinking about what makes a cluster(ing) “meaningful”?

Different Cluster Definitions

For instance, one could apply k-means to an unclusterable dataset and return, say, k=3 clusters. But would these k=3 clusters be considered “meaningful”? Depending on research motivation for the cluster analysis, perhaps not. If the person that you describe in your research practically considered a “meaningful” cluster as a set of observations that were relatively well-separated from other observations, then this set of k-means clusters would be considered not meaningful and perhaps misleading about the nature of the dataset in this scenario.

Variable Scaling

Furthermore, you would want to consider if the type of person/application described in your research motivation would want the clustering structure of the dataset to be dominated by higher scale variables, or would they want the contribution of each variable to contribute to the clustering structure equally. If so, then you should scale your variables first.

Irrelevant Clustering Structure Contributions

By performing basic descriptive analytics on your dataset (even after variable scaling), do you think that the clustering structure detected by some of your clustering related algorithms may still be dominated by some variables more than others?

• For instance, do you suspect that the clustering structure (or clustering algorithm results) might be dominated by the categorical variables? If so, what are some work-arounds you can try for this?

• Do you have some numerical variables that only have a few distinct values, but the gaps between these values don’t necessarily indicate any meaningful clusters?

o Ex: a clustering structure that is just defined by the discrete value gaps of 4 shoe sizes (7, 8, 9, 10) doesn’t necessarily indicate a meaningful clustering structure.

Clustering Results that Only Exist because of some Data Preprocessing Property/Decision

For instance, does your dataset have a suspicious amount of 0 values for a given variable, whereas most of the rest of the variable values are much higher than 0? Do you suspect that whoever made/curated the dataset might have imputed missing values with 0’s? If so, you might want to consider dropping these rows (or columns) because a clustering algorithm that simply finds clusters of observations that had missing values is not very interesting.

Masking Variables

A masking variable in a dataset that you intend to cluster is variable that does not contribute anything to the clustering structure and can weaken the ability of an algorithm to detect the clustering structure. Do you think that your dataset has any masking variables? If so, you might consider deleting them or giving them smaller scaling weights.

Additional Analyses

If you think that there are additional questions or analyses that would add additional insights to your overall research goal, you’re more than welcome to pursue these in addition to what is stipulated in this document.

Intended Audience/Reader of your Project

The intended audience of your report/presentations should be someone who taken this STAT437 class.

Theoretically, you should be able to send/present your report to one of your classmates (who is not on your team), and they should be able to understand everything that you did and the claims that

you are making.

Project AI Tools Policy

To reiterate, code or text that exhibits the linguistic style or structure of AI tools like ChatGPT will lose significant points in the corresponding professionalism sections in the report rubric. The reason for this is because, in addition to running the risk of the AI tools making false claims as well as writing incorrect code (in which you would also lose points for correctness), using AI tools to generate technical content or code can make the reader of your report/code more skeptical that you understand the claims/code that has been written.

Other reasons why ChatGPT generated content in a report can lead to a decline in report professionalism are the following.

• The report is written in a style/to an audience that is not our intended target audience (ie. a

boss/client/researcher that has the same STAT437 knowledge as you). For instance, your report should not read as if you are educating the reader on a new topic that they are not familiar with.

• The report injects superfluous, off-topic sentences/code that are not relevant to the target reader and your overarching research motivation. This can make your report less concise, readable, and clear.

• The report discusses broad potential generalities, rather than address what you know about the actual dataset that you are exploring.

What’s not ok

• GENERATING code/content with AI tools will lose points.

What is ok

• Using AI to TRANSLATE written content in another language (that has been written by YOU) to English is ok.

Qualitative Evaluation Criteria

In addition to being graded for correctness and completion, this project will also be graded on a qualitative basis. Qualitatively, we will be looking for the following things.

Clarity about Analyses, Algorithms, and Data Choices

o Someone who has taken this class should be able to read through your report and/or watch your presentation and easily be able to do the following.

Replicate what you did in your analyses.

Know why you made the choices that you did in your analyses.

Clarity about Motivation (ie. the “so what?”) of your Analyses

o Beginning of the Report and Presentation:

Someone who is about to read your report and watch your presentation should be able to clearly answer the questions.

• “Why should I (or someone else) care about the report that I am about to read/listen to?”

• “What research questions do they intend to answer?”

• “How do these research questions relate to their motivation?”

Therefore, in the introduction of your report and presentation you should make this clear.

o Middle of the Report and Presentation:

While in the middle of your report and presentation, your audience should be able to clearly answer the question.

• “How do each of these analyses/algorithms/data choices that they’re making/using tie back into the overarching motivation of this whole analysis?”

Therefore, for each new analysis/model/algorithm/data choice that you make, you should explain this and make it clear to your audience.

o End of the Report and Presentation:

Someone who has just finished reading your report and watching your presentation should be able to clearly answer the questions:

• “Why should I (or someone else) care about the analysis that I just read/listened to?”

• “Did their analyses and conclusions answer the research questions that they stated at the beginning of the report/presentation? If so, how? What were the answers to these research questions?”

• “How would the results/answers to these research questions be useful to someone?”

Therefore, in the conclusion of your report and presentation you should make this clear.

Project Format [5 components]

Project Report [76 pt]

Deadline: Tuesday, May 7th 11:59pm CST on Canvas.

Should contain: Everything stipulated in the Project Report Specifications discussed below. Format:

o Jupyter notebook.

o This should look like a clean data analysis report that you would theoretically submit to an employer (not a homework assignment). Thus, at the very least, your report should have:

a title

headings for each of your sections

You should write paragraphs and in complete sentences.

o You can use and modify the attached project Final_Project_YOURNAMESHERE.ipynb file as a

template for this report if you’d like. You can add and delete as many cells as you’d like in this file.

Graded:

o See “Project Report Specifications” section below for point breakdown.

Project Presentation [16 pt]

Presentation Date: Wednesday May 8am CST in-person, LOCATION TBD.

Format:

o Your presentation should be no more than 9 minutes.

o You must present some part of the presentation in order to get full presentation credit.

o Presentation should be presented in slides (not the Jupyter notebook). Graded:

o See attached presentation rubric for what you should present and how you will be graded.

Report Peer Evaluation [4 pts]

Deadline: Sunday, May 12 11:59pm CST on Canvas.

• Steps :

o You will be randomly assigned to read another group’s report (as an individual).

o After reading their report you will fill out a survey form on Canvas, which will ask you the following questions (see last pages of this document).

• Graded :

o For completeness

Presentation Peer Evaluation [2.5 pts]

Deadline: Sunday, May 12 11:59pm CST on Canvas.

• Steps:

o You will be randomly assigned to watch another group’s presentation (as an individual). It will not be the same group that you read the report for.

o After watching their presentation, you will fill out a survey form on Canvas, which will ask you the following questions (see last pages of this document).

• Graded :

o For completeness

Individual Research Impact Questions [1.5 pts]

Deadline: Sunday, May 12 11:59pm CST on Canvas.

• Steps:

o As an individual, you must do at least 25% of the work in your team to get full credit.

o You will be asked a few questions about the work that you individually contributed to your group.

o You should have an understanding as to how your individual contributions influenced and were influenced by the insights and decisions made by your teammates. (See questions in last pages of this document)

• Graded :

o For completeness

Dataset Options

You can choose your own dataset or you can choose from one of the three supplied datasets below.

The csvs for each of these datasets are located in the same folder that this document is in. There is more information about each of these datasets below.

Choosing your Own Dataset

There are several places you can go to find interesting datasets, but here are some places you can start.

https://www.kaggle.com/datasets

https://archive.ics.uci.edu/ml/datasets.php https://corgis-edu.github.io/corgis/csv/

https://data.world/datasets/clustering

https://github.com/fivethirtyeight/data

For students interested in sports data:

a. NFL:https://www.nflfastr.com/

b. MLB and other baseball:https://billpetti.github.io/baseballr/

c. CFB:https://saiemgilani.github.io/cfbfastR/index.html

d. More sports stuff:https://sportsdataverse.org/

If you decide to choose your own dataset, it must meet the following specifications.

1. Dataset Size Specifications

Your dataset should have:

at least 5 attributes (not including the pre-assigned class labels if there are any) and

at least 50 rows

(If there’s a dataset that doesn’t meet these specifications, but you’re really interested in you can talk to me about it).

2. Data Cleaning and Scaling Considerations

Before moving onto to checking whether the dataset is clusterable below, you should think about any type of cleaning and scaling that would need to be done in order to

create an insightful analysis. Do this cleaning and scaling before moving on to the clusterability check.

3. Clusterability and Cluster Algorithm Fit Specifications

In this project you will be asked to apply at least two clustering algorithms to this

dataset. Thus, before proceeding with further analysis, you should do the following.

• First, test whether your dataset is clusterable.

o You should apply the t-SNE algorithm on your scaled and/or unscaled dataset (depending on what you intend to use).

o If the t-SNE algorithm suggests that there is a clustering structure, your dataset has passed this check.

• Clustering Algorithm Suitability

o Next, you want to make sure that you know of at least two clustering

algorithms that will able to cluster this particular type of dataset. For instance, if this is a numerical, structured dataset, then you know many clustering algorithms that can take this dataset as input. (You are not constrained to the clustering algorithms that we have learned in this class. However, ensuring that there are at least two clustering algorithms that we have learned in this class that will cluster your dataset can be a useful backup just in case the work in this project takes longer than you expected.)

Dataset Options (if you don’t want to choose your own)

1. Breast Cancer Gene Expression Profile Data Chanrion et al. (2008) reported results of a study of

155 patients treated for breast cancer with Tamoxofen. The patients were followed for a period of time and diagnosed as having a recurrence of breast cancer (R), or being recurrence free (RF).

Various clinical measurements were made including tumor size at the time of treatment. The gene expression levels were measured for a large number of gene sequences. Here we focus on a sample of 50 gene sequences.

The data for this example are in two different files:

• 'clinical_data.csv' contains the clinical

measurements and status for the patients in the study

• 'gene_expr.csv' contains the gene expression values for 50 genes.

I’d suggest clustering the the 155 patients based on their 50 gene expression values in 'gene_expr.csv' for, and

then (potentially) using one or more of the columns in the clinical_data.csv as an interesting “ pre-assigned class

label” . For instance, you might explore the nature of the relationship between the clustering structure identified in

the 50 genes and one (or more) of the columns in clinical_data.csv. You should check out this paper to learn more about the dataset.

Chanrion, Maïa, et al. "A gene expression signature that can predict the recurrence of tamoxifen- treated primary breast cancer." Clinical Cancer Research 14.6 (2008): 1744-1752.

2. Sample of the Afro-MNIST Dataset: The observations in the ethiopic_MNIST_sample.csv in the zip file contain a random sample of 1000 28-by-28 pixel images of digits (“1”-“10”) from the Ge‘ez

(Ethiopic) Afro-MNIST dataset. The Ge‘ez script lacks the digit “0”, so the images in this file represent the numerals 1-10 for that script.

The full dataset and more information about the full dataset can be found here:

https://www.kaggle.com/datasets/danjwu/afromnist

3. U.S. State Substance Abuse Rates: This dataset is about substance abuse (cigarettes, marijuana,

cocaine, alcohol) among different age groups and states. Data was collected from individual states as part of the NSDUH study. The data ranges from 2002 to 2018. Both totals (in thousands of people) and rates (as a percentage of the population) are given.

Analysis tips/ideas:

• Rates: Given that U.S. states have different populations, a more insightful analysis would focus on just clustering the rates columns in this dataset.

• 51 Rows: The objects that you cluster in this analysis should be individual states. Otherwise you will most likely get 50 clusters that correspond to each state,

which is not very insightful. So your dataset that you cluster should only have 51 rows (50 states +

Washington DC).

• Years: You might choose to cluster this dataset in the following way.

• Cluster the rates for just one year (like 2018)

• Cluster the rates for all years (or multiple years) all at once.

• (See code below, for instance, for how you can create columns in a dataframe that corresponds to the

'Rates.Tobacco.Use Past Month.12-

17','Rates.Tobacco.Use Past Month.18-25' for each year).

• You might choose to cluster the rates for just one year, and then cluster the rates for another year, then compare the clustering results of the two years.

• Which Rates: You can choose to cluster all the rate columns. Alternatively, you might choose to focus on a subset of rates like:

• just one age group

• just a subset drug or drug activities

The full dataset and more information about the full dataset can be found here:https://corgis- edu.github.io/corgis/csv/drugs/

pivoted_df = df.pivot(index='State', columns='Year',

values=['Rates.Tobacco.Use Past Month.12-17','Rates.Tobacco.Use Past Month.18-25']).reset_index()

pivoted_df.columns = ['_'.join(map(str, col)) for col in pivoted_df.columns]

pivoted_df

Project Report

Your report should include the analyses, code, and explanations detailed in each of the following sections.

General Report Professionalism Points

Report Professionalism

* Content written in the linguistic style of ChatGPT (or other AI content/code GENERATION tools) will automatically lose all points in this section. I don't write recommendation letters for anyone who looks like they used ChatGPT (or other AI content/code generation tools) to write their section of their report. However, foreign language to English TRANSLATION tools are ok.

* Write in complete sentences

* Write text in the markdown files (not code blocks).

* You are not copy-pasting the prompts/questions from this rubric and answering them. Rather, you should incorporate the requirements in this rubric naturally into a paragraph.

* Appropriate titles/headers are used. 6

1. Introduction

You should write an introduction (1-2 paragraphs) for your report. Your introduction should include/incorporate the following things.

Research Introduction and Motivation

* Clearly state the motivation for why someone might want to learn more about the clusters that exist in this dataset.

* Describe at least one type of person/application that may find your cluster analysis useful and how they might use it. Be specific about how they might use it.

* You should use at least THREE CITATIONS that support your motivation/answers in this section.

* Make sure that your citations are referenced and cited appropriately in this document. 3

How research motivation impacts the desired type of clustering

Based on the research application or motivation that you selected, discuss whether the person using the results of your analysis would find the following potential types of clustering results useful. If the type of result would be useful, explain HOW it might be useful to this particular application/person. If this result would NOT be useful, explain why.

* A clustering where each cluster is well-separated.

* A clustering where at least one cluster was a singleton outlier.

* A fuzzy clustering

* A dendrogram which displays a nested cluster relationship 3

2. Dataset Discussion

You should write a paragraph in your report discussing your dataset(s) that you will be using to answer these research questions. This paragraph should include/incorporate the following things.

Dataset Display

* Read your csv file and display the first 5 rows of your dataframe.

* How many rows are in your dataframe (originally before any data cleaning)? 0.25

Dataset Source

* State where YOU got this csv file (dataset) from.

* Provide a link/reference to where it came from.

* State when you downloaded this csv file. 0.5

Original Dataset Information In the place where you found this dataset, try to answer the following questions. If the source does not give the answer to these questions, say so.

* What do the rows (ie. observations) represent in this dataset?

* How was this dataset collected?

* Is this dataset inclusive of ALL possible types of observations that could have been considered in this dataset? If not, what types of observations might be left out?

* How does your answer to the question above impact the types of actions that the person in your research motivation might take based on the answer to your research questions? 2.5

Selected Variables

* Briefly describe the variables you intend to use in your analysis. 1

3. Basic Dataset Cleaning and Exploration

You should show and discuss any dataset cleaning decisions that you made in this section.

Missing Value Detection and Cleaning

* Does your dataset have any missing values?

* If so, clean these missing values.

* Are there any downsides to cleaning the missing values in this particular way? 1.5

Outlier Identification - Two Variable Outliers

*For every pair of numerical explanatory variables that you're using, create a scatterplot.

* Are you able to detect any outliers in these plots? 1

Outlier Identification - 3+ Variable Outliers

*Use one of the techniques that we discussed in this class that has the ability to identify outliers including those that can only be seen in 3 or more dimensions.

* Does this analysis suggest that there are additional outliers that we were not able to detect in our scatterplots? 2.5

Outlier Consideration

*In the context of your research motivation, what do you think should be done with any identified outliers? Explain.

- Should they be dropped? If so, what are some of the pros and cons of droping these outliers?

- Should they be clustered in their own singleton clusters?

- Should they be clustered with larger clusters that may happen to be further away?

* If you identified outliers in your dataset, does this impact the type of clustering algorithms or clustering evaluation metrics that you might use in your analysis? Explain. 2

Noise Consideration and Identification

* Use a technique discussed in this class to determine if your dataset has any noise.

*In the context of your research motivation, what do you think should be done with any identified noise? Explain.

* If you identified noise in your dataset, does this impact the type of clustering algorithms or clustering evaluation metrics that you might use in your analysis? Explain. 2

Other Data Cleaning

* Were there any other data cleaning steps that you deemed suitable for this analysis? What were they? Why did you choose to perform this additional data cleaning?

* If there are, do so here.

* If you dropped rows, how many did you drop? 0.5

4. Basic Descriptive Analytics

Before using any unsupervised learning algorithms, you should learn more about your dataset by performing some basic descriptive analytics.

OPTION 1

If your dataset is a structured dataset (ie. not image, audio, time-series data etc.), do the following.

* For your numerical attributes, calculate basic summary statistics about each attribute.

* For any categorical attributes (including the pre-assigned class labels, if your dataset has any) count up the number of observations of each type.

* Determine if there exist are any strong pairwise relationships between the variables in your dataset. 2

OPTION 2

If your dataset is an image dataset, do the following.

1. If your dataset has pre-assigned class labels:

* Visualize the first few images of each type of class-label.

* Discuss how much image variability each of the classes has, and what image elements are different.

2. If your dataset DOES NOT have pre-assigned class labels:

* Visualize a random sample of images from this dataset.

* Discuss how much image variability the image in your dataset have, and what image elements are different. `(2)

5.Scaling Decisions

From your analyses conducted here, discuss whether you should scale the dataset or not. Explain why or why not. If you choose to scale, then do so in this section here. 1

6. Clusterability and Clustering Structure Questions

Does your analysis suggest that the dataset clusterable? (The answer to this should be yes). Explain why. 1.75

Describe the Underlying Clustering Structure of the Dataset

* Approximately how many underlying clusters does the data have?

* What are the shapes of the underlying clusters?

* Are the clusters balanced in size?

* Are there are any clusters that are not well-separated?

* Is there any evidence of nested cluster relationships in this dataset? 2.5

Clustering Structure and Attribute Association

* Is there an association between each of the attributes and the clustering structure suggested by the t-SNE plot? Show the appropriate visualizations to explain. 1.5

Understanding the t-SNE algorithm

* Caution your reader about 4 original dataset properties that your t-SNE plots are not able to reveal or represent. 1

7. Clustering Algorithm Selection Motivation

Clustering Algorithm #1

Explain why you chose to use your first clustering algorithm to cluster this dataset. In your explanation, you should discuss and consider:

* your research motivation

* the "ideal dataset properties" that this algorithm is designed to work best for (your anaylses above should give you a sense as to whether many of these ideal properties are met or not). 2.5

Clustering Algorithm #2

Explain why you chose to use your first clustering algorith to cluster this dataset. In your explanation, you should discuss and consider:

* your research motivation

* the "ideal dataset properties" that this algorithm is designed to work best for (your anaylses above should give you a sense as to whether many of these ideal properties are met or not). 2.5

Follow Up Question:

With what you have observed about the nature of the dataset clustering structure and your particular research motivation, do you believe that there is any benefit to returning a fuzzy clustering of the dataset? Explain. If so, how might this result be useful with respect to your research motivation? 0.75

Follow Up Question:

With what you have observed about the nature of the dataset clustering structure and your particular research motivation, do you believe that there is any benefit to returning a hierarchical clustering of the dataset? Explain. If so, how might this result be useful with respect to your research motivation? 0.75