DS4400: Machine Learning I Summer-II 2023 Homework-II (Part-II)
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit
Generative Models: Document Classification Using Naïve Bayes
DS4400: Machine Learning I
Summer-II 2023
Homework-II (Part-II)
Due Date: Friday 8/4/2023 by 11:59 PM
Introduction
In this assignment, we’ll implement and evaluate the Naïve Bayes document classification models. As a starting point, you can recycle the code from the provided lab in the Generative Models module on the course Canvas page. Your implementation should be in python. You may use any libraries you wish for visualizing results, but your algorithms should be hand-crafted from scratch and not usescikit-learn or other ML libraries.
Instructions
1. Load the 20 Newsgroups dataset using sklearn’sdatasets module (link). Download data for the following five topics (download both train and test components and remove the metadata from the text blobs):
a. comp.sys.ibm.pc.hardware
b. Comp.sys.mac.hardware
c. rec.sport.baseball
d. Rec.sport.hockey
e. Talk.politics.guns
2. Preprocess the data to get rid of escape sequences, remove stop words, and clean the
data (hint: look at the lab preprocesses the data)
Bernoulli Event Model:
a. Using the code from the lab, fit the parameters for a Bernoulli event model for the five topics: list and analyze the top ten ranking words for each topic
b. Implement the predict function for the Bernoulli multivariate event model that predicts the topic of a test document (Note: do not smooth the test document vectors). (20 points)
c. For the test set estimate the precision and recall for individual topics.
Additionally calculate the overall precision and recall across the five topics using micro, macro and weighted macro aggregations (15 points). Report the metrics for:
i. Top 100 most common words
ii. Top 1000 most common words
iii. Top 10000 most common words
3. Multinomial Event Model:
a. Implement the fit function for the multinomial event model for the five topics. (20 points)
b. Fit the model to the training data and identify the top 10 words for each topic (10 points)
c. Implement the predict function for the Multinomial event model that predicts the topic of a test document. (Note: do not smooth the test document vectors). (20 points)
d. For the test set estimate the precision and recall for individual topics.
Additionally calculate the overall precision and recall across the five topics using micro, macro and weighted macro aggregations (15 points). Report the metrics for:
i. Top 100 most common words
ii. Top 1000 most common words
iii. Top 10000 most common words
4. Which of the two models would you choose? Provide detailed justification for your choice. (10 points)
5. Compare the performance of the two models on each topic and provide reasons for the differences / similarities. (10 points)
Submit
• Code (python files or Jupyter notebooks). If submitting ajupyter notebook, include a PDF version of your notebook to facilitate grading.
2023-08-07
Generative Models: Document Classification Using Naïve Bayes