Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

CS5489 -Tutorial 2

Text Document Classification with Naive Bayes

In this tutorial you will classify text documents using Naive Bayes classifers. We will be working with the

dataset called "20 Newsgroups", which is a collection of 20,000 newsgroup posts organized into 20 categories.

First we need to initialize Python. Run the below cell.

In    [  ]:

%matplotlib inline

impor t matplotlib inline # se tup outpu t imag e f orma t

matplotlib inline.backend inline.set matplotlib formats('svg')

impor t matplotlib.pyplot as plt

impor t matplotlib

from numpy impor t *

from sklearn impor t *

from scipy impor t stats, special

random. seed(100)

Next, put the file "2Onews-bydate py3.pkz' into the same directory as this ipynb file. Do not unzip the file.

Next, we will extract 4 classes from the dataset. Run the below cell.

In    [  ]:

# str ip away he aders/footer s/quo tes from the text

removeset =('headers' ,'footers ','quotes ')

# only use 4 categories

cats =['a lt.a theism ','ta lk.rel igion .misc' ,'co mp.graphics ','sc i.space'

# load th e train ing and testing sets

newsgroups train = datasets.fetch 20newsgroups(subset='train',

remove=removeset, categories=cats,data home='./')

newsgroups test = datasets.fetch 20newsgroups(subset='test',

remove=removeset, categories=cats, data home='./')

Now, we check if we got all the data. The training set should have 2034 documents, and the test set should have 1353 documents.

In   [ ]:

print ( "tra in ing set size:", len(newsgroups train.data))

pr in t( "test i ng se t s ize : ", len(newsgroups test.data))

prin t(newsgroups train.target names)

Count the number examples in each class. newsgroups train. target is an array of class values (0 through

3),  and  newsgroups train.target[i]  is  the  class  of  the  i-th  document.

In     [  ]:

print( "class counts")

for i in [0,1,2,3]:

print( "{ :20s}:{ }".format(newsgroups train.target names[i], sum(newsgroups train.target == i)

Now have a look at the documents. newsgroups train. data is a list of strings, and

newsgroups train.data[i] is the i-th document.

In   [ ]:

for i in [0 , 1 ,2 ,3 ]:

pr int ( "--- document { }(cla ss= { })---" .format(

i, newsgroups train.target names[newsgroups train.target[i]]))

print(newsgroups train.data[i])

Tip : while you do the tutorial, it is okay to make additional code cells in the file. This will allow you to avoid re-running code (like training a classifier, then testing a classifier).

Build document vectors

Cr eate the vocabulary from the training data. Then build the document vectors for the training and testing sets. You can decide how many words you want in the vocabulary.

In     [  ]:

# pul l ou t the documen t data and labe ls

traindata = newsgroups train. data

trainY = newsgroups train.target

testdata = newsgroups test.data

testY = newsgroups test. target

In   [ ]:

### INSERT YOUR CODE HERE

In     [  ]:


Bernoulli Naive Bayes

Learn a Bernoulli Naive Bayes model from the training set. What is the prediction accuracy on the test set?

In     [  ]:

### INSERT YOUR CODE HERE

In    [  ]:

What are the most important (frequent) words for each category? Run the below code.

Note:  model.feature log prob [i]  will  index  the  word  log-probabilities  for  the  i-th  class

In     [  ]:

# g e t the word name

fnames = asarray(entvect.get feature names out())

for i,c in enumerate(newsgroups train.target names):

tmp = argsort(bmodel.feature log prob [i])[-10:]

print( "class" ,c)

for t in tmp:

print( " { :9s }({ : .5f})".format(fnames[t],bmodel.feature log prob [i][t]))

Multinomial Naive Bayes model

Now learn a multinomial Naive Bayes model using the TF-IDF representation for the documents. Again try different parameter values to improve the test accuracy.

In    [  ]:


### INSERT YOUR CODE HERE


In    [  ]:

What are the most important features for Multinomial model? Run the below code.

In    [  ]:

# ge t the word names

fnames = asarray(cntvect.get feature names out())

for i,c in enumerate(newsgroups train.target names):

tmp = argsort(mmodel tf.feature log prob [i])[-