W21 - COMP1005                                                                                                              Due Monday, March 29th at Noon

(No late submissions allowed)


Assignment 4

Word Fun (Sets, Dictionaries, Tuples) & Art

Due: Monday, March 29th at NOON (no late submissions allowed)


Submit a single zip file called A4.zip.

The assignment has 50 marks.


Notes: It is essential that you use the built-in, default, archiving program to create this zip file. If we cannot easily open your zip file and extract the python files from them we cannot grade your assignment. Other file formats, such as rar, 7zip, etc, will not be accepted.

Windows: Highlight (select with ctrl-click) all of your files for submission. Right-click and select “Send to” and then “Compressed (zipped) folder”. Change the name of the new folder “A3.zip”.

MacOS: Highlight (select with shift-click) all of your files for submission in Finder. Right-click on one of the files and select “compress N items…” where N is the number of files you have selected. Rename the “Archive.zip” file “A4.zip”.

Linux: use the zip program.


After submitting your A4.zip file to cuLearn, be sure that you download it and then unzip it to be certain that what you have submitted is what you wanted to submit. This also checks that your zip file is not corrupted and can be unzipped.


Please note that reasons similar in nature to “I submitted the wrong files” or “I didn’t know the zip file was corrupt” will not be accepted as an excuse after the due date.

Submit early and often. cuLearn will save your latest submission. I would highly suggest that you submit as soon as you have one question done and keep re-submitting each time you add another problem (or partial problem).


Q1: Word Stats                                                                                                                           [40 marks]

In this problem you will generate some statistics for a body of text (English text). You are NOT allowed to import any modules to help with this.

You will write the following functions (in a file called words.py):

unique_words( text : str ) -> list:

The function takes a body of text (string) and outputs (returns) a list of all unique words in the text. The output list will contain strings. For example, calling

unique_words( "The,. cat. Live’s in the! the road.")

will return the list

[ "the", "cat", "live’s", "road", "in" ]

Notice that punctuation is removed from the words. Notice that all words in the output are lower-case. Notice the word `the` only appears once in the output list. The order of the words in the output list does NOT matter.

top_words( text : str, number : int) -> list

The function body of text (string) and an integer. The function returns a list of tuples. Each tuple will look like (word, frequency) where word is a string and frequency is the number of times that word appears in the input text. The returned list will have “number” tuples in it that correspond to the “number” most frequently used words in the input text.

For example, calling

top_words("cat dog cat. Dog cat cat kitten.", 2)

will return the list (of tuples)

[("cat", 4), ("dog", 2)]

Notice that the most frequently used word appears first. Your list must return the tuples in decreasing order (based on frequency of the word appearing). Again, we don’t care about the case of words. All outputs should be in lower-case.

What if “number” is larger than the actual number of unique words in the text? Only make your output list as big as needed to include all unique words.

What if there are multiple words with the same frequency? Your output list can have more than “number” tuples in it. When you creating your output list, if you reach “number” words and there are more words with the same frequency then you should include the rest of the words with that frequency. For example, suppose you wanted the top 2 words, but the data was as follows:

‘cat’ appears 10 times

‘dog’ appears 7 times

‘eel’ appears 7 times

‘cow’ appears 4 times

The output list would be [(‘cat’, 10), (‘dog’, 7), (‘eel’, 7)]. The order of the dog and eel do not matter.

display_punctuation_stats( text : str, punctuation : str ) -> str

The function takes body of text (string) and a string consisting of punctuation characters. The function outputs (returns) a string that when printed will be a visual display (frequency plot) of the frequency that each punctuation character appears in the text. Note that the function returns a string. When that string is later printed, it is a plot of the frequencies of the punctuation marks.

Example, if text "the, c!a!t. Sits.. on, the! Bed’s edge." Then calling

print( display_punctuation_stats(text, ".?,!’")) will display the following on the screen (output is shown in yellow background with blue font; no spaces at the end of any line; newline at the end of each line EXCEPT the last)

Punctuation Stats (10 in total)

-+---------------------------------------- (#=10)

.|########################################

?|

,|####################

!|##############################

’|##########

-+----------------------------------------

The order of the punctuation marks follows the same order as the input string.

The punctuation with the maximal count will display 40 hash-tags. All others are scaled using the same scale (to make the max 40 #’s). The second line indicates how the scaling factor. You can round to the nearest integer when doing your scaling.


Program

Your solution will consist of the THREE functions above. You will also include a main function (and main guard) that prompts the user for the name of a file, loads the file and then runs your functions with the data read from the file. Your program will display some information as shown in the example: (yellow background indicates user input)

Input name of file : fancy-story.txt

Input the punctuation you are interested in : .,!

fancy-story.txt stats

---------------------

Number of unique words : 287

Top 5 words used: the, cat, dog indeed, was

Punctuation Stats (121 in total)

-+---------------------------------------- (#=7)

.|########################################

,|####################

!|##############################

-+----------------------------------------

Note that the numbers (121, #=7) are made up here. They will depend on the actual contents of the file loaded and the desired punctuation. Note that the ---- line has as many characters as the name of the input file.

You can assume (1) the file entered exists and is in the same directory as words.py and (2) the punctuation input will have no whitespace in it. Note: if for some reason, a user entered non-punctuation marks as input, your code should still work and fund the frequencies of the characters in the input string. For example, if the user entered abc then the plot would be for the frequencies of the letters a, b and c.

Include your words.py file in your submission zip file.

Note: the marks breakdown is as follows:

unique_words – 10 marks

top_words – 10 marks

display_punctuation_stats – 10 marks

main – 10 marks


Q2: Drawing                                                                                                                               [10 marks]

Think about your experience so far in COMP 1005. Think about what you have learned and what you have done. The joys and frustrations. Think about what you might be able to do with what you have learned. Your task in this problem is to either draw a picture that expresses this reflection or to write about it (or a combination of both). My hope in asking you to do this exercise is that you will critically reflect on what you have learned and perhaps where you would like to take what you have learned forward. It should also make this assignment a bit lighter than the others. The intention is that this problem should not cause you any stress. Do not worry about your “artistic ability”. You will not be graded on how “artistic” your drawing is or how grammatically correct your writing is. If you put an honest effort into the problem, you will receive full marks. Have fun!

You can create your drawing or writing in any way you wish but you should save it in PDF format. Ideally, the size of your drawing should be a standard letter size in horizontal orientation and the length of writing should not be more than one page. Time permitting, we will show your pictures to the class.

If you want your submission to remain private (and not shown to the class) then save your file as private-name.pdf, where name is your name. If you agree to have your picture/text displayed (possibly this semester or in future semesters of this course or related courses) then submit your drawing in a file called public-name.pdf, where name is your first (given) name. For public submissions, do NOT include your full name/ID in your picture/text unless you are OK with everyone seeing it. Since you are submitting using cuLearn, we already know who you are so we don’t need this information in your picture.

Note: Offensive/rude/insensitive submissions will receive zero marks and may be forwarded to the Dean's office depending on the severity. (This has never happened before and I do not anticipate it happening now.) Save your program in a file as specified above and add it to your submission zip file.


Recap                                                                                                                                             [A4.zip]

Submit a single zip file called A4.zip. Your zip file should have two (or three) files in it.

words.py

Either private-name.pdf or public-name.pdf (or both).