Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

INFM 203 Final Exam — Fall 2022

Practical Lab Exploration Component

There are two sections in this Practical Lab Exploration Component:

1. The first section should be carried out by you using your regular Operating System (Windows, macOS, or perhaps Linux) plus Jupyter.

2. The second section should be carried out using your Cloudera CDH VM.

Since this is part of Final Exam for this course, you are honor bound to work on this Practical Exam by yourself and without the help of others, family or otherwise.

Before you go further, read the last page below — Purpose of this Practical Final Exam Component — so that you understand the purpose of this Practical Part of the Final Exam.

_______________

Open up the area in this file after each question / section in this MS Word file and type in (or paste) your answer(s) and your comments. EMail the resulting file to me at [email protected] — the deadline is Absolutely NO LATER than Thursday, December 08   but you can submit anytime prior to that, including immediately (try not to have it interfere with your written Final Exams for other courses nor your online INFM 203 Final Exam on Canvas).

Your email Subject line should be:

INFM 203-10 Final Practical, Fall 2022

And then attach this MS Word file — derived from this Template file, but with your additions and changes — sent with the email should be named:

YourLastName,YourFirstName-INFM_203_Final.docx

The file naming is important as it causes my email software to sort your Final Exam Response Email into a directory for me so that I don’t lose it amongst my other daily email. Make sure that you include both LastName and FirstName and in that order

First Section

Use a Python-based Jupyter Notebook on your regular Operating System (Windows, macOS, or perhaps Linux) — this means that you will have previously installed Anaconda 5 (or a later version — generally the latest, by the way).  You are not expected to be an expert in Python 3, because I want you to find a Python 3 program on the Internet and then use it —but, hopefully, you have, at least, an introductory Python 3 background.

You will be running wordcount written in Python 3 (if you can only find an appropriate program in Python 2, you would have to convert it to Python 3) to process a text document. The text document is the US Constitution and is available at: https://www.usconstitution.net/const.txt or https://www.constitution.org/cons/constitu.txt (amongst other places).

The wordcount program should have the following characteristics:

· Remove punctuation from the source file (but be careful with apostrophe-s at the end of word, e.g, state’s)

· Lowercase all words

· Eliminate stop words with a set of stopwords preferably provided in a separate file

· Stem the words — optional extra — stemming is the process of reducing a word into its stem, i.e. its root form (there are Python ways of doing this), but this is an optional extra and not a full requirement. This is quite tough, and I am not expecting any students will be able to complete it.

· Sort in order of frequency of occurrence of the words found with the highest occurrences coming first

Suggested steps (but you are free to choose your own steps):

__1. You could use Jupyter Notebook installed on your host operating system (Windows, macOS, Linux). Thus, make sure that it is installed (or still installed) and available to use. (If you must reinstall, go to http://anaconda.com/install and install a version with Python 3 64-bit.)

__2. Find an appropriate Python 3 program for wordcount on the Internet. Check that it has the appropriate characteristics (all lowercase, remove punctuation, remove stop words) — or add them from elsewhere. Please state where you found [all] the program(s).

__3. Get the text copy of the US Constitution as the text that you will analyze and load it to your host operating system (Windows, macOS, …).

__4. Start a fresh, new Jupyter notebook that you will run locally on your host operating system.

__5. Using Markdown, explain the parts of the program as you develop and run it.  This is an important part of using the Jupyter Notebook that we will use for Step One.

__6. Read in your stopwords file.

__7. Read and process your text file (US Constitution, text).

__8. Output the top 20 words occurring in the text directly from the Python program such as, for instance:

President 123

Congress 99

__9. Optional extra:  Create a Word Cloud of the top 20 or so words.

__10. Send me your .ipynb file as a second attachment when you email me this completed documentation file.  I might like to run it too.

Second Section

Use your Cloudera Sandbox VM to do the same wordcount analysis, but this time with the PIG Language (that is the PIG language)— and this time inside your VM.

This program should not be run with Jupyter, but as a Pig Latin script.  Use the US Constitution text again for this program.

Again, start by finding an appropriate program written in Pig Latin on the Internet. It might not have all the features — because you might not be able to find a program with all the features that you had with Python 3.

Load the US Constitution text file into your VM using wget or drag & drop.  Upload it to HDFS where it will be accessed from Pig Latin.

Document your Pig Latin script internally with comments to explain what it does, and how.  This means that the program will be self-documenting with these comments – a good programming practice.

Show that it can run and produce the same (or, essentially the same) results as your Python 3 program.

The last thing in this report document should be to paste the Pig Latin program that you run — with its comments — and the results of processing – here:

Purpose of this Practical Final Exam Component

1. Complete the circle back to our original wordcount program (or, better, any improved wordcount2 program).

2. Recognize that this course is not only about theory of the Big Data Technologies, but also about doing practical things.  

3. Before writing new code, see if something is already available. Don’t be afraid to use pre-written code but also don’t be afraid to work with material developed by others, but first understanding it and then adapting it if appropriate.  

4. Something fun — and hopefully not too hard — to finish the course and do something practical as part of the Final Exam.

Please do not spend more than about 3-5 hours total!  If you have difficulty, others in our class will be having similar difficulties as well. The quality of your work is, in my mind, more important than the quantity.  I want to see:

· Whether you understand the steps to process the data in Section One and Section Two.

· How far you are able to go with the minimal directions that I am providing here.

· Again, let me repeat:  The quality of your work is, in my mind, more important than the quantity. You do not need to be able to complete everything, but I want to see what you can achieve and how you present your thinking and your work.

Good Luck!  And thank you for taking part in this course. I have been glad to have been your Instructor. I sincerely hope that you learned a lot in this course.

Please keep in touch in later years as you go forward with your future career choices and on to success there! Continue your learning – it is a long journey that you are just starting on at Commencement. You will always be able to find me at my personal email address: [email protected]