Project 1: Data ETL Pipeline


Project 1 Instructions

Important: You need to watch the videos first, then read these instructions, then start the project.

The goal of this project is to construct a data pipeline in a series of functions. The pipeline will take a raw data object from Reddit, transform the data, and load it into a CSV file. You have been supplied with starter code and the code from my recordings to use as reference.

My starter code. (https://i.ibb.co/BT8c5GX/image.png)

● The code at the end of the videos. (https://i.ibb.co/SwtckTN/image.png)

Remember: You may use my starter code. In fact, I strongly recommend that you do. But your grade will be based on your work, not the starter code.

Finally, keep this in mind: Project 2 (the “final” project) will build on this project. There’s not much you should do differently based on that, but my advice is to make sure you really understand your code well in this project and document it well in comments. That will help later on.


Step 1: Set a goal for your project

In my example, the project goal was to report the titles of Reddit articles and their word count. That’s a pretty boring goal! Yours should be a bit more interesting that that, but it doesn’t have to solve world hunger. What I mean is, have an achievable goal for the project that actually produces some kind of knowledge about the data.

Here’s a real student example goal from the Twitter version of this assignment from an earlier semester:

My project reads 100 Tweets mentioning the hashtag #blacklivesmatter and reports what other hashtags are used these tweets, including the number of times all hashtags appear in the 100 Tweets.

Include the goal at the top of your script, as it appears in the starter code.


Step 2: Exlore the data, confirm the project goal

You might find that the data object available from the Reddit API you are using does not help you achieve your goal. Be sure the data you need is available before you waste time trying to achieve the impossible or the very unfeasible.

If necessary, update your goal to something achievable based on the available data. You will write the goal at the top of your code, wrapped in a multi-like string like this:

'''

Your goal here. Lorem ipsum dolor sit amet. Nullam fermentum mattis risus et mollis.

'''


Step 3: Write the code

Once you have a goal in mind and have confirmed that the data you need is available, it’s time to start hacking! Here are a few tips:

1. Start with the starter code. Yes, you have to type it yourself! But therein lies part of the learning, especially for new programmers and new Pythonistas: Singing along with a song can help you learn to sing, following a recipe can help you learn to cook, and typing code that already has been written can help you learn to code.

2. Start with minimal features and test along the way. Your application is not going to be super-complex, but each time you add functionality you should test that functionality right away. Once your application can do the basics, always go back over your code to see what you can do to improve it.

3. Do not cheat, and not just for the obvious reason. If you are not able to write the code you need to write, you’re also quite unlikely to be able to understand code that you find on Stack Overflow or anywhere else online.

Using code you did not write yourself (except starter code that I give you) is a violation of academic integrity. I have a zero-tolerance policy for cheating in all its forms.


Requirements for your application

Much of what your code must do will be determined by your project goal. However, in addition to standard practices of good coding that we have already covered, here are some things that you must do in your code.


Extraction function

Start with the function definition as it appears in the starter code and change the function to meet the following requirements. In your extraction function:

1. The subreddit can be passed into the function.

2. If a subreddit is not passed to the function when the function is called, default to r/worldnews .

3. The user agent string is defined in a “global” variable. In Python, “global” variables are defined just like regular variables, but the variable assignment must go immediately after your import statements and are the names are always in ALLCAPS.


Requirements: Transformation function

This function must return a curated data object with the transformed data you will use in the final (loading) phase.

The goal of this function is to iterate over a subreddit’s listings and transform the data so it is useful. There are various was to transform the raw data into something useful, so what you and your code need to do is going to be determined by your project’s goal.

Rename the function something meaningful based on what it does, and update the docstring so it accurately describes the function.


Requirements: Loading function

You will write your data to a CSV-formatted text file. If you need to do any more data transformation in this step remember to keep it minimal, or add another transformation function to your code.

Rename the function something meaningful based on what it does, and update the docstring so it accurately describes the function.


Other Requirements

You do not have to add any error handling inside the function definitions. Instead, write your function calls in a way that your script will fail gracefully if there is an error anywhere in your functions. Yes, this is a rather crude way to attempt to avoid errors, but it’s a good place to start with error handling based only on what we have covered in class.

In addition to the requirements above, your code must follow the standard practices of good coding that we have already covered.


What to submit

When your project code is complete, submit these three things:

Your Python script through codepost.io. This is where your code will be graded, checked for plagiarism, and commented on by the grader and myself. Name the file project-1.py

Also submit the same Python script as an attachment through Canvas. Submit the same script that you uploaded to codepost.io through Canvas as an attachment. This will facilitate peer review-type work that we may do with these projects.

One video that confirms that your code executes successfully. Record a screencast (not a phone video!) of your code successfully running on your computer. Why? There can be situations where a student’s code runs in their environment but it does not run in mine. It will be too late to figure out what is wrong when I discover this issue. Therefore, as a safety measure, you must do a screen-recording of your code executing successfully. The video does not have to be “professional” or “polished” it just has to be clear that it is your code and that it executes successfully.