AD654: Marketing Analytics Assignment 1
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit
AD654: Marketing Analytics
Assignment 1: Data Exploration & Visualization
For this assignment, you will use the file july4_snapshot.csv, which can be found on our course Blackboard page. This dataset includes information about the park visitors on one particular day of operations – July 4, 2021.
Once you have completed this assignment, you will upload two files into Blackboard: The .ipynb file that you create in Jupyter Notebook (or Colab), and an .html file that was generated from your .ipynb file. If you run into any trouble with submitting the .html file to Blackboard, you can submit it as a PDF instead. Please include your last name in your filenames. The exact way you save it is up to you, but the last name makes it easier to keep track of the file (e.g. BakerAssignment 1.ipynb, bakerAssgn1.ipynb, bakerAssignment 1.html, etc. -- any of these would be fine).
For any question that asks you to perform some particular task, you just need to show your input and output. Tasks will always be written in regular, non-italicized font.
For any question that asks you to include interpretation, write your answer in a Markdown cell in Jupyter Notebook (or a‘Text’cell if you used Colab). Any homework question that needs interpretation wilbe written in italicizedfont. Do not simply write your answer in a code cell as a comment, but use a Markdown or Text cell instead.
Remember to be resourceful! There are many helpful resources available to you, including the video library, the lecture notes on Blackboard, recitations, the office hours sessions, and the web.
In the prompt, variables might not be referred to in the exact way that their names appear in the dataset. This is okay -- that’s very realistic. You should familiarize yourself with the dataset and its variables through the dataset description table.
Dataset Description:
Variable |
Description |
visitor |
This is an incremental count variable – each person who visited Lobster Land on July 4th is assigned a unique number. Note that the actual number of visitors is larger than the number of rows here – if a person purchased tickets for a family, only the ticket buyer is included in this dataset. |
ay_pass |
This variable indicates that the visitor either used a day pass (1) or did not use a day pass (0). A day pass gives the buyer access to Lobster Land for one full day. Season ticket holders do not purchase day passes. |
season ticket |
A“1”in this variable means that the person used a season ticket, whereas a“0”means that the person did not. |
domestic |
A“1”in this variable means that the person is a U.S. resident, whereas a“0”indicates that the person is not. |
state |
The homestate of domestic visitors to Lobster Land on July 4th. |
country |
The visitor’s home country, either BRA (Brazil) CAN (Canada) CHN (China) FRA (France) GER (Germany) IND (India) JPN (Japan) MEX (Mexico) ROK (South Korea) UK (United Kingdom USA (United States of America |
gender |
This shows the gender of the visitor, with 1 representing female, and 0 representing male. |
age |
This is an integer variable depicting the age of the visitor. |
maine res |
A“1”in this variable means that the person is from Maine, whereas a“0”indicates that the person is not. |
stay_four |
A“1”in this variable means that the person stayed at Lobster Land for more than four hours on July 4th, whereas a“0”indicates that the person did not. |
payment_method |
A“1”in this variable indicates that the visitor used cash to purchase a ticket, whereas a“0” indicates the use of a credit card, debit card, or digital payment method. |
ice_cream_purch |
This variable indicates whether the visitor purchased ice cream during their visit (1) or did not purchase ice cream (0). |
ice_cream_flavor |
This indicates the type of ice cream purchased by the visitor during their stay at Lobster Land on July 4th. |
sky_chair |
This variable indicates whether the visitor rode the“Sky Chair” ride during their visit (1) or did not go on this ride (0). A picture of this ride is below. |
|
|
ferris_wheel |
This variable indicates whether the visitor rode the“Ferris Wheel” ride during their visit (1) or did not go on this ride (0). A picture of this ride is below.
|
lobster_claw |
This variable indicates whether the visitor rode the“Lobster Claw”roller coaster during their visit (1) or did not go on this ride (0). A picture of this ride is below.
|
lobster_junior |
This variable indicates whether the visitor rode the“Lobster Junior” kids’roller coaster during their visit (1) or did not go on this ride (0). A picture of this ride is below. |
|
|
merch_spend |
Total merchandise spending on July 4th by the visitor. |
lobsterama_spend |
Total spending at the Lobsterama (a sit-down restaurant inside of Lobster Land) by the visitor on July 4th. |
Your Tasks:
Bring this dataset into your local environment (in Jupyter Notebook, or in Colab).
I. Exploratory Data Analysis : Exploration & Manipulation
A. Call the head() function on this dataframe and look at your results.
B. How many rowsofthedatasetarevisibleinJupyternow?
C. Take a look at the dataset’s shape attribute.
a. How many rows,andhow manycolumns,areinthisentire dataframe?
D. Read the dataset description, and take a look at the variables in the dataset.
a. Whichofyourvariablesshouldbeseen as categorical,andwhichones shouldbeseen as numeric?
E. Lobsterland has two monetary-related variables in this dataset. One of them has too many decimal places! This was caused by an issue with Lobster Land’s software system. Using Python, round that variable’s values to just two digits.
F. Are there any missing values in this dataset? If so, how many totalvalues are missing? Use Python code to answer this question.
a. What percentageofALLofthevaluesinthisentire dataframeare NaN?
b. Generate a table that shows the percentage of missing values for each column in the dataset.
c. Using the missingno package in Python, display a matrix that depicts missing values within the dataset as white spaces.
d. Again using the missingno package in Python, display a bar chart that depicts missingness (or completeness) for each variable.
e. Make a separate subset of the dataframe that only includes the rows that have NaN values for‘state’. What important thing do they all have in common, which helps to explain why there are NaN values for these rows?
G. Erroneous Data.
a. We just received an update from the Lobsterland front ticket office. Apparently, some guests’ ages were mistakenly copied down at the time that their tickets were purchased. The youngest age of any guest who purchased a ticket on July 4th was 15. Alter the dataframe so that any guest age currently less than 15 becomes 15.
H. LobsterLand wants to know more about how its international guests compare to its domestic ones.
a. First, find the percentage of guests from the entire dataset who stayed at Lobster Land for more than four hours on July 4th.
b. Now, let’s break this down a bit more. What percentageof domestic visitors stayed for morethan4 hours on that day? What percentageof internationalvisitorsstayedfor morethan4hoursonthatday?
c. Ifthe valuesyoufound in Step B weredifferent,whatdoyouthinkmight explainthisdifference? (No domain knowledge is required here – take a moment to think about it, and come up with a thoughtful, plausible explanation).
I. Removing a variable
a. Pick any variable from the dataset that is redundant (in other words, all the information that it contains is already included in another variable). Remove the variable that you have identified as redundant.
i. In a sentenceortwo,explainwhythisvariableisnotneeded.
J. Renaming a variable.
a. Pick any variable in the dataset, and rename it. (For this step, it doesn’t matter which variable you pick -- the purpose is just to become familiar with the process for doing this -- it can sometimes be a very helpful step in data cleaning/data preparation).
II. Data Visualization
K. Using any plotting tool in Python, generate a boxplot that shows maine_res on on the x-axis, and merchandise spending on the y-axis.
a. What doyou noticeaboutthis relationship? In acouple ofsentences, why doesthisfitor notfitwithwhatyou wouldintuitively expect?
L. Which rides were most popular / least popular on July 4th, 2021? Generate one barplot that depicts the total number of people who went on the Sky Chairs, the Ferris Wheel, the Lobster Claw, and the Lobster Junior. (Note: there are many ways you can solve this – any approach that gets the job done is completely fine).
a. In a sentenceortwo,whatdoesthisplotshow?
M. Build a histogram that depicts the ages of people who visited Lobster Land on July 4th, 2021.
a. How can youincreasethenumberofbinsinyourhistogram?
b. Create another age histogram, but with more bins. Be sure to include an x-axis label and a title with your histogram.
c. Howisyoursecond histogramdifferentfrom yourfirstone? Whatisthe impactofincreasingthenumberofbins?
d. Now, make faceted histograms that show the age distribution of those visitors who went on the Lobster Junior, as well as the age distribution of those who did not. Doesthis result fit withyour intuition/expectation? Why or why not?
N. Use the countplot() function from seaborn to show a comparison of the
home countries of international visitors to Lobster Land on July 4th. Set up the bars so that they are in decreasing or increasing order of size.
a. What does this graph show? In a sentence or two, explain what it depicts.
O. Now, use the barplot() function from seaborn to show a comparison of payment methods for visitors from different countries. Construct this plot so that countries are on the x-axis, and the proportion of guests who paid with cash is on the y-axis. Do not include confidence intervals with the bars. Set up the bars so that they are in decreasing or increasing order of size.
b. What does this graph show? In a sentence or two, explain what it depicts.
Part III: Wildcard: Metrics and “Quantified Self”(1 point)
A. For three days, you will gather observational data, acting as both the researcher AND as the subject.
a. Choose ANY aspect of your daily life that you can quantify. This could be a health metric, a financial metric, an entertainment metric, a metric about a hobby, or just ANYTHING else that you can measure/quantify.
b. For each of the three days, keep track of this metric.
c. Write ONE thoughtful paragraph about your experience. Include your results in the write-up. In your paragraph, you might answer questions such as: Whatdid you learn by tracking this? Did it changeyourbehavior inany way? Did anyone else aroundyou reacttowhatyou weredoing? Wouldyouconsider runningthe “experiment”for a longerperiodoftime?
Thoughtful, complete paragraphs here will receive one full point. The goal here is to focus on the impact of measuring some particular metric.
● The goal is to pick ONE metric and to be thoughtful about it – saying you tracked 7 different metrics for the three days, and showing a bunch of charts and fake data is NOT the purpose here – just be genuine.
● Pick something that you have some control over (so do not use something like a stock market index)
● There is NO Python code required for this – do not use any.
2023-02-19
Data Exploration & Visualization