Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

AD654: Marketing Analytics

Assignment 1: Data Exploration & Visualization

For  this  assignment,  you  will  use  the  file july4_snapshot.csv,  which  can  be  found  on  our  course Blackboard page.    This dataset includes information about the park visitors on one particular day of operations July 4, 2021.

Once you have completed this assignment, you will upload two files into Blackboard:  The .ipynb file that you create in Jupyter Notebook (or Colab), and an .html file that was generated from your .ipynb file.  If you run into any trouble with submitting the .html file to Blackboard, you can submit it as a PDF instead. Please include your last name in your filenames.  The exact way you save it is up to you, but the last name makes it easier to keep track of the file (e.g. BakerAssignment 1.ipynb, bakerAssgn1.ipynb, bakerAssignment 1.html, etc. -- any of these would be fine).

For any question that asks you to perform some particular task, you just need to show your input and output. Tasks will always be written in regular, non-italicized font.

For any question that asks you to include interpretation, write your answer in a Markdown cell in Jupyter  Notebook  (or  a‘Text’cell  if  you  used  Colab).     Any homework question that needs interpretation wilbe written in italicizedfont. Do not simply write your answer in a code cell as a comment, but use a Markdown or Text cell instead.

Remember to be resourceful!   There are many helpful resources available to you, including the video library, the lecture notes on Blackboard, recitations, the office hours sessions, and the web.

In the prompt, variables might not be  referred to in the exact way that their names appear in the dataset.   This is okay -- that’s very realistic.   You should familiarize yourself with the dataset and its variables through the dataset description table.

Dataset Description:

Variable

Description

visitor

This  is  an  incremental  count variable –  each  person who visited Lobster Land on July 4th is assigned a unique number. Note that the actual number of visitors is larger than the number of rows here – if a person purchased tickets for a family, only the ticket buyer is included in this dataset.

ay_pass

This variable indicates that the visitor either used a day pass (1) or did not use a day pass (0). A day pass gives the buyer access to Lobster Land for one full day. Season ticket holders do not purchase day passes.

season ticket

A1”in this variable means that the person used a season ticket, whereas a“0”means that the person did not.

domestic

A“1”in this variable means that the person is a U.S. resident, whereas a“0”indicates that the person is not.

state

The homestate of domestic visitors to Lobster Land on July 4th.

country

The visitor’s home country, either BRA (Brazil) CAN (Canada)  CHN (China) FRA (France) GER (Germany)  IND (India) JPN (Japan)  MEX (Mexico)  ROK (South Korea)   UK (United Kingdom USA (United States of America

gender

This shows the gender of the visitor, with 1 representing female, and 0 representing male.

age

This is an integer variable depicting the age of the visitor.

maine  res

A“1”in this variable means that the person is from Maine, whereas a“0”indicates that the person is not.

stay_four

A1”in this variable means that the person stayed at Lobster Land for more than four hours on July 4th, whereas a“0”indicates that the person did not.

payment_method

A“1”in this variable indicates that the visitor used cash to purchase a ticket, whereas a“0” indicates the use of a credit card, debit card, or digital payment method.

ice_cream_purch

This variable indicates whether the visitor purchased ice cream during their visit (1) or did not purchase ice cream (0).

ice_cream_flavor

This indicates the type of ice cream purchased by the visitor during their stay at Lobster Land on July 4th.

sky_chair

This variable indicates whether the visitor rode the“Sky Chair” ride during their visit (1) or did not go on this ride (0). A picture of this ride is below.

ferris_wheel

This variable indicates whether the visitor rode the“Ferris Wheel” ride during their visit (1) or did not go on this ride (0). A picture of this ride is below.

lobster_claw

This variable indicates whether the visitor rode theLobster Clawroller coaster during their visit (1) or did not go on this ride (0). A picture of this ride is below.

lobster_junior

This variable indicates whether the visitor rode theLobster Junior” kids’roller coaster during their visit (1) or did not go on this ride (0). A picture of this ride is below.

merch_spend

Total merchandise spending on July 4th by the visitor.

lobsterama_spend

Total spending at the Lobsterama (a sit-down restaurant inside of Lobster Land) by the visitor on July 4th.

Your Tasks:

Bring this dataset into your local environment (in Jupyter Notebook, or in Colab).

I. Exploratory Data Analysis : Exploration & Manipulation

A.  Call the head() function on this dataframe and look at your results.

B.  How many rowsofthedatasetarevisibleinJupyternow?

C.  Take a look at the dataset’s shape attribute.

a.   How many rows,andhow manycolumns,areinthisentire dataframe?

D.  Read the dataset description, and take a look at the variables in the dataset.

a.   Whichofyourvariablesshouldbeseen as categorical,andwhichones shouldbeseen as numeric?

E.   Lobsterland has two monetary-related variables in this dataset.  One of them has too  many  decimal  places!   This was  caused by an issue with Lobster Land’s software system. Using Python, round that variable’s values to just two digits.

F.   Are there any missing values in this dataset?   If so, how many totalvalues are missing? Use Python code to answer this question.

a.   What percentageofALLofthevaluesinthisentire dataframeare NaN?

b.   Generate a table that shows the percentage of missing values for each column in the dataset.

c.   Using  the missingno package  in  Python,  display  a  matrix that  depicts missing values within the dataset as white spaces.

d.  Again  using the missingno package in Python, display a bar chart that depicts missingness (or completeness) for each variable.

e.   Make a separate subset of the dataframe that only includes the rows that have NaN values for‘state’.  What important thing do they all have in common, which helps to explain why there are NaN values for these rows?

G.  Erroneous Data.

a.   We just  received  an  update from the  Lobsterland front ticket  office. Apparently, some guests’ ages were mistakenly copied down at the time that their tickets were purchased.   The youngest age of any guest who purchased a ticket on July 4th was  15.  Alter the dataframe so that any guest age currently less than 15 becomes 15.

H.  LobsterLand wants to know more about how its international guests compare to its domestic ones.

a.   First, find the percentage of guests from the entire dataset who stayed at Lobster Land for more than four hours on July 4th.

b.   Now, let’s break this down a bit more.  What percentageof domestic visitors stayed for morethan4 hours on that day?  What percentageof internationalvisitorsstayedfor morethan4hoursonthatday?

c.   Ifthe valuesyoufound in Step B weredifferent,whatdoyouthinkmight explainthisdifference? (No domain knowledge is required here take a moment  to  think about  it,  and  come  up with  a thoughtful,  plausible explanation).

I.     Removing a variable

a.   Pick any variable from the dataset that is redundant (in other words, all the information that it contains is already included in another variable). Remove the variable that you have identified as redundant.

i.     In a sentenceortwo,explainwhythisvariableisnotneeded.

J.     Renaming a variable.

a.   Pick any variable in the dataset, and rename it.  (For this step, it doesn’t matter which variable you pick -- the purpose is just to become familiar with the process for doing this -- it can sometimes be a very helpful step in data cleaning/data preparation).

II. Data Visualization

K.   Using any plotting tool in Python, generate a boxplot that shows maine_res on on the x-axis, and merchandise spending on the y-axis.

a.   What doyou noticeaboutthis relationship?   In acouple ofsentences, why doesthisfitor notfitwithwhatyou wouldintuitively expect?

L.  Which rides were most popular / least popular on July 4th, 2021?  Generate one barplot that depicts the total number of people who went on the Sky Chairs, the Ferris Wheel, the Lobster Claw, and the Lobster Junior.    (Note:  there are many ways you can solve this any approach that gets the job done is completely fine).

a.   In a sentenceortwo,whatdoesthisplotshow?

M.    Build a histogram that depicts the ages of people who visited Lobster Land on July 4th, 2021.

a.   How can youincreasethenumberofbinsinyourhistogram?

b.   Create another age histogram, but with more bins.    Be sure to include an x-axis label and a title with your histogram.

c.   Howisyoursecond histogramdifferentfrom yourfirstone? Whatisthe impactofincreasingthenumberofbins?

d.   Now, make faceted histograms that show the age distribution of those visitors who went on the Lobster Junior, as well as the age distribution of those who did not.  Doesthis result fit withyour intuition/expectation? Why or why not?

N.  Use the countplot() function from seaborn to show a comparison of the

home countries of international visitors to Lobster Land on July 4th. Set up the bars so that they are in decreasing or increasing order of size.

a.   What does this graph show?   In a sentence or two, explain what it depicts.

O.   Now, use the barplot() function from seaborn to show a comparison of payment methods for visitors from different countries.  Construct this plot so that countries are on the x-axis, and the proportion of guests who paid with cash is on the y-axis. Do not include confidence intervals with the bars.  Set up the bars so that they are in decreasing or increasing order of size.

b.  What does this graph show?   In a sentence or two, explain what it depicts.

Part III: Wildcard: Metrics and Quantified Self”(1 point)

A.  For three days, you will gather observational data, acting as both the researcher AND as the subject.

a.   Choose ANY aspect of your daily life that you can quantify.   This could be a health  metric,  a financial  metric,  an  entertainment  metric,  a  metric  about a hobby, or just ANYTHING else that you can measure/quantify.

b.   For each of the three days, keep track of this metric.

c.   Write ONE thoughtful paragraph about your experience.  Include your results in  the write-up.  In your paragraph, you might answer questions such as:  Whatdid  you learn by tracking this? Did it changeyourbehavior inany way? Did anyone else aroundyou reacttowhatyou weredoing? Wouldyouconsider runningthe “experiment”for a longerperiodoftime?

Thoughtful, complete paragraphs here will receive one full point. The goal here is to focus on the impact of measuring some particular metric.

●   The goal  is to  pick  ONE metric and to be thoughtful about it saying you tracked 7 different metrics for the three days, and showing a bunch of charts and fake data is NOT the purpose here – just be genuine.

●   Pick something that you have some control over (so do not use something like a stock market index)

There is NO Python code required for this – do not use any.