This assesment aims to guide you in exploring a data set through the process of exploratory data analysis (EDA), primarily through visualisation of that data using various data science tools.

You will need to draw on what you have learnt and will continue to learn, in class. You are also encouraged to seek out alternative information from reputable sources. If you use or are 'inspired' by any source code from one of these sources, you must reference this.

Learning outcomes You will learn the following through completing this assessment:

1. Read in files and extract data from them into a data frame.

2. Wrangle and process data.

3. Use graphical and non-graphical tools to perform EDA.

4. Use basic tools for managing and processing big data.

5. Determine information

6. Communicate your findings in your report.

Submission details The Python code as a Jupyter notebook file (.ipyn). A PDF print of your Jupyter notebook containing the code, figures and answers to all the questions. Hint: Wrap your code using the Jupyter magics or pythonic standard.

Please note: Marks will be assigned based on their correctness and clarity of your answers and code. The PDF should be concise and not take up an excessive number of pages. You should not print the data frames in your PDF (comment out the code that prints those).

Zip file submissions attract a penalty of 10%. Submit two separate files requested above together. You will need to submit your PDF to Turnitin.

Task

In this course, you have learned about the definitions, skill sets, tools, applications and knowledge domains In this course, you have learned about the definitions, skill sets, tools, applications and knowledge domains precisely. By completing the EDA, we hope you can get a clearer understanding of how a career in data science compares to others in the IT industry.

The Data
In late 2018, a survey was conducted for a large Australian collective of IT professionals. The survey, which received 7000 responses, aimed to gather information about IT professionals. The dataset was made public, and many insights have emerged since. We have taken the data set and heavily modified the data. Both to clean the data, a significant component of data science and to ensure original assignment submission.

The data set is called assignment1_ dataset.csv, and contains respondents answers to survey questions. Each column contains the answers of one respondent to a specific question. Do not alter this dataset.

How to complete this assesment

The following notebook has been constructed to provide you with directions (blue), questions (yellow) and background information. Responses to both blue directions and yellow questions are assessed.
Underneath the blue direction boxes, there are empty cells with the comment #Your code. Place your code in these. You should not need to but may insert new cells under this cell if required.
To respond to questions you should double click on the cell beneath each question with the comment Answer. Write your answer under these.

Please note, your commenting and adherence to Python code standards will be marked. This notebook has been designed to give you a template for the layout of future notebooks you might create. If you require further information on Python standards, please visit https://www.python.org/dev/peps/pep-0008/ (https://www.python.org/dev/peps/pep-0008/)
Do not change any of the directions or answer boxes, the order of questions, order of code entry cells or the name of the input files.

Table of contents
●Student information
●L oad data
●1 Demographic analysis
■1.1. Age
■1.2. Gender
■1.3. Country
■1.4. Roles
●2. Education
■2.1. Formal education
●3. Employment
■3.1. Employment status
■3.2. Job satisfaction
●4. Salary
■4.1. Salary overview
■4.2. Salary by country
■4.3. Salary & gender
■4.4. Salary & formal education
■4.5. Salary & employment sector
●5. Predicting salary
●6. Tasks & tools
■6.1. Data science - common tasks
■6.2. Data science - common tools
●6. Data quality assessment
Enter your information in the following cell. Please make sure you specify what version of python you are using as your tutor may not be using the same version and will adjust your code accordingly.
Student Information

Please enter your details here.
Name:
Student number:
Tutorial number. :For example, P07
Tutor:
Environment: Python (what version of python you are using) and distribution (i.e. Anaconda 5.3.0 (64-
bit))
Load your libraries and files

This assesment will be conducted using pandas. You will also be required to create visualisations. We recommend Seaborn, which is more visually appealing than matplotlib. However, you may choose either. For further information on Seaborn visit https://seaborn.pydata.org/ (https://seaborn.pydata.org/)

Hint: Remember to comment on what each library does.

In [17]:
# . Your code
1. Demographic Analysis
Who are the survey participants?
Let's get a general understanding of the characteristics of the survey participants. Demographic overviews are a standard way to start an exploration of survey data. The types of participants can heavily affect survey responses.
1.1 Age
Visualisation is a quick and easy way to gain an overview of the data. One method is through a boxplot. Boxplots are a way to show the distribution of numerical data and display the five descriptive statistics: minimum, first quartile, median, third quartile, and maximum. Outliers should also be shown.
1. Create a box plot showing the age of all the participants.
Your plot must have labels for each axis, a title, numerical points for the age axis and also show the outliers.
In [31]:
# Your code
2. Calculate the five descriptive statistics as shown on the boxplot, as well as the mean.
Round your answer to the nearest whole number.
In [35]:
# Your code .
Answer
3.i. Looking at the boxplot, what general conclusion can you make about the age of the participants? You must explain your answer with reference to all five descriptive statistics. Simply listing will not suffice. You must discuss the conclusions drawn based on these descriptive statistics' relationship to each other. You must also make mention of the outliers if there are any.
3.ii. Would the mode be greater or lower than the mean? Why?
Answer i
Answer i
4. Regardless of the errors that the data show, we are interested in working-age IT professionals, aged
between 20 and 65.
Calculate how many respondents were under 20 or over 65?
In [29]:
# Your code
Answer
1.2 Gender
We are interested in the gender of respondents. Within the STEM fields, there are more males than females or other genders. In 2016 the Office of the chief scientist found that women held only 25% of jobs in STEM. Let's . see how that compares to our participants.
5. Plot the gender distribution of survey participants.
In [28]:
# Your code
6. Calculate what percentage of respondents were men and what percentage were women.
In [27]:
# Your code
Answer
7. Let's see if there is any relationship between age and gender.
Create a box plot showing the age of all the participants according to gender.
In [26]:
# Your code
8. What comments can you make about the relationship between the age and gender of the respondents?
Hint: You need to determine the descriptive statistics.

In [25]:
# Your code
Answer
1.3 Country

We know that people practice IT all over the world. The United States is thought of as a central "hub' for commercial IT services as well as research followed by the United Kingdom and Germany.

Because the field is evolving so quickly, and it may be that these perceptions, formed in the late 2000's are now inaccurate. So let's find out where IT professionals live.

9. Create a bar graph of the respondents according to which country they are from. Find the percentage of respondents from the top 5 countries. Print your display rounding to two decimal places before writing out your answer.
in [24]:
# Your code
10. Find the percentage of respondents from the top 5 countries.
Print your display rounding to two decimal places before writing out your answer.
In [1]:
# Your code
Answer
11. What comments can you make about the United States, the United Kingdom and Germany? Are these
results consistent with what you expected?
Explain why.
Answer
12. Now that we have another demographic variable let's see if there is any relationship between country, age and gender. We are specifically interested in the top 5 countries.
Calculate the mean, median and count for the ages of each gender for each of these countries.
Hint: You may need to create a copy or slice.
In [2]:
# Your Code
13. What Pattern do you notice about the relationship between age, gender for each of these countries? (if any).
Answer
1.4 Roles
Now let's investigate the different roles assumed by IT professionals and how they are distributed. Since we are specifically interested in data science, we will also create a flag for each of the participants to indicate whether his/her role is data-science related.

14. Plot a bar graph depicting the counts of different roles (each bar should represent the count of
participants assuming a certain job role).
In [32]:
# Your code
15. What is the percentage of Data Scientists among the survey respondents?
In [3]:
# Your code
Answer

16. Data Scientists usually work closely with specific functions in organisations. Data Analysts and Data Engineers are among the top collaborators with Data Scientists. Since our analysis will now focus on data science roles.

Create a boolean column "DataScienceRelated" which holds if a participant has a job title among "Data Scientist, Data Analyst or Data Engineer."

17. What is the percentage of Data Science related roles among the survey participants?
In [23]:
# Your code
Answer
2. Education
So far, we have seen that there may be some relationships between age, gender and the country that the respondents are from. Next, we should look at what their education is like.
2.1 Formal education
We saw in a recent activity that a significant number of data scientists job advertisements call for a masters degree or a PhD. Let's see if this is a reasonable ask based on the respondent's formal education.
1. Plot a bar chart showing the percentage of each type of education for the three data science related roles.
Hint: You should appropriately label your axes with a legend and a title
In [4]:
# Your code
2. Based on what you have seen, do you think that a Master's or Doctoral degree is too unrealistic for job advertisers looking for someone with data science skills or is it job-dependent?
Answer
3. Let's see if the trend is reflected in the Australian respondents.
Plot a bar chart like above but only for Australia, and display the counts of the number of Australian respondents holding a Doctoral degree for each of the three job roles as text output.
In [5]:
# Your code
4. Display as text output the mean and median age of ALL respondents according to each degree type.
In [6]:
# Your code
3. Employment
Many of you will be seeking work after your degree. Let's have a look at the state of the employment market for the respondents of the survey.
Let's have a look at the data.
3.1 Employment status

The type of employment will affect the salary of a worker. Those employed part-time will likely earn less than those who work full time.
1. Plot the type of employment the respondents have on a bar chart for respondents who do not assume data science related roles.
In [7]:
# Your code .
2. Now plot the type of employment the respondents have on a bar chart only for those assuming data science related roles
In [8]:
# Your code
3. Comparing the two graphs, would you say that the data science roles differ in the type of employment as opposed to non-data science roles?

Explain your answers.

Answer
4. Let's investigate whether the type of employment is country dependent.
Print out the percentages of all respondents who are employed full time in Australia, United Kingdom and the United States.
In [9]:
# Your code
Remember earlier, we saw that age seemed to have some interesting characteristics when plotted with other variables.
Let's find out the median age of employees by type of employment.
5. Plot a boxplot of the respondents age, grouped by employment type.

In [10]:
# Your code
6. What are your observations?
Answer
7. You may be wondering if a relevant Computer degree is necessary to help gain fll-time employment after graduation.
Plot the respondents' employment types (for all respondents) for each of the two categories of
"EducationlsComputerRelated".
In [11]:
# Your code
8. Looking at the graph, does holding a computer-related degree improves your chances of securing a full- time job?
Explain your answers.
Answer
3.2 Job Satisfaction
Let's now investigate how happy IT professionals are about their jobs. It is also relevant to look at the years of experience to see whether the jb gets boring after a while.
9. Create a bar chart for the percentage of respondents who are looking for another job grouped by the different job titles.
In [12]:
# Your code
10. What are the two roles that have the highest and lowest percentage of employees looking for other jobs?
Answer
11. Let's focus on data science-related roles. Plot a box plot depicting the distribution of years-of-
experience of those respondents who are looking for another job versus those who are not for each of the
three roles.
In [13]:
# Your code
12. What can you say about the years of experience as to whether it impacts happiness?
Answer
4. Salary
Data science is considered a very well paying role and was named 'best job of the year' for 2019.
We would like to investigate in this section the different salary ranges for the different job roles in the IT industry and compare it to those of Data Science roles.
4.1 Salary overview
Note that the salaries given in the dataset is in USD. If we are to investigate the salaries in AUD, we need to consider the currency conversion.
You can use the following rate of conversion:
1 USD = 1.47 AUD
Let's have a look at the data.
1. Create a derived column "SalaryAUD" containing the converted salary data into Australian Dollars
(AUD).
Print out the maximum and median salary in AUD for each of the job roles in our dataset.
In [14]:
# Your code
2. Do those figures confirm that data scientists are well paid?
Answer
4.2 Salary by country
Since each country has different cost of living and pay indexes, we want to compare these jobs only in
Australia. .
3. Plot boxplot chart of the Australian respondents salary distribution grouped by the different job titles.
In [15]:
# Your code
4. How are data scientists paid in comparison to other roles in Australia?
Answer
5. Australia's salaries look pretty good in general. Is that the case for all other countries?
Plot the salaries of all countries on a bar chart (with error bars).
Hint: Consider all job titles and filter for fll-time employees only
In [16]:
# Your code
4.3 Salary and Gender
The gender pay gap in the tech industry is a big talking point. Let's see if the respondents are noticing the
effect.
7. Plot the salaries of all respondents grouped by gender on a boxplot.
In [22]:
# Your code
8. What do you notice about the distributions?
Answer
9. The salaries may be affected by the country the respondent is from. In Australia, the weekly difference in pay between men and women is 17.7%, and in the United States it is 26%.
Print the median salaries of Australia, United States and India grouped by gender.
In [17]:
# Your code
4.4 Salary and formal education
Is getting your master's really worth it ? Do PhDs get more money?
Let's see. .
10. Plot the salary distribution of all respondants and group by formal education type on a boxplot.
In [33]:
# Your code
11. Is it better to get your Masters or PhD?
Explain your answer.
Answer
4.5 Salary and Employment Sector
Do government jobs pay better than private sector? Does it differ based on the country?
Let's see.
12. Plot a bar chart (with error bars) of the salaries of respondents for each of the employment sectors.
In [18]:
# . Your code
13. Which seems to be the highest paying sector overall?
Do you think it would differ based on the country?
Propose a method to find out and explain your answer.
Answer
5. Predicting salary
We have looked at many variables and seen that there are a lot of factors that could affect your salary.
Let's say we wanted to reduce it; one method we could use is a linear regression. This is a basic but powerful model that can give us some insights. Note though, there are more robust ways to predict salary based on categorical variables. But this exercise will give you a taste of predictive modelling.
1. Plot the salary and years-of-experience of respondants on a scatterplot.
In [34]:
# Your code
2. Let's refine this.
Remove Salary outliers using 2-sigma rule and then create a linear regression between the salary and years-of experience of ful-timne respondents.
Plot the linear fit over the scatterplot.
In [19]:
#Your code
3. Do You think that this is a good way to predict salaries?
Explain your answer.
Answer
6. Tasks and tools
You might be wondering (or not) what different tasks you will be assigned in a data science role and what kind of tools would you be using the most?
In this section, we perform necessary text processing to investigate such aspects.
6.1 Data science common tasks
We focus here on the three data science job roles and investigate the tasks usually carried out in such roles.
1. Investigate the 'KindsOfTasksPerformed' column and perform the required text processing to enable you to plot a word cloud depicting the frequency of the different tasks.
In [20]:
# Your code
6.2 Data Science Common Tools
Now we compare the killset required by data science roles and other IT roles.
2. Filter your respondents based on DataScienceRelated flag and plot two seperate bar charts depicting the tools used by data science roles versus other roles.
Hint: You will need to do similar text processing to the previous task.
In [21]:
# Your code
3. What do you think are the most commonly used tools for a data science role?
Answer
7. Data quality assessment
' Garbage in, garbage out'.
The saying means that poor quality data will return unreliable and often conflicting results. In this task, you need to assess your data set critically and understand not just what its use means for the outcome of your analysis, but also how those insights inform decisions which lead to broader effects.
1. Now that you have analysed the data. Go into the data set file and determine two anomalies. These could be parts of the data that don't seem quite right or logically can't co-exist. Write a paragraph about these explaining what part of your analysis alerted you to them, why they are anomalies, why they may exist, and what could be done to fix them.