Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

General Python Tips and Tricks

Introduction

Python is an extremely versatile programming language, and one of the most used across almost any application. With the usage of the right packages it can even compete with industry standard              statistical softwares (SAS, Stata, etc.) when it comes to data manipulation and analytics. In many cases    you will find that large statistical analyses run faster when written with streamlined Python code.              However, the Python language itself is quite different from the language shared by most statistical           programs. Hopefully this document serves to help with the learning curve!

Necessary Packages

We will be using the following packages to assist us in our analyses:

•    Pandas: short for “panel data”, this is a library specifically tailored for data manipulation and analysis.

•    Numpy: a library that helps support functions operating on arrays and matrices.

•    Matplotlib: allows us to plot output with appealing formats

•   Sklearn: short for Scikit-learn, this will provide most of our statistical tools like regressions

•    Researchpy: a library that includes useful statistical tests like the T-test

•    Datetime: lets us manipulate variables with a date format

We will want complete access to almost all packages, but only really need specific plotting and  regression functions from Matplotlib and Sklearn respectively. You can accomplish this with the following code:

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

from sklearn.linear_model import LinearRegression

import researchpy as rp

import datetime

Useful Tools

1.    Forming quantiles grouped by another variable

Lets assume z is the grouping variable, and y is the variable for which we want to generate N quantiles

df[‘y_quantile’] = df.groupby([‘z’])[‘y’].transform(lambda x: pd.qcut(x, N, labels = False))

For those interested in learning how this is all working, here is the breakdown of the code:

•    df[‘y_quantile’] tells Python to create a new column in the dataframe called df using the output from whatever is on the other side of the equals sign

•    df.groupby[‘z’] tells Python to group data by the z variable before any manipulations

•    [‘y’].transform() tells Python to execute whatever is in the parentheses on the variable y which we just grouped by z

•    lambda x: ___ tells Python that we are defining a function that takes a variable x. In this case we are passing the y variable into the function as the argument through the transform just above

•    pd.qcut(x, N, labels = False) uses the qcut function from the pandas package to split the variable x into N quantiles. N = 4 gives you quartiles, N = 10 gives you deciles, etc. Worth mentioning that by default Python will label the quantiles starting from 0, so quartiles will give you bins labeled    “0”, “1”, “2”, and “3” .

2.    Generating summary statistics

Lets assume we wish to find the means grouped by z and our newly defined y_quantile variable:

df.groupby([‘z’, ‘y_quantile’])[‘y’].mean()

A quick way to get a bunch of summary statistics all at once is to use the describe() function. This will give the count, mean, standard deviation, and quartiles all in the same output!

3.    Plotting a graph of y on x

plt.plot(x, y)

plt.title(‘title name’)

plt.xlabel(‘xAxis name’)

plt.ylabel(‘yAxis name’)

plt.show()

4.   Two-sided independent T-test for statistically significant difference in the mean value y between two groups x_1 and x_2, both of which are part of the variable x

rp.ttest(group1 = df[‘y’][df[‘x’] == ‘x_1’], group1_name = “Upper”,

group2 = df[‘y’][df[‘x’] == ‘x_2’], group2_name = “Lower”)

I just used Upper” and Lower” as placeholder names here. Use whichever names make the output easy to interpret.

5.    Produce labels for Friday” and All other days” for comparative analysis

datetime_obj = datetime.strptime(date, ‘%m/%d/%y’)

This generates a new variable “datetime_obj” that is just a conversion of the “date” variable into a proper date format that the datetime package recognizes

To get the weekdays, simply run the weekday() function on our new “datetime_obj” variable, and define it as a new variable in our dataframe. Note that this function will return an integer, with 0 corresponding to Monday and 6 corresponding to Sunday. Assuming you’ve already created this new variable and           called it weekday”, you can convert this to a more recognizable string with the following:

weekDays = (“Mon”, “Tue”, “Wed”, “Thu”, “Fri”, “Sat”, “Sun”)

weekday_string = weekDays[weekday]

Now you can do most manipulations using the groupby function on the weekday variable, or you can do the intermediate step of splitting the dataframe into two new sub-samples based on the weekday of       interest:

df_Friday = df[df[‘weekday’] == 4]

df_Other = df[df[‘weekday’] != 4]