General Python Tips and Tricks
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit
General Python Tips and Tricks
Introduction
Python is an extremely versatile programming language, and one of the most used across almost any application. With the usage of the right packages it can even compete with industry standard statistical softwares (SAS, Stata, etc.) when it comes to data manipulation and analytics. In many cases you will find that large statistical analyses run faster when written with streamlined Python code. However, the Python language itself is quite different from the language shared by most statistical programs. Hopefully this document serves to help with the learning curve!
Necessary Packages
We will be using the following packages to assist us in our analyses:
• Pandas: short for “panel data”, this is a library specifically tailored for data manipulation and analysis.
• Numpy: a library that helps support functions operating on arrays and matrices.
• Matplotlib: allows us to plot output with appealing formats
• Sklearn: short for Scikit-learn, this will provide most of our statistical tools like regressions
• Researchpy: a library that includes useful statistical tests like the T-test
• Datetime: lets us manipulate variables with a date format
We will want complete access to almost all packages, but only really need specific plotting and regression functions from Matplotlib and Sklearn respectively. You can accomplish this with the following code:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
import researchpy as rp
import datetime
Useful Tools
1. Forming quantiles grouped by another variable
Lets assume z is the grouping variable, and y is the variable for which we want to generate N quantiles
df[‘y_quantile’] = df.groupby([‘z’])[‘y’].transform(lambda x: pd.qcut(x, N, labels = False))
For those interested in learning how this is all working, here is the breakdown of the code:
• df[‘y_quantile’] tells Python to create a new column in the dataframe called df using the output from whatever is on the other side of the equals sign
• df.groupby[‘z’] tells Python to group data by the z variable before any manipulations
• [‘y’].transform() tells Python to execute whatever is in the parentheses on the variable y which we just grouped by z
• lambda x: ___ tells Python that we are defining a function that takes a variable x. In this case we are passing the y variable into the function as the argument through the transform just above
• pd.qcut(x, N, labels = False) uses the qcut function from the pandas package to split the variable x into N quantiles. N = 4 gives you quartiles, N = 10 gives you deciles, etc. Worth mentioning that by default Python will label the quantiles starting from 0, so quartiles will give you bins labeled “0”, “1”, “2”, and “3” .
2. Generating summary statistics
Lets assume we wish to find the means grouped by z and our newly defined y_quantile variable:
df.groupby([‘z’, ‘y_quantile’])[‘y’].mean()
A quick way to get a bunch of summary statistics all at once is to use the describe() function. This will give the count, mean, standard deviation, and quartiles all in the same output!
3. Plotting a graph of y on x
plt.plot(x, y)
plt.title(‘title name’)
plt.xlabel(‘xAxis name’)
plt.ylabel(‘yAxis name’)
plt.show()
4. Two-sided independent T-test for statistically significant difference in the mean value y between two groups x_1 and x_2, both of which are part of the variable x
rp.ttest(group1 = df[‘y’][df[‘x’] == ‘x_1’], group1_name = “Upper”,
group2 = df[‘y’][df[‘x’] == ‘x_2’], group2_name = “Lower”)
I just used “Upper” and “Lower” as placeholder names here. Use whichever names make the output easy to interpret.
5. Produce labels for “Friday” and “All other days” for comparative analysis
datetime_obj = datetime.strptime(date, ‘%m/%d/%y’)
This generates a new variable “datetime_obj” that is just a conversion of the “date” variable into a proper date format that the datetime package recognizes
To get the weekdays, simply run the weekday() function on our new “datetime_obj” variable, and define it as a new variable in our dataframe. Note that this function will return an integer, with 0 corresponding to Monday and 6 corresponding to Sunday. Assuming you’ve already created this new variable and called it “weekday”, you can convert this to a more recognizable string with the following:
weekDays = (“Mon”, “Tue”, “Wed”, “Thu”, “Fri”, “Sat”, “Sun”)
weekday_string = weekDays[weekday]
Now you can do most manipulations using the groupby function on the weekday variable, or you can do the intermediate step of splitting the dataframe into two new sub-samples based on the weekday of interest:
df_Friday = df[df[‘weekday’] == 4]
df_Other = df[df[‘weekday’] != 4]
2022-09-22