IS5126 Assignment 2
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit
IS5126 Assignment 2 (14 Marks)
Submissions and Grading
1. The deadline is October 31 Monday midnight 11:59pm.
2. Submit your Python notebook file, a Word/PDF file and predicted CSV output (for Q4) to LumiNUS submission folder. Please zip all the files into one file and use your ID as the file name.
3. Do not amend the data files directly. Do not include data files in your zip folder.
4. Plagiarism is strictly prohibited.
5. For all questions, please set random seeds so that TA can replicate your results.
6. The default packages to use are: numpy, pandas, sklearn, statistics, scipy, lightgbm, xgboost, matplotlib, and seaborn. You may use other packages if needed but please include them clearly in your notebook.
7. TA canjudge the quality of your code and deduct up to 3 points out of 14 points. For example, if you included many unnecessary codes (which shows you do not really know which line of codes are required for answering that question), then TA can deduct your point even if the answer is correct. Roughly,
a. 1 mark for reproducibility (code doesn't run on first instance),
b. 1 mark for clutter/mess codes,
c. 1 mark for others (E.g. after clarifying with the student, the code still doesn't run)
8. You can provide comments in your codes to show your understanding and help TA understand your codes.
9. TA can provide up to +2 bonus to the submissions that she feels the effort or quality is excellent. The maximum point you may receive is still 14 out of 14.
a. E.g., if the student's codes runs faster than normal (efficient),
b. E.g., Excellent coding. Use functions/loops to group commands/ good naming of variables (neat) and/or has a very clear report/comments consistently on the notebooks (detailed), the bonus marks can be awarded.
Dataset
1. The dataset is the real financial numbers ofUSA public firms. Training and testing datasets are in two csv files in the LumiNUS folder. The DV is a binary variable, which is bankruptcy or not. The input features are several key accounting variables in the financial statement. More details in Variable_List.csv.
2. We also provide a template Python code to simplify this assignment. You are required to modify the template. You are not allowed to code from scratch with a very different coding.
3. In A2, we only practice prediction modelling. First, do not use GVKey, Datadate, and Company name in your prediction model. Those are for your reference about which company is that. There are also sample codes for you to fill in about this simple step.
4. There is only one categorical variable, which is the industry definition code. Details are on https://en.wikipedia.org/wiki/Standard_Industrial_Classification SIC code has 4 digits and it has a hierarchical meaning. That means firms with the same first digit code are in the same broadly defined industries. See for example
https://siccode.com/sic-code-lookup-directory
Firms with first two-digit codes are in the same industries as explained on for example
https://www.naics.com/sic-codes-industry-drilldown/
5. There are 20 numerical features and 1 categorical feature. 1 binary DV. You are not allowed to use “date” for prediction.
6. You use the data in training.csv to build your model. You make a final prediction on test.csv (without label).
7. Output of the prediction performance metric is PR-AUC. You can use average precision as a proxy.
Q1 (3 Points). Data Preprocessing
The objective of Q1 is to build a pipeline that satisfies the following requirements: (1) feature engineering by adding interaction terms, (2) scaling the numerical variables, (3) encoding the categorical values. The specification details are as follows: (you should use the sample template and modify from there)
Q1- 1 ( 1.5 points): The first requirement of the pipeline is about adding interaction terms. This is only for the numerical features. Please add interaction terms of all pairs as additional features. You are required to code this as one step in the pipeline. If I do not make mistake, there are 20 features and 20*19/2=190 pairs. You will have 190 more columns after this step.
• A more meaningful feature engineering is to create the industry average of 20 columns. Then we construct 20 new columns that are the original value minus the industry average. To save you some time, I do not add this as a requirement. If you are interested in, you can try coding this one and see whether the prediction performance is further improved (This part is not allowed to be added into Q4).
Q1-2 ( 1.5 points): The 2nd step is about scaling of numerical variables. In general, there are four scaling methods Standardization, MinMax scaling, MaxAbsScaler, and RobustScaler. Set the default scaler to be processed as StandardScaler. For the requirement in Q1-2, please code a StandardScaler solution. Later, in Q2, you should tune among the 4 scalers.
Q1-3 (OPTIONAL, no points): Review the customized Pipeline in the sample code to handle SIC code in the following way.
• We first create 11 categories according to the table in
Appendix A.
• Apply sklearn.preprocessing.OneHotEncoder on these 11 categories
Q2: Practicing Grid Search and Simple Manual Tuning (3 points)
You shall perform gridsearch on LightGBM, XGBoost and SVM, as well as fitting the models for prediction. The code template has already been provided for you. You simply need to fill in the missing parts to fulfil the requirements.
Q2- 1: Gridsearch from LightGBM, XGBoost and SVM (0.5 marks each)
You will create and try these different models and run the following GridSearch parameters in the following table. For faster processing, set njobs = - 1 on the GridSearchCV function. This will allow for parallel processing utilizing your
cores.
Scaler
Types
Parameters to
Tune
LightGBM
StandardScaler,
MinMaxScaler,
RobustScaler,
MaxAbsScaler
Max_depth= {3,6,9} Num_estimators= {100,150} Learning_rate= {0.01, 0.05}
StandardScaler and
MinMaxScaler
Max_depth={6} Num_estimators= {100,200} Learning_rate= {0.01, 0.05}
SVM (set
to RBF
kernel)
MinMaxScaler
only
C = {1,10}
Gamma =
[1,10]
The performance metric is the PR-AUC results. This is not meant to be a fair comparison of the algorithms – a fairer comparison would entail you tuning each of the algorithms the best you possibly can to assess them. However, with time constraints, we only need you to perform GridSearch on the above ranges only
Q2-2: Simple Manual Tuning Adjustments from GridSearch results (0.5 mark)
For the LightGBM best estimator model, you are required to replicate the best_estimator results. grid_LGB.best_params_ should show you the list of params tested with CV that yields the best result for you. We will try to create a separate LGBM model with that list of parameters
Step 1: Search for the following code template snippet provided:
Your task is to modify best_params such that it matches the best params for your LGB CV model. This should create a pipeline with the manually inserted params you specified. Run the following blocks of code which saves the model to manually_tuned_LGB.
Step 2: Calling the following part of the code will test and generate the cross-validated PR_AUC from your CV model and your manually inserted params model. Check to see if they are similar.
Step 3 (Optional) The purpose of the above is to have you practice how to retrieve the best parameters from your CV search and also allow you the flexibility to model later on. For your additional practice for Q4, you may create another variable e.g best_params2 and save it into manually_tuned_LGB2. Try to see if you can get it higher than your one generated from CV. Do not save this into manually_tuned_LGB => will be checking the codes to see if your outcome is replicated.
Modify the Stacking sample code that I uploaded to LumiNUS folder. You need to use the pipeline to chain everything together similar to that sample code.
• Level-0: LightGBM, XGBoost, and SVM-RBF Kernel (The ones you obtained from Q2- 1)
• Level- 1: LightGBM
• Using sklearn.ensemble.StackingClassifier package, not the other two methods covered in class.
• You don’t need to split train.csv into two parts. Just use the code template and fill in the missing part.
The requirement for the tuning for the L1 model is that you use the template code and fill up param_grid. For this question, we will like you to try experimenting with LightGBM parameters. While the code template provides you with GridSearchCV, due to the time needed to train a single iteration of the StackingClassifier function, you are allowed to manually tune the model instead (by specifying only one parameter in the respective parameter lists).
The key objective for Q3 is for you to generate the cross-validated PR- AUC results from StackingClassifier.
Notes: One method you could try is to follow the intuition behind 1_ManualTuning_XGBoost.ipynb in Week 6 on Luminus. That is, you could first have everything else using constant values (you can refer to pipe_LGB i.e the default lgbm pipeline object to see what default values it used, and then specify only one parameter to be a list of multiple values. Then after you retrieve the best model parameter for the value. You fix it and move onto another parameter.
However, if you do this approach, do note that this does not necessarily lead to a globally optimal solution (the actual best solution of this optimization problem). There is no easy solution for tuning or optimization in general to find this globally best solution.
You may leave your final params list in the code for this part. However, you should document and write down the general steps and the multiple experiments you undertook to arrive at the final params list you had (in markdown).
While the detail of documentation is not marked for Q3, this will be assessed for Q4. So it is recommended you do this for practice (to understand the effect of changing the LGBM parameters) on the simple setup as you will have a lot more variables to consider and change for Q4.
Now please use your pipelines with stacking in Q3 to make predictions on the test.csv dataset.
You should copy the code cells in Q3 and begin Q4 on a separate code. For Q1-Q3, as long as your code can run and correctly specified to the requirements, you will get full marks. This Q4 is about tuning, cross-validation, and encoding of SIC code. The grading is mostly based on your prediction result and how you encode SIC codes. So please submit both codes in Jupyter Notebook and the results in csv file.
• Name your file q4_pred.csv. This is not a data competition task. You are not allowed to use different structures from Q1-Q2 that may affect prediction performance significantly. More details on the elements of the earlier parts you are allowed to deviate from
o First, remember to submit predicted probabilities, not 0 or 1 as your prediction values. Make sure you do not mess up the order of the prediction. In the past, 1 or 2 students out of 50 may submit the results that are completely wrong due to various kinds of reasons.
o Q1- 1: you are allowed to try all interaction terms included OR no interaction terms at all. You are not allowed to do features selection on
190 features in Q3 (although in general, this may be a helpful step in your pipeline).
o Q1-2: You are allowed to tune 4 types of scaling or no-scaling at all. But you are not allowed to include more scaling methods. You should justify why you picked a particular method.
o Q3: you are allowed to tune parameters of 3 level-0 and 1 level- 1 as you like. You are NOT allowed to change the algorithms. It must be LightGBM, XGBoost, or SVM-RBF and level- 1 is LightGBM.
o For encoding of the SIC code, you are allowed to explore any method that encodes the categorical SIC code to numbers.
• The only three things allowed are “hyper-parameter tuning” and “cross- validation”, and encoding of SIC. You are allowed to use any combination of cross-validation. For tuning, you are allowed to use only RandomSearchCV or GridSearchCV methods. You are encouraged to use repeated stratified cross-validation, which may improve the performance a little bit.
• Optional: You can also explore RandomizedSearchCV, from online resources, if you wish to be able to explore more at a time (the lecture mentioned that one can also do RandomizedSearch, and then do a second round of gridsearch around those values. [How to do this? Hint: you can retrieve the cv_results and shortlist best N average PR-AUC scores after organizing results in a data frame. This does require some work however.
Your prediction performance will decide the grading. The grading is the following.
• Prediction metric is PR-AUC.
• As long as your prediction performance is >= median. 4 out of 6
• 40%-50% => 3.5
• 30%-40% => 3
• 20%-30% => 2.5
• 10%-20% => 2
• 0%- 10% => 1
• 2 Points is for TA to evaluate the quality of your efforts spent in Q4, especially how you do the encoding of SIC codes.
Appendix A:
Range of SIC Codes |
Division |
0100-0999 |
1. Agriculture, Forestry and Fishing |
1000-1499 |
2. Mining |
1500-1799 |
3. Construction |
1800-1999 |
not used |
2000-3999 |
4. Manufacturing |
4000-4999 |
5. Transportation, Communications, Electric, Gas and Sanitary service |
5000-5199 |
6. Wholesale Trade |
5200-5999 |
7. Retail Trade |
6000-6799 |
8. Finance, Insurance and Real Estate |
7000-8999 |
9. Services |
9100-9729 |
10. Public Administration |
2022-10-29