Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

DE2 Data Science Assignment 2

Who is going to be the winner of League of Legend games?

INTRODUCTION

Background and Glossary

League of Legends (LoL) is one of the most popular video games in the world. It is a multiplayer online battle where 2 teams (blue and red) face off. The goal is to reach and destroy the opponent’s Nexus – the primary objective in the team base, which teams destroy the enemy’s Nexus first wins the game. As shown in Figure 1, there are lanes and jungles where players of opposite teams will fight, kill minions/ elite monsters to obtain gold and experiences, place wards, or destroy towers along the lane to get closer to opponent’s base and finally take down their Nexus.

Context and problem

Data analysis has a crucial role in video game development process. The insight from gaming analytics can help us game designers to obtain specific knowledge of which factors are good indicators to determine a team’s chance of winning, also to predict the gaming bottlenecks, reasoning and timing. We accessed a League of Legends dataset and realised this data obtained from first 10 min of the game will provide a good opportunity for us to build predictive models to analyse different attributes, then to predict whether our target team (blue team in our case) can win by the end of the game (usually about 30 min). Then improvements can be made for the existing game using previously gained data, or even development of new concepts, storylines and mechanics.

Introduction to the dataset

Our dataset, League of Legends Diamond Ranked Games(10 min), is from Riot Games API and contains features for the first 10min. stats of 9879 ranked games (SOLO QUEUE) from a high ELO (DIAMOND I to MASTER) from //// to//////, which have 40 attributes extracted by the authors include:

• Unique identifiers (gameId)

• Target attribute (blueWins)

• Characteristics of the game for blue team (blueWardsPlaced…)

• Characteristics of the game for red team (redWardsPlaced…)

The innovative aspect of this dataset is that it uses features for the first ten minutes of the game, effectively allowing the player of a League of Legend game to see if there are some factors that are exceptionally good indication of when you will win a game or not before the game ends. This not only applies to the amateur players, but also to the professional players, the coaches and other relative members (like spectators, sponsors etc.) for League of Legend Professional Games.

Data preparation

The whole dataset is attributes related to blue and red teams respectively, therefore, some attributes should be processed in order to have contrastive analysis in a better model.

Firstly, the following attributes are removed:

• gameID: As it is a non-predictive attribute and has no particular use for our predictive purpose.

• blueDragon/Heralds, redDragons/Heralds: In the game’s definition, dragons and heralds are elite monsters, therefore EliteMonster attributes can represent these 4.

• blueKills/Deaths/Assists, redKills/Deaths/Assists: As there are only 2 teams, blueKills = redDeaths and redKills = blueDeaths, so these are repetitive, we combined them to new attributes.

Secondly, the following attributes are added in new columns, as by simple calculations can integrate blue and red attributes together therefore simplify:

• VisionScoreDiff: blue/red Wardsplaced and Wardsdestroyed attributes represent the vision score for each team.

VisionScoreDiff = ( blueWardsPlaced + blueWardsDestroyed ) - ( redWardsPlaced + redWardsDestroyed )

• KDA_Diff: As kills, deaths and assists are similar features in the game, they can be combined into a new feature.

KDA_Diff = ( blueKills - blueDeaths + blueAssists ) - ( redKills - redDeaths + redAssists )

• EliteMonsterDiff/TotalMinionsKilledDiff/TotalJungleMinionsKilledDiff/AvgLevelDiff: Differences are calculated using corresponding blue and red attributes.

Therefore, now we have twelve main feature attributes and one target attribute after elimination and combination. Then we split the data into three sets for training, validation and testing which contain 80%, 10% and 10% of the dataset respectively. The validation set is used for further development after building the model around the training set. And the final test set is left until evaluating the models in a true accuracy for real life.

Performance metrics in context

• Accuracy: The proportion of games which have been correctly predicted that whether to be won by blue team or not.

• Precision: The proportion of games predicted to be won by blue team which turned out to be won by blue team.

• Recall: The proportion of games predicted to be won by blue team turn out to be won by red team.

• For a balanced dataset, accuracy is the main metric for weighing as the proportion of correctly predicted that whether to be won by blue team or not are equally weighted.

As our target attribute is a binary variable relating to all the categories, the dataset is reasonably balanced so undersampling is not necessary to get representative data in this situation. What we need, however, is to beat the default value of 50.1% accuracy for the model.


THE PREDICTIVE MODELS

Linear regression

We checked the linear relations (Appendix) between blue’s attributes and red’ attributes in our original dataset, which gave us some insights about some attributes that may affect the victory of blue team in further research. And we also plotted Seaborn plot (Appendix) of our 12 attributes that after data preparation, having more insights about the attributes we need further developed.

Logistic regression

Algorithm design

As we want to develop the relationship of ‘blueWins’ with all the other features, and ‘blueWins’ is a binary dependent variable, Logistic Regression is used in our next step Model Building. We first generated 12 graphs of our 12 attributes that may affect the Victory in a game. After analysing the graphs, we found that blueExperienceDiff, blueGoldDiff, GoldPerMinDiff and KDA_Diff are 4 factors that have strongest relation to ‘blueWins’.


After developing all summaries of 12 Logistic Regression Models (see in Appendix), we found that ‘GoldPerMinDiff’, ‘blueExperienceDiff’ and ‘blueGoldDiff’ showed great accuracies with 0.723, 0.713 and 0.723 respectively in LogisticRegression Model Prediction, while ‘blueTowersDestroyed’, ‘TotalJungleMinionsKilledDiff’ and ‘VisionScoreDiff’ have nearly no effect on the prediction with accuracies closed to 0.5 which is the default value of the accuracy. Therefore, we did further research by removing these three irrelevant attributes and adding other 9 attributes one by one and run out the results.

As the attributes is adding, the predictive accuracy increasing from 0.601 (only has ‘blueFirstBlood’ attribute) to 0.732 (after adding all 9 attributes). This is a great model (see figure 6) which has an accuracy above 0.73, which means these 9 attributes are key factors for blue team to win a game.

Forward Selection

In order to improve the accuracy of our model, we use Forward Selection Algorithm as a development to generate different combinations of attributes. This algorithm generates all possible combinations of all attributes randomly and automatically at a given length, then processes and checks the performance, collects and manages the results. When a best fit model is appeared (see figure 7), it’s saved and tested with validation set. This Forward Selection help us eliminate all overfitting models, and give us a best result with 0.734 as Training Accuracy, 0.746 as Validation Accuracy, a model with three variables: ‘blueGoldDiff’, ‘blueExperienceDiff’ and ‘EliteMonsterDiff’.


Nonlinear Model

Decision Tree

A Scikit Learn decision tree classifier is created with the training set, using the Gini Impurity - a measure of the probability that this assignment is incorrect. The maximum gini for two classes is 0.5 and the minimum gini is 0. We used a structured approach to select the maximum depth and minimum impurity for the decision tree for ensuring the model didn’t overfit.

To make sure our model did not overfit and it will be optimal, we developed using Max Depth versus Precision Score/Accuracy Score/Recall Score for out Training Set and Validation Set. The results show that the three scores peaked when the Max Depth is 4, so that we use 4 as our maximum depth to build our Decision Tree Model. (As the graph below)


By analysing the Decision Tree, what we found was that the Tree splits from ‘blueGoldDiff’, and factors ‘GoldPerMinDiff’, ‘blueExperienceDiff’ and ‘EliteMonsterDiff’ also appeared. Surprisingly, ‘VisionScoreDiff’ appeared after the third split. We can conclude that ‘blueGoldDiff’, ‘GoldPerMinDiff’, ‘blueExperienceDiff’ and ‘EliteMonsterDiff’ are important attributes to predict Victory.

Random Forest

Choosing random forest regressor, which is an auto algorithm for creating a non-linear model, is building a variety of similar decision trees and processing these to get the best result for prediction. Random forests correct for decision trees’ habit of overfitting to our training set and generally outperform decision trees, but their accuracy is lower than gradient boosted trees. From figure 9 can be found that the relevant features are different, but the random forest tree have similar Gini values to the decision tree while splitting, showing that the decision tree we decided was a pretty good outcome.


Support Vector Machine

In machine learning, Support Vector Machines (SVMs) can efficiently perform a non-linear model by mapping the inputs into high dimensional feature spaces which is a better way to find the best separation between classes. For our model, we found a C value of 1, a ‘rbf’ kernel and ‘0.005’ gamma which is defined by 1/number of features. However, this model came out with a relatively low accuracy and precision lower than 50%, therefore, SVM is not suitable for our aim of correctly predicting the winner of the game.

MODEL COMPARISON OF RESULTS, DISCUSSION


After all the analysis, we found that the highest scoring attributes were common to most models, which were blueGoldDiff, blueExperienceDiff, EliteMonsterDiff and GoldPerMinDiff, which means they contribute most for the blue team to win the game. This makes sense and is logical, as Gold and Experience are directly related to players’ level, and if a player’s level is higher, the more elite monsters he/she will kill, hence chance of winning is high.

CONCLUSION

Depending on the performance metrics during discussion, LogReg model is chosen for reasons listing below:

• Highest accuracy: This is our vital metric which means that we have the largest proportion of correctly predicting who is going to be the winner, no matter which team it is.

• High precision: This represents that our model correctly detected the winner of the game, which would be important during the real game situation.

• Simplicity: The LogReg model is easier to interpret and also showing the good indicators clearly among the attributes.

Therefore, the LogReg model was used on the test set which had a prediction for the winner of the game with 73.785% accuracy and 68.089% precision. It was found that the precision slightly dropped compared to the validation set, due to the limitation on the test set. Whilst the game, the predictive analysis showed, is best predicted using the gold and experience difference between two teams, with a small consideration to the other factors.