2019, Dec 07
Winning Team Prediction
1. Introduction¶
League of Legends is one of the most famous team-based strategy games. This game consists of two teams composed of 5 players each that will try to destroy the other's base. In fact, each player (also known as "Summoner") has to choose a champion from over more than 140 unique playable characters.
Like all strategy games, there are several objectives that give a huge advantage to the team who take them. So, the information about these objectives for each team and other important characteristics will help us in our way to train a winner prediction model.
Today we are going to analyze the dataset League of Legends Ranked Games from Kaggle. As we are going to appreciate, this dataset is composed of a large number of League of Legends games as well as some important information about the different events that occurred in each.
2. Game introduction¶
In order to understand all the steps done during the prediction model development, we are going to learn the most important objectives of the game.
- The Nexus is the most important structure of the team and also, the main objective of enemies. The importance of this structure resides in the fact that the team that destroys it, wins. Allies will have to do some team-work in order to protect it from strong enemies that will try to destroy it at any price.
- Inhibitors are one important structure that once destroyed, the enemy minions become very stronger. There is one Inhibitor in each lane and at least one has to be destroyed to win, otherwise, the Nexus is indestructible.
- Towers are important defensive structures that protect the base and lanes of teams. Each team has 3 in each lane and 2 in front of the Nexus that always prioritizes champions over other targets. They have huge damage, so destroying them provides a big advantage to the enemy team.
- Rift Herald is one of the epic monsters that appear during the game. The team who kills it gets an object that once thrown, a Rift Herald spawns and helps them destroying enemy towers.
- Dragon is another epic monster that keeps appearing in the game every certain time. The team who kills it, gets his stats improved.
- Baron is probably the most important epic monster. Like the dragon, the team who kills it has a huge improvement in his stats and also makes their minions stronger so pushing lanes and taking objectives becomes easier.
3. Dependences¶
Before starting with data analysis, these are the main libraries used:
- Pandas: all data vector and matrix operations.
- Seaborn, Plotly and Matplotlib: visually plotting the obtained results.
- Scikit-learn: split the data and work with the model.
import pandas as pd
import seaborn as sns
import sklearn.tree as tree
import plotly.express as px
import plotly.graph_objs as go
import plotly.figure_factory as ff
from plotly.offline import iplot
from matplotlib import pyplot as plt
from sklearn.metrics import accuracy_score, roc_curve
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
4. Data exploration¶
Before starting to analyze our dataset, we should first familiarize ourselves with it. In other words, we need to understand our data if we want to make good decisions and learn what do we need to search for.
path = r".\Data\games.csv"
df = pd.read_csv(path)
print('Total entrances:', df.shape[0], '.')
print('Total fields for each entrance:', df.shape[1], '.')
As we can see, there is a lot of information in our dataset. More specifically, we have a total of 51490 entrances and 61 available information fields each.
Once we know how big our dataset is, let's take a look at a few examples:
pd.set_option('display.max_columns', None)
df.head(5)
When taking a look at these few examples, we notice that some of the fields are repeated twice (once for each team). Moreover, if you try to group the attributes, it will result in something similar to these categories:
Game settings: General information about the match. (5 attributes)
Examples: gameId, creationTime, gameDuration, seasonId, winner.
First on doing something: Has the information on which team takes the first certain objective. (6 attributes)
Example: firstBlood, firstTower, firstInhibitor,...
Champions information: Indicates all champion related actions chosen: champions selected, banned, or summoner spells. (40 attributes)
Example: t1_champ1id, t2_champ4_sum2, t1_ban5, ...
Number of objectives: Counter of objectives taken for each team. (10 attributes)
Examples: t1_towerKills, t2_inhibitorKills, t1_dragonKills, ...
The next step, is searching for missing values in our dataset. This is an important step when preparing our data in order to train a model.
print('Total of missing values found:', df.isna().sum().sum(), '.')
As we can see, our dataset does not have missing values. Now, let's take a look at the information we have about our data:
5. Data analysis¶
Once we have familiarized ourselves with our dataset, it is time to analyze it.
First of all, we are going to take a look at the information we have about each attribute:
df.describe()
Standard deviation (std) shows how dispersed are the attribute values from its average. So, if we take a look, we can see how attributes that have high standard deviation values will not probably be the best ones when training our model.
In addition, we can observe some interesting facts:
- Being the winning team does not depend on which team you belong to. We know that because the mean of 'winner' is practically 1.5 and it only has two possible values: 1 or 2.
- Knowing the season will not help us when predicting the winning team since all games are from season 9. We know that because the value is 9 and the standard deviation is equal to 0.
Another thing we can expect from what we have seen in the previous table is that attributes do not follow a concrete distribution.
For example, on one hand we have the attribute gameDuration which has a non-common distribution in our dataset: a gaussian distribution. On the other hand, we can appreciate how most of the other attributes distribution are very different, for example the attribute firstTower.
df[['gameDuration', 'firstTower']].hist(figsize=(10,5))
plt.show()
We should only keep the data that helps us when trying to predict which team wins, so lets search which data is that. In order to that, we are going to use the correlation matrix to filter our attributes and keep only the useful ones.
The following figure shows the correlation of all the attributes with the one we want to predict.
# We take only the row of the attribute we want to predict from the correlation matrix.
winner_correlation = df.corr()['winner'].to_frame().T
plt.subplots(figsize=(20, 1))
sns.heatmap(winner_correlation)
plt.show()
Analyzing this correlation matrix, we can clearly watch how there are some attributes correlated with the one we want to predict. Lets found which these are:
correlation_threshold = 0.3
# According to the correlation values, we filter the useful attributes.
attributes = list(filter(lambda x: abs(float(winner_correlation[x])) > correlation_threshold, df.columns))
print('Total of useful attributes:', len(attributes), '.')
We are going to divide our attributes into two different categories according to their correlation value:
Moderate:
- firstTower
- firstInhibitor
- firstDragon
- t1_inhibitorKills
- t1_baronKills
- t1_dragonKills
- t2_inhibitorKills
- t2_baronKills
- t2_dragonKills
Strong:
- t1_towerKills
- t2_towerKills
Next plots show the relation of the attributes with the one we want to predict by plotting the probability of being (or not) the winning team if we have been the first one who has taken a certain objective.
# We filter the attributes that have information about which team has first taken a certain objective.
objectives = list(filter(lambda x: 'first' in x, attributes))
fig, ax = plt.subplots(1, len(objectives), figsize=(15,5))
total = df.shape[0]
for idx, objective in enumerate(objectives):
# We calculate the percentage for the winning and the losing team.
first_objective_wins = df[df[objective] == df['winner']].shape[0]
first_objective_wins = round((first_objective_wins / total) * 100, 2)
first_objective_loses = 100 - first_objective_wins
ax[idx].set_title(objective.replace('first', 'First '))
sns.barplot(['Winning Team', 'Losing Team'],
[first_objective_wins, first_objective_loses],
palette='cool',
ax=ax[idx])
plt.show()
If we analyze these plots we can appreciate there is a huge probability of winning the game when our team is the first on taking certain objectives. So, it makes sense that we choose these attributes for training our prediction model.
Moreover, if we take a look at the average of objective kills for each team we also perceive that there is an important relation between being the winning team and the number of objective kills:
# We filter the attributes that represent the objective kills of the winning/losing team.
objectives = list(filter(lambda x: 'Kills' in x, attributes))
# We modify the names of the attributes to make them more understandable.
objectives = list(set(map(lambda x: x.split('_')[-1], objectives)))
fig, ax = plt.subplots(1, len(objectives), figsize=(15, 5))
for i, objective in enumerate(objectives):
# We filter the games in which the team 1 has won.
t1_wins = df[df['winner'] == 1][['t1_' + objective, 't2_' + objective]]
t1_wins.columns = ['winner', 'loser']
# We filter the games in which the team 2 has won.
t2_wins = df[df['winner'] == 2][['t2_' + objective, 't1_' + objective]]
t2_wins.columns = ['winner', 'loser']
# We compute the mean of the winning/losing objective kills.
mean = pd.concat([t1_wins, t2_wins], axis=0).mean(axis=0)
ax[i].set_title(objective.replace('Kills', ' number of kills').capitalize())
sns.barplot(['Winning Team', 'Losing Team'],
[mean['winner'], mean['loser']],
palette='cool',
ax=ax[i])
plt.show()
6. Data preparation¶
To successfully train our model, we are going to need to prepare our data for it.
The first step needed is cleaning our dataset by keeping only the most useful attributes.
# We keep only the useful attributes.
df = df[attributes]
Next step is to split our entire data into two different sets:
- Train: subset used for training the model.
- Test: subset used for evaluating the model.
X = df.drop(labels=['winner'], axis=1)
y = df['winner']
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8)
7. Model creation¶
In order to reach our objective and make possible to predict which team is the winner, we are going to use a decision tree model.
What is a decision tree?
Decision trees can be used in regression or classification problems.
A decision tree consists in different types of nodes:
Decision nodes: Depending on the value of certain attribute they indicate which branch you must take.
Leaf nodes: These are the last nodes and indicates the output.
In our problem we are going to use a classifier decision tree, so lets understand how internally works when being created.
First, we start at the root of the decision tree, where all the data is mixed. Here, we need to found the attribute which most reduces the entropy when you split the data grouping by the selected attribute values. In other words, the attribute with the largest Information Gain.
Then, you iterate that process until you get nodes where all the samples of the subset belong to the same clase.
Normally, a limit on the depth of the tree is set to avoid overfitting, that results in some leaf nodes which are not pure at all.
Moreover, Scikit-learn library has a module named GridSearchCV which is very useful when looking for the best paramenters.
What this module does internally is using cross-validation to split the data in two different sets (train and validation) and trying all the combinations of the parameters you define. Once all combinations are done, you are able to get the best parameters combination as well as the best score reached.
First of all, we need to found which are the parameters that obtain the best result.
There are a lot of possible parameters we can modify when creating a model using the DecisionTreeClassifier from Sci-kit-learn. In this problem, we are going to try modifying some of them:
- splitter: Indicates which strategy is used when choosing the split at each node.
- criterion: Indicates which function is going to measure the quality of a split.
- max_depth: Indicates the maximum depth of the decision tree.
k = 10
# We define the parameters we want to try.
parameters = dict()
parameters['splitter'] = ['best', 'random']
parameters['max_depth'] = range(1, len(attributes))
parameters['criterion'] = ['gini', 'entropy']
decision_tree = DecisionTreeClassifier()
# We search for the best combination of the parameters.
grid_decision_tree = GridSearchCV(estimator=decision_tree, cv=k, param_grid=parameters)
grid_decision_tree.fit(X_train, y_train)
print("Best parameters: ", grid_decision_tree.best_params_)
Once known which parameters are the best ones, the next step is creating a decision tree using them..
# We get the best combination of the parameters found.
best_params = grid_decision_tree.best_params_
best_criterion = best_params['criterion']
best_max_depth = best_params['max_depth']
best_splitter = best_params['splitter']
# We fit the decision tree with these parameters.
decision_tree = DecisionTreeClassifier(criterion=best_criterion, max_depth=best_max_depth, splitter=best_splitter)
decision_tree = decision_tree.fit(X_train, y_train)
In order to evaluate the model, we are going to use the test set to predict which team is the winner and then calculate the accuracy:
y_pred = decision_tree.predict(X_test)
accuracy = round(accuracy_score(y_test, y_pred)*100, 2)
print('Accuracy:', accuracy, '%')
Our model has reached a really good accuracy, so we can say it predicts well who is going to be the winner.
We are also going to plot the ROC curve of our model to see how it works. The ROC curve is a visual representation of the relation between the True Positive Rate (TPR) and the False Positive Rate (FPR) with different thresholds.
# We take the probabilities of the predictions.
y_pred = decision_tree.predict_proba(X_test)
# We represent the y_test in the format we need to compute the ROC curve.
y_test = y_test.apply(lambda x: 0 if x == 1 else 1)
# We compute the ROC curve.
fpr, tpr, thresholds = roc_curve(y_test, y_pred[:,1])
# We represent the ROC curve.
roc_curve = go.Scatter(x=fpr, y=tpr, name="ROC curve", marker_color='#1DA6FF')
reference = go.Scatter(x=fpr, y=fpr, name="Reference", marker_color='#4F555D')
data = [roc_curve, reference]
figure = go.Figure(data)
figure.update_layout(title="Roc curve",
xaxis=dict(title="False Positive Rate (FPR)"),
yaxis=dict(title="True Positive Rate (TPR)"),
width=800,
height=400)
iplot(figure)
We can appreciate a really good result by taking a look at this ROC curve since it reaches practically 100% of TPR very fast.
Now, the decision tree of our model is going to be represented to see how it looks like.
fn = list(X_train.columns)
cn = ['Team 1 wins', 'Team 2 wins']
fig, axes = plt.subplots(figsize=(25, 10))
tree.plot_tree(decision_tree, feature_names=fn, class_names=cn)
plt.show()
If we take a look at this figure we can observe how big our decision tree model is.
Since it is impossible to analyze it because of the size, we are going to represent a simpler decision tree with fewer nodes and a maximum depth of 2 to have an idea of how our decision tree is.
# We create a decision tree of maximum depth equal to 2.
simple_decision_tree = DecisionTreeClassifier(criterion=best_criterion, max_depth=2, splitter=best_splitter)
simple_decision_tree = simple_decision_tree.fit(X_train, y_train)
fig, axes = plt.subplots(figsize=(15, 8))
tree.plot_tree(simple_decision_tree, feature_names=fn, class_names=cn)
plt.show()
In this representation of this simpler decision tree, we can see how there are 3 decision nodes and 4 leaf nodes. Also, in each decision node we can observe its condition for which each possibility is represented by a branch. The left branch means that the condition is satisfied while the right one means not.
For each node, there is certain information such as the Gini value, the number of samples that satisfy all previous conditions, the number of samples that belong to each prediction target, and temporal/final a prediction.
8. Conclusions¶
After all the analysis and the results obtained, we can expose the following conclusions:
Using a decision tree is a good practice when trying to predict which team is going to win.
We have created a model that has really good prediction accuracy.
Not all the information about each game is relevant. At first, the dataset had more than 60 attributes and after analyzing them, only 11 have been used to train the model.
"TowerKills" attributes are the most important ones since they have a strong correlation with the one we want to predict: the "winner" attribute. Moreover, if we take a look at the simpler decision tree we have represented we will see how in the case where depth is limited to 3, these are the attributes used in all decision nodes.