2019, Dec 07

Winning Team Prediction

LoL winner prediction

1. Introduction

League of Legends is one of the most famous team-based strategy games. This game consists of two teams composed of 5 players each that will try to destroy the other's base. In fact, each player (also known as "Summoner") has to choose a champion from over more than 140 unique playable characters.

Like all strategy games, there are several objectives that give a huge advantage to the team who take them. So, the information about these objectives for each team and other important characteristics will help us in our way to train a winner prediction model.

Today we are going to analyze the dataset League of Legends Ranked Games from Kaggle. As we are going to appreciate, this dataset is composed of a large number of League of Legends games as well as some important information about the different events that occurred in each.

2. Game introduction

In order to understand all the steps done during the prediction model development, we are going to learn the most important objectives of the game.

  • The Nexus is the most important structure of the team and also, the main objective of enemies. The importance of this structure resides in the fact that the team that destroys it, wins. Allies will have to do some team-work in order to protect it from strong enemies that will try to destroy it at any price.
  • Inhibitors are one important structure that once destroyed, the enemy minions become very stronger. There is one Inhibitor in each lane and at least one has to be destroyed to win, otherwise, the Nexus is indestructible.
  • Towers are important defensive structures that protect the base and lanes of teams. Each team has 3 in each lane and 2 in front of the Nexus that always prioritizes champions over other targets. They have huge damage, so destroying them provides a big advantage to the enemy team.
  • Rift Herald is one of the epic monsters that appear during the game. The team who kills it gets an object that once thrown, a Rift Herald spawns and helps them destroying enemy towers.
  • Dragon is another epic monster that keeps appearing in the game every certain time. The team who kills it, gets his stats improved.
  • Baron is probably the most important epic monster. Like the dragon, the team who kills it has a huge improvement in his stats and also makes their minions stronger so pushing lanes and taking objectives becomes easier.

Drawing

3. Dependences

Before starting with data analysis, these are the main libraries used:

  • Pandas: all data vector and matrix operations.
  • Seaborn, Plotly and Matplotlib: visually plotting the obtained results.
  • Scikit-learn: split the data and work with the model.
In [1]:
import pandas as pd
import seaborn as sns
import sklearn.tree as tree
import plotly.express as px
import plotly.graph_objs as go
import plotly.figure_factory as ff

from plotly.offline import iplot
from matplotlib import pyplot as plt
from sklearn.metrics import accuracy_score, roc_curve
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV

4. Data exploration

Before starting to analyze our dataset, we should first familiarize ourselves with it. In other words, we need to understand our data if we want to make good decisions and learn what do we need to search for.

In [2]:
path = r".\Data\games.csv"
df = pd.read_csv(path)

print('Total entrances:', df.shape[0], '.')
print('Total fields for each entrance:', df.shape[1], '.')
Total entrances: 51490 .
Total fields for each entrance: 61 .

As we can see, there is a lot of information in our dataset. More specifically, we have a total of 51490 entrances and 61 available information fields each.

Once we know how big our dataset is, let's take a look at a few examples:

In [3]:
pd.set_option('display.max_columns', None)
df.head(5)
Out[3]:
gameId creationTime gameDuration seasonId winner firstBlood firstTower firstInhibitor firstBaron firstDragon firstRiftHerald t1_champ1id t1_champ1_sum1 t1_champ1_sum2 t1_champ2id t1_champ2_sum1 t1_champ2_sum2 t1_champ3id t1_champ3_sum1 t1_champ3_sum2 t1_champ4id t1_champ4_sum1 t1_champ4_sum2 t1_champ5id t1_champ5_sum1 t1_champ5_sum2 t1_towerKills t1_inhibitorKills t1_baronKills t1_dragonKills t1_riftHeraldKills t1_ban1 t1_ban2 t1_ban3 t1_ban4 t1_ban5 t2_champ1id t2_champ1_sum1 t2_champ1_sum2 t2_champ2id t2_champ2_sum1 t2_champ2_sum2 t2_champ3id t2_champ3_sum1 t2_champ3_sum2 t2_champ4id t2_champ4_sum1 t2_champ4_sum2 t2_champ5id t2_champ5_sum1 t2_champ5_sum2 t2_towerKills t2_inhibitorKills t2_baronKills t2_dragonKills t2_riftHeraldKills t2_ban1 t2_ban2 t2_ban3 t2_ban4 t2_ban5
0 3326086514 1504279457970 1949 9 1 2 1 1 1 1 2 8 12 4 432 3 4 96 4 7 11 11 6 112 4 14 11 1 2 3 0 92 40 69 119 141 104 11 4 498 4 7 122 6 4 238 14 4 412 4 3 5 0 0 1 1 114 67 43 16 51
1 3229566029 1497848803862 1851 9 1 1 1 1 0 1 1 119 7 4 39 12 4 76 4 3 10 4 14 35 4 11 10 4 0 2 1 51 122 17 498 19 54 4 12 25 4 14 120 11 4 157 4 14 92 4 7 2 0 0 0 0 11 67 238 51 420
2 3327363504 1504360103310 1493 9 1 2 1 1 1 2 0 18 4 7 141 11 4 267 3 4 68 4 12 38 12 4 8 1 1 1 0 117 40 29 16 53 69 4 7 412 14 4 126 4 12 24 4 11 22 7 4 2 0 0 1 0 157 238 121 57 28
3 3326856598 1504348503996 1758 9 1 1 1 1 1 1 0 57 4 12 63 4 14 29 4 7 61 4 1 36 11 4 9 2 1 2 0 238 67 516 114 31 90 14 4 19 11 4 412 4 3 92 4 14 22 4 7 0 0 0 0 0 164 18 141 40 51
4 3330080762 1504554410899 2094 9 1 2 1 1 1 1 0 19 4 12 29 11 4 40 4 3 119 4 7 134 7 4 9 2 1 3 0 90 64 412 25 31 37 3 4 59 4 12 141 11 4 38 4 12 51 4 7 3 0 0 1 0 86 11 201 122 18

When taking a look at these few examples, we notice that some of the fields are repeated twice (once for each team). Moreover, if you try to group the attributes, it will result in something similar to these categories:

  • Game settings: General information about the match. (5 attributes)

    Examples: gameId, creationTime, gameDuration, seasonId, winner.

  • First on doing something: Has the information on which team takes the first certain objective. (6 attributes)

    Example: firstBlood, firstTower, firstInhibitor,...

  • Champions information: Indicates all champion related actions chosen: champions selected, banned, or summoner spells. (40 attributes)

    Example: t1_champ1id, t2_champ4_sum2, t1_ban5, ...

  • Number of objectives: Counter of objectives taken for each team. (10 attributes)

    Examples: t1_towerKills, t2_inhibitorKills, t1_dragonKills, ...

The next step, is searching for missing values in our dataset. This is an important step when preparing our data in order to train a model.

In [4]:
print('Total of missing values found:', df.isna().sum().sum(), '.')
Total of missing values found: 0 .

As we can see, our dataset does not have missing values. Now, let's take a look at the information we have about our data:

5. Data analysis

Once we have familiarized ourselves with our dataset, it is time to analyze it.

First of all, we are going to take a look at the information we have about each attribute:

In [5]:
df.describe()
Out[5]:
gameId creationTime gameDuration seasonId winner firstBlood firstTower firstInhibitor firstBaron firstDragon firstRiftHerald t1_champ1id t1_champ1_sum1 t1_champ1_sum2 t1_champ2id t1_champ2_sum1 t1_champ2_sum2 t1_champ3id t1_champ3_sum1 t1_champ3_sum2 t1_champ4id t1_champ4_sum1 t1_champ4_sum2 t1_champ5id t1_champ5_sum1 t1_champ5_sum2 t1_towerKills t1_inhibitorKills t1_baronKills t1_dragonKills t1_riftHeraldKills t1_ban1 t1_ban2 t1_ban3 t1_ban4 t1_ban5 t2_champ1id t2_champ1_sum1 t2_champ1_sum2 t2_champ2id t2_champ2_sum1 t2_champ2_sum2 t2_champ3id t2_champ3_sum1 t2_champ3_sum2 t2_champ4id t2_champ4_sum1 t2_champ4_sum2 t2_champ5id t2_champ5_sum1 t2_champ5_sum2 t2_towerKills t2_inhibitorKills t2_baronKills t2_dragonKills t2_riftHeraldKills t2_ban1 t2_ban2 t2_ban3 t2_ban4 t2_ban5
count 5.149000e+04 5.149000e+04 51490.000000 51490.0 51490.000000 51490.000000 51490.000000 51490.000000 51490.000000 51490.000000 51490.000000 51490.000000 51490.000000 51490.000000 51490.000000 51490.000000 51490.000000 51490.000000 51490.000000 51490.000000 51490.000000 51490.000000 51490.000000 51490.000000 51490.000000 51490.000000 51490.000000 51490.000000 51490.000000 51490.000000 51490.000000 51490.000000 51490.000000 51490.000000 51490.000000 51490.000000 51490.000000 51490.000000 51490.000000 51490.000000 51490.000000 51490.000000 51490.000000 51490.000000 51490.000000 51490.000000 51490.000000 51490.000000 51490.000000 51490.000000 51490.000000 51490.000000 51490.000000 51490.000000 51490.000000 51490.000000 51490.000000 51490.000000 51490.000000 51490.000000 51490.000000
mean 3.306223e+09 1.502926e+12 1832.362808 9.0 1.493552 1.471295 1.450631 1.308487 0.926510 1.442804 0.731676 114.293397 6.601787 7.333929 118.101631 6.547796 7.198213 116.905127 6.542280 7.200602 117.657953 6.530511 7.221441 114.601748 6.622412 7.261235 5.699359 1.017537 0.372286 1.387182 0.251466 108.319713 108.786094 108.205904 107.630491 109.027287 115.852088 6.595261 7.305457 117.580618 6.546630 7.230627 117.481103 6.521849 7.227384 118.185881 6.535424 7.201476 115.941853 6.612682 7.249680 5.549466 0.985084 0.414547 1.404370 0.240105 108.216294 107.910216 108.690581 108.626044 108.066576
std 2.946096e+07 1.978026e+09 512.017696 0.0 0.499963 0.520326 0.542848 0.676097 0.841424 0.569579 0.822526 119.000867 4.025601 4.299902 123.577538 3.980675 4.224076 122.653184 3.966289 4.243279 123.354082 3.965559 4.244099 120.042622 4.020005 4.257531 3.799808 1.263934 0.583934 1.206818 0.433860 102.247492 102.942617 102.660955 103.000610 102.433377 121.694131 4.028611 4.280467 123.297642 3.976676 4.256462 122.939051 3.960422 4.242333 124.002327 3.963142 4.235044 122.015086 4.013472 4.253408 3.860989 1.256284 0.613768 1.224492 0.427151 102.551787 102.870710 102.592145 103.346952 102.756149
min 3.214824e+09 1.496892e+12 190.000000 9.0 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000
25% 3.292218e+09 1.502021e+12 1531.000000 9.0 1.000000 1.000000 1.000000 1.000000 0.000000 1.000000 0.000000 35.000000 4.000000 4.000000 35.000000 4.000000 4.000000 35.000000 4.000000 4.000000 36.000000 4.000000 4.000000 35.000000 4.000000 4.000000 2.000000 0.000000 0.000000 0.000000 0.000000 38.000000 38.000000 37.000000 38.000000 38.000000 35.000000 4.000000 4.000000 35.000000 4.000000 4.000000 36.000000 4.000000 4.000000 35.000000 4.000000 4.000000 35.000000 4.000000 4.000000 2.000000 0.000000 0.000000 0.000000 0.000000 38.000000 37.000000 38.000000 38.000000 38.000000
50% 3.320021e+09 1.503844e+12 1833.000000 9.0 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 0.000000 79.000000 4.000000 4.000000 79.000000 4.000000 4.000000 78.000000 4.000000 4.000000 79.000000 4.000000 4.000000 78.000000 4.000000 4.000000 6.000000 1.000000 0.000000 1.000000 0.000000 90.000000 90.000000 90.000000 89.000000 90.000000 78.000000 4.000000 4.000000 79.000000 4.000000 4.000000 79.000000 4.000000 4.000000 79.000000 4.000000 4.000000 78.000000 4.000000 4.000000 6.000000 0.000000 0.000000 1.000000 0.000000 90.000000 90.000000 90.000000 90.000000 90.000000
75% 3.327099e+09 1.504352e+12 2148.000000 9.0 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 1.000000 136.000000 11.000000 11.000000 141.000000 11.000000 11.000000 141.000000 11.000000 11.000000 141.000000 11.000000 11.000000 136.000000 11.000000 11.000000 9.000000 2.000000 1.000000 2.000000 1.000000 141.000000 141.000000 141.000000 141.000000 141.000000 141.000000 11.000000 11.000000 141.000000 11.000000 11.000000 141.000000 11.000000 11.000000 141.000000 11.000000 11.000000 141.000000 11.000000 11.000000 9.000000 2.000000 1.000000 2.000000 0.000000 141.000000 141.000000 141.000000 141.000000 141.000000
max 3.331833e+09 1.504707e+12 4728.000000 9.0 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 516.000000 21.000000 21.000000 516.000000 21.000000 21.000000 516.000000 21.000000 21.000000 516.000000 21.000000 21.000000 516.000000 21.000000 21.000000 11.000000 10.000000 5.000000 6.000000 1.000000 516.000000 516.000000 516.000000 516.000000 516.000000 516.000000 21.000000 21.000000 516.000000 21.000000 21.000000 516.000000 21.000000 21.000000 516.000000 21.000000 21.000000 516.000000 21.000000 21.000000 11.000000 10.000000 4.000000 6.000000 1.000000 516.000000 516.000000 516.000000 516.000000 516.000000

Standard deviation (std) shows how dispersed are the attribute values from its average. So, if we take a look, we can see how attributes that have high standard deviation values will not probably be the best ones when training our model.

In addition, we can observe some interesting facts:

  • Being the winning team does not depend on which team you belong to. We know that because the mean of 'winner' is practically 1.5 and it only has two possible values: 1 or 2.
  • Knowing the season will not help us when predicting the winning team since all games are from season 9. We know that because the value is 9 and the standard deviation is equal to 0.

Another thing we can expect from what we have seen in the previous table is that attributes do not follow a concrete distribution.

For example, on one hand we have the attribute gameDuration which has a non-common distribution in our dataset: a gaussian distribution. On the other hand, we can appreciate how most of the other attributes distribution are very different, for example the attribute firstTower.

In [6]:
df[['gameDuration', 'firstTower']].hist(figsize=(10,5))
plt.show()

We should only keep the data that helps us when trying to predict which team wins, so lets search which data is that. In order to that, we are going to use the correlation matrix to filter our attributes and keep only the useful ones.

The following figure shows the correlation of all the attributes with the one we want to predict.

In [7]:
# We take only the row of the attribute we want to predict from the correlation matrix.
winner_correlation = df.corr()['winner'].to_frame().T 
plt.subplots(figsize=(20, 1))
sns.heatmap(winner_correlation)
plt.show()

Analyzing this correlation matrix, we can clearly watch how there are some attributes correlated with the one we want to predict. Lets found which these are:

In [8]:
correlation_threshold = 0.3

# According to the correlation values, we filter the useful attributes.
attributes = list(filter(lambda x: abs(float(winner_correlation[x])) > correlation_threshold, df.columns))

print('Total of useful attributes:', len(attributes), '.')
Total of useful attributes: 12 .

We are going to divide our attributes into two different categories according to their correlation value:

Moderate:

  • firstTower
  • firstInhibitor
  • firstDragon
  • t1_inhibitorKills
  • t1_baronKills
  • t1_dragonKills
  • t2_inhibitorKills
  • t2_baronKills
  • t2_dragonKills

Strong:

  • t1_towerKills
  • t2_towerKills

Next plots show the relation of the attributes with the one we want to predict by plotting the probability of being (or not) the winning team if we have been the first one who has taken a certain objective.

In [9]:
# We filter the attributes that have information about which team has first taken a certain objective.
objectives = list(filter(lambda x: 'first' in x, attributes))

fig, ax = plt.subplots(1, len(objectives), figsize=(15,5))
total = df.shape[0]

for idx, objective in enumerate(objectives):
    
    # We calculate the percentage for the winning and the losing team.
    first_objective_wins = df[df[objective] == df['winner']].shape[0]
    first_objective_wins = round((first_objective_wins / total) * 100, 2)
    first_objective_loses = 100 - first_objective_wins
    
    ax[idx].set_title(objective.replace('first', 'First '))
    sns.barplot(['Winning Team', 'Losing Team'],
                [first_objective_wins, first_objective_loses],
                palette='cool',
                ax=ax[idx])
plt.show()

If we analyze these plots we can appreciate there is a huge probability of winning the game when our team is the first on taking certain objectives. So, it makes sense that we choose these attributes for training our prediction model.

Moreover, if we take a look at the average of objective kills for each team we also perceive that there is an important relation between being the winning team and the number of objective kills:

In [10]:
# We filter the attributes that represent the objective kills of the winning/losing team.
objectives = list(filter(lambda x: 'Kills' in x, attributes))

# We modify the names of the attributes to make them more understandable.
objectives = list(set(map(lambda x: x.split('_')[-1], objectives)))

fig, ax = plt.subplots(1, len(objectives), figsize=(15, 5))

for i, objective in enumerate(objectives):
    
    # We filter the games in which the team 1 has won.
    t1_wins = df[df['winner'] == 1][['t1_' + objective, 't2_' + objective]]
    t1_wins.columns = ['winner', 'loser']
    
    # We filter the games in which the team 2 has won.
    t2_wins = df[df['winner'] == 2][['t2_' + objective, 't1_' + objective]]
    t2_wins.columns = ['winner', 'loser']
    
    # We compute the mean of the winning/losing objective kills.
    mean = pd.concat([t1_wins, t2_wins], axis=0).mean(axis=0)

    ax[i].set_title(objective.replace('Kills', ' number of kills').capitalize())

    sns.barplot(['Winning Team', 'Losing Team'],
                [mean['winner'], mean['loser']],
                palette='cool',
                ax=ax[i])
plt.show()

6. Data preparation

To successfully train our model, we are going to need to prepare our data for it.

The first step needed is cleaning our dataset by keeping only the most useful attributes.

In [11]:
# We keep only the useful attributes.
df = df[attributes]

Next step is to split our entire data into two different sets:

  • Train: subset used for training the model.
  • Test: subset used for evaluating the model.
In [12]:
X = df.drop(labels=['winner'], axis=1)
y = df['winner']

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8)

7. Model creation

In order to reach our objective and make possible to predict which team is the winner, we are going to use a decision tree model.


What is a decision tree?

Decision trees can be used in regression or classification problems.

A decision tree consists in different types of nodes:

  • Decision nodes: Depending on the value of certain attribute they indicate which branch you must take.

  • Leaf nodes: These are the last nodes and indicates the output.

In our problem we are going to use a classifier decision tree, so lets understand how internally works when being created.

First, we start at the root of the decision tree, where all the data is mixed. Here, we need to found the attribute which most reduces the entropy when you split the data grouping by the selected attribute values. In other words, the attribute with the largest Information Gain.

Then, you iterate that process until you get nodes where all the samples of the subset belong to the same clase.

Normally, a limit on the depth of the tree is set to avoid overfitting, that results in some leaf nodes which are not pure at all.


Moreover, Scikit-learn library has a module named GridSearchCV which is very useful when looking for the best paramenters.

What this module does internally is using cross-validation to split the data in two different sets (train and validation) and trying all the combinations of the parameters you define. Once all combinations are done, you are able to get the best parameters combination as well as the best score reached.

First of all, we need to found which are the parameters that obtain the best result.

There are a lot of possible parameters we can modify when creating a model using the DecisionTreeClassifier from Sci-kit-learn. In this problem, we are going to try modifying some of them:

  • splitter: Indicates which strategy is used when choosing the split at each node.
  • criterion: Indicates which function is going to measure the quality of a split.
  • max_depth: Indicates the maximum depth of the decision tree.
In [13]:
k = 10

# We define the parameters we want to try.
parameters = dict()
parameters['splitter'] = ['best', 'random']
parameters['max_depth'] = range(1, len(attributes))
parameters['criterion'] = ['gini', 'entropy']

decision_tree = DecisionTreeClassifier()

# We search for the best combination of the parameters.
grid_decision_tree = GridSearchCV(estimator=decision_tree, cv=k, param_grid=parameters)
grid_decision_tree.fit(X_train, y_train)

print("Best parameters: ", grid_decision_tree.best_params_)
Best parameters:  {'criterion': 'gini', 'max_depth': 7, 'splitter': 'best'}

Once known which parameters are the best ones, the next step is creating a decision tree using them..

In [14]:
# We get the best combination of the parameters found.
best_params = grid_decision_tree.best_params_
best_criterion = best_params['criterion']
best_max_depth = best_params['max_depth']
best_splitter = best_params['splitter']

# We fit the decision tree with these parameters.
decision_tree = DecisionTreeClassifier(criterion=best_criterion, max_depth=best_max_depth, splitter=best_splitter)
decision_tree = decision_tree.fit(X_train, y_train)

In order to evaluate the model, we are going to use the test set to predict which team is the winner and then calculate the accuracy:

In [15]:
y_pred = decision_tree.predict(X_test)

accuracy = round(accuracy_score(y_test, y_pred)*100, 2)

print('Accuracy:', accuracy, '%')
Accuracy: 96.93 %

Our model has reached a really good accuracy, so we can say it predicts well who is going to be the winner.

We are also going to plot the ROC curve of our model to see how it works. The ROC curve is a visual representation of the relation between the True Positive Rate (TPR) and the False Positive Rate (FPR) with different thresholds.

In [16]:
# We take the probabilities of the predictions.
y_pred = decision_tree.predict_proba(X_test)

# We represent the y_test in the format we need to compute the ROC curve.
y_test = y_test.apply(lambda x: 0 if x == 1 else 1)

# We compute the ROC curve.
fpr, tpr, thresholds = roc_curve(y_test, y_pred[:,1])

# We represent the ROC curve.
roc_curve = go.Scatter(x=fpr, y=tpr, name="ROC curve", marker_color='#1DA6FF')
reference = go.Scatter(x=fpr, y=fpr, name="Reference", marker_color='#4F555D')

data = [roc_curve, reference]

figure = go.Figure(data)
figure.update_layout(title="Roc curve",
                     xaxis=dict(title="False Positive Rate (FPR)"),
                     yaxis=dict(title="True Positive Rate (TPR)"),
                     width=800,
                     height=400)

iplot(figure)

We can appreciate a really good result by taking a look at this ROC curve since it reaches practically 100% of TPR very fast.

Now, the decision tree of our model is going to be represented to see how it looks like.

In [17]:
fn = list(X_train.columns)
cn = ['Team 1 wins', 'Team 2 wins']

fig, axes = plt.subplots(figsize=(25, 10))
tree.plot_tree(decision_tree, feature_names=fn, class_names=cn)
plt.show()

If we take a look at this figure we can observe how big our decision tree model is.

Since it is impossible to analyze it because of the size, we are going to represent a simpler decision tree with fewer nodes and a maximum depth of 2 to have an idea of how our decision tree is.

In [18]:
# We create a decision tree of maximum depth equal to 2. 
simple_decision_tree = DecisionTreeClassifier(criterion=best_criterion, max_depth=2, splitter=best_splitter)
simple_decision_tree = simple_decision_tree.fit(X_train, y_train)

fig, axes = plt.subplots(figsize=(15, 8))
tree.plot_tree(simple_decision_tree, feature_names=fn, class_names=cn)
plt.show()

In this representation of this simpler decision tree, we can see how there are 3 decision nodes and 4 leaf nodes. Also, in each decision node we can observe its condition for which each possibility is represented by a branch. The left branch means that the condition is satisfied while the right one means not.

For each node, there is certain information such as the Gini value, the number of samples that satisfy all previous conditions, the number of samples that belong to each prediction target, and temporal/final a prediction.

8. Conclusions

After all the analysis and the results obtained, we can expose the following conclusions:

  • Using a decision tree is a good practice when trying to predict which team is going to win.

  • We have created a model that has really good prediction accuracy.

  • Not all the information about each game is relevant. At first, the dataset had more than 60 attributes and after analyzing them, only 11 have been used to train the model.

  • "TowerKills" attributes are the most important ones since they have a strong correlation with the one we want to predict: the "winner" attribute. Moreover, if we take a look at the simpler decision tree we have represented we will see how in the case where depth is limited to 3, these are the attributes used in all decision nodes.

Author face

Eric Lozano

Computer Science Student at Universitat Autònoma de Barcelona (UAB)