Using a data set of 10k Diamond and Master games with data from the first 10 min, along with the Riot Games API to get even more data about each game in the data set to predict the outcome of games.
The Jupyter Notebook equivalent of this README with interactive plots can be found here:
https://www.kaggle.com/jbofill/league-of-legends-match-outcome-classifier
The target variable that I will attempt to determine is whether the team will win the game or not. In the data set, this is given as "blueWins", which really doesn't matter as we will be able to tell when red wins (if blue loses)
It appears that wins and loses is distributed almost equally throughout the data set, which is good. This means I don't need to manually compensate for an uneven amount of wins/losses and essentially "guess" whether a team won or not just to make the distriubtion even.
Red has 4949 wins
Blue has 4930 wins
Having more information than your opponent in any game is one of the biggest advantages a team can have over another. Therefore, wards are extremely important for the success of a team as they grant extra vision of the enemy and map.
On average, about 20 wards a game are placed. It is interesting and mind boggling how some games have up to 276 wards placed.
It is clear that the distribution of wards placed are left skewed, which may require a transformation.
Here we can see that there is a slight trend with the amount of wards placed and a team winning. Usually the winning team has a larger amount of wards placed that game.
Both blue and red have a higher mean of wards placed when winning, but it is not that clear for blue side as it is with red.
Similar to wards being so important due to the information they can provide to the team, denying that information is also key to winning. Due to this idea, I would suspect that the team with more destroyed wards has a better chance of winning the game.
In the range of 0-2 wards destroyed, there isn't a trend at all regarding the outcome of the game. Once the wards destroyed count exceeds 3, there is a clear trend that destroyed ward count correlates with winning. This is most likely due to the fact that if so many wards are destroyed that quickly, the team is dominating / being very aggressive and putting a lot of pressure on the other time.
First bloods are important because it provides gold for the team, which allows wards to be set up early. This will prevent the jungler from ganking as well.
This bar graph takes the sum of games where a certain team got first blood. It is clear that teams that got a first kill are more likely to win, since it does provide advantages as stated earlier.
But it is also possible that the team was just overall better and won regardless of the "first blood advantage".
Kills in League are very important because it slows down the progression of your opponents build, along with giving you gold to upgrade your champion. For this reason, it is pretty clear that kills will be a very important indicator of whether a team wins or loses
The distribution of all the variations in this histogram are right skewed slightly.
As expected, teams that have more kills generally win more games.
If kills are vital to the success of a team, it must mean that survivability is also very important. This means that most likely the team that dies least has the higher chance of winning.
As suspected, Teams that win tend to die less on average.
The team with more assists also means that team has more kills which should give them the advantage over their opponents. I expect the team with more assists to win more games
It is evident that teams with more assists end up winning more games.
Towers prevent the enemy team from attacking the Nexus. The more towers that are destroyed, the easier it becomes to attack the Nexus.
Due to this idea, teams that win would have destoryed more towers than their opponents.
In the first 10 min, usually no towers are destroyed. If one is destroyed, usually that team wins.
Epic Monsters are important provide high gold/experience and buffs to the team that defeats them. This in turn provides an advantage for the team. I would expect to see that teams will win more on average if they kill more epic monsters.
Due to the rewards epic monsters provide, there is a definite advantage to the team with more monster kills.
The more experience a champion has earned, the stronger the build the champion will have. Therefore teams that win will most likely have more experience.
Stronger items and upgrades to items can be obtained with gold. So most likely the teams that win will have more gold since they had stronger champions that contributed to winning the game.
Creep Score, or CS, is one of the most important aspects of League. Having a good creep score means a reliable and steady amount of income, which is pointed out earlier is very important for winning games. Due to this, I know for sure that the winning teams will have higher CS than the losing teams.
These are the 20 most frequent bans when a team wins. Unfortunately the API doesn't give information on who picked first, so when it says "Blue" or "Red" it just means the team that won In th legend, the champions go in order of most frequent. For example, Other is the most banned, followed by Kassadin in Ban 1.
It seems that people in diamond and master are not fond of Kassadin at all. Generally, it seems that the ban selections are very strict at the top level of play.
The idea presented by the pie chart makes sense, as usually team composition related games typically have metas between patches or seasons. So it makes sense to see champions that work best together to have pick rates close to one anothers.
I suspect the choices to be relatively similar to the champion selection visualization embodying winning and losing teams It's interesting to see that Lee Sin is selected more on Red when they lose. I don't think this contributes to Red losing though.
Interesting that when blue loses, Lee Sin is very popular. At the same time, Lee Sin is the third most fequent pick when Red wins.
Inhibitors are a very important structure in League of Legends because it prevents the opposition from training super minions in the lane with the inhibitor destroys. That having said, the team with a destroyed inhibitor would be put at a major disadvantage. I would suspect that teams win significantly more with a first inhibitor destroyed vs. not having destroyed an inhibitor first.
Quite a difference. The data set is relatively small, but I am certain that if more data was added, the difference would still be large.
Same result as Blue.
This means that destroying the inhibitor first is probably a good indicator of whether a team wins or not.
The Baron is the strongest monster in league due to providing the team that defeats it buffs including increased attack damage, increased ability power, and increases the power of minions. Due to this, I suspect that this will also be a strong indicator of who wins the match.
It seems teams usually win more when getting the first baron, but the difference is not as severe as it is with inhibitors.
The idea is the same as it is with Blue.
I would suspect that getting the first baron helps to lead to the first inhibitor if not destroyed already, which would help secure the game.
Since deaths in League can be detrimental to a team, I would assume that if one team dies a lot to towers that they would most likely lose the match.
This was basically the distribution I had in mind. As a team suffers more deaths to towers, the less likely they win a game. I think this will also be a strong indicator of winning or losing if data values lie outside of 4-6 tower deaths
I will try using several different models in order to pick the most accurate result.
The process through which model selection will occur is by getting feature importance, hyperparameter tuning with 5 fold cv, then train test split to see which model performs the best.
Some features that will be removed regardless of correlation are: redFirstBlood, red_firstInhibitor, red_firstBaron, red_firstRiftHerald, and gameId just because we have the blue counter part to them and gameId is not necessary.
Other than those mentioned, I will keep everything else.
As for the features relating to champions, I will one hot encode all of them. This in return will give me a dataframe with really large dimensions, since there are 139 champions in league, and there are 15 champion related columns, the dimensions will grow very rapidly.
I plan to one hot encode the champion related columns, but this is problematic. There are currently 135 champions in League of Legends, this means that once I finish one hot encoding, I will have ~2025 extra columns added to my original data set. So I plan to try different methods of reducing the dimensionality and comparing their scores.
I would first like to try removing a certain amount of champions that are hardly used. I will set a threshold at 15 selections to be considered and see what happens.
So not bad, but the dimensionality is still really high. I may just try it anyways to see if PCA improves our results or worsens.
After 3 days of training, I decided that the dimensions of the data were too high to continue and required research on how to handle this situation.
I began by looking into Logistic PCA, but unfortunately there are no Python packages for that. I would've made one myself had I known Linear Algebra (haven't taken that course yet, but soon I will)
I then looked into MCA, which deals with nominal values in data. I read some papers to grasp an understanding of Multiple Correspondence Analysis and decided this would be my approach to reducing the dimensionality of my data.
I plan to do a grid search on optimal components for MCA, but for now I am deciding to do 5 components for both champion selections (red and blue) and ban selections.
On top of this, I will experiment with applying log1p and PCA to my continuous data related columns to see how that effects the accuracy.
My dataframe now has the dimensions of 9879 x 26 This is a considerable improvement and the training process should be much quicker now.
Since PCA assumes Gaussian, I am going to apply log1p transformations to the features with high skew values.
The only features that will require transformations are blue/redWardsPlaced and blue/redTowersDestroyed.
The same features that are very skewed also have high kurtosis.
I would've liked blue/redTowers to be a little lower, but it should be OK.
The log transformation helped lower the Kurtosis score, but outliers are still very present in towers destroyed.
This is fine though, as the reason for the outliers is that many times, no towers are destroyed within the first 10 min. That means that if one tower is destroyed, it is significant regarding whether a team wins or not.
All models will have their hyperparameters tuned with a grid search and 5 fold cv.
So far, I have suspicion that Logistic Regression and XGBoost may be overfitting. Since I have endless games to pull from, that would be the best way to test these models against data that has not been seen before.
Really high accuracy score. Potential overfitting
Tuned hyperparameters with a grid search and 5 fold cv
The accuracy falls around 95% - 99%. I suspect overfitting as well, but I am going to mess with more models to see what happens.
Since I suspected my models might be overfitting, I grabbed 5k more games to add to my validation data set. As of now, the models are being trained by 9879 games and are attempting to predict 5559 results.
I will discontinue the use of using the train test split, as I see no reason to use it since the validation set is twice as small as my custom test data set.
The accuracy went down by 1% due to increasing the validation data set by more than double, which is pretty good.
A huge performance drop with XGBoost, which is interesting. Originally, the model's accuracy was around 96% - 99%. It seems that XGBoost was overfitting to some degree.
RF seems to be the model that performs the worst so far.
Multiple Corresdondence Analysis by Hervé Abdi & Dominique Valentin
Variable Extractions using Principal Components Analysis and Multiple Correspondence Analysis for Large Number of Mixed Variables Classification Problems by Hashibah Hamid, Nazrina Aziz, and Penny Ngu Ai Huong