Skip to the content.

League of Legends CS vs. Kills Analysis (2022 Esports Data)

Step 1: Introduction

The dataset comes from the game League of Legends. This dataset is from the year 2022 and contains 150,588 rows of data of 12,549 unique matches.

To provide some context, in matches items are bought by players and items cost gold. There are 2 main ways to earn gold:

  1. Kill enemies
  2. Kill minions (Creep Score or CS)

There is a third way which is to break down structures but this is trivial and not considered.

Items make players more stronger, so as you get more gold, you have a larger opportunity to get stronger. However I always get into a dispute with my friends about if the creepscore (killing minions) is a good proxy for the gold. I primarily focus on the kills. This is actually a real debate I have with my hometown friends in Korea when I say look I have 10 kills and they say you only have 3 cs per minute(this is bad apparently). They argue good cs means good at being able to kill and I just stole other peoples kills. So I’ve taken it upon myself to analyze if higher creepscore is truly correlated with higher ability to be a threat to kill the enemy team.

“The central question of this project is: Is higher creep score (CS) a stronger predictor of winning than kills at 15 minutes?”

Each match has 10 players, with 5 players on each team.

Here are the columns that I will concern myself with:

Early Game Performance Metrics

Creep Score (CS) Metrics

Kill Metrics

“While I preserved data from 10, 15, and 20 minutes to enable comprehensive analysis, I focus on the 15-minute mark because it represents the end of the laning phase while still being predictive of the final outcome.”


Step 2: Data Cleaning and Imputation

Explanation of Data Cleaning Steps

  1. Column Selection: We specify a list of columns that are directly relevant to the analysis:
    • General identifiers and context: gameid, teamid, side, patch, result.
    • CS metrics: csat10, csat15, csat20 and their opponent counterparts (opp_csat10, etc.).
    • Kill metrics: killsat10, killsat15, killsat20 and corresponding opponent metrics.
    • Overall metrics: total cs, teamkills.
    • Position: Included so that we can filter out support players (who don’t typically prioritize kills or CS) and help the ADC, which is a position that is considered. We will use the omission of support as a means of imputation. This ensures that we only load the data necessary for hypothesis testing.
  2. Data Type Conversion: Many columns in the raw data that should be numeric might be read as strings. We convert these columns to numeric using pd.to_numeric with errors='coerce', which automatically replaces any non-numeric entries with NaN. This is critical for numerical analyses.
  3. Filtering Rows: The dataset includes rows for all positions, including aggregated ‘team’ rows. We filter out ‘team’ rows initially. We also filter out ‘sup’ (support) position rows after imputation, as their CS/kill patterns differ significantly. Also I noticed there were an obscene amount of rows where CS was 0, so we will also remove these rows as these likely represent incomplete game records.
  4. Imputation: Missing values in key numeric columns (csat*, killsat*, total cs, etc.) are imputed using a cascading median strategy (details below).
  5. Index Reset and Renaming: After filtering and imputation, the index is reset for easier handling. The column 'total cs' is renamed to 'total_cs' for consistency.

Head of Cleaned DataFrame

| position | result | csat10 | opp_csat10 | csat15 | opp_csat15 | killsat10 | opp_killsat10 | killsat15 | opp_killsat15 | total_cs | teamkills |
| :------- | -----: | -----: | ---------: | -----: | ---------: | --------: | ------------: | --------: | ------------: | -------: | --------: |
| top      |      0 |     89 |         81 |    135 |        121 |         0 |             0 |         0 |             0 |      231 |         9 |
| jng      |      0 |     58 |         63 |     89 |        100 |         1 |             0 |         2 |             0 |      148 |         9 |
| mid      |      0 |     81 |         81 |    120 |        119 |         0 |             0 |         0 |             3 |      193 |         9 |
| bot      |      0 |     78 |         90 |    115 |        149 |         1 |             0 |         2 |             3 |      226 |         9 |
| top      |      1 |     81 |         89 |    121 |        135 |         0 |             0 |         0 |             0 |      229 |        19 |

Univariate Analysis

Here we look at the distributions of key team-level metrics: Total CS and Total Kills, separated by match outcome (Win/Loss).

Distribution of Team-Level Total CS by Outcome

Interpretation: This histogram shows that CS distributions form a relatively normal curve centered around 900-1000 total CS per team. Winning teams (green) generally have higher CS than losing teams (red). Wow, I guess I might be on the path towards being wrong. The box plots at the top confirm that the median CS for winning teams is clearly higher than for losing teams. There appears to be a CS advantage associated with winning, supporting the idea that farming efficiency correlates with success. Ouch.

Distribution of Team-Level Total Kills by Outcome

Interpretation: This histogram shows the distribution of total team kills. Similar to CS, winning teams (green) tend to have higher kill counts than losing teams (red), although the distributions overlap significantly. The median number of kills is higher for winning teams, suggesting that securing more kills is also associated with winning.

Bivariate Analysis

This plot examines the relationship between a player’s role (position), their CS at 15 minutes, and the match outcome. Note: the reason why I will be looking at the 15 minute mark for the following graphs is because 15 minutes usually marks the end of the laning phase where most CS is collected. After 15 minutes players roam to other parts of the map and get into team fights and whatnot.

CS at 15 Minutes by Position and Outcome

Interpretation: This box plot reveals that Mid and Top laners generally achieve the highest CS by 15 minutes, followed closely by Bot lane ADCs. Junglers have significantly lower CS, as expected. Within each role, winning players (green) tend to have a higher median CS than losing players (red). The difference appears most pronounced in the Top lane, suggesting that a CS advantage in this lane might be particularly impactful for securing a win. This aligns with competitive LoL dynamics where lane dominance through CS creates pressure.

Interesting Aggregates

This combined chart shows the relationship between CS brackets at 15 minutes and the overall win rate for players within those brackets.

Win Rate by CS at 15 Minutes Brackets

Interpretation: This chart strongly demonstrates a positive correlation between CS at 15 minutes and win rate. The win rate climbs steadily from around 50% in the lower CS brackets to nearly 70% in the highest bracket. Most games fall into the middle brackets (e.g., 115-153 CS), but achieving exceptionally high CS significantly increases the likelihood of winning. This highlights that strong early-game farming is a robust predictor of match success in this dataset. But I will address that the small amount of sample size in the beginning of the graph signifying a high win rate respite having low CS could indicate that there are a decent amount of games despite having low CS there is usually another carry on the team, and doing this bad at 15 minutes could mean there is an x factor on your own team. But the trend for 38.4 cs + remains true.

Average CS at Different Timestamps by Position and Outcome

| position | csat10_0 | csat10_1 | csat15_0 | csat15_1 | csat20_0 | csat20_1 | cs_diff_10 | cs_diff_15 | cs_diff_20 |
| :------- | -------: | -------: | -------: | -------: | -------: | -------: | ---------: | ---------: | ---------: |
| bot      |  77.6326 |  79.7207 |  124.943 |  128.997 |  171.942 |  178.213 |    2.08813 |    4.05387 |    6.27169 |
| jng      |  62.0141 |   63.216 |   93.571 |  96.0065 |  122.054 |  126.875 |    1.20197 |    2.43545 |    4.82102 |
| mid      |  83.3052 |  85.1219 |  134.034 |  137.689 |  177.165 |  182.729 |    1.81672 |    3.65451 |    5.56431 |
| top      |  75.6975 |  77.8944 |  122.852 |  126.971 |  165.039 |  171.271 |    2.19683 |    4.11893 |    6.23265 |

Interpretation: This table provides the average CS values at 10, 15, and 20 minutes for each position, split by whether the team won (1) or lost (0). The cs_diff columns show the average CS advantage for winning players in each role at each time point. This numerically reinforces the observations from the box plot, showing consistent CS leads for winners across most roles and time points.

Imputation

A significant portion of the data had missing values, particularly for the CS and Kill metrics at specific timestamps.

Missing Data Summary:

Dropping these rows would discard too much information. Therefore, imputation was necessary.

Imputation Strategy:

At first I did median imputation but because such a large amount of rows were missing data the graph looked really awkward as the median value frequency count was so so high. So with my background knowledge in League of Legends, I decided to use a different impute strategy where I get the median by role and team average. Because some teams like to play aggressive and some play passively. Furthermore ADC’s typically have more kills than other roles so I thought this would be a good strategy.

A more nuanced, cascaded median imputation strategy was adopted:

  1. For a missing value, first try imputing using the median for that specific position and champion combination.
  2. If that combination isn’t available or its median is NaN, fall back to the median for the player’s position.
  3. If that fails, use the median for the player’s teamid.
  4. As a final fallback, use the overall median for the column across all non-support players.

This strategy leverages domain knowledge (different roles/champions/teams have different typical stats) to provide more realistic imputed values than a simple overall median. The reason why I chose the position + champion combination is because some positions have an opportunity to get more CS than others, and some champions have better farming mechanics(i.e it’s easier to kill minons with their abilities) than others!

Imputation Quality Report

| column        | missing_values | original_mean | original_median | imputed_mean | imputed_median | percent_change |
| :------------ | -------------: | ------------: | --------------: | -----------: | -------------: | -------------: |
| total_cs      |          25098 |       198.968 |             214 |      239.868 |            237 |        20.5561 |
| csat10        |          22716 |        104.83 |              77 |      75.5753 |             78 |        27.9071 |
| csat15        |          22716 |       167.489 |             124 |      120.633 |            126 |        27.9758 |
| csat20        |          23172 |       224.578 |             167 |      161.911 |            169 |        27.9045 |
| opp_csat10    |          22716 |        104.83 |              77 |      75.5935 |             77 |        27.8898 |
| opp_csat15    |          22716 |       167.489 |             124 |      120.677 |            125 |        27.9493 |
| opp_csat20    |          23172 |       224.578 |             167 |      161.948 |            169 |        27.8879 |
| killsat10     |          22716 |       0.72826 |               0 |      0.42483 |              0 |        41.6651 |
| killsat15     |          22716 |       1.35872 |               1 |       0.8852 |              1 |        34.8504 |
| killsat20     |          23172 |       2.15704 |               1 |      1.43643 |              1 |        33.4075 |
| opp_killsat10 |          22716 |       0.72826 |               0 |     0.417548 |              0 |        42.6649 |
| opp_killsat15 |          22716 |       1.35872 |               1 |     0.883681 |              1 |        34.9622 |
| opp_killsat20 |          23172 |       2.15704 |               1 |       1.4194 |              1 |        34.1972 |

Interpretation: This table compares the mean and median of key columns before and after imputation (calculated on the non-missing original values vs. the fully imputed column). The goal is for the imputed statistics to be reasonably close to the original ones, indicating the imputation hasn’t drastically changed the central tendency of the data. Small percentage changes suggest the imputation was successful in preserving the overall data characteristics.

Visualizing Imputation Effect (Example: CS at 15 Minutes)

Interpretation: Comparing the distribution of csat10 before imputation (orange, excluding NaNs) and after imputation (blue) shows how the imputation strategy filled the gaps. Ideally, the shape of the blue histogram should look like a “filled-in” version of the orange one, without introducing extreme artificial peaks, demonstrating that the imputation maintained the original data’s general distribution. Getting rid of these artificial peaks at the bottom and the top help us better predict, as these are not a typically representative of a normal League of Legends game. I don’t think blowouts are good to consider here.

Step 3: Framing a Prediction Problem

Prediction Problem: Can we predict the final outcome (win or loss) of a League of Legends match based only on performance metrics measured at the 15-minute mark?

Motivation: This problem aims to test the hypothesis that the early-to-mid game phase (specifically, the state at 15 minutes) is significantly predictive of the final match result. It explores whether early advantages in creep score (CS), kills, gold, or experience translate reliably into victories, addressing the common debate about the importance of the “laning phase” versus later team fights.

Prediction Type: This is a Binary Classification problem.

Response Variable:

Evaluation Metric:

I chose f1 over Auroc as well because auroc is good for understanding separability before pickinga deciison threshold, but doesn’t tell us how our final win loss cutoff actually performs. Since the end goal is a concrete classifier, we need a metric that reflects performance at the cutoff, and this is exactly what F1 does. I will still include auroc and accuracy just to show overall performance of the model however.

Features at Prediction Time:


Step 4: Baseline Model

To establish a performance baseline, we start with a simple Logistic Regression model.

Model Description:

Features Used:

Performance:

Assessment:


Step 5: Final Model

Building upon the baseline, the final model aims to improve prediction accuracy by incorporating more sophisticated features and optimizing model parameters.

Feature Engineering:

In addition to the baseline features (cs_diff_15, side), the final model includes:

These features were chosen because they represent key aspects of early-game performance (combat, economy, levels) and interactions between them that are commonly understood to influence match outcomes in League of Legends.

Modeling Algorithm and Hyperparameter Tuning:

Final Model Performance:

Comparison to Baseline:

The final model demonstrated a clear improvement over the baseline:

Metric Baseline Model Final Model Improvement
Accuracy 0.5817 0.6272 +0.0455
F1 Score (Wgt) 0.5805 0.6272 +0.0467
ROC AUC 0.6141 0.6870 +0.0729

Conclusion:

The final Logistic Regression model, enhanced with additional and engineered features reflecting key game interactions and tuned hyperparameters, outperforms the baseline model in predicting match outcomes based on the 15-minute mark. While still far from perfect (accuracy ~63%), the model achieves an F1-score of 0.6272 and ROC AUC of 0.6870, demonstrating that early-game advantages, particularly gold difference, are statistically significant indicators of eventual success. With my previous of League of Legends, one thing that may contribute early laning phase being a modest predictor for winning is the fact that these are data from pro games. And in pro games there are a lot of upsets in the late stages because pro’s know how to capitalize on mistakes much more. In casual games there is more snowballing meaning that if someone has a big advantage early on, inexprerienced players have low probability of coming back.

The feature importance aligns with game knowledge, emphasizing the impact of gold and experience differentials. The negative coefficient for CS difference warrants further investigation but might suggest complexities beyond simple farm counts when other advantages are considered.