A Logistic Regression-Based Tennis Match Rating System

This blog post is based on work I did in Spring 2023 as an analyst with the Cornell Sports Analytics Club.

Introduction

In many sports, player ratings are essential for evaluating individual performance, guiding roster decisions, and boosting fan engagement. Existing rating systems in tennis, such as UTR and NTRP, focus mainly on long-term ability and often fail to capture short-term performance. This presents a gap in player evaluation, particularly in high-level team tennis, where coaches and players need more immediate, data-driven insights.

The goal of this project was to develop a system for rating tennis performances in a single match by analyzing key match statistics and assigning each player a score between 1 and 10. To accomplish this, I built a logistic regression model that estimates the probability of a player winning a match based on 33 match statistics, which is then scaled to produce the rating. This approach achieves two main objectives:

  1. Provides a quantitative measure of player performance.
  2. Identifies which match statistics have the greatest impact on match outcomes.

The approach looks effective through preliminary testing, although more refinement and testing is needed. Overall, I think it's a novel and useful way of measuring tennis performances.

Motivation

Let's start by defining the difference between rating a player's ability and rating a player's performance. Ability captures a player’s underlying skill level: how strong they are independent of any single match. It is relatively stable over time and usually changes slowly as a player improves or declines. Performance, on the other hand, reflects how well a player actually played in a given match or even a single game. Performance can fluctuate from day to day depending on form, fitness, mental state, or external conditions. Measuring these two things should be considered separate ideas.

2023 Miami Open Final
Despite having very similar UTRs, Jannik Sinner (right) defeated Alex De Minaur 6-4, 6-1 in the 2023 Canadian Open final. According to the model, Sinner receives a performance rating of 8.72 and De Minaur scores a 2.81.

To illustrate this further, consider a scenario where Carlos Alcaraz plays a match on the ATP tour and a fifteen-year old plays a match in a local tournament on the same day. It is perfectly feasible that Carlos Alcaraz's performance score out of 10 is lower than the junior player's, especially if Carlos commits many unforced errors that day and the fifteen-year old plays consistently. However, that fifteen-year old will almost certainly never have the tennis ability of Carlos Alcaraz.

Currently, there are many ways to rate a player's ability. The most popular and effective is the Universal Tennis Rating, which assigns competitive players a number between 1 (novice) and 16 (top professional men's player). Another example is NTRP, an outdated system that rates ability from 1 to 7, and a newer system called World Tennis Number. While there are strong systems for measuring ability, there is still no widely used system for measuring performance in tennis. A good performance rating system could be valuable in several ways:

  • Quantifying the performance of a player in a single match.
  • Comparing performances across matches.
  • Roster management: helping high school and college coaches compare players and decide lineups.
  • Fan engagement and marketing: for example, the US Open could highlight the player with the best performance in the quarterfinals.
I was inspired by performance rating systems from other sports. In soccer, for example, WhoScored.com assigns players a rating between 1 and 10 for each match. Positive actions, like a pass leading to a shot, increase the rating, whereas negative events, like committing a foul which leads to a dangerous free kick, decrease it. With this in mind, I set out some principles that a good performance rating system should follow:
  1. Ratings should be comparable across matches.
  2. Ratings should be (almost) independent of the level of a player.
  3. Ratings should be independent of (but probably correlated with) winning and losing.
Of course, some events (like hitting aces) are inherently easier for high-level players, so certain statistics will naturally skew toward stronger players. But the goal is to measure how well someone played relative to their match, not just how good they are overall.

Implementation Details

Data for this project was obtained from the Ultimate Tennis Statistics website, one of the best public sources of tennis match data. The project uses match data from 1081 men's tennis matches from the Masters 1000, ATP Finals, and ATP 500 levels during the 2019, 2021, 2022, and 2023 seasons. This data was split into a training set and a testing set using a 80-20 split. The features collected included serving, returning, net play, and overall performance statistics:

  • Serving Stats
    • Ace Percentage
    • Double Fault Percentage
    • First Serve Percentage
    • First Serve Win Percentage
    • Second Serve Win Percentage
    • Break Points Saved Percentage
    • Service Points Won Percentage
    • Aces per Service Game
    • Double Faults per Second Serve
    • Double Faults per Service Game
    • Service In-Play Points Won
    • Points per Service Game
    • Points Lost per Service Game
    • Break Points per Service Game
    • Service Games Won Percentage
    • Service Games Lost per Set
  • Returning Stats
    • First Serve Return Points Won Percentage
    • Second Serve Return Points Won Percentage
    • Break Points Won Percentage
    • Return Points Won Percentage
    • Return In-Play Points Won
    • Points per Return Game
    • Points Won per Return Game
    • Return Games Won Percentage
  • Net Play
    • Points Played at the Net Percentage
    • Net Points Won Percentage
    • Points Won at the Net Percentage
  • Overall / Other
    • Points Won Percentage (All points)
    • Winner Percentage (Winners per Point Won)
    • Unforced Error Percentage (per Point Lost)
    • Forced Error Percentage
    • Games Won Percentage

Initially, I tried to explore more direct methods for training a model to rate performances, like manually sorting a large corpus of match data and then learning what features led to the sorted order. However, these approaches were not efficient. The key insight is that in tennis, good performances tend to translate into wins, so instead of trying to rate performance directly, I trained a logistic regression model to predict the probability of winning a match based on the available features.

Logistic regression has several advantages in this context. It is mathematically simple and computationally efficient, so it can be trained quickly even on large datasets. The model is also highly interpretable: each feature weight directly reflects how that statistic contributes to the chance of winning. Finally, the output of the model is a probability, which gives a nuanced sense of confidence rather than a strict yes/no prediction.

Logistic regression models the probability of a binary outcome (win/loss) as: $$\hat{y} = \sigma(w^T x + b)$$ where $$\sigma(z) = \frac{1}{1 + e^{-z}}$$ is the sigmoid function. In this project, the sigmoid function used has an extra parameter s, which makes it $$\sigma(z) = \frac{1}{1 + e^{-z/s}}$$.

The goal of the model is not just to produce a binary win/loss prediction, but to derive a continuous rating that reflects how strongly a player is expected to perform. If the sigmoid is too steep, most predictions collapse toward 0 or 1, and the intermediate values that could be interpreted as a “rating” are lost. By dividing z by s, the curve is “compressed” and becomes less steep. This spreads the predictions more evenly across the output interval so the output can serve as a rating measure instead of a hard classification.

To fit the model, I optimized the parameters \(w\) and \(b\) by minimizing the log loss function, which measures how far the predicted probabilities are from the actual match outcomes. The loss for a dataset of \(m\) examples is: $$\mathcal{L}(y, \hat{y}) = - \frac{1}{m} \sum_{i=1}^m \Big[ y^{(i)} \log(\hat{y}^{(i)}) + (1 - y^{(i)}) \log(1 - \hat{y}^{(i)}) \Big]$$

First, the input data was normalized so that each feature had zero mean and unit variance. For feature j and sample i, the normalized value was computed as

\[ \tilde{x}_{ij} = \frac{x_{ij} - \mu_j}{\sigma_j}, \]

where \(\mu_j\) and \(\sigma_j\) are the mean and standard deviation of feature j. If a feature had zero variance, \(\sigma_j\) was set to one to avoid division by zero. This ensures all features contribute on a comparable scale.

The model was trained using 20,000 iterations and a learning rate of 0.001. In each iteration, a set of predictions were calculated using the current weights and bias. The loss was calculated using the log-loss function.

The gradients of the loss with respect to the weights and bias are

\[ \frac{\partial L}{\partial w} = \frac{1}{m} X^T (\hat{y} - y), \quad \frac{\partial L}{\partial b} = \frac{1}{m} \sum_{i=1}^m (\hat{y}^{(i)} - y^{(i)}) \]

and the update rules at each step of gradient descent are

\[ w \leftarrow w - \alpha \cdot \frac{1}{m} X^T (\hat{y} - y), \quad b \leftarrow b - \alpha \cdot \frac{1}{m} \sum_{i=1}^m (\hat{y}^{(i)} - y^{(i)}), \]

where \(\alpha\) is the learning rate. After 20,000 iterations, the change loss between iterations was near-zero.

Finally, when predicting on new data, the same normalization step was applied using the training set statistics \((\mu_j, \sigma_j)\).

Results

The dataset contained 1,729 training examples and 433 testing examples. To evaluate the model in the standard way for logistic regression, I first checked how accurately it predicted match outcomes (win/loss).

In the training set, the model correctly predicted wins and losses for 1609 out of 1729 rows, a 93% accuracy rate. For the testing set, the model correctly predicted the outcome of 406 out of 433 rows, a 94 percent accuracy rate. Increasing the number of training iterations did not improve accuracy significantly. The average rating in the test set was a 5.00 out of 10 and 5.77% of ratings fall above 9 or below 1.

The results of the model have been incorporated into this dashboard below.

Conclusion & Reflections

2023 Madrid Open Final
In the 2023 Madrid Open Final, Alcaraz (right) defeated Struff 6-4, 3-6, 6-3. The model rated Alcaraz's performance as an 6.73 and Struff's as a 6.32.

While logistic regression models are simple, there are a few drawbacks to this model which must be taken into consideration. First, rating a player by predicting whether they are going to win or not strays slightly from the goal of measuring performance. A player can win despite putting in a poor performance and likewise, they can lose despite putting in a good one. This model will not capture that completely. In addition, many indicators of good play, like depth of shot, average height over the net of a shot, are not directly captured by the model.

Another drawback is the number of statistics a user needs to keep track of to use the model. It is difficult to track over thirty tennis statistics accurately without the help of software which logs point-by-point results. This presents a barrier to large-scale adoption of this model.

Going forward, the model requires further testing across tennis matches at a variety of levels. One way to accomplish this is by integrating this model into match statictic tracking software, such as this tool. This would solve two problems at once: it would lead to easy testing of the model at scale and would make data collection feasible.

The Rating Dashboard

Rating: 0

Fill out all of the fields above and press "Calculate Rating" to get the rating. For fields that represent a percentage, enter a value between 0 and 1.