Earlier this year, I enrolled in a Data Management course as a part of my MSc in Business Analytics. The final project in this course involved presenting a business case for a database and building a MySQL database as a solution. As a squash enthusiast, I chose to base my project on the Professional Squash Association (PSA).

Long story short, I did not complete this project because it was cancelled due the spread of COVID-19. Unfortunately, I had already put in approximately 20 hours of coding into the project. So being as stubborn as I am, I committed to reusing the code by building an R package. The result is squashinformr: a package for scraping player, tournament, and ranking data from SquashInfo.
 

Politeness on the web

squashinformr is built on the polite package and, therefore, inherits its principles of “being nice on the web”. In short, web scraping is a data collection method that can be harmful to websites and their users, if done carelessly. Therefore, it is important to use our manners:

  1. Seek permission to scrape before you begin.
  2. Take slowly, so as to not be mistaken for a DDoS attack.
  3. Never ask twice by using memoization.

Following polite principles incurs mandatory delays (set by SquashInfo’s robots.txt file) on the scraping process. Therefore, it is important that users are patient when using squashinformr. SquashInfo currently offers full access to their data and extra features through a premium membership. Please consider signing up and/or subscribing to SquashInfo to support their work.
 

Installation

Install squashinformr via CRAN:

install.packages("squashinformr")

Install the development version of squashinformr from its GitHub repository via:

if (!require("remotes")) install.packages("remotes")

remotes::install_github("HaydenMacDonald/squashinformr")

 

Functions in v0.1.2

Generally, squashinformr functions fall into one of three families:  

  • Player functions for scraping player profile data
    • get_players()
    • get_player_recent_results()
    • get_player_recent_matches()
    • get_player_recent_games()
    • get_player_rankings_history()
    • get_matchup()
  • Ranking functions for scraping current and historical rankings tables
    • get_rankings()
    • get_historical_rankings()
  • Tournament functions for scraping tournament results data
    • get_tournaments()
    • get_tournament_players()
    • get_tournament_matches()
    • get_tournament_games()

Let’s take a closer look at some of these functions in action…


Player functions

Use get_players() to extract biographical information on players currently ranked within the top 500. Here, we’ll just scrape the current top 25 men’s players.  

library(squashinformr)

(top_25_men <- get_players(top = 25, category = "mens"))
## # A tibble: 25 x 15
##     rank first last    age gender birthplace nationality residence height weight
##    <dbl> <chr> <chr> <dbl> <chr>  <chr>      <chr>       <chr>      <dbl>  <dbl>
##  1     1 Moha~ Elsh~    29 Male   Alexandria Egypt       Bristol,~    185     82
##  2     2 Ali   Farag    28 Male   Cairo      Egypt       Cairo        183     70
##  3     3 Karim Abde~    28 Male   Alexandria Egypt       Giza         173     72
##  4     4 Tarek Momen    32 Male   Cairo      Egypt       Cairo        174     66
##  5     5 Paul  Coll     27 Male   Greymouth  New Zealand Amsterda~    179     83
##  6     6 Diego Elias    23 Male   Lima       Peru        Lima         188     82
##  7     7 Marw~ Elsh~    26 Male   Alexandria Egypt       Bristol,~    183     73
##  8     8 Simon Rösn~    32 Male   Wurzburg   Germany     Paderborn    190     89
##  9     9 Migu~ Rodr~    34 Male   Bogota     Colombia    Bogota       170     72
## 10    10 Joel  Makin    25 Male   Haverford~ Wales       Birmingh~    180     80
## # ... with 15 more rows, and 5 more variables: plays <chr>, racket <chr>,
## #   joined_psa <dbl>, university <chr>, club <chr>

 
If you’re familiar with squash, you might be under the impression that players that are younger, taller, or lighter on their feet might have competitive advantages.
 

The data returned from SquashInfo seem to imply that this is not the case! Age, height, weight, and BMI do not seem to have any clear relationship with PSA rankings. There are many factors that contribute to a player’s success, including endurance, positioning, accuracy, and play-style.
 

Recent match data

Use get_player_recent_matches() to retrieve a player’s recent match data (past January 1, 2019). Let’s get Raneem El Welily’s recent match data.

(welily <- get_player_recent_matches(player = "Raneem El Welily", category = "womens"))
## # A tibble: 71 x 12
##     rank player opponent result games_won games_lost match_time round date
##    <int> <chr>  <chr>    <chr>      <dbl>      <dbl>      <dbl> <chr> <date>
##  1     1 Ranee~ Nour El~ L              2          3         72 QF    2020-03-01
##  2     1 Ranee~ Yathreb~ W             NA         NA         NA R3    2020-03-01
##  3     1 Ranee~ Alexand~ W              3          1         36 R2    2020-03-01
##  4     1 Ranee~ Nour El~ L              2          3         59 F     2020-03-01
##  5     1 Ranee~ Nouran ~ W              3          0         31 SF    2020-03-01
##  6     1 Ranee~ Olivia ~ W              3          0         29 QF    2020-03-01
##  7     1 Ranee~ Salma H~ W              3          0         22 R3    2020-03-01
##  8     1 Ranee~ Juliann~ W              3          0         26 R2    2020-03-01
##  9     1 Ranee~ Nouran ~ W              1          0         15 F     2020-02-01
## 10     1 Ranee~ Hania E~ W              3          0         40 SF    2020-02-01
## # ... with 61 more rows, and 3 more variables: event <chr>, country <chr>,
## #   psa <chr>

 

El Welily’s 2019-2020 match data exemplify her general dominance of women’s competitons. She has a 5.45 win/loss ratio across the tournaments she participated in and 65% of her match wins are 3-0s. Even when she does lose a match, she does not make it easy for her opponents. Her median match times in losses are longer regardless of the number of games played. Although, the sample size is low for El Welily’s losses.
 

Apologies to those using dark theme! The data above are not available anymore, so I could not reproduce this graphic with a dark theme :(

Given El Welily’s general dominance, how does she play versus the current #2, Nouran Gohar? We can use get_matchup() to get a quantitative take on their head-to-head performance.

welily_v_gohar <- get_matchup("Raneem El Welily", "Nouran Gohar", category = "womens")

 

El Welily’s currently holds a lead on Gohar by five matches since Janary 1, 2019 and a majority of her wins over Gohar have been decisive (3-1 or 3-0).
 

Raneem El Welily v Nouran Gohar
Head-to-Head Game Statistics, 2019-2020
Metric Value
Games Played 32
Player 1 Games Won 20
Player 2 Games Won 11
Player 1 Avg Advantage 4.75
Player 2 Avg Advantage 4.73
Avg Point Diff 4.74
Player 1 Tiebreak Wins 2
Player 2 Tiebreak Wins 1
Pct Games Tiebreak 6.25
 
But, game-level data show that Gohar can be equally dominant in any individual game. Their approximately equal average point advantages reflects this. So it may be that El Welily’s consistency across games that allows her to win more matches over Gohar.
 

Ranking functions

PSA Rankings are the definitive international ranking that decide which players are the best in the world. They have a huge influence on how tournament organizers seed individual players. Using get_rankings(), let’s scrape the top 15 players from the most recent rankings tables (April 2020 at time of writing) for both competition categories.
 

apr <- get_rankings(top = 15, category = "both")

 

Men's PSA Rankings (April 2020)
Rank Previous Rank Change Name Highest Ranking Highest Ranking Date Country
1 1 Mohamed Elshorbagy 1 2014-11-01 EGY
2 2 Ali Farag 1 2019-03-01 EGY
3 4 1 Karim Abdel Gawad 1 2017-05-01 EGY
4 3 −1 Tarek Momen 3 2019-02-01 EGY
5 5 Paul Coll 5 2019-04-01 NZL
6 6 Diego Elias 6 2020-01-01 PER
7 8 1 Marwan Elshorbagy 3 2018-05-01 EGY
8 7 −1 Simon Rösner 3 2018-12-01 GER
9 9 Miguel Rodriguez 4 2015-06-01 COL
10 10 Joel Makin 10 2020-03-01 WAL
11 11 Mohamed Abouelghar 7 2019-06-01 EGY
12 12 Fares Dessouky 8 2017-11-01 EGY
13 13 Saurav Ghosal 10 2019-04-01 IND
14 14 Mazen Hesham 13 2015-12-01 EGY
15 17 2 Omar Mosaad 3 2016-06-01 EGY

 

Women's PSA Rankings (April 2020)
Rank Previous Rank Change Name Highest Ranking Highest Ranking Date Country
1 1 Raneem El Welily 1 2015-09-01 EGY
2 2 Nouran Gohar 2 2017-01-01 EGY
3 4 1 Nour El Sherbini 1 2016-05-01 EGY
4 3 −1 Camille Serme 2 2017-02-01 FRA
5 5 Nour El Tayeb 3 2018-02-01 EGY
6 7 1 Sarah-Jane Perry 6 2017-09-01 ENG
7 10 3 Hania El Hammamy 7 2020-04-01 EGY
8 8 Amanda Sobhy 6 2016-10-01 USA
9 6 −3 Joelle King 3 2019-02-01 NZL
10 9 −1 Tesni Evans 9 2018-11-01 WAL
11 12 1 Joshna Chinappa 10 2016-07-01 IND
12 13 1 Salma Hany 12 2019-03-01 EGY
13 16 3 Olivia Blatchford Clyne 12 2017-12-01 USA
14 15 1 Yathreb Adel 14 2020-04-01 EGY
15 14 −1 Alison Waters 3 2010-10-01 ENG
 
get_rankings() is useful for analyzing how player rankings have changed recently. Alternatively, we can look back on an individual player’s ranking history with get_player_rankings_history().
 

ph <- get_player_rankings_history(rank = 1:4, category = "womens")

 
 
This is excellent data to use when discussing the story-lines of the sport. In this example, the current top 4 women’s players all began their careers before 2012. Since the end of 2014, they have occupied the top 20 rankings. We can see that Nouran Gohar had a meteoric rise through the rankings after her first PSA ranking in January 2011. The following graphic shows the last year of these players’ ranking histories.
 

Shoutout to the ggbump package! Shoutout to the ggbump package!

Shoutout to the ggbump package!


How will these rankings shift over the next couple of years?
 

Tournament functions

squashinformr provides useful functions for pulling tournament data at many levels of detail. This includes tournament metadata, players, matches, and games.

The lowest level of detail is the tournament games, accessed via get_tournament_games(). Let’s use this function to look at the JP Morgan Tournament of Champions, one of the premier tournaments in squash.  

## Return game data for 2020's Tournament of Champions.
(toc <- get_tournament_games("tournament of champions", year = 2020))
## # A tibble: 388 x 15
##    tournament_name category tournament_date round match  game player_1 player_2
##    <chr>           <chr>    <date>          <ord> <int> <int> <chr>    <chr>
##  1 JP Morgan Tour~ Men's    2020-01-17      F        64     4 Mohamed~ Tarek M~
##  2 JP Morgan Tour~ Men's    2020-01-17      F        64     3 Mohamed~ Tarek M~
##  3 JP Morgan Tour~ Men's    2020-01-17      F        64     2 Mohamed~ Tarek M~
##  4 JP Morgan Tour~ Men's    2020-01-17      F        64     1 Mohamed~ Tarek M~
##  5 JP Morgan Tour~ Women's  2020-01-17      F        62     3 Camille~ Nour El~
##  6 JP Morgan Tour~ Women's  2020-01-17      F        62     2 Camille~ Nour El~
##  7 JP Morgan Tour~ Women's  2020-01-17      F        62     1 Camille~ Nour El~
##  8 JP Morgan Tour~ Men's    2020-01-17      SF       63     5 Tarek M~ Ali Far~
##  9 JP Morgan Tour~ Men's    2020-01-17      SF       63     4 Tarek M~ Ali Far~
## 10 JP Morgan Tour~ Men's    2020-01-17      SF       63     3 Tarek M~ Ali Far~
## # ... with 378 more rows, and 7 more variables: game_winner <chr>,
## #   player_1_score <dbl>, player_2_score <dbl>, player_1_seed <dbl>,
## #   player_2_seed <dbl>, player_1_nationality <chr>, player_2_nationality <chr>

 
Let’s compare the consistency of a player’s performance in the men’s tournament. We can ask, by what point margin did players win or lose on average throughout this tournament?


If we only look at the quarter-finalists, we can see that most were able to defeat their opponents with a comfortable lead of 4 or 5 points.


The averages for Karim Abdel Gawad, Simon Rösner, and Paul Coll are particularly high. This reflects how they handedly defeated many of their opponents up to the third round of the tournament.

Conversely, two players stand out from a “defensive” perspective. Simon Rösner only lost 3 games, all by a 2 or 3 point margin, during a game 5 defeat to Karim Abdel Gawad. Meanwhile, Mohamed Elshorbagy, the eventual champion, lost only one game. Tarek Momen took a game off of Elshorbagy in the final match, but only by a margin of 2 points. Elshorbagy certainly came to play in this tournament!
 

Conclusion

In this post, I’ve shown only a small selection of the functions provided in squashinformr. Install the package and try it out yourself! There is a huge amount of data on this excellent sport that has yet to be explored. Cheers!

The full code for this post is available here.
 

Help

If you find a bug, please submit an issue on GitHub.

If you are interested in helping me extend the functionality of this package, fork the repository, make changes, and submit them as a pull request. The squashinformr project is released with a Contributor Code of Conduct. By contributing to this project, you agree to its terms.
 

Disclaimer

SquashInfo is a valuable resource for the international squash community. By creating and sharing this package, I do not intend to compete with SquashInfo or any of its stakeholders. The squashinformr package was created to allow individuals to access data from SquashInfo in an efficient and responsible way, using polite principles. SquashInfo currently offers full access to their data and extra features through a premium membership. Please consider signing up and subscribing to SquashInfo to support their work.
 

License

The squashinformr package is released under a GPL-3 license.