Earlier this year, I enrolled in a Data Management course as a part of my MSc in Business Analytics. The final project in this course involved presenting a business case for a database and building a MySQL database as a solution. As a squash enthusiast, I chose to base my project on the Professional Squash Association (PSA).
Long story short, I did not complete this project because it was cancelled due the spread of COVID-19. Unfortunately, I had already put in approximately 20 hours of coding into the project. So being as stubborn as I am, I committed to reusing the code by building an R package. The result is
squashinformr: a package for scraping player, tournament, and ranking data from SquashInfo.
Politeness on the web
squashinformr is built on the
polite package and, therefore, inherits its principles of “being nice on the web”. In short, web scraping is a data collection method that can be harmful to websites and their users, if done carelessly. Therefore, it is important to use our manners:
- Seek permission to scrape before you begin.
- Take slowly, so as to not be mistaken for a DDoS attack.
- Never ask twice by using memoization.
polite principles incurs mandatory delays (set by SquashInfo’s robots.txt file) on the scraping process. Therefore, it is important that users are patient when using
squashinformr. SquashInfo currently offers full access to their data and extra features through a premium membership. Please consider signing up and/or subscribing to SquashInfo to support their work.
squashinformr via CRAN:
Install the development version of
squashinformr from its GitHub repository via:
if (!require("remotes")) install.packages("remotes") remotes::install_github("HaydenMacDonald/squashinformr")
Functions in v0.1.2
squashinformr functions fall into one of three families:
- Player functions for scraping player profile data
- Ranking functions for scraping current and historical rankings tables
- Tournament functions for scraping tournament results data
Let’s take a closer look at some of these functions in action…
get_players() to extract biographical information on players currently ranked within the top 500. Here, we’ll just scrape the current top 25 men’s players.
library(squashinformr) (top_25_men <- get_players(top = 25, category = "mens"))
## # A tibble: 25 x 15 ## rank first last age gender birthplace nationality residence height weight ## <dbl> <chr> <chr> <dbl> <chr> <chr> <chr> <chr> <dbl> <dbl> ## 1 1 Moha~ Elsh~ 29 Male Alexandria Egypt Bristol,~ 185 82 ## 2 2 Ali Farag 28 Male Cairo Egypt Cairo 183 70 ## 3 3 Karim Abde~ 28 Male Alexandria Egypt Giza 173 72 ## 4 4 Tarek Momen 32 Male Cairo Egypt Cairo 174 66 ## 5 5 Paul Coll 27 Male Greymouth New Zealand Amsterda~ 179 83 ## 6 6 Diego Elias 23 Male Lima Peru Lima 188 82 ## 7 7 Marw~ Elsh~ 26 Male Alexandria Egypt Bristol,~ 183 73 ## 8 8 Simon Rösn~ 32 Male Wurzburg Germany Paderborn 190 89 ## 9 9 Migu~ Rodr~ 34 Male Bogota Colombia Bogota 170 72 ## 10 10 Joel Makin 25 Male Haverford~ Wales Birmingh~ 180 80 ## # ... with 15 more rows, and 5 more variables: plays <chr>, racket <chr>, ## # joined_psa <dbl>, university <chr>, club <chr>
If you’re familiar with squash, you might be under the impression that players that are younger, taller, or lighter on their feet might have competitive advantages.
The data returned from SquashInfo seem to imply that this is not the case! Age, height, weight, and BMI do not seem to have any clear relationship with PSA rankings. There are many factors that contribute to a player’s success, including endurance, positioning, accuracy, and play-style.
Recent match data
get_player_recent_matches() to retrieve a player’s recent match data (past January 1, 2019). Let’s get Raneem El Welily’s recent match data.
(welily <- get_player_recent_matches(player = "Raneem El Welily", category = "womens"))
## # A tibble: 71 x 12 ## rank player opponent result games_won games_lost match_time round date ## <int> <chr> <chr> <chr> <dbl> <dbl> <dbl> <chr> <date> ## 1 1 Ranee~ Nour El~ L 2 3 72 QF 2020-03-01 ## 2 1 Ranee~ Yathreb~ W NA NA NA R3 2020-03-01 ## 3 1 Ranee~ Alexand~ W 3 1 36 R2 2020-03-01 ## 4 1 Ranee~ Nour El~ L 2 3 59 F 2020-03-01 ## 5 1 Ranee~ Nouran ~ W 3 0 31 SF 2020-03-01 ## 6 1 Ranee~ Olivia ~ W 3 0 29 QF 2020-03-01 ## 7 1 Ranee~ Salma H~ W 3 0 22 R3 2020-03-01 ## 8 1 Ranee~ Juliann~ W 3 0 26 R2 2020-03-01 ## 9 1 Ranee~ Nouran ~ W 1 0 15 F 2020-02-01 ## 10 1 Ranee~ Hania E~ W 3 0 40 SF 2020-02-01 ## # ... with 61 more rows, and 3 more variables: event <chr>, country <chr>, ## # psa <chr>
El Welily’s 2019-2020 match data exemplify her general dominance of women’s competitons. She has a 5.45 win/loss ratio across the tournaments she participated in and 65% of her match wins are 3-0s. Even when she does lose a match, she does not make it easy for her opponents. Her median match times in losses are longer regardless of the number of games played. Although, the sample size is low for El Welily’s losses.
Given El Welily’s general dominance, how does she play versus the current #2, Nouran Gohar? We can use
get_matchup() to get a quantitative take on their head-to-head performance.
welily_v_gohar <- get_matchup("Raneem El Welily", "Nouran Gohar", category = "womens")
El Welily’s currently holds a lead on Gohar by five matches since Janary 1, 2019 and a majority of her wins over Gohar have been decisive (3-1 or 3-0).
|Raneem El Welily v Nouran Gohar|
|Head-to-Head Game Statistics, 2019-2020|
|Player 1 Games Won||20|
|Player 2 Games Won||11|
|Player 1 Avg Advantage||4.75|
|Player 2 Avg Advantage||4.73|
|Avg Point Diff||4.74|
|Player 1 Tiebreak Wins||2|
|Player 2 Tiebreak Wins||1|
|Pct Games Tiebreak||6.25|
But, game-level data show that Gohar can be equally dominant in any individual game. Their approximately equal average point advantages reflects this. So it may be that El Welily’s consistency across games that allows her to win more matches over Gohar.
PSA Rankings are the definitive international ranking that decide which players are the best in the world. They have a huge influence on how tournament organizers seed individual players. Using
get_rankings(), let’s scrape the top 15 players from the most recent rankings tables (April 2020 at time of writing) for both competition categories.
apr <- get_rankings(top = 15, category = "both")
|Men's PSA Rankings (April 2020)|
|Rank||Previous Rank||Change||Name||Highest Ranking||Highest Ranking Date||Country|
|3||4||1||Karim Abdel Gawad||1||2017-05-01||EGY|
|Women's PSA Rankings (April 2020)|
|Rank||Previous Rank||Change||Name||Highest Ranking||Highest Ranking Date||Country|
|1||1||—||Raneem El Welily||1||2015-09-01||EGY|
|3||4||1||Nour El Sherbini||1||2016-05-01||EGY|
|5||5||—||Nour El Tayeb||3||2018-02-01||EGY|
|7||10||3||Hania El Hammamy||7||2020-04-01||EGY|
|13||16||3||Olivia Blatchford Clyne||12||2017-12-01||USA|
get_rankings()is useful for analyzing how player rankings have changed recently. Alternatively, we can look back on an individual player’s ranking history with
ph <- get_player_rankings_history(rank = 1:4, category = "womens")
This is excellent data to use when discussing the story-lines of the sport. In this example, the current top 4 women’s players all began their careers before 2012. Since the end of 2014, they have occupied the top 20 rankings. We can see that Nouran Gohar had a meteoric rise through the rankings after her first PSA ranking in January 2011. The following graphic shows the last year of these players’ ranking histories.
How will these rankings shift over the next couple of years?
squashinformr provides useful functions for pulling tournament data at many levels of detail. This includes tournament metadata, players, matches, and games.
The lowest level of detail is the tournament games, accessed via
get_tournament_games(). Let’s use this function to look at the JP Morgan Tournament of Champions, one of the premier tournaments in squash.
## Return game data for 2020's Tournament of Champions. (toc <- get_tournament_games("tournament of champions", year = 2020))
## # A tibble: 388 x 15 ## tournament_name category tournament_date round match game player_1 player_2 ## <chr> <chr> <date> <ord> <int> <int> <chr> <chr> ## 1 JP Morgan Tour~ Men's 2020-01-17 F 64 4 Mohamed~ Tarek M~ ## 2 JP Morgan Tour~ Men's 2020-01-17 F 64 3 Mohamed~ Tarek M~ ## 3 JP Morgan Tour~ Men's 2020-01-17 F 64 2 Mohamed~ Tarek M~ ## 4 JP Morgan Tour~ Men's 2020-01-17 F 64 1 Mohamed~ Tarek M~ ## 5 JP Morgan Tour~ Women's 2020-01-17 F 62 3 Camille~ Nour El~ ## 6 JP Morgan Tour~ Women's 2020-01-17 F 62 2 Camille~ Nour El~ ## 7 JP Morgan Tour~ Women's 2020-01-17 F 62 1 Camille~ Nour El~ ## 8 JP Morgan Tour~ Men's 2020-01-17 SF 63 5 Tarek M~ Ali Far~ ## 9 JP Morgan Tour~ Men's 2020-01-17 SF 63 4 Tarek M~ Ali Far~ ## 10 JP Morgan Tour~ Men's 2020-01-17 SF 63 3 Tarek M~ Ali Far~ ## # ... with 378 more rows, and 7 more variables: game_winner <chr>, ## # player_1_score <dbl>, player_2_score <dbl>, player_1_seed <dbl>, ## # player_2_seed <dbl>, player_1_nationality <chr>, player_2_nationality <chr>
Let’s compare the consistency of a player’s performance in the men’s tournament. We can ask, by what point margin did players win or lose on average throughout this tournament?
If we only look at the quarter-finalists, we can see that most were able to defeat their opponents with a comfortable lead of 4 or 5 points.
The averages for Karim Abdel Gawad, Simon Rösner, and Paul Coll are particularly high. This reflects how they handedly defeated many of their opponents up to the third round of the tournament.
Conversely, two players stand out from a “defensive” perspective. Simon Rösner only lost 3 games, all by a 2 or 3 point margin, during a game 5 defeat to Karim Abdel Gawad. Meanwhile, Mohamed Elshorbagy, the eventual champion, lost only one game. Tarek Momen took a game off of Elshorbagy in the final match, but only by a margin of 2 points. Elshorbagy certainly came to play in this tournament!
In this post, I’ve shown only a small selection of the functions provided in
squashinformr. Install the package and try it out yourself! There is a huge amount of data on this excellent sport that has yet to be explored. Cheers!
The full code for this post is available here.
If you find a bug, please submit an issue on GitHub.
If you are interested in helping me extend the functionality of this package, fork the repository, make changes, and submit them as a pull request. The
squashinformr project is released with a Contributor Code of Conduct. By contributing to this project, you agree to its terms.
SquashInfo is a valuable resource for the international squash community. By creating and sharing this package, I do not intend to compete with SquashInfo or any of its stakeholders. The
squashinformr package was created to allow individuals to access data from SquashInfo in an efficient and responsible way, using
polite principles. SquashInfo currently offers full access to their data and extra features through a premium membership. Please consider signing up and subscribing to SquashInfo to support their work.
squashinformr package is released under a GPL-3 license.