Earlier this year, I enrolled in a Data Management course as a part of my MSc in Business Analytics. The final project in this course involved presenting a business case for a database and building a MySQL database as a solution. As a squash enthusiast, I chose to base my project on the Professional Squash Association (PSA).
Long story short, I did not complete this project because it was cancelled due the spread of COVID-19. Unfortunately, I had already put in approximately 20 hours of coding into the project. So being as stubborn as I am, I committed to reusing the code by building an R package. The result is squashinformr
: a package for scraping player, tournament, and ranking data from SquashInfo.
Politeness on the web
squashinformr
is built on the polite
package and, therefore, inherits its principles of “being nice on the web”. In short, web scraping is a data collection method that can be harmful to websites and their users, if done carelessly. Therefore, it is important to use our manners:
- Seek permission to scrape before you begin.
- Take slowly, so as to not be mistaken for a DDoS attack.
- Never ask twice by using memoization.
Following polite
principles incurs mandatory delays (set by SquashInfo’s robots.txt file) on the scraping process. Therefore, it is important that users are patient when using squashinformr
. SquashInfo currently offers full access to their data and extra features through a premium membership. Please consider signing up and/or subscribing to SquashInfo to support their work.
Installation
Install squashinformr
via CRAN:
install.packages("squashinformr")
Install the development version of squashinformr
from its GitHub repository via:
if (!require("remotes")) install.packages("remotes")
remotes::install_github("HaydenMacDonald/squashinformr")
Functions in v0.1.2
Generally, squashinformr
functions fall into one of three families:
- Player functions for scraping player profile data
get_players()
get_player_recent_results()
get_player_recent_matches()
get_player_recent_games()
get_player_rankings_history()
get_matchup()
- Ranking functions for scraping current and historical rankings tables
get_rankings()
get_historical_rankings()
- Tournament functions for scraping tournament results data
get_tournaments()
get_tournament_players()
get_tournament_matches()
get_tournament_games()
Let’s take a closer look at some of these functions in action…
Player functions
Use get_players()
to extract biographical information on players currently ranked within the top 500. Here, we’ll just scrape the current top 25 men’s players.
library(squashinformr)
(top_25_men <- get_players(top = 25, category = "mens"))
## # A tibble: 25 x 15
## rank first last age gender birthplace nationality residence height weight
## <dbl> <chr> <chr> <dbl> <chr> <chr> <chr> <chr> <dbl> <dbl>
## 1 1 Moha~ Elsh~ 29 Male Alexandria Egypt Bristol,~ 185 82
## 2 2 Ali Farag 28 Male Cairo Egypt Cairo 183 70
## 3 3 Karim Abde~ 28 Male Alexandria Egypt Giza 173 72
## 4 4 Tarek Momen 32 Male Cairo Egypt Cairo 174 66
## 5 5 Paul Coll 27 Male Greymouth New Zealand Amsterda~ 179 83
## 6 6 Diego Elias 23 Male Lima Peru Lima 188 82
## 7 7 Marw~ Elsh~ 26 Male Alexandria Egypt Bristol,~ 183 73
## 8 8 Simon Rösn~ 32 Male Wurzburg Germany Paderborn 190 89
## 9 9 Migu~ Rodr~ 34 Male Bogota Colombia Bogota 170 72
## 10 10 Joel Makin 25 Male Haverford~ Wales Birmingh~ 180 80
## # ... with 15 more rows, and 5 more variables: plays <chr>, racket <chr>,
## # joined_psa <dbl>, university <chr>, club <chr>
If you’re familiar with squash, you might be under the impression that players that are younger, taller, or lighter on their feet might have competitive advantages.
The data returned from SquashInfo seem to imply that this is not the case! Age, height, weight, and BMI do not seem to have any clear relationship with PSA rankings. There are many factors that contribute to a player’s success, including endurance, positioning, accuracy, and play-style.
Recent match data
Use get_player_recent_matches()
to retrieve a player’s recent match data (past January 1, 2019). Let’s get Raneem El Welily’s recent match data.
(welily <- get_player_recent_matches(player = "Raneem El Welily", category = "womens"))
## # A tibble: 71 x 12
## rank player opponent result games_won games_lost match_time round date
## <int> <chr> <chr> <chr> <dbl> <dbl> <dbl> <chr> <date>
## 1 1 Ranee~ Nour El~ L 2 3 72 QF 2020-03-01
## 2 1 Ranee~ Yathreb~ W NA NA NA R3 2020-03-01
## 3 1 Ranee~ Alexand~ W 3 1 36 R2 2020-03-01
## 4 1 Ranee~ Nour El~ L 2 3 59 F 2020-03-01
## 5 1 Ranee~ Nouran ~ W 3 0 31 SF 2020-03-01
## 6 1 Ranee~ Olivia ~ W 3 0 29 QF 2020-03-01
## 7 1 Ranee~ Salma H~ W 3 0 22 R3 2020-03-01
## 8 1 Ranee~ Juliann~ W 3 0 26 R2 2020-03-01
## 9 1 Ranee~ Nouran ~ W 1 0 15 F 2020-02-01
## 10 1 Ranee~ Hania E~ W 3 0 40 SF 2020-02-01
## # ... with 61 more rows, and 3 more variables: event <chr>, country <chr>,
## # psa <chr>
El Welily’s 2019-2020 match data exemplify her general dominance of women’s competitons. She has a 5.45 win/loss ratio across the tournaments she participated in and 65% of her match wins are 3-0s. Even when she does lose a match, she does not make it easy for her opponents. Her median match times in losses are longer regardless of the number of games played. Although, the sample size is low for El Welily’s losses.
Given El Welily’s general dominance, how does she play versus the current #2, Nouran Gohar? We can use get_matchup()
to get a quantitative take on their head-to-head performance.
welily_v_gohar <- get_matchup("Raneem El Welily", "Nouran Gohar", category = "womens")
El Welily’s currently holds a lead on Gohar by five matches since Janary 1, 2019 and a majority of her wins over Gohar have been decisive (3-1 or 3-0).
Raneem El Welily v Nouran Gohar | |
---|---|
Head-to-Head Game Statistics, 2019-2020 | |
Metric | Value |
Games Played | 32 |
Player 1 Games Won | 20 |
Player 2 Games Won | 11 |
Player 1 Avg Advantage | 4.75 |
Player 2 Avg Advantage | 4.73 |
Avg Point Diff | 4.74 |
Player 1 Tiebreak Wins | 2 |
Player 2 Tiebreak Wins | 1 |
Pct Games Tiebreak | 6.25 |
But, game-level data show that Gohar can be equally dominant in any individual game. Their approximately equal average point advantages reflects this. So it may be that El Welily’s consistency across games that allows her to win more matches over Gohar.
Ranking functions
PSA Rankings are the definitive international ranking that decide which players are the best in the world. They have a huge influence on how tournament organizers seed individual players. Using get_rankings()
, let’s scrape the top 15 players from the most recent rankings tables (April 2020 at time of writing) for both competition categories.
apr <- get_rankings(top = 15, category = "both")
Men's PSA Rankings (April 2020) | ||||||
---|---|---|---|---|---|---|
Rank | Previous Rank | Change | Name | Highest Ranking | Highest Ranking Date | Country |
1 | 1 | — | Mohamed Elshorbagy | 1 | 2014-11-01 | EGY |
2 | 2 | — | Ali Farag | 1 | 2019-03-01 | EGY |
3 | 4 | 1 | Karim Abdel Gawad | 1 | 2017-05-01 | EGY |
4 | 3 | −1 | Tarek Momen | 3 | 2019-02-01 | EGY |
5 | 5 | — | Paul Coll | 5 | 2019-04-01 | NZL |
6 | 6 | — | Diego Elias | 6 | 2020-01-01 | PER |
7 | 8 | 1 | Marwan Elshorbagy | 3 | 2018-05-01 | EGY |
8 | 7 | −1 | Simon Rösner | 3 | 2018-12-01 | GER |
9 | 9 | — | Miguel Rodriguez | 4 | 2015-06-01 | COL |
10 | 10 | — | Joel Makin | 10 | 2020-03-01 | WAL |
11 | 11 | — | Mohamed Abouelghar | 7 | 2019-06-01 | EGY |
12 | 12 | — | Fares Dessouky | 8 | 2017-11-01 | EGY |
13 | 13 | — | Saurav Ghosal | 10 | 2019-04-01 | IND |
14 | 14 | — | Mazen Hesham | 13 | 2015-12-01 | EGY |
15 | 17 | 2 | Omar Mosaad | 3 | 2016-06-01 | EGY |
Women's PSA Rankings (April 2020) | ||||||
---|---|---|---|---|---|---|
Rank | Previous Rank | Change | Name | Highest Ranking | Highest Ranking Date | Country |
1 | 1 | — | Raneem El Welily | 1 | 2015-09-01 | EGY |
2 | 2 | — | Nouran Gohar | 2 | 2017-01-01 | EGY |
3 | 4 | 1 | Nour El Sherbini | 1 | 2016-05-01 | EGY |
4 | 3 | −1 | Camille Serme | 2 | 2017-02-01 | FRA |
5 | 5 | — | Nour El Tayeb | 3 | 2018-02-01 | EGY |
6 | 7 | 1 | Sarah-Jane Perry | 6 | 2017-09-01 | ENG |
7 | 10 | 3 | Hania El Hammamy | 7 | 2020-04-01 | EGY |
8 | 8 | — | Amanda Sobhy | 6 | 2016-10-01 | USA |
9 | 6 | −3 | Joelle King | 3 | 2019-02-01 | NZL |
10 | 9 | −1 | Tesni Evans | 9 | 2018-11-01 | WAL |
11 | 12 | 1 | Joshna Chinappa | 10 | 2016-07-01 | IND |
12 | 13 | 1 | Salma Hany | 12 | 2019-03-01 | EGY |
13 | 16 | 3 | Olivia Blatchford Clyne | 12 | 2017-12-01 | USA |
14 | 15 | 1 | Yathreb Adel | 14 | 2020-04-01 | EGY |
15 | 14 | −1 | Alison Waters | 3 | 2010-10-01 | ENG |
get_rankings()
is useful for analyzing how player rankings have changed recently. Alternatively, we can look back on an individual player’s ranking history with get_player_rankings_history()
.ph <- get_player_rankings_history(rank = 1:4, category = "womens")
This is excellent data to use when discussing the story-lines of the sport. In this example, the current top 4 women’s players all began their careers before 2012. Since the end of 2014, they have occupied the top 20 rankings. We can see that Nouran Gohar had a meteoric rise through the rankings after her first PSA ranking in January 2011. The following graphic shows the last year of these players’ ranking histories.
How will these rankings shift over the next couple of years?
Tournament functions
squashinformr
provides useful functions for pulling tournament data at many levels of detail. This includes tournament metadata, players, matches, and games.
The lowest level of detail is the tournament games, accessed via get_tournament_games()
. Let’s use this function to look at the JP Morgan Tournament of Champions, one of the premier tournaments in squash.
## Return game data for 2020's Tournament of Champions.
(toc <- get_tournament_games("tournament of champions", year = 2020))
## # A tibble: 388 x 15
## tournament_name category tournament_date round match game player_1 player_2
## <chr> <chr> <date> <ord> <int> <int> <chr> <chr>
## 1 JP Morgan Tour~ Men's 2020-01-17 F 64 4 Mohamed~ Tarek M~
## 2 JP Morgan Tour~ Men's 2020-01-17 F 64 3 Mohamed~ Tarek M~
## 3 JP Morgan Tour~ Men's 2020-01-17 F 64 2 Mohamed~ Tarek M~
## 4 JP Morgan Tour~ Men's 2020-01-17 F 64 1 Mohamed~ Tarek M~
## 5 JP Morgan Tour~ Women's 2020-01-17 F 62 3 Camille~ Nour El~
## 6 JP Morgan Tour~ Women's 2020-01-17 F 62 2 Camille~ Nour El~
## 7 JP Morgan Tour~ Women's 2020-01-17 F 62 1 Camille~ Nour El~
## 8 JP Morgan Tour~ Men's 2020-01-17 SF 63 5 Tarek M~ Ali Far~
## 9 JP Morgan Tour~ Men's 2020-01-17 SF 63 4 Tarek M~ Ali Far~
## 10 JP Morgan Tour~ Men's 2020-01-17 SF 63 3 Tarek M~ Ali Far~
## # ... with 378 more rows, and 7 more variables: game_winner <chr>,
## # player_1_score <dbl>, player_2_score <dbl>, player_1_seed <dbl>,
## # player_2_seed <dbl>, player_1_nationality <chr>, player_2_nationality <chr>
Let’s compare the consistency of a player’s performance in the men’s tournament. We can ask, by what point margin did players win or lose on average throughout this tournament?
If we only look at the quarter-finalists, we can see that most were able to defeat their opponents with a comfortable lead of 4 or 5 points.
The averages for Karim Abdel Gawad, Simon Rösner, and Paul Coll are particularly high. This reflects how they handedly defeated many of their opponents up to the third round of the tournament.
Conversely, two players stand out from a “defensive” perspective. Simon Rösner only lost 3 games, all by a 2 or 3 point margin, during a game 5 defeat to Karim Abdel Gawad. Meanwhile, Mohamed Elshorbagy, the eventual champion, lost only one game. Tarek Momen took a game off of Elshorbagy in the final match, but only by a margin of 2 points. Elshorbagy certainly came to play in this tournament!
Conclusion
In this post, I’ve shown only a small selection of the functions provided in squashinformr
. Install the package and try it out yourself! There is a huge amount of data on this excellent sport that has yet to be explored. Cheers!
The full code for this post is available here.
Help
If you find a bug, please submit an issue on GitHub.
If you are interested in helping me extend the functionality of this package, fork the repository, make changes, and submit them as a pull request. The squashinformr
project is released with a Contributor Code of Conduct. By contributing to this project, you agree to its terms.
Disclaimer
SquashInfo is a valuable resource for the international squash community. By creating and sharing this package, I do not intend to compete with SquashInfo or any of its stakeholders. The squashinformr
package was created to allow individuals to access data from SquashInfo in an efficient and responsible way, using polite
principles. SquashInfo currently offers full access to their data and extra features through a premium membership. Please consider signing up and subscribing to SquashInfo to support their work.
License
The squashinformr
package is released under a GPL-3 license.