In the early days of the pandemic, I wrote squashinformr
, an R package for web scraping professional squash data. Iām quite proud of the package, as it was my first real effort in creating open source. But when I wrote the initial version, I committed many cardinal programming sins.
At the time, my approach to writing code was very āscript-likeā. I kept code for my analyses in single Rmarkdown files that I would run from top to bottom. Itās clear looking at old squashinformr
code that I expected functions to behave the same way. That is, I wrote the functions as if they were individual scripts that had no common behaviour or tasks. In reality, some functions contained identical code except for a few substituted variables. In short, the code was a mess and vulnerable to bugs.
Since then, Iāve learned software development best practices with the help of a mentor at work. Feeling more confident about my ability to write stable code, I refactored squashinformr
earlier this year. Iād like to break down examples of where I applied these best practices as if I were speaking to my past self. If you havenāt heard of these rules, I recommend you apply some of them in your work. That way, you or someone else does not have to rewrite your code in the future š
Ā
Determine requirements by drawing a diagram
Itās easy to get inspired about a project idea and want to dive straight in to making it. This happened to me when I wrote squashinformr
. I was excited by the idea of writing an R package and I had already written code that achieved some of the basic functionality. So I just started writing code and putting hours of time into something I had not thought through. To quote Jurassic Park:
Yeah, yeah, but your scientists were so preoccupied with whether or not they could that they didnāt stop to think if they should.
In other words, I started writing code before determining what the end product should be. I didnāt think about how many functions I would be writing or what they should deliver. I didnāt consider how those functions could share subtasks so I didnāt write helper functions. I ended up wasting a lot of time because of this mistake. To be honest, if I had determined the packageās requirements beforehand, I wouldnāt have written this blog post.
Instead, I could have achieved a high-level vision of the package by drawing a diagram. Using three squashinformr
functions as an example, the diagram could look like this:
Ā
By creating this diagram, Iāve forced myself to think about how the final product works. Iām forced to think about the package functions as a whole and the steps that each function takes towards its desired behaviour. I also have to consider which functions share logic and which logic is unique (e.g.Ā data cleaning). This exercise is informative even if you are the one who came up with the project idea. It helps you determine the scope of the project and leads you toward a refined vision of what youāre about to do.
Development teams will have their own processes for determining requirements, which may be too formal for some personal projects. Thatās why I recommend this method. You can decide how formal or detailed you want to be so you donāt exhaust your inspiration in the planning stage. At the same time, you donāt dive in and hope for the best. At the very least, you have an image you can reference to keep you on track with the high-level purpose of your project.
Ā
Beware copy + paste
This is a tip that you might have heard before and, like me, ignored out of convenience. Copying and pasting code is a quick and easy solution in the moment. Unfortunately, it is also a sure way to accrue technical debt. It is an easy way to implement the logic you need, but it comes with a catch: it (at least) doubles the length of your code. I made this mistake when writing squashinformr
many times over by doing something like this:
get_some_data <- function(category = c("mens", "womens")) {
if (category == "mens") {
## [Code to get men's data]
## [Data cleaning]
}
else if (category == "womens") {
## [Code to get women's data]
## [Data cleaning]
}
## [Common data operations]
return(result)
}
Ā
Andrew Tannenbaum explained the consequence of lengthening your code:
In addition, a substantial number of the problems caused by buggy software, which occurs because vendors keep adding more and more features to their programs, which inevitably means more code and thus more bugs.
My pseudo-code function above is waiting for a bug to ruin it. In this scenario, the code that retrieves the menās data is nearly identical to the code that retrieves the womenās data. If I want to introduce a change or if a bug stops the function from working, I would have to rewrite code in up to four separate places to ensure consistent results. Itās a similar story when it comes to code commenting. I would have to rewrite the comments in up to four places if they were common between each workflow. This is why itās in our best interest to avoid copying and pasting code. Not only does it make code more vulnerable, but it also creates more work for you in the future.
So what is the alternative strategy to copying and pasting? Writing functions. When you are about to write code that repeats the same logic as other code youāve written, itās time to consider whether you should start writing a function. Hadley Wickham wrote a good rule of thumb for this in his R for Data Science book:
You should consider writing a function whenever youāve copied and pasted a block of code more than twice (i.e.Ā you now have three copies of the same code).
While refactoring squashinformr
, I wrote new helper functions that achieved subtasks within the packageās main functions. The result looks something like this:
get_some_data <- function(category = c("mens", "womens")) {
if (category == "mens") {
result <- helper_function(category = category)
}
else if (category == "womens") {
result <- helper_function(category = category)
}
## [Common data operations]
return(result)
}
helper_function <- function(category = c("mens", "womens", "both")) {
## [Code to get data, given category]
## [Data cleaning]
return(data)
}
Ā
The code that gets the data is now abstracted away from the main function. The result is succinct code that achieves the desired result without accruing technical debt.
By avoiding copying and pasting, we can make our code more robust and maintainable for the future.
Ā
Continuous testing
One thing I did well in the first version of squasinformr
was testing. I wrote two main types of tests for every function using the testthat
package. Equality tests (i.e.Ā āthis function with these inputs should return exactly this resultā) and input error tests (e.g.Ā āthis function with these inputs should return an errorā). These tests would tell me if a function was behaving as expected or return an error if it wasnāt. So all was good while I developed the package, but I forgot to consider something important. How was I going to continue to test these functions after I released squashinformr
to the world?
After releasing a package, you take on responsibility for it. Software packages need maintenance and you are the person people go to if it has issues. So, itās best that you are the first person to know when something is wrong. This is where continuous testing comes in. Itās great to write tests, but itās not enough to run them every once in a while. You need to run tests on an interval that make sense. If your software is niche and not integral to other peopleās work, you can afford to test less often. If it is important and widely used, you should be testing constantly.
So how do you conduct tests on a regular interval? There are many options for software testing at an entreprise level, but only one stand-out solution for everyday open source contributors. GitHub Actions is a flexible workflow system that can be configured to test almost any piece of software. If you host your project on GitHub, automated testing workflows can be achieved by adding a configuration file to a .github/workflows/
directory in your project. The good news is that the GitHub community already hosts useful preconfigured actions that complete all sorts of tasks āout of the boxā. For R users, the r-libs/actions repository serves as an extraordinary resource, particularly for testing. If youāre feeling adventurous, GitHub also provides tools to create your own custom actions.
I implemented a modified version of the Standard CI Workflow into squashinformr
. In addition to every time I submit a change to the master branch, GitHub Actions runs my tests on squashinformr
every 24 hours. This ensures that I wonāt go more than a day without knowing something is wrong with the package. I also run these tests in a āmatrix configurationā, which is a fancy way of saying āacross many operating systemsā. So I know my package is accessible to as many users as possible. Now that Iāve taken these precautions, squashinformr
is more maintainable than ever. Barring any wild bus accidents, Iāll be able to identify and fix issues as they arise.
Ā
Conclusion
These are the three best practices Iāve been able to apply to my open source work with squashinformr
. I determined software requirements by drawing high-level diagrams of the finished product. I reduced code redundancy and complexity by writing helper functions. Lastly, I integrated continuous testing through GitHub Actions to ensure timely maintenance and wider accessibility.
After all that, my work is not finished. squashinformr
is not perfect, it still needs help! If you are looking to get involved in open source, check out the GitHub repository if this project interests you. Feel free to open an issue or send me an email (hmd@needleinthehay.ca) with your idea. Thanks for reading!