In the early days of the pandemic, I wrote squashinformr, an R package for web scraping professional squash data. Iā€™m quite proud of the package, as it was my first real effort in creating open source. But when I wrote the initial version, I committed many cardinal programming sins.

At the time, my approach to writing code was very ā€˜script-likeā€™. I kept code for my analyses in single Rmarkdown files that I would run from top to bottom. Itā€™s clear looking at old squashinformr code that I expected functions to behave the same way. That is, I wrote the functions as if they were individual scripts that had no common behaviour or tasks. In reality, some functions contained identical code except for a few substituted variables. In short, the code was a mess and vulnerable to bugs.

Since then, Iā€™ve learned software development best practices with the help of a mentor at work. Feeling more confident about my ability to write stable code, I refactored squashinformr earlier this year. Iā€™d like to break down examples of where I applied these best practices as if I were speaking to my past self. If you havenā€™t heard of these rules, I recommend you apply some of them in your work. That way, you or someone else does not have to rewrite your code in the future šŸ˜„
Ā 

Determine requirements by drawing a diagram

Itā€™s easy to get inspired about a project idea and want to dive straight in to making it. This happened to me when I wrote squashinformr. I was excited by the idea of writing an R package and I had already written code that achieved some of the basic functionality. So I just started writing code and putting hours of time into something I had not thought through. To quote Jurassic Park:

Yeah, yeah, but your scientists were so preoccupied with whether or not they could that they didnā€™t stop to think if they should.

In other words, I started writing code before determining what the end product should be. I didnā€™t think about how many functions I would be writing or what they should deliver. I didnā€™t consider how those functions could share subtasks so I didnā€™t write helper functions. I ended up wasting a lot of time because of this mistake. To be honest, if I had determined the packageā€™s requirements beforehand, I wouldnā€™t have written this blog post.

Instead, I could have achieved a high-level vision of the package by drawing a diagram. Using three squashinformr functions as an example, the diagram could look like this:
Ā 

Requirements Diagram Example

By creating this diagram, Iā€™ve forced myself to think about how the final product works. Iā€™m forced to think about the package functions as a whole and the steps that each function takes towards its desired behaviour. I also have to consider which functions share logic and which logic is unique (e.g.Ā data cleaning). This exercise is informative even if you are the one who came up with the project idea. It helps you determine the scope of the project and leads you toward a refined vision of what youā€™re about to do.

Development teams will have their own processes for determining requirements, which may be too formal for some personal projects. Thatā€™s why I recommend this method. You can decide how formal or detailed you want to be so you donā€™t exhaust your inspiration in the planning stage. At the same time, you donā€™t dive in and hope for the best. At the very least, you have an image you can reference to keep you on track with the high-level purpose of your project.
Ā 

Beware copy + paste

This is a tip that you might have heard before and, like me, ignored out of convenience. Copying and pasting code is a quick and easy solution in the moment. Unfortunately, it is also a sure way to accrue technical debt. It is an easy way to implement the logic you need, but it comes with a catch: it (at least) doubles the length of your code. I made this mistake when writing squashinformr many times over by doing something like this:

get_some_data <- function(category = c("mens", "womens")) {

  if (category == "mens") {
    ## [Code to get men's data]
    ## [Data cleaning]
  }

  else if (category == "womens") {
    ## [Code to get women's data]
    ## [Data cleaning]
  }

  ## [Common data operations]
  return(result)

}

Ā 
Andrew Tannenbaum explained the consequence of lengthening your code:

In addition, a substantial number of the problems caused by buggy software, which occurs because vendors keep adding more and more features to their programs, which inevitably means more code and thus more bugs.

My pseudo-code function above is waiting for a bug to ruin it. In this scenario, the code that retrieves the menā€™s data is nearly identical to the code that retrieves the womenā€™s data. If I want to introduce a change or if a bug stops the function from working, I would have to rewrite code in up to four separate places to ensure consistent results. Itā€™s a similar story when it comes to code commenting. I would have to rewrite the comments in up to four places if they were common between each workflow. This is why itā€™s in our best interest to avoid copying and pasting code. Not only does it make code more vulnerable, but it also creates more work for you in the future.

So what is the alternative strategy to copying and pasting? Writing functions. When you are about to write code that repeats the same logic as other code youā€™ve written, itā€™s time to consider whether you should start writing a function. Hadley Wickham wrote a good rule of thumb for this in his R for Data Science book:

You should consider writing a function whenever youā€™ve copied and pasted a block of code more than twice (i.e.Ā you now have three copies of the same code).

While refactoring squashinformr, I wrote new helper functions that achieved subtasks within the packageā€™s main functions. The result looks something like this:

get_some_data <- function(category = c("mens", "womens")) {

  if (category == "mens") {
    result <- helper_function(category = category)
  }

  else if (category == "womens") {
    result <- helper_function(category = category)
  }

  ## [Common data operations]
  return(result)

}


helper_function <- function(category = c("mens", "womens", "both")) {

  ## [Code to get data, given category]
  ## [Data cleaning]
  return(data)

}

Ā 
The code that gets the data is now abstracted away from the main function. The result is succinct code that achieves the desired result without accruing technical debt.

By avoiding copying and pasting, we can make our code more robust and maintainable for the future.
Ā 

Continuous testing

One thing I did well in the first version of squasinformr was testing. I wrote two main types of tests for every function using the testthat package. Equality tests (i.e.Ā ā€œthis function with these inputs should return exactly this resultā€) and input error tests (e.g.Ā ā€œthis function with these inputs should return an errorā€). These tests would tell me if a function was behaving as expected or return an error if it wasnā€™t. So all was good while I developed the package, but I forgot to consider something important. How was I going to continue to test these functions after I released squashinformr to the world?

After releasing a package, you take on responsibility for it. Software packages need maintenance and you are the person people go to if it has issues. So, itā€™s best that you are the first person to know when something is wrong. This is where continuous testing comes in. Itā€™s great to write tests, but itā€™s not enough to run them every once in a while. You need to run tests on an interval that make sense. If your software is niche and not integral to other peopleā€™s work, you can afford to test less often. If it is important and widely used, you should be testing constantly.

So how do you conduct tests on a regular interval? There are many options for software testing at an entreprise level, but only one stand-out solution for everyday open source contributors. GitHub Actions is a flexible workflow system that can be configured to test almost any piece of software. If you host your project on GitHub, automated testing workflows can be achieved by adding a configuration file to a .github/workflows/ directory in your project. The good news is that the GitHub community already hosts useful preconfigured actions that complete all sorts of tasks ā€œout of the boxā€. For R users, the r-libs/actions repository serves as an extraordinary resource, particularly for testing. If youā€™re feeling adventurous, GitHub also provides tools to create your own custom actions.

I implemented a modified version of the Standard CI Workflow into squashinformr. In addition to every time I submit a change to the master branch, GitHub Actions runs my tests on squashinformr every 24 hours. This ensures that I wonā€™t go more than a day without knowing something is wrong with the package. I also run these tests in a ā€œmatrix configurationā€, which is a fancy way of saying ā€œacross many operating systemsā€. So I know my package is accessible to as many users as possible. Now that Iā€™ve taken these precautions, squashinformr is more maintainable than ever. Barring any wild bus accidents, Iā€™ll be able to identify and fix issues as they arise.
Ā 

Conclusion

These are the three best practices Iā€™ve been able to apply to my open source work with squashinformr. I determined software requirements by drawing high-level diagrams of the finished product. I reduced code redundancy and complexity by writing helper functions. Lastly, I integrated continuous testing through GitHub Actions to ensure timely maintenance and wider accessibility.

After all that, my work is not finished. squashinformr is not perfect, it still needs help! If you are looking to get involved in open source, check out the GitHub repository if this project interests you. Feel free to open an issue or send me an email () with your idea. Thanks for reading!