This summer, I had the pleasure of working as a Data Analyst Intern at Voices.com as a part of my MSc in Business Analytics. Voices.com is a marketplace website that matches business clients with voice-over talent.
During my internship at Voices.com, I became involved in a variety of analytics projects. I committed to a customer segmentation analysis on voice-over talent, analyzed customer churn, built dashboards for core metrics, and designed A/B tests! Through these projects, I procured even more experience with the R programming language. In this post, I will showcase R tips that I learned this summer for data analysis and visualization. Hopefully, you can apply some of these tips to your own data analysis projects.
1. Removing similarly-named objects from R’s Global Environment
Say that you are in the middle of a data analysis and you need to test out a part of your code. If you’re like me, you will test sections of code and assign the results to a new object named something like
test. After I’m finished testing my code, I’m left with a RStudio environment that looks like this:
My first tip will allow us to selectively delete these
test objects. This is accomplished through the following code:
rm(list = ls()[str_detect(ls(), "string")])
Let’s break down what is going on in this code.
rm()removes objects from the R environment.
ls()lists all the objects in the R environment.
stringrdetects a specified pattern in a string.
"string"is the pattern which our objects of interest have in common (e.g. “test”).
The innermost part of the code,
str_detect(ls(), "string"), detects the specified pattern within the environment. The previous code, enclosed in square brackets, indexes the list of object names in the environment (
ls()[str_detect(ls(), "string")]). Finally, that code is entered as the list argument in the
rm(list = ls()[str_detect(ls(), "test")])removes objects with the matching
"test"pattern from the environment!
coord_cartesian() for zooming in on ggplots
Say that you are producing a histogram from your data that looks something like this:
aq <- as_tibble(airquality) ggplot(aq, aes(x = Ozone)) + geom_histogram(fill = customBlue, binwidth = 5) + theme_light() + labs(y = "Count", x = "Ozone")
For whatever reason, you are only interested in visualizing a subset of this data. For example, we only want to visualize the histogram between Ozone values 0 and 50. You can achieve this using
ggplot(aq, aes(x = Ozone)) + geom_histogram(fill = customBlue, binwidth = 5) + theme_light() + labs(y = "Count", x = "Ozone") + xlim(c(0, 50))
xlim() has resulted in an odd behaviour. Our histogram looks different than the original above. This is because
scale_x_continuous(limits = ...)) has excluded all data that don’t fall within the limits. In this example,
xlim() excluded data points near 0 and 50. This is problematic if our goal is to visualize the same data without excluding data that lies on the edge of our criteria.
Instead, we can use
coord_cartesian() to achieve a true “zoom in” on our original data visualization.
ggplot(aq, aes(x = Ozone)) + geom_histogram(fill = customBlue, binwidth = 5) + theme_light() + labs(y = "Count", x = "Ozone") + coord_cartesian(xlim = c(0, 50))
xlim argument within
coord_cartesian() results in the desired behaviour. Note that the previous erroneous behaviour of
xlim() isn’t restricted to histograms! Remember this tip so that you don’t accidentally exclude data from your visualizations!
3. Intra-figure legends in
I agree with Edward Tufte’s disdain for legends in data visualizations. Legends, when placed on outside of a visualization, oblige your audience to look back-and-forth to interpret what you are presenting. This means that your audience has to focus on the format of your visualization before they can begin interpreting the content or message you are trying to deliver. That’s why I always try to use direct labels on my data visualizations instead of legends, like so:
But say you want to do the same direct labelling with a messier or more complicated visualization…
Now, I would hope that you would never have to present a spaghetti graph like this, but this is a situation where direct labels are not as easily implemented. Instead, we’ll use a legend. To reduce the amount of back-and-forth your audience has to do, we can place the legend within the figure as a compromise.
aq %>% filter(Month != 5) %>% ggplot(aes(x = Day, y = Temp, group = Month, colour = factor(Month))) + geom_line(size = 1) + geom_point(size = 3) + scale_colour_manual(values = c(customBlue, customOrange, customYellow, customPurple)) + coord_cartesian(ylim = c(50, 100)) + labs(title = "Mean temperature by day grouped by month", y = "Mean Temperature (F)", colour = "Month") + theme_light() + theme(legend.background = element_blank(), ## These lines here! legend.position = c(0.25, 0.2), ## These lines here! legend.title = element_text(size = 12), legend.text = element_text(size = 12))
This is accomplished by using
theme(legend.background = element_blank(), legend.position = c(0.25, 0.2), ...) to make the legend background transparent and to move the legend into the figure. This intra-figure legend will reduce, but not eliminate all of the figure’s interpretability issues. At the very least, it is a good compromise when you can’t directly label your data.
4. High resolution plot outputs via
Exporting graphics in R can be challenging. New users will typically use RStudio’s Export button in the Plots window to save images to their working directory. In my experience, this method results in pixelated, low resolution graphics. There is nothing more demotivating than working on a visualizations for hours, only to have them look mediocre when presenting them.
This is where the
Cairo package comes in. The formula for generating
Cairo outputs is as follows:
CairoPNG(filename = "filename.png", res = 300, units = "in", width = 16, height = 9) ggplot()... dev.off()
- The first line initializes a graphics device via Cairo to render a high quality image of your visualization in the file format of your choosing. Here, I specify a PNG output through
CairoPNG(), but other formats include JPEG, TIFF, and PDF.
- The second portion of the script should contain your visualization code.
- The last part is specifying
dev.off(), which turns the graphics device off.
After executing the code in order, a high quality output of your visualization will be found in your working directory.
Note that your visualization might be slightly different from your viewer pane preview. It is then up to you to adjust your visualization code before running the whole code chunk again to see the changes. Use this tip so that your visualizations are always high res and crisp for your audience!
5. Fuzzy joins on intervals with
The last tip is for dirty data joining! You may find yourself in a situation where you want to join two data frames / tables but you don’t have the necessary number of unique identifiers. For example, you are looking at event data in two separate data frames, where one data frame contains subscription data and the other contains cancellation case data.
## # A tibble: 8 x 4 ## id sub_id start_date end_date ## <dbl> <dbl> <date> <date> ## 1 1 12 2019-09-05 2020-09-05 ## 2 1 41 2018-09-05 2019-09-05 ## 3 2 57 2019-07-01 2020-07-01 ## 4 2 62 2018-07-01 2019-07-01 ## 5 3 78 2019-02-01 2020-02-01 ## 6 3 87 2018-02-01 2019-02-01 ## 7 4 35 2018-05-25 2019-05-25 ## 8 4 23 2017-05-25 2018-05-25
## # A tibble: 3 x 4 ## id case_open_date case_close_date retain ## <dbl> <date> <date> <lgl> ## 1 1 2020-05-09 2020-05-12 TRUE ## 2 3 2019-11-30 2019-12-01 FALSE ## 3 4 2018-04-25 2018-04-27 FALSE
In this example, we can identify which members have subscriptions and cancellation cases through
id. But we don’t have enough information to know which subscription each cancellation case belongs to. To know this, we would need a unique subscription id variable in the cancellation case data. Unfortunately, we only have subscription ids in
memberships. We can solve this problem with something called a fuzzy join.
In many programming languages, a left or right join allows you to link data frames or tables that contain a key identifier variable that is common between the two. This allows the analyst to link the data frames based on an exact match of the key id variables in each data frame. In our example, that would be the
id variable. Fuzzy joins differ as they allow you to link data frames based on inexact matching. This inexact matching technique is particularly useful in combination with an exact match.
In the following example, we will use
genome_left_join() from the
fuzzyjoin package to perform a fuzzy left join of
memberships. This join will be based on criteria of an exact match of
id and an inexact match of the subscription period (between
end_date) that overlaps with the case period (between
# Convert start_date and end_date to numeric memberships <- memberships %>% mutate(start_date = as.numeric(start_date), end_date = as.numeric(end_date)) # Convert case_open_date and case_close_date to numeric cancellations <- cancellations %>% mutate(case_open_date = as.numeric(case_open_date), case_close_date = as.numeric(case_close_date)) # Fuzzy join mem_canc <- memberships %>% genome_left_join(cancellations, by = c("id" = "id", "start_date" = "case_open_date", "end_date" = "case_close_date")) %>% # Then convert start_date, end_date, case_open_date, and case_close_date back to dates mutate( start_date = as_date(start_date), end_date = as_date(end_date), case_open_date = as_date(case_open_date), case_close_date = as_date(case_close_date))
The first two code chunks change our subscription and case dates to integers (which represent the number of days since 1970-01-01). The result of the fuzzy join is a single data frame containing all rows from
memberships and the matched rows of
cancellations. Any rows in
memberships that do not have a matching row in
cancellations are filled with NAs.
## # A tibble: 8 x 7 ## id sub_id start_date end_date case_open_date case_close_date retain ## <dbl> <dbl> <date> <date> <date> <date> <lgl> ## 1 1 12 2019-09-05 2020-09-05 2020-05-09 2020-05-12 TRUE ## 2 1 41 2018-09-05 2019-09-05 NA NA NA ## 3 2 57 2019-07-01 2020-07-01 NA NA NA ## 4 2 62 2018-07-01 2019-07-01 NA NA NA ## 5 3 78 2019-02-01 2020-02-01 2019-11-30 2019-12-01 FALSE ## 6 3 87 2018-02-01 2019-02-01 NA NA NA ## 7 4 35 2018-05-25 2019-05-25 NA NA NA ## 8 4 23 2017-05-25 2018-05-25 2018-04-25 2018-04-27 FALSE
Now we can see which subscriptions the cancellation cases belong to. From here, I could further clean this data by changing the
end_date of the subscription to match the cancellation
case_close_date, if the subscription was successfully cancelled according to
retain. This would allow me to calculate churn rates of the subscribers, which has huge implications for a subscription business’ monthly and annual revenue.
Note: to perform a fuzzy join with
genome_left_join() you will require the
IRanges packages as well. Installation instructions can be found here.
Thanks for reading this blog post! Hopefully, you’ll be able to use some of these tips to make your data analyses cleaner, crisper, and more accurate!
Feel free to share this content with others who may benefit from using any of these tips. As always, the full code for this post is available here. Happy coding!