library(knitr)
library(tidyverse)
library(dplyr)
Example analysis
Like many other graduate students, I am an avid consumer of coffee. I typically drink at least one cup of dark roast every morning with a splash of oat milk. I buy whole beans and grind them by hand, and brew using the pour over method. If I’m feeling fancy, I’ll go to a cafe and order a black iced Americano (objectively the best espresso drink).
Years of caffeine dependency have undoubtedly made me an authority on this topic, but in order to assess the current coffee zeitgeist, I will be analyzing a data set consisting of Walmart coffee listings from 500 different stores as well as consumer ratings of each coffee (Zub 2022). I will examine whether more expensive coffees are better rated than cheaper ones.
If you are interested in learning more about coffee, or just love to drink it, read on to find out which type might be your best bet!
First, let’s take a quick look at the data:
<- read.csv("data/coffee-listings-from-all-walmart-stores.csv")
coffee <- coffee %>%
coffee select(-c(thumbnail, seller_name, weight_formatted_to_gramms)) %>%
filter(grepl("roast", coffee_type))
<- strsplit(coffee$coffee_type, split=",")
roastTypes <- lapply(roastTypes, grep, pattern="roast")
roastIndex for (i in 1:length(roastTypes)){
<- roastTypes[[i]][[roastIndex[[i]][[1]]]]
roastTypes[i]
}<- unlist(roastTypes)
roastTypes #print(roastTypes)
kable(head(coffee))
title | coffee_type | rating | reviews | price | weight |
---|---|---|---|---|---|
folgers classic roast ground coffee, 40.3-ounce | classic roast | 3.8 | 93 | 13.92 | 40.3-ounce |
café bustelo, espresso style dark roast ground coffee, vacuum-packed 10 oz. brick | espresso,dark roast | 4.7 | 914 | 3.76 | 10 oz |
folgers classic roast ground coffee, medium roast coffee, 25.9 ounce canister | medium roast,classic roast | 4.4 | 740 | 9.97 | 25.9 ounce |
great value classic roast medium ground coffee, value size, 48 oz | classic roast | 4.7 | 1598 | 9.98 | 48 oz |
great value classic roast medium ground coffee, 30.5 oz | classic roast | 4.2 | 263 | 7.98 | 30.5 oz |
caf bustelo, espresso style dark roast ground coffee, vacuum-packed 10 oz. brick | espresso,dark roast | 4.7 | 914 | 3.76 | 10 oz |
Data dictionary
title: the brand and name of the coffee
coffee_type: the type of roast of the coffee
rating: the average rating of the coffee
reviews: the number of reviews of the coffee
price: the retail price of the coffee
weight: the weight for one package of the coffee
Analysis
To keep things simple, we will subset the data to only look at the three most common roast types: light, medium and dark.
library(ggplot2)
$roastTypes <- roastTypes
coffee
<- coffee %>%
coffee filter(roastTypes %in% c("dark roast", "medium roast", "light roast"))
<- ggplot(coffee, aes(roastTypes))+
p geom_bar()+
xlab("Roast Types")
p
Figure 1: a bar chart of the number of coffees in each roast type.
Let’s look at the distribution of the different ratings:
ggplot(coffee, aes(x=rating))+
geom_histogram()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Figure 2: A histogram of the coffee ratings.
Just for fun, let’s look at the mean rating for each type of roast:
<- coffee %>%
ratingMeans group_by(roastTypes) %>%
group_modify(~ as.data.frame(mean(.$rating)))
colnames(ratingMeans)[[2]] <- "avgRating"
ratingMeans
# A tibble: 3 × 2
# Groups: roastTypes [3]
roastTypes avgRating
<chr> <dbl>
1 dark roast 4.33
2 light roast 3.37
3 medium roast 4.31
We can see that dark and medium roast are much higher rated than light roast coffees.
Now, let’s plot the rating of the coffee against the price, in order to see whether more expensive coffees are better.
<- coffee %>%
coffee mutate(numericWeight = as.numeric(gsub("([0-9]+).*$", "\\1", weight))) %>%
mutate(unitPrice = price/numericWeight)
Warning in mask$eval_all_mutate(quo): NAs introduced by coercion
Now, let’s plot the rating of the coffee against the price, faceted by the roast type.
I’m faceting by roast type because I think the three different roasts are not really interchangeable and thus should be considered individually.
<- ggplot(coffee, aes(x=unitPrice, y=rating))+
p2 geom_point()+
geom_smooth(method=lm)+
facet_wrap(~roastTypes)+
xlab("Unit Price")
p2
`geom_smooth()` using formula 'y ~ x'
Warning: Removed 48 rows containing non-finite values (stat_smooth).
Warning: Removed 47 rows containing missing values (geom_point).
Figure 3: Scatter plots of the coffee ratings against the unit price, faceted by the roast type.
Across the three different coffees, we can see that there is a slight decreasing trend. This means that a more expensive coffee does not necessarily indicate a better tasting one! Unfortunately, as we saw from the histogram of coffee ratings, the distribution of the ratings is heavily left-skewed so we actually do not have much data on lower-rated coffees. Thus, the results from this analysis should be taken with several grains of salt, until we get more data. I hope you’ve enjoyed this pretentious coffee data analysis!
Functions used:
dplyr (Wickham et al. 2022): filter, mutate, select, group_by, group_modify
ggplot2 (Wickham 2016): geom_histogram, geom_point, geom_smooth, facet_wrap