Exploring Popular Productions and their Statistics
knitr::include_graphics("broadway image data science.jpeg")
In the world of theatre, there have been countless plays and musicals performed since Broadway has originated. Productions have entered and left Broadway, some continuing on to tour across the country. Being a popular tourist attraction for New York, the question remains of how successful certain productions have been throughout the age of Broadway. When examining different plays, musicals, etc. many questions arise in determining how impactful a show has been to an audience. Do certain productions tend to raise more income for the economy? What year(s) has Broadway sold the most tickets or filled the most seats? Is there a correlation between location of a theater and its attendance? These questions and more can be answered through exploratory data analysis by looking at specific productions throughout the years of 1990 and 2016 to solidify our understanding of Broadway’s peak engagement from tourists around the world.
For this project, we will be examining the Broadway Data Set from CORGIS (The Collection of Really Great, Interesting, Situated Datasets)(“Broadway CSV File,” n.d.). This data set includes 31,296 observations of 12 variables which include dates, show names and types, statistics regarding gross and attendance, gross potential, number of performances per week, and the theaters that were used for performances.
For further explanation of these variables, the dates included are
broken into four sections: Date.Day
,
Date.Full
, Date.Month
, and
Date.Year
where all of them are categorical ordinal. For
this project, we will be focusing on just the years of productions and
therefore use the Date.Year
variable.
Show.Name
is a categorical nominal variable that is the
name of each production that is within the data set and
Show.Type
is also a categorical nominal variable that
determines if the production is either a Musical, Play, or Special.
Show.Theatre
is a categorical nominal variable that lists
the particular theater in which a production was performed in while
Statistics.Capacity
(numerical discrete variable) lists the
capacity in the theaters. Statistics.Attendance
is a
numerical discrete variable that lists the amount of people who showed
up to a performance over that week. Statistics.Gross
is a
numerical discrete variable measured in dollars that shows how much a
production made that week. Statistics.Gross.Potential
(numerical discrete variable) is listed as a percentage that is gross
over gross potential, determined through ticket sales, capacity, etc. In
the case it could not be calculated, there is a 0 listed. Finally,
Statistics.Performances
is a numerical discrete variable
that shows how many performances occurred during the week that aligns in
the data set.
To begin, we will explore which years between 1990 and 2016 have had the most attendance or engagement from audiences. Below is a visualization of which years had the most attendance for productions.
broadway_attend <- broadway %>%
group_by(Date.Year) %>%
summarize(attendance = sum(Statistics.Attendance)) %>%
arrange(desc(attendance))
ggplot(data = broadway_attend, aes(x = Date.Year, y = attendance)) +
geom_line(color = "yellow3") +
geom_point() +
labs(x = "Year",
y = "Attendance",
title = "Total Attendance per Year") +
theme_minimal()
We can see from the graph that during the 21st century, attendance has fluctuated. However, it’s also clear that during the earlier years observed in this data set (1990-1995), attendance is drastically lower than other years. This implies that in more recent years the attendance rate appears to be rising overall, even despite the apparent drop in 2016. We are going to examine the top years in attendance closer to get a better sense of which productions have had a proportionately larger impact in the world of Broadway. This will also help eliminate other plays that may have been popular in the 20th century, but have not made a lasting impression throughout Broadway’s history.
We can filter the data set to find the observations within these top years, seemingly the 20-year period between 1996 and 2016.
years <- as.character(seq(1996, 2016, 1))
broadway_new <- broadway %>%
filter(Date.Year %in% years)
During their time on Broadway, which productions have been most successful? When defining success, we will examine each type of show’s attendance and gross. First off, we will determine which category (plays, musicals, or special) have scored highest in these categories.
broadway_musical <- broadway_new %>%
filter(Show.Type == "Musical") %>%
summarise(n = n())
n_musical <- broadway_musical$n
broadway_play <- broadway_new %>%
filter(Show.Type == "Play") %>%
summarise(n = n())
n_play <- broadway_play$n
broadway_special <- broadway_new %>%
filter(Show.Type == "Special") %>%
summarise(n = n())
n_special <- broadway_special$n
show_types <- tibble(type = c("Musical","Play","Special"),
n = c(n_musical,n_play,n_special))
ggplot(show_types, aes(x = "", y = n, fill = type)) +
geom_bar(width = 1, stat = "identity") +
coord_polar("y", start = 0) +
colorblindr::scale_fill_OkabeIto(name = "Show Type") +
theme(panel.background = element_blank(),
axis.title = element_blank(),
axis.text = element_blank(),
axis.ticks = element_blank())
We can see that the majority of the data set is musicals. Since the
amount of Special
shows is so minuscule compared to the size
of the data set, we will be focusing primarily on musicals and plays.
Below we can examine which category has the highest gross and
attendance.
broadway_new <- broadway_new %>%
filter(Show.Type != "Special")
ggplot(data = broadway_new, aes(x = Date.Year, y = Statistics.Gross)) +
geom_jitter(alpha = 0.8, aes(color = Statistics.Attendance),
position = position_jitter(width = .2, height = .2)) +
scale_color_viridis(option = "C",
name = "Attendance") +
facet_wrap(~Show.Type) +
labs(title = "Attendance and Gross of Broadway Productions",
x = "Year",
y = "Gross") +
theme_light()
We can see that musicals seem to score higher in attendance and gross than plays do. However, both seem to exhibit the upward trend of attendance as the years go on. This is reasonable, considering many people associate Broadway with primarily musicals, especially the super famous ones. It’s clear to see that musicals on Broadway tend to raise more money due to their reputation than plays. Plays continue to have a high gross, however, and likely reach a different audience than musicals which would engage a larger portion of the population to pay to see these productions. With so many musicals and plays performed in the past—which plays/musicals had the highest gross and attendance? Below we will plot the top 10 productions (for both plays and musicals) with the highest attendance and gross between 1996 and 2016. (“Color Palettes,” n.d.)
best_musical_gross <- broadway_new %>%
filter(Show.Type == "Musical") %>%
group_by(Show.Name) %>%
summarise(gross = sum(Statistics.Gross)) %>%
arrange(desc(gross)) %>%
head(10)
best_musical_attend <- broadway_new %>%
filter(Show.Type == "Musical") %>%
group_by(Show.Name) %>%
summarise(attend = sum(Statistics.Attendance)) %>%
arrange(desc(attend)) %>%
head(10)
best_play_gross <- broadway_new %>%
filter(Show.Type == "Play") %>%
group_by(Show.Name) %>%
summarise(gross = sum(Statistics.Gross)) %>%
arrange(desc(gross)) %>%
head(10)
best_play_attend <- broadway_new %>%
filter(Show.Type == "Play") %>%
group_by(Show.Name) %>%
summarise(attend = sum(Statistics.Attendance)) %>%
arrange(desc(attend)) %>%
head(10)
musical_attend <- ggplot(best_musical_attend,
aes(x = reorder(Show.Name, attend), y = attend)) +
geom_col(aes(fill = reorder(Show.Name, attend))) +
coord_flip() +
scale_fill_manual(values = c("#f6f2ff","#e8daff","#d4bbff",
"#be95ff","#a56eff","#8a3ffc","#6929c4",
"#491d8b","#31135e","#1c0f30")) +
theme(legend.position = "none") +
scale_x_discrete(labels = scales::label_wrap(13)) +
labs(x = "",
y = "Attendance",
title = "Attendance for Musicals")
play_attend <- ggplot(best_play_attend, aes(x = reorder(Show.Name, attend),
y = attend)) +
geom_col(aes(fill = reorder(Show.Name, attend))) +
coord_flip() +
scale_fill_manual(values = c("#edf5ff","#d0e2ff","#a6c8ff",
"#78a9ff","#4589ff","#0f62fe","#0043ce",
"#002d9c","#001d6c","#001141")) +
theme(legend.position = "none") +
scale_x_discrete(labels = scales::label_wrap(20)) +
labs(x = "",
y = "Attendance",
title = "Attendance for Plays")
musical_gross <- ggplot(best_musical_gross, aes(x = reorder(Show.Name, gross),
y = gross)) +
geom_col(aes(fill = reorder(Show.Name, gross))) +
coord_flip() +
scale_fill_manual(values = c("#e5f6ff","#bae6ff","#82cfff",
"#33b1ff","#1192e8","#0072c3","#00539a",
"#003a6d","#012749","#1c0f30")) +
theme(legend.position = "none") +
scale_x_discrete(labels = scales::label_wrap(13)) +
labs(x = "",
y = "Gross",
title = "Gross for Musicals")
play_gross <- ggplot(best_play_gross, aes(x = reorder(Show.Name, gross),
y = gross)) +
geom_col(aes(fill = reorder(Show.Name, gross))) +
coord_flip() +
scale_fill_manual(values = c("#d9fbfb","#9ef0f0","#3ddbd9",
"#08bdba","#009d9a","#007d79","#005d5d",
"#004144","#022b30","#081a1c")) +
theme(legend.position = "none") +
scale_x_discrete(labels = scales::label_wrap(20)) +
labs(x = "",
y = "Gross",
title = "Gross for Plays")
grid.arrange(play_attend, musical_attend, ncol = 2)
grid.arrange(play_gross, musical_gross, ncol = 2)
We can see that Broadway musicals have a much higher numerical gross and attendance value than plays do, supporting our previous assumption that Broadway is typically associated with its musicals. Out of the top musicals in gross and attendance, The Lion King appears to be the most successful musical on Broadway during the busiest years, with Wicked and The Phantom of the Opera being in second and third place. For plays, War Horse seems to be the most successful overall just as The Lion King was, with The Curious Incident Of The Dog In The Night-Time in second. However, Proof has the third highest attendance and It’s Only A Play has the third highest gross.
It is interesting to see that for both gross and attendance, the other arrangements of musicals and plays seem to differ. We can see that some productions have had higher attendance with less gross, while others have had higher gross with less attendance. Overall we see a trend to which musicals and plays have been the most successful in their time on Broadway.
Now that we have seen which productions have had the highest gross and attendance, we will examine where they have been performed within New York. Below we can filter to find which theaters these productions have been in. Typically, certain productions tend to be performed in the same theaters, explaining why there are not many theaters in comparison to the number of years that each production has been performed.
theaters_musical <- broadway_new %>%
filter(Show.Type == "Musical") %>%
filter(Show.Name %in% c("The Lion King", "Wicked",
"The Phantom of the Opera", "Jersey Boys", "Mamma Mia!", "Chicago",
"The Book of Mormon", "Mary Poppins", "The Producers", "Beauty And The Beast", "Rent")) %>%
select(Show.Name, Show.Theatre) %>%
filter(Show.Theatre != "Cadillac Winter Garden")
theaters_play <- broadway_new %>%
filter(Show.Type == "Play") %>%
filter(Show.Name %in% c("War Horse", "Art",
"The Curious Incident Of The Dog In The Night-Time", "August: Osage County",
"God Of Carnage", "Doubt", "The 39 Steps", "Proof", "The Odd Couple 05",
"It'S Only A Play", "Fish In The Dark", "The Tale Of The Allergist'S Wife",
"The Graduate")) %>%
select(Show.Name, Show.Theatre)
p3 <- ggplot(data = theaters_musical, aes(x = Show.Name, y = Show.Theatre)) +
geom_point(shape = 23, color = "dodgerblue") +
scale_x_discrete(labels = scales::label_wrap(10)) +
labs(x = "Show Name (Musicals)",
y = "Theater") +
coord_flip() +
theme_minimal() +
theme(axis.text.x = element_text(angle = 60, hjust = 1))
p4 <- ggplot(data = theaters_play, aes(x = Show.Name, y = Show.Theatre)) +
geom_point(shape = 23, color = "forestgreen") +
scale_x_discrete(labels = scales::label_wrap(22)) +
labs(x = "Show Name (Plays)",
y = "Theater") +
coord_flip() +
theme_minimal() +
theme(axis.text.x = element_text(angle = 60, hjust = 1))
gridExtra::grid.arrange(p3, p4, ncol = 2)
We can see from the maps below that for both musicals and plays, the theaters they are performed in tend to be located within the theater district of New York. This is somewhat intuitive for the average tourist or anyone familiar with the city, since there is a clear cultural center which includes these theaters for Broadway productions. When considering the most popular musicals and plays, the locations of the theaters seem to be centered around the same portion of New York.
theater_locations_musicals <- tibble(locations = c(
"1634 Broadway, NY, NY",
"245 W 52nd St, NY, NY",
"246 W 44th St, NY, NY",
"225 W 44th St, NY, NY",
"226 W 46th St, NY, NY",
"1564 Broadway, NY, NY",
"214 W 42nd St, NY, NY",
"208 W 41st St, NY, NY",
"200 W 45th St, NY, NY",
"205 W 46th St, NY, NY",
"222 W 51st St, NY, NY",
"235 W 44th St, NY, NY",
"245 W 52nd St, NY, NY",
"219 W 49th St, NY, NY"))
theater_locations_plays <- tibble(locations = c(
"219 W 48th St, NY, NY",
"150 W 65th St, NY, NY",
"236 W 45th St, NY, NY",
"242 W 45th St, NY, NY",
"236 W 45th St, NY, NY",
"239 W 45th St, NY, NY",
"242 W 45th St, NY, NY",
"249 W 45th St, NY, NY",
"240 W 44th St, NY, NY",
"243 W 47th St, NY, NY",
"138 W 48th St, NY, NY",
"256 W 47th St, NY, NY",
"227 W 42nd St, NY, NY"))
theater_locations_musicals_geo <- geo(
address = theater_locations_musicals$locations, method = "osm",
lat = latitude, long = longitude
)
theater_locations_musicals_geo[1,2] <- 40.761520
theater_locations_musicals_geo[1,3] <- -73.983490
#for some reason, the address of the Winter Garden theater kept mapping to
#a different location with the same address, so I had to update the data set
#with the correct longitude and latitude of the theater.
theater_locations_plays_geo <- geo(
address = theater_locations_plays$locations, method = "osm",
lat = latitude, long = longitude
)
theater_locations_musicals_geo <- theater_locations_musicals_geo %>%
mutate(names = c("Winter Garden",
"Virginia",
"St. James",
"Shubert",
"Richard Rodgers",
"Palace",
"New Amsterdam",
"Nederlander",
"Minskoff",
"Lunt-Fontanne",
"Gershwin",
"Broadhurst",
"August Wilson",
"Ambassador"))
theater_locations_plays_geo <- theater_locations_plays_geo %>%
mutate(names = c("Walter Kerr",
"Vivian Beaumont",
"Schoenfeld",
"Royale",
"Plymouth",
"Music Box",
"Jacobs",
"Imperial",
"Helen Hayes",
"Ethel Barrymore",
"Cort",
"Brooks Atkinson",
"American Airlines"))
p5 <- leaflet() %>%
addTiles() %>%
addMarkers(theater_locations_musicals_geo$longitude, theater_locations_musicals_geo$latitude, label = theater_locations_musicals_geo$names) %>%
setView(-73.98, 40.76, zoom = 13)
p5
The map of theaters that performed musicals all tend to be located near one another. When investigating the question of how location is associated with attendance and gross, it’s a bit hard to answer. For the theaters that have showed The Lion King (Minskoff and New Amsterdam), Minskoff is located directly in the center of the Theater District while New Amsterdam lies slightly South. On the other hand for the lowest scoring most popular musicals in attendance and gross (The Producers which was performed in the St. James theater), is only block away from Minskoff. In this case, it seems that location is not correlated with attendance or gross considering they all reside in the same general area.
However, for plays this question is answered a bit differently. The top grossing and attended play War Horse was performed in the Vivian Beaumont theater, which is the outlier in terms of location in the map. While all other theaters that performed plays are within the theater district, the Vivian Beaumont theater is far away from them—yet its play had the highest attendance and gross. Perhaps the outside location was easier to commute to or had less traffic during the typical hours of performances that assisted in easier and higher demanded engagement for the play. While there is only one play that would support this hypothesis, we cannot say with confidence that location fully determines the attendance and/or gross of certain productions, but it’s an interesting observation in the context of these particular plays and may have been a confounding variable of some sort that determined the success of War Horse.
Overall, we can see which productions in the categories of musicals and plays have had the highest attendance and gross throughout two decades of Broadway history. The top three musicals are The Lion King, Wicked, and The Phantom Of The Opera. The top three plays are War Horse, The Curious Incident Of The Dog In The Night-Time, and a split between Proof (attendance) and It’s Only A Play (gross). Further exploration could involve looking at more productions to determine if location truly does have any sort of correlation with attendance and/or gross for productions. We saw a specific instance of this for War Horse, but could in reality be a coincidence or involve the play’s contents that captured audience’s attention. Another area of exploration could involve following the upward trend of attendance throughout the 21st century. Will this trend continue to occur as new productions are currently being written and performed in the age of Covid-19? Or has Broadway’s popularity already peaked prior to the pandemic? However, in this specific exploration of this data set (“Broadway CSV File,” n.d.), after exploring the highest attendance per year, popularity of musicals vs. plays, and theater locations for performances, we discovered the collective impact these productions have had throughout the age of Broadway. It is no surprise why these productions have scored so well and continue to do so during the modern day.
Reviewer 1
The author conducted an investigation of popular Broadway performances during the 20th and early 21st centuries, in particular studying relationships between show attendance, profit, and location. The author sought to answer three questions: Do certain productions tend to raise more income for the economy? The conclusion is that musicals make more income than plays or special productions and is supported by figures 2-5. What year(s) has Broadway sold the most tickets or filled the most seats? The conclusion is that Broadway attendance was highest from roughly 2000 to 2016 and is supported by figures 1 and 3. Is there a correlation between location of a theater and its attendance? No strong conclusions could be drawn, though the outlier of War Horse at Vivian Beaumont theater is noted. This is supported by figures 7 and 8. Overall, I would say that the figures support the conclusions that the author made.
The report’s figures are legible and clearly communicate the trends of interest. Color choice is linear or not as appropriate to properly display the variables. Figures are well-labeled with appropriate axes, though a few could use titles for reference. Overall, the data is presented in a concise, appealing manner, and graphs do not distract from the observations that the reader is intended to make.
This report’s graphs are very appealing. I especially like the interactive maps. The report’s code is legible and well-structured. The written aspect of the report is also interesting and well-edited. Figure 2 (the pie chart) would have been better presented as a bar graph–pie charts are somewhat infamous for being difficult to read. Figure 6 (the categorical-categorical point plot) would have been better presented as a table–as both variables are unordered, relating them spatially makes the data more difficult to read without communicating any meaningful trends.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".