Plays and Musicals Throughout the Age of Broadway

Exploring Popular Productions and their Statistics

Evan Heintz (Data Science at Reed College)https://reed-statistics.github.io/math241-spring2022/
May 4, 2022
hide
knitr::include_graphics("broadway image data science.jpeg")


Introduction

In the world of theatre, there have been countless plays and musicals performed since Broadway has originated. Productions have entered and left Broadway, some continuing on to tour across the country. Being a popular tourist attraction for New York, the question remains of how successful certain productions have been throughout the age of Broadway. When examining different plays, musicals, etc. many questions arise in determining how impactful a show has been to an audience. Do certain productions tend to raise more income for the economy? What year(s) has Broadway sold the most tickets or filled the most seats? Is there a correlation between location of a theater and its attendance? These questions and more can be answered through exploratory data analysis by looking at specific productions throughout the years of 1990 and 2016 to solidify our understanding of Broadway’s peak engagement from tourists around the world.

Broadway Data Set

For this project, we will be examining the Broadway Data Set from CORGIS (The Collection of Really Great, Interesting, Situated Datasets)(“Broadway CSV File,” n.d.). This data set includes 31,296 observations of 12 variables which include dates, show names and types, statistics regarding gross and attendance, gross potential, number of performances per week, and the theaters that were used for performances.

For further explanation of these variables, the dates included are broken into four sections: Date.Day, Date.Full, Date.Month, and Date.Year where all of them are categorical ordinal. For this project, we will be focusing on just the years of productions and therefore use the Date.Year variable. Show.Name is a categorical nominal variable that is the name of each production that is within the data set and Show.Type is also a categorical nominal variable that determines if the production is either a Musical, Play, or Special. Show.Theatre is a categorical nominal variable that lists the particular theater in which a production was performed in while Statistics.Capacity (numerical discrete variable) lists the capacity in the theaters. Statistics.Attendance is a numerical discrete variable that lists the amount of people who showed up to a performance over that week. Statistics.Gross is a numerical discrete variable measured in dollars that shows how much a production made that week. Statistics.Gross.Potential (numerical discrete variable) is listed as a percentage that is gross over gross potential, determined through ticket sales, capacity, etc. In the case it could not be calculated, there is a 0 listed. Finally, Statistics.Performances is a numerical discrete variable that shows how many performances occurred during the week that aligns in the data set.

hide

Methods and Results

To begin, we will explore which years between 1990 and 2016 have had the most attendance or engagement from audiences. Below is a visualization of which years had the most attendance for productions.

hide
broadway_attend <- broadway %>%
  group_by(Date.Year) %>%
  summarize(attendance = sum(Statistics.Attendance)) %>%
  arrange(desc(attendance))

ggplot(data = broadway_attend, aes(x = Date.Year, y = attendance)) +
  geom_line(color = "yellow3") +
  geom_point() +
  labs(x = "Year",
       y = "Attendance",
       title = "Total Attendance per Year") +
  theme_minimal() 

We can see from the graph that during the 21st century, attendance has fluctuated. However, it’s also clear that during the earlier years observed in this data set (1990-1995), attendance is drastically lower than other years. This implies that in more recent years the attendance rate appears to be rising overall, even despite the apparent drop in 2016. We are going to examine the top years in attendance closer to get a better sense of which productions have had a proportionately larger impact in the world of Broadway. This will also help eliminate other plays that may have been popular in the 20th century, but have not made a lasting impression throughout Broadway’s history.

We can filter the data set to find the observations within these top years, seemingly the 20-year period between 1996 and 2016.

hide
years <- as.character(seq(1996, 2016, 1))

broadway_new <- broadway %>%
  filter(Date.Year %in% years)

Plays vs. Musicals

During their time on Broadway, which productions have been most successful? When defining success, we will examine each type of show’s attendance and gross. First off, we will determine which category (plays, musicals, or special) have scored highest in these categories.

hide
broadway_musical <- broadway_new %>%
  filter(Show.Type == "Musical") %>%
  summarise(n = n())
n_musical <- broadway_musical$n

broadway_play <- broadway_new %>%
  filter(Show.Type == "Play") %>%
  summarise(n = n())
n_play <- broadway_play$n

broadway_special <- broadway_new %>%
  filter(Show.Type == "Special") %>%
  summarise(n = n())
n_special <- broadway_special$n

show_types <- tibble(type = c("Musical","Play","Special"),
                     n = c(n_musical,n_play,n_special))

ggplot(show_types, aes(x = "", y = n, fill = type)) +
  geom_bar(width = 1, stat = "identity") +
  coord_polar("y", start = 0) +
  colorblindr::scale_fill_OkabeIto(name = "Show Type") +
  theme(panel.background = element_blank(),
        axis.title = element_blank(),
        axis.text = element_blank(),
        axis.ticks = element_blank())

We can see that the majority of the data set is musicals. Since the amount of Specialshows is so minuscule compared to the size of the data set, we will be focusing primarily on musicals and plays. Below we can examine which category has the highest gross and attendance.

Highest Attendance and Gross

hide
broadway_new <- broadway_new %>%
  filter(Show.Type != "Special")

ggplot(data = broadway_new, aes(x = Date.Year, y = Statistics.Gross)) +
  geom_jitter(alpha = 0.8, aes(color = Statistics.Attendance), 
              position = position_jitter(width = .2, height = .2)) +
  scale_color_viridis(option = "C",
                      name = "Attendance") +
  facet_wrap(~Show.Type) +
  labs(title = "Attendance and Gross of Broadway Productions",
       x = "Year",
       y = "Gross") +
  theme_light()

We can see that musicals seem to score higher in attendance and gross than plays do. However, both seem to exhibit the upward trend of attendance as the years go on. This is reasonable, considering many people associate Broadway with primarily musicals, especially the super famous ones. It’s clear to see that musicals on Broadway tend to raise more money due to their reputation than plays. Plays continue to have a high gross, however, and likely reach a different audience than musicals which would engage a larger portion of the population to pay to see these productions. With so many musicals and plays performed in the past—which plays/musicals had the highest gross and attendance? Below we will plot the top 10 productions (for both plays and musicals) with the highest attendance and gross between 1996 and 2016. (“Color Palettes,” n.d.)

hide
best_musical_gross <- broadway_new %>%
  filter(Show.Type == "Musical") %>%
  group_by(Show.Name) %>%
  summarise(gross = sum(Statistics.Gross)) %>%
  arrange(desc(gross)) %>%
  head(10)

best_musical_attend <- broadway_new %>%
  filter(Show.Type == "Musical") %>%
  group_by(Show.Name) %>%
  summarise(attend = sum(Statistics.Attendance)) %>%
  arrange(desc(attend)) %>%
  head(10)

best_play_gross <- broadway_new %>%
  filter(Show.Type == "Play") %>%
  group_by(Show.Name) %>%
  summarise(gross = sum(Statistics.Gross)) %>%
  arrange(desc(gross)) %>%
  head(10)

best_play_attend <- broadway_new %>%
  filter(Show.Type == "Play") %>%
  group_by(Show.Name) %>%
  summarise(attend = sum(Statistics.Attendance)) %>%
  arrange(desc(attend)) %>%
  head(10)


musical_attend <- ggplot(best_musical_attend, 
                         aes(x = reorder(Show.Name, attend), y = attend)) +
  geom_col(aes(fill = reorder(Show.Name, attend))) +
  coord_flip() +
  scale_fill_manual(values = c("#f6f2ff","#e8daff","#d4bbff",
                               "#be95ff","#a56eff","#8a3ffc","#6929c4",
                               "#491d8b","#31135e","#1c0f30")) +
  theme(legend.position = "none") +
  scale_x_discrete(labels = scales::label_wrap(13)) +
  labs(x = "",
       y = "Attendance",
       title = "Attendance for Musicals") 
  
play_attend <- ggplot(best_play_attend, aes(x = reorder(Show.Name, attend), 
                                            y = attend)) +
  geom_col(aes(fill = reorder(Show.Name, attend))) +
  coord_flip() +
  scale_fill_manual(values = c("#edf5ff","#d0e2ff","#a6c8ff",
                               "#78a9ff","#4589ff","#0f62fe","#0043ce",
                               "#002d9c","#001d6c","#001141")) +
  theme(legend.position = "none") +
  scale_x_discrete(labels = scales::label_wrap(20)) +
  labs(x = "",
       y = "Attendance",
       title = "Attendance for Plays") 

musical_gross <- ggplot(best_musical_gross, aes(x = reorder(Show.Name, gross), 
                                                y = gross)) +
  geom_col(aes(fill = reorder(Show.Name, gross))) +
  coord_flip() +
  scale_fill_manual(values = c("#e5f6ff","#bae6ff","#82cfff",
                               "#33b1ff","#1192e8","#0072c3","#00539a",
                               "#003a6d","#012749","#1c0f30")) +
  theme(legend.position = "none") +
  scale_x_discrete(labels = scales::label_wrap(13)) +
  labs(x = "",
       y = "Gross",
       title = "Gross for Musicals") 

play_gross <- ggplot(best_play_gross, aes(x = reorder(Show.Name, gross), 
                                          y = gross)) +
  geom_col(aes(fill = reorder(Show.Name, gross))) +
  coord_flip() +
  scale_fill_manual(values = c("#d9fbfb","#9ef0f0","#3ddbd9",
                               "#08bdba","#009d9a","#007d79","#005d5d",
                               "#004144","#022b30","#081a1c")) +
  theme(legend.position = "none") +
  scale_x_discrete(labels = scales::label_wrap(20)) +
  labs(x = "",
       y = "Gross",
       title = "Gross for Plays")

grid.arrange(play_attend, musical_attend, ncol = 2)

hide
grid.arrange(play_gross, musical_gross, ncol = 2)

We can see that Broadway musicals have a much higher numerical gross and attendance value than plays do, supporting our previous assumption that Broadway is typically associated with its musicals. Out of the top musicals in gross and attendance, The Lion King appears to be the most successful musical on Broadway during the busiest years, with Wicked and The Phantom of the Opera being in second and third place. For plays, War Horse seems to be the most successful overall just as The Lion King was, with The Curious Incident Of The Dog In The Night-Time in second. However, Proof has the third highest attendance and It’s Only A Play has the third highest gross.

It is interesting to see that for both gross and attendance, the other arrangements of musicals and plays seem to differ. We can see that some productions have had higher attendance with less gross, while others have had higher gross with less attendance. Overall we see a trend to which musicals and plays have been the most successful in their time on Broadway.

Mapping Theaters

Now that we have seen which productions have had the highest gross and attendance, we will examine where they have been performed within New York. Below we can filter to find which theaters these productions have been in. Typically, certain productions tend to be performed in the same theaters, explaining why there are not many theaters in comparison to the number of years that each production has been performed.

hide
theaters_musical <- broadway_new %>%
  filter(Show.Type == "Musical") %>%
  filter(Show.Name %in% c("The Lion King", "Wicked", 
"The Phantom of the Opera", "Jersey Boys", "Mamma Mia!", "Chicago", 
"The Book of Mormon", "Mary Poppins", "The Producers", "Beauty And The Beast", "Rent")) %>%
  select(Show.Name, Show.Theatre) %>%
  filter(Show.Theatre != "Cadillac Winter Garden")

theaters_play <- broadway_new %>%
  filter(Show.Type == "Play") %>%
  filter(Show.Name %in% c("War Horse", "Art", 
"The Curious Incident Of The Dog In The Night-Time", "August: Osage County", 
"God Of Carnage", "Doubt", "The 39 Steps", "Proof", "The Odd Couple 05", 
"It'S Only A Play", "Fish In The Dark", "The Tale Of The Allergist'S Wife", 
"The Graduate")) %>%
  select(Show.Name, Show.Theatre)

p3 <- ggplot(data = theaters_musical, aes(x = Show.Name, y = Show.Theatre)) +
  geom_point(shape = 23, color = "dodgerblue") +
  scale_x_discrete(labels = scales::label_wrap(10)) +
  labs(x = "Show Name (Musicals)",
       y = "Theater") +
  coord_flip() +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 60, hjust = 1))

p4 <- ggplot(data = theaters_play, aes(x = Show.Name, y = Show.Theatre)) +
  geom_point(shape = 23, color = "forestgreen") +
  scale_x_discrete(labels = scales::label_wrap(22)) +
  labs(x = "Show Name (Plays)",
       y = "Theater") +
  coord_flip() +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 60, hjust = 1))

gridExtra::grid.arrange(p3, p4, ncol = 2)

We can see from the maps below that for both musicals and plays, the theaters they are performed in tend to be located within the theater district of New York. This is somewhat intuitive for the average tourist or anyone familiar with the city, since there is a clear cultural center which includes these theaters for Broadway productions. When considering the most popular musicals and plays, the locations of the theaters seem to be centered around the same portion of New York.

hide
theater_locations_musicals <- tibble(locations = c(
  "1634 Broadway, NY, NY",
  "245 W 52nd St, NY, NY",
  "246 W 44th St, NY, NY",
  "225 W 44th St, NY, NY",
  "226 W 46th St, NY, NY",
  "1564 Broadway, NY, NY",
  "214 W 42nd St, NY, NY",
  "208 W 41st St, NY, NY",
  "200 W 45th St, NY, NY",
  "205 W 46th St, NY, NY",
  "222 W 51st St, NY, NY",
  "235 W 44th St, NY, NY",
  "245 W 52nd St, NY, NY",
  "219 W 49th St, NY, NY"))

theater_locations_plays <- tibble(locations = c(
  "219 W 48th St, NY, NY",
  "150 W 65th St, NY, NY",
  "236 W 45th St, NY, NY",
  "242 W 45th St, NY, NY",
  "236 W 45th St, NY, NY",
  "239 W 45th St, NY, NY",
  "242 W 45th St, NY, NY",
  "249 W 45th St, NY, NY",
  "240 W 44th St, NY, NY",
  "243 W 47th St, NY, NY",
  "138 W 48th St, NY, NY",
  "256 W 47th St, NY, NY",
  "227 W 42nd St, NY, NY"))

theater_locations_musicals_geo <- geo(
  address = theater_locations_musicals$locations, method = "osm",
  lat = latitude, long = longitude
) 

theater_locations_musicals_geo[1,2] <- 40.761520
theater_locations_musicals_geo[1,3] <- -73.983490
#for some reason, the address of the Winter Garden theater kept mapping to
#a different location with the same address, so I had to update the data set
#with the correct longitude and latitude of the theater.

theater_locations_plays_geo <- geo(
  address = theater_locations_plays$locations, method = "osm",
  lat = latitude, long = longitude
)

theater_locations_musicals_geo <- theater_locations_musicals_geo %>%
  mutate(names = c("Winter Garden",
             "Virginia",
             "St. James",
             "Shubert",
             "Richard Rodgers",
             "Palace",
             "New Amsterdam",
             "Nederlander",
             "Minskoff",
             "Lunt-Fontanne",
             "Gershwin",
             "Broadhurst",
             "August Wilson",
             "Ambassador"))

theater_locations_plays_geo <- theater_locations_plays_geo %>%
  mutate(names = c("Walter Kerr",
             "Vivian Beaumont",
             "Schoenfeld",
             "Royale",
             "Plymouth",
             "Music Box",
             "Jacobs",
             "Imperial",
             "Helen Hayes",
             "Ethel Barrymore",
             "Cort",
             "Brooks Atkinson",
             "American Airlines"))

p5 <- leaflet() %>%
  addTiles() %>%
  addMarkers(theater_locations_musicals_geo$longitude, theater_locations_musicals_geo$latitude, label = theater_locations_musicals_geo$names) %>%
  setView(-73.98, 40.76, zoom = 13)
p5

The map of theaters that performed musicals all tend to be located near one another. When investigating the question of how location is associated with attendance and gross, it’s a bit hard to answer. For the theaters that have showed The Lion King (Minskoff and New Amsterdam), Minskoff is located directly in the center of the Theater District while New Amsterdam lies slightly South. On the other hand for the lowest scoring most popular musicals in attendance and gross (The Producers which was performed in the St. James theater), is only block away from Minskoff. In this case, it seems that location is not correlated with attendance or gross considering they all reside in the same general area.

hide
p6 <- leaflet() %>%
  addTiles() %>%
  addMarkers(theater_locations_plays_geo$longitude, theater_locations_plays_geo$latitude, label = theater_locations_plays_geo$names) %>%
  setView(-73.98, 40.76, zoom = 13)
p6

However, for plays this question is answered a bit differently. The top grossing and attended play War Horse was performed in the Vivian Beaumont theater, which is the outlier in terms of location in the map. While all other theaters that performed plays are within the theater district, the Vivian Beaumont theater is far away from them—yet its play had the highest attendance and gross. Perhaps the outside location was easier to commute to or had less traffic during the typical hours of performances that assisted in easier and higher demanded engagement for the play. While there is only one play that would support this hypothesis, we cannot say with confidence that location fully determines the attendance and/or gross of certain productions, but it’s an interesting observation in the context of these particular plays and may have been a confounding variable of some sort that determined the success of War Horse.

Conclusion

Overall, we can see which productions in the categories of musicals and plays have had the highest attendance and gross throughout two decades of Broadway history. The top three musicals are The Lion King, Wicked, and The Phantom Of The Opera. The top three plays are War Horse, The Curious Incident Of The Dog In The Night-Time, and a split between Proof (attendance) and It’s Only A Play (gross). Further exploration could involve looking at more productions to determine if location truly does have any sort of correlation with attendance and/or gross for productions. We saw a specific instance of this for War Horse, but could in reality be a coincidence or involve the play’s contents that captured audience’s attention. Another area of exploration could involve following the upward trend of attendance throughout the 21st century. Will this trend continue to occur as new productions are currently being written and performed in the age of Covid-19? Or has Broadway’s popularity already peaked prior to the pandemic? However, in this specific exploration of this data set (“Broadway CSV File,” n.d.), after exploring the highest attendance per year, popularity of musicals vs. plays, and theater locations for performances, we discovered the collective impact these productions have had throughout the age of Broadway. It is no surprise why these productions have scored so well and continue to do so during the modern day.

Class Peer Reviews

Broadway CSV file. (n.d.). In CORGIS Datasets Project. https://corgis-edu.github.io/corgis/csv/broadway/
Color palettes. (n.d.). In Color palettes – Carbon Design System. https://carbondesignsystem.com/data-visualization/color-palettes/#categorical-palettes

References

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".