Inference Testing on Probability Distributions of Chosen vs Given Names

Kenai Burton-Heckman (Data Science at Reed College)https://reed-statistics.github.io/math241-spring2022/
May 5, 2022

Introduction

How do chosen names differ from given names? Or more precisely: What is the probability distribution of chosen names for (prominent) transgender people? Is that distribution representative of the distribution of names given by parents to their children? This project aims at uncovering some of the trends in how trans people choose names for themselves. Some of the nuance of the discussion has to be sacrificed due to the limitations of the data; while the population data does allow for names with some observations that are male and some that are female, there is are no non-binary categories for gender. Additionally, manually determining the gender of each trans person in the sample would be unfeasable, so instead each name was assigned a masculine or feminine connotation based on the relative prevalence of male-assigned or female-assigned people in the population (as we may assume that a name predominantly taken up by male-assigned people makes for a stereotypically masculine name).

Data on given names was taken from the University of California Irvine Machine Learning Repository, with information on 147 thousand different names, their relative prevalences in the populations of the (predominantly English-speaking) countries of the USA, the UK, Australia, and Canada, and the associated genders of those names at their respective frequencies. Of additional note is that this data is presumably from birth certificates, so the actual distribution of names in the population is certainly different, given that people are able to use their chosen names on a daily basis without changing their birth certificates (this is especially true in the context of this problem, since many trans people have a dead name, and changing the gender/name on your birth certificate is a hassle). However, given that trans people who haven’t changed their birth certificates are a fraction of a fraction of the population, I will assume that this gender data is approximately representative of the population. (Gender by Name, 2020)

In addition, Wikipedia has several lists of transgender people, under different categories. This data was scraped and compiled it to create a sample of transgender names in order to make inference about their probability distribution as compared to the population. Using these Wikipedia lists poses no significant threat to the privacy or well-being of the trans (or otherwise) people involved. Given that they have Wikipedia articles written about them, they are already prominent figures, so the minor exposure that this project would offer has very little potential to be harmful. Additionally, this is all publicly available data and no trans peoples’ names or other sensitive information is included in this report. (Wikipedia Lists of People by Gender, 2020)

After processing, every data set being used for this project has 3 to 4 variables. A categorical variable name with approximately 134 thousand levels, a categorical variable gender that assigns a perceived masculine or feminine gender to each name (50 thousand masculine, 83 thousand feminine), a discrete numerical variable n that counts the number of people with that first name, and a continuous numerical variable prop that’s the probability of a randomly selected individual from the data having that given name (in other words, defining a frequency distribution).

There is a possible difference in scope between Wikipedia and the dataset on the gender of names. While no country is monolithic, all four are English-speaking and therefore present a bias in favor of common English names. There may be a similar bias on English Wikipedia pages, but one would expect it to be diminished due to the website’s goal of containing encyclopedic knowledge. Furthermore, by merging multiple lists of transgender people, there may be individual biases associated with each subcategory of list. While they may all be prominent in some way, it is entirely possible or even likely that the distribution of names of transgender academics is different from the distribution of names of transgender actors. It is possible some or all of the lists of transgender people are not actually indicative of the distribution of names in the population of transgender people in predominantly English-speaking countries. Finally, there is probably a bias towards older trans people, because older people have had more time to get Wikipedia articles written about them, but this is not a bias to worry about, since older trans people are also more likely to be using a chosen name.

Methods/Results

rvest was used to webscrape Wikipedia pages for the names of prominent transgender people. Four types of URLs were identified, and the HTML content was pulled from each page according to its XPATH. These lists of names were processed to remove stopwords, such as Wikipedia’s single-letter subcategory headings, and raw text was parsed into full names. Abbreviations (initials) were dropped. The full names were used to check for uniqueness of Wikipedia articles, and then the first names were pulled and turned into frequency distributions. The raw population data was also turned into frequency distributions. For both chosen and given names, three frequencies were calculated; one total, one for feminine names, and one for masculine names. Before performing inference, some preliminary observations about the datasets were made.

hide
hide
# data
base           <- read.csv("name_gender_dataset.csv")
colnames(base) <- c("name","gender","n","prop")
given          <- read.csv("given.csv")
given_f        <- read.csv("given_f.csv")
given_m        <- read.csv("given_m.csv")
chosen         <- read.csv("chosen.csv")
chosen_f       <- read.csv("chosen_f.csv")
chosen_m       <- read.csv("chosen_m.csv")
hide
base                    %>%
  filter(name == "Sam")
  name gender      n        prop
1  Sam      M 127192 0.000348092
2  Sam      F   1476 0.000004040

In the unprocessed given data, gender-neutral names have separate observations with their own frequencies. We can also see that there are no nonbinary genders in the unprocessed given data.

hide
base                                    %>%
  filter(gender != "M" & gender != "F") %>%
  nrow()
[1] 0

There are still plenty of gender-neutral names, though; the 10 most popular are:

hide
base               %>%
  group_by(name)   %>%
  summarize(
    prop = n/sum(n),
    sum = sum(n),
    gender = gender
  )                %>%
  slice_head()     %>%
  filter(
    prop > 0.45,
    prop < 0.55
  ) %>%
  arrange(desc(sum)) %>%
  head(10)
# A tibble: 10 × 4
# Groups:   name [10]
   name     prop    sum gender
   <chr>   <dbl>  <int> <chr> 
 1 Riley   0.514 211592 F     
 2 Jackie  0.537 171577 F     
 3 Kerry   0.505 106748 F     
 4 Frankie 0.549  75490 M     
 5 Quinn   0.511  67994 M     
 6 Emerson 0.543  46578 M     
 7 Robbie  0.508  44130 F     
 8 Justice 0.521  34549 M     
 9 Blair   0.513  32904 M     
10 Elisha  0.527  29179 F     

Overall, there are only a few thousand names with names that have similar counts of male and female observations, and many don’t have that many observations. If someone wrote code to scrape the gender identity of the trans people in my sample off of Wikipedia, then more in-depth analysis surrounding the distribution of genders between chosen and given names could be performed, but my current options are more limited.

hide
base            %>%
  arrange(name) %>%
  head(1)
  name gender n     prop
1    A      F 2 5.47e-09
hide
base            %>%
  arrange(name) %>%
  tail(1)
        name gender  n     prop
147269 Zzyzx      M 10 2.74e-08

There aren’t names beginning with non-alphabetical characters in the unprocessed given data. This could be due to cleaning of the data prior to my investigation of it, or it could also be due to regulations placed on the naming of babies on their birth certificates. Either way, this does exclude names that would be viewed as particularly unusual, which we might speculate are present at an inflated rate in a sample of chosen names, given the intrinsically liberating aspect of choosing a name for oneself. Certainly there are names in the sample data not present in the population of given names.

hide
df <- given                          %>%
  group_by(gender)                   %>%
  summarize(given = n()/nrow(given)) %>%
  select(!gender)

chosen                                                %>%
  filter(n > 0)                                       %>%
  group_by(gender)                                    %>%
  summarize(chosen = n()/nrow(filter(chosen, n > 0))) %>%
  bind_cols(df)                                       %>%
  pivot_longer(
    cols = c("chosen", "given"),
    names_to = "group",
    values_to = "prop"
  )                                                   %>%
  ggplot(aes(x = group, y = prop, fill = gender)) +
  geom_col()                                      +
  scale_fill_colorblind()                         +
  labs(
    title = "proportions of feminine and masculine names by sample",
    y = "proportion",
    fill = "gender associated\nwith name"
  )

hide
df <- given                              %>%
  group_by(gender)                       %>%
  summarize(given = sum(n)/sum(given$n)) %>%
  select(!gender)

chosen                                            %>%
  group_by(gender)                                %>%
  summarize(chosen = sum(n)/sum(chosen$n))        %>%
  bind_cols(df)                                   %>%
  pivot_longer(
    cols = c("chosen", "given"),
    names_to = "group",
    values_to = "prop"
  )                                               %>%
  ggplot(aes(x = group, y = prop, fill = gender)) +
  geom_col()                                      +
  scale_fill_colorblind()                         +
  labs(
    title = "proportions of individuals with feminine and masculine names by sample",
    y = "proportion",
    fill = "gender associated\nwith name"
  )

There is a similar ratio of feminine to masculine names in both the chosen and given datasets, favoring feminine names, but while the ratio of male to female people in the population is approximately 50-50 (as expected), there are relatively more people with feminine names in the sample of transgender people. This implies that the variance in given masculine names is lower relative to the variance of chosen masculine names, while the variance in given feminine names is higher relative to the variance of chosen feminine names.

Given a null frequency distribution represented by the population of given names, and a sample of chosen names of transgender people, a chi-squared goodness-of-fit test is appropriate to determine whether the distributions by gender are significantly different from one another. This gave hypotheses

\[ H_0\colon \vec{p}_c=\vec{p}_g\\ H_a\colon \vec{p}_c\neq\vec{p}_g \]

The conditions for the test are:

Simple random sample. Not violated, the trans people who get famous enough to receive wikipedia pages should be a random sample of the population of trans people.

Sample size (whole table). Not violated (at least on initial impression, more on this later). Sample sizes are:

hide
n_f <- chosen_f %>%
  uncount(n)    %>%
  nrow()

n_m <- chosen_m %>%
  uncount(n)    %>%
  nrow()

n_  <- n_f + n_m

n_
[1] 1474
hide
n_f
[1] 992
hide
n_m
[1] 482

For feminine, masculine, and all names, respectively.

Expected cell count. We want at least 80% of categories to have null counts of at least five, but clearly:

hide
given                                             %>%
  mutate(expected = n_*prop, met = expected >= 5) %>%
  summarize(prop = sum(met)/n())
          prop
1 0.0002389825
hide
given_f                                            %>%
  mutate(expected = n_f*prop, met = expected >= 5) %>%
  summarize(prop = sum(met)/n())
          prop
1 0.0002400298
hide
given_m                                            %>%
  mutate(expected = n_m*prop, met = expected >= 5) %>%
  summarize(prop = sum(met)/n())
          prop
1 0.0002372573

The sample size is not nearly large enough for the number of categorical levels. Yates’s Correction for continuity was also tried, but it had an effect opposite of what was desired, increasing the test statistics by several orders of magnitude. (Yates, 1934) At the very least, each name with less than 5 expected observations will contribute very little to the test statistic, as their probabilities are miniscule.

Independence. Assumed. While it’s true that some parents name children after themselves or relatives in the null frequency distribution, in terms of choosing a name for oneself, this is not a relevant practice.

This formula was used for the \(\chi^2\) statistic: \[ \chi^2=\sum_{i=1}^n\frac{(O_i-E_i)^2}{E_i} \]

Base R was used because infer could not allocate vectors of the size generated by the datasets.

hide
obs     <- chosen$n                                     # observed counts
exp     <- 1410*given$prop                              # expected counts
chisq   <- sum((obs-exp)^2/exp)                         # test statistic
degf    <- nrow(chosen)                                 # degrees of freedom
cv      <- qchisq(0.95, df = degf)                      # critical value
p_val   <- pchisq(q = chisq, df = degf, lower.tail = F) # p-value

obs_f   <- chosen_f$n
exp_f   <- 975*given_f$prop
chisq_f <- sum((obs_f-exp_f)^2/exp_f)
degf_f  <- nrow(chosen_f)
cv_f    <- qchisq(0.95, df = degf_f)
p_val_f <- pchisq(q = chisq_f, df = degf_f, lower.tail = F)

obs_m   <- chosen_m$n
exp_m   <- 435*given_m$prop
chisq_m <- sum((obs_m-exp_m)^2/exp_m)
degf_m  <- nrow(chosen_m)
cv_m    <- qchisq(0.95, df = degf_m)
p_val_m <- pchisq(q = chisq_m, df = degf_m, lower.tail = F)

tibble(
 subset          = c("all", "feminine", "masculine"),
 degrees_freedom = c(degf, degf_f, degf_m),
 test_statistic  = c(chisq, chisq_f, chisq_m),
 critical_value  = c(cv, cv_f, cv_m),
 p_value         = c(p_val, p_val_f, p_val_m)
)
# A tibble: 3 × 5
  subset    degrees_freedom test_statistic critical_value p_value
  <chr>               <int>          <dbl>          <dbl>   <dbl>
1 all                133901       7039542.        134753.       0
2 feminine            83323       2588891.         83996.       0
3 masculine           50578       5601883.         51102.       0

The test statistic is comically large (it reached upwards of 2 billion while using the continuity correction). Names contributing the most to the test statistic have one observation in both datasets. It’s very possible that these are in fact the same observation, and that the people in question had their names legally changed and that this was reflected in the null frequency distribution. However, I can’t in good conscience remove these observations, since they are still indicative of chosen names being skewed very far towards the tail of the null distribution. Without overzealous subsetting of the data, the sample of transgender people is just too small relative to the number of possible names. Name probabilities will almost always be so small that as soon as a single observation is noted, the squared function of the numerator will fail to offset the very small denominator, causing a the contribution to th \(\chi^2\) value of that single observation to far exceed the degrees of freedom for the test which doubles as a rough estimate of the critical value at such extremes. Given that the sample grossly failed to meet the assumptions for inference, several Monte Carlo simulations were run as an alternative. (Engels, 2015)

hide
xmonte(obs = obs, expr = exp, ntrials = 5000, statName = "Prob")
xmonte(obs = obs_f, expr = exp_f, ntrials = 5000, statName = "Prob")
xmonte(obs = obs_m, expr = exp_m, ntrials = 5000, statName = "Prob")

The above code evaluates to three p-values of zero. For the sake of computational efficiency, it is not automatically run. Using a Monte Carlo method is about as good an approach as one can hope to use for this topic, as it is useful for more complex processes. While an exact goodness of fit test would be ideal, this isn’t applicable for probability distributions as large as that of the given and chosen names. And while a chi-squared test for goodness of fit is easily computed, its assumption of a sufficient number of expected values simply cannot be met because the sample size is too small relative to the number of possible levels. But there aren’t any assumptions that need to be made about the probability distribution for the Monte Carlo method to work, and it still gives significant results.

Conclusions

While preparing the samples for analysis, some names (approximately 300) were dropped because they would have made it impossible to perform inference. In other words, these names were so unique that they were not even recorded once in the expansive list of given names. Given that these make up a significant proportion of the sample of transgender people, and because they would further cause the distribution of chosen names to differ from the distribution of given names, the p-values provided by the Monte Carlo methods are somewhat conservative. It is therefore safe to conclude that the distribution of transgender peoples’ chosen names differs from the distribution of given names; further, that the distribution of traditionally feminine chosen names differs from the distribution of traditionally feminine given names, and the same for traditionally masculine names. This can be visualized in the context of the distribution of given names:

hide
given                                          %>%
  rename(prop_given = "prop")                  %>%
  bind_cols(select(chosen, prop))              %>%
  rename(prop_chosen = "prop")                 %>%
  filter(prop_given > 0.000056)                %>% # cutoff of approximately n > 20000
  mutate(name = fct_reorder(name, prop_given)) %>%
  ggplot()                                                  +
  geom_col(aes(x = name, y = prop_given, fill = "given"))   +
  geom_col(aes(x = name, y = prop_chosen, fill = "chosen")) +
  labs(
    title = "upper tail comparison of distributions of chosen and given names",
    x = "names in increasing frequency of appearance in population of given names",
    y = "proportion of sample",
    fill = "distribution"
  )                                                         +
  scale_fill_manual(values = c("salmon", "grey33"))         +
  theme(
    axis.text.x = element_blank(),
    axis.ticks.x = element_blank()
  )

The distribution of chosen names is clearly quite different from the distribution of given names. Most notably, the absolute most common given names are much less prevalent than expected. While this graphic appears to show the distribution of chosen names varying greatly from a power law (which is clearly seen in the null distribution, and is frequently observed in complex human processes), the chosen names are still in a power law of their own:

hide
chosen                                   %>%
  filter(n > 0)                          %>%
  mutate(name = fct_reorder(name, prop)) %>%
  ggplot()                                          +
  geom_col(aes(x = name, y = prop))                 +
  labs(
    title = "distribution of chosen names",
    x = "names in increasing frequency of appearance",
    y = "proportion of sample"
  )                                                 +
  theme(
    axis.text.x = element_blank(),
    axis.ticks.x = element_blank()
  )

However, just because both distributions can be approximated with a power law does not make them remotely similar to one another. While the lower tails of both distributions were trimmed due to their undesired effect on the scale of the x-axis, we can further assume that the lower tail of given names is much more sparse relative to the tail of chosen names. I would speculate that given the intrinsically liberating aspect of naming oneself, transgender people choose names for themselves skewed away from the prototypical naming conventions of their parent generation. However, given the at times insular nature of transgender communities (especially in reference to the practice of shopping or searching for names, where one might ask other members of the community for prospective names), the most popular chosen names would be more likely to be circulated, making the power law an intuitive model for chosen names, even when their distribution differs significantly from that of given names.

Code Appendix

hide
# first-time data acquisition
# the output of all this code is already included in the zip

# data loading
given           <- read.csv("name_gender_dataset.csv")
colnames(given) <- c("name","gender","n","prop")

# name abbreviations to remove
# they may be actual names in this context, but as the names from wikipedia
# are pulled from page titles, they are just initials in the context of my sample
# and therefore have to be excluded
stoplist <- c(
 "A.",
 "B.",
 "C.",
 "D.",
 "E.",
 "F.",
 "G.",
 "H.",
 "I.",
 "J.",
 "K.",
 "L.",
 "M.",
 "N.",
 "O.",
 "P.",
 "Q.",
 "R.",
 "S.",
 "T.",
 "U.",
 "V.",
 "W.",
 "X.",
 "Y.",
 "Z."
)

# chooses gender with most occurrences to assign a perceived gender to each name
given <- given                  %>%
  select(name, gender, n)       %>%
  group_by(name)                %>%
  arrange(desc(n))              %>%
  slice_head()                  %>%
  filter(!(name %in% stoplist)) %>%
  arrange(name)

 # we will treat the binary genders as separate distributions
given_f <- given        %>%
  filter(gender == "F") %>%
  select(!gender)

given_m <- given        %>%
  filter(gender == "M") %>%
  select(!gender)

# create frequency distributions
given <- given            %>%
  ungroup()               %>%
  mutate(prop = n/sum(n))

given_m <- given_m        %>%
  ungroup()               %>%
  mutate(prop = n/sum(n))

given_f <- given_f        %>%
  ungroup()               %>%
  mutate(prop = n/sum(n))
hide
# defining functions for legibility

# grabs table from specified page and xpath
table_scraper <- function(url){
  url                                                                    %>%
    read_html()                                                          %>%
    html_node(xpath = "/html/body/div[3]/div[3]/div[5]/div[1]/table[2]") %>%
    html_table(fill = T)                                                 %>%
    mutate(full = Name, .keep = "none")                                  %>%
    return()
}

# grabs text from specified page and xpath
# and does minimal processing to convert it to a df where each element
# in the list is a row of the df
text_scraper <- function(url){
  df <- url                                                             %>%
    read_html()                                                         %>%
    html_node(xpath = "/html/body/div[3]/div[3]/div[5]/div[2]/div/div") %>%
    html_text2()                                                        %>%
    str_split("\n")                                                     %>%
    as.data.frame()
  colnames(df) <- c("full")
  return(df)
}

# for alternative xpath
text_scraper_ <- function(url){
  df <- url                                                                %>%
    read_html()                                                            %>%
    html_node(xpath = "/html/body/div[3]/div[3]/div[5]/div[2]/div[2]/div") %>%
    html_text2()                                                           %>%
    str_split("\n")                                                        %>%
    as.data.frame()
  colnames(df) <- c("full")
  return(df)
}

# all pages that would implicate a high likelihood of people from
# non predominantly english-speaking countries were excluded
# a few pages were excluded because they would have been significantly harder to scrape
hide
# every single wikipedia url i'm using, categorized by how it should be scraped

# table
urls_a <- c(
  'https://en.wikipedia.org/wiki/List_of_people_with_non-binary_gender_identities',
  'https://en.wikipedia.org/wiki/List_of_transgender_people'
)

# text with standard xpath
urls_b <- c(
  'https://en.wikipedia.org/wiki/Category:Non-binary_activists',
  'https://en.wikipedia.org/wiki/Category:American_non-binary_actors',
  'https://en.wikipedia.org/wiki/Category:Australian_non-binary_actors',
  'https://en.wikipedia.org/wiki/Category:British_non-binary_actors',
  'https://en.wikipedia.org/wiki/Category:Canadian_non-binary_actors',
  'https://en.wikipedia.org/wiki/Category:Non-binary_voice_actors',
  'https://en.wikipedia.org/wiki/Category:Non-binary_archaeologists',
  'https://en.wikipedia.org/wiki/Category:Non-binary_artists',
  'https://en.wikipedia.org/wiki/Category:Non-binary_comedians',
  'https://en.wikipedia.org/wiki/Category:Non-binary_computer_scientists',
  'https://en.wikipedia.org/wiki/Category:Non-binary_drag_performers',
  'https://en.wikipedia.org/wiki/Category:Non-binary_models',
  'https://en.wikipedia.org/wiki/Category:Non-binary_musicians',
  'https://en.wikipedia.org/wiki/Category:Non-binary_politicians',
  'https://en.wikipedia.org/wiki/Category:Non-binary_sportspeople',
  'https://en.wikipedia.org/wiki/Category:Non-binary_writers',
  'https://en.wikipedia.org/wiki/Category:Intersex_non-binary_people',
  'https://en.wikipedia.org/wiki/Category:Transgender_non-binary_people',
  'https://en.wikipedia.org/wiki/Category:Asexual_non-binary_people',
  'https://en.wikipedia.org/wiki/Category:Genderqueer_people',
  'https://en.wikipedia.org/wiki/Category:Two-spirit_people',
  'https://en.wikipedia.org/wiki/Category:Transgender_and_transsexual_men_musicians',
  'https://en.wikipedia.org/wiki/Category:Transgender_and_transsexual_female_adult_models',
  'https://en.wikipedia.org/wiki/Category:Transgender_and_transsexual_women_musicians',
  'https://en.wikipedia.org/wiki/Category:Transgender_and_transsexual_academics',
  'https://en.wikipedia.org/wiki/Category:Transgender_pornographic_film_actresses',
  'https://en.wikipedia.org/wiki/Category:Transgender_male_pornographic_film_actors',
  'https://en.wikipedia.org/wiki/Category:Transgender_and_transsexual_clergy',
  'https://en.wikipedia.org/wiki/Category:Transgender_and_transsexual_computer_programmers',
  'https://en.wikipedia.org/wiki/Category:Transgender_and_transsexual_DJs',
  'https://en.wikipedia.org/wiki/Category:Transgender_and_transsexual_comedians',
  'https://en.wikipedia.org/wiki/Category:Transgender_and_transsexual_lawyers',
  'https://en.wikipedia.org/wiki/Category:Transgender_and_transsexual_media_personalities',
  'https://en.wikipedia.org/wiki/Category:Transgender_and_transsexual_military_personnel',
  'https://en.wikipedia.org/wiki/Category:Transgender_and_transsexual_physicians',
  'https://en.wikipedia.org/wiki/Category:Transgender_and_transsexual_scientists',
  'https://en.wikipedia.org/wiki/Category:Transgender_prostitutes',
  'https://en.wikipedia.org/wiki/Category:Trans_and_gender_non-conforming_dramatists_and_playwrights',
  'https://en.wikipedia.org/wiki/Category:Transgender_and_transsexual_Jews',
  'https://en.wikipedia.org/wiki/Category:Transgender_and_transsexual_Muslims'
)

# text with alternative xpath
urls_c <- c(
  'https://en.wikipedia.org/wiki/Category:Non-binary_actors',
  'https://en.wikipedia.org/wiki/Category:Transgender_and_transsexual_people',
  'https://en.wikipedia.org/wiki/Category:Transgender_and_transsexual_actors',
  'https://en.wikipedia.org/wiki/Category:Transgender_and_transsexual_male_actors',
  'https://en.wikipedia.org/wiki/Category:Transgender_and_transsexual_actresses',
  'https://en.wikipedia.org/wiki/Category:Transgender_and_transsexual_female_models',
  'https://en.wikipedia.org/wiki/Category:Transgender_and_transsexual_artists',
  'https://en.wikipedia.org/wiki/Category:Transgender_drag_performers',
  'https://en.wikipedia.org/wiki/Category:Transgender_and_transsexual_entertainers',
  'https://en.wikipedia.org/wiki/Category:Transgender_and_transsexual_models',
  'https://en.wikipedia.org/wiki/Category:Transgender_and_transsexual_politicians',
  'https://en.wikipedia.org/wiki/Category:Transgender_and_transsexual_sex_workers',
  'https://en.wikipedia.org/w/index.php?title=Category:Transgender_and_transsexual_writers&pageuntil=Reitz%2C+Jennifer+Diane%0AJennifer+Diane+Reitz#mw-pages',
  'https://en.wikipedia.org/w/index.php?title=Category:Transgender_and_transsexual_writers&pagefrom=Reitz%2C+Jennifer+Diane%0AJennifer+Diane+Reitz#mw-pages',
  'https://en.wikipedia.org/w/index.php?title=Category:Transgender_and_transsexual_women&pageuntil=Grey%2C+Alexandra%0AAlexandra+Grey#mw-pages',
  'https://en.wikipedia.org/w/index.php?title=Category:Transgender_and_transsexual_women&pagefrom=Grey%2C+Alexandra%0AAlexandra+Grey#mw-pages',
  'https://en.wikipedia.org/w/index.php?title=Category:Transgender_and_transsexual_women&pagefrom=Paradeda%2C+Dina+Alma+de%0ADina+Alma+de+Paradeda#mw-pages'
)

#text with alternative xpath, top 2 observations must be removed
urls_d <- c(
  'https://en.wikipedia.org/wiki/Category:People_with_non-binary_gender_identities',
  'https://en.wikipedia.org/wiki/Category:Transgender_and_transsexual_sportspeople',
  'https://en.wikipedia.org/wiki/Category:Transgender_and_transsexual_men'
)

# stoplist to filter out alphabetical subcategories
stoplist <- c(
  'A',
  'B',
  'C',
  'D',
  'E',
  'F',
  'G',
  'H',
  'I',
  'J',
  'K',
  'L',
  'M',
  'N',
  'O',
  'P',
  'Q',
  'R',
  'S',
  'T',
  'U',
  'V',
  'W',
  'X',
  'Y',
  'Z',
  '*',
  '-'
)
hide
# scrape web pages, removing any duplicate (full) names
# (i am assuming that our sample is small enough that duplicate full names will not appear)
init <- F

for(url in urls_a){
  df <- table_scraper(url)        %>%
    filter(!(full %in% stoplist)) %>%
    mutate(name = word(full, 1))
  if(init == F){
    init   <- T
    chosen <- df
  } else {
    chosen <- chosen %>%
      bind_rows(df)  %>%
      unique()
  }
  Sys.sleep(1)
}

for(url in urls_b){
  df <- text_scraper(url)         %>%
    filter(!(full %in% stoplist)) %>%
    mutate(name = word(full, 1))
  chosen <- chosen %>%
    bind_rows(df)  %>%
    unique()
  Sys.sleep(1)
}

for(url in urls_c){
  df <- text_scraper_(url)        %>%
    filter(!(full %in% stoplist)) %>%
    mutate(name = word(full, 1))
  chosen <- chosen %>%
    bind_rows(df)  %>%
    unique()
  Sys.sleep(1)
}

for(url in urls_d){
  df <- text_scraper_(url)        %>%
    filter(!(full %in% stoplist)) %>%
    mutate(name = word(full, 1))  %>%
    tail(-2)
  chosen <- chosen %>%
    bind_rows(df)  %>%
    unique()
  Sys.sleep(1)
}
hide
# scraped data processing

# count names
chosen <- chosen %>%
  count(name)

# drop names that would break my hypothesis test (no expected values)
# and assign assumed genders
chosen <- chosen                                             %>%
  filter(name %in% given$name)                               %>%
  mutate(gender = if_else(name %in% given_m$name, "M", "F"))

# expand the list of chosen names to include empty categories
df <- given                        %>%
  select(name, gender)             %>%
  filter(!(name %in% chosen$name)) %>%
  mutate(n = 0)

chosen <- chosen %>%
  bind_rows(df)  %>%
  arrange(name)

# create subset of masculine chosen names
chosen_m <- chosen      %>%
  filter(gender == "M") %>%
  select(!gender)

# create subset of feminine chosen names
chosen_f <- chosen      %>%
  filter(gender == "F") %>%
  select(!gender)

# create frequency distributions
chosen <- chosen          %>%
  mutate(prop = n/sum(n))

chosen_m <- chosen_m      %>%
  mutate(prop = n/sum(n))

chosen_f <- chosen_f      %>%
  mutate(prop = n/sum(n))
hide
# creating csvs
write.csv(given, "given.csv", row.names = F)
write.csv(given_m, "given_m.csv", row.names = F)
write.csv(given_f, "given_f.csv", row.names = F)
write.csv(chosen, "chosen.csv", row.names = F)
write.csv(chosen_m, "chosen_m.csv", row.names = F)
write.csv(chosen_f, "chosen_f.csv", row.names = F)

Class Peer Reviews

Engels, B. (2015). XNomial: Exact goodness-of-fit test for multinomial data with fixed probabilities. https://cran.r-project.org/web/packages/XNomial/XNomial.pdf
Gender by name. (2020). UCI Machine Learning Repository. https://archive-beta.ics.uci.edu/ml/datasets/gender+by+name
Wikipedia lists of people by gender. (2020). Wikipedia, The Free Encyclopedia. https://en.wikipedia.org/wiki/Category:People_by_gender
Yates, F. (1934). Contingency tables involving small numbers and the ??2 test. Supplement to the Journal of the Royal Statistical Society, 1(2), 217–235. http://www.jstor.org/stable/2983604

References

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".