Assignment 2 Answers

Question 1 [1 points]
Re-express the following small program using pipes.

x<-3; mu<-5; sigma <-7

(nrm <- ( 1 / (sigma * sqrt( 2 * pi ))) * exp( -(1/2) * ( (x - mu)/ sigma )^2))
## [1] 0.05471239

Answer: (Input example)

p <- c(x = 3, mu = 5, sigma = 7)

p %>%
  { ( 1 / (.[["sigma"]] * sqrt( 2 * pi ))) * exp( -(1/2) * ( (.[["x"]] - .[["mu"]])/ .[["sigma"]] )^2 ) }

## [1] 0.05471239

Alt answer: (Memory example)

exp( -(1/2) * ( (p[["x"]] - p[["mu"]])/ p[["sigma"]] )^2) %>%
  "*"( 1 / (p[["sigma"]] * sqrt( 2 * pi )))

## [1] 0.05471239

Bonus [1/2 point] What is this equation?
The equation is the probability density function of a normal distribution

Question 2 [3 points]
Re-express the following small program using pipes.

site <- 15
gsub(";", "/", paste(paste( paste( "~hallett", paste("raw", "tara", sep = ";"), sep = ";"), 
      paste("site", site, sep="_"), sep=";"), 
      "dna_seq.fa", sep=";"))
## [1] "~hallett/raw/tara/site_15/dna_seq.fa"

Answer: (Input example)

site %>%
  paste("site", ., sep="_") %>% 
  paste("dna_seq.fa", sep=";") %>%
  paste("raw", "tara", ., sep = ";") %>%
  paste("~hallett", ., sep = ";") %>%
  gsub(";", "/", .)

## [1] "~hallett/raw/tara/site_15/dna_seq.fa"

Alt answer: (Memory example)

paste("~hallett") %>%
  paste("raw", "tara", sep = ";") %>% 
  paste("site", sep = ";") %>%
  paste(site, sep="_") %>% 
  paste("dna_seq.fa", sep=";") %>%
  gsub(";", "/", .)

## [1] "~hallett/raw/tara/site_15/dna_seq.fa"

Question 3 [6 points]

3a [1 point] How many black or african women are included in the dataset?

small_brca %>%
  filter(gender == "FEMALE" & race == "BLACK OR AFRICAN AMERICAN") %>%
  distinct(participant) %>%
  summarise(count = n())

## # A tibble: 1 x 1
##   count
##   <int>
## 1   179

3b [1 point] How many black or african males?

small_brca %>%
  filter(gender == "MALE" & race == "BLACK OR AFRICAN AMERICAN") %>%
  summarise(count = n())

## # A tibble: 1 x 1
##   count
##   <int>
## 1     3

3c [3 points] Create a histogram of all black or african individuals grouped by age. Each bar in the histogram should have two colours corresponding to male and females. (For example, red might be used for the fraction of women in some age group while blue is used for men in that same age group). Choose the number of bins (or binsize) in a way that makes your plot informative.

small_brca %>%
  filter(race == "BLACK OR AFRICAN AMERICAN") %>%
  distinct(participant, .keep_all = TRUE) %>%
  mutate(age_at_diagnosis = as.numeric(age_at_diagnosis)) %>%
  ggplot(aes(x = age_at_diagnosis, fill = gender)) +
  geom_histogram(binwidth = 5)

3d [1 point] What was the rationale for the number of bins you used?
I chose bin = 5 because it made the distribution look the smoothest while keeping the most information.

Question 4 [7 points]
4a [2 point] How many participants are there? (Careful: not samples but participants.) Show code.

small_brca %>%
  select(participant) %>%
  n_distinct()

## [1] 1092

4b [3 points] List all of the that have more than one sample. Show code.

small_brca %>%
  group_by(participant) %>%
  summarise(count = n(), .groups = 'drop') %>%
  filter(count > 1)

## # A tibble: 115 x 2
##    participant count
##    <chr>       <int>
##  1 A0AU            2
##  2 A0AY            2
##  3 A0AZ            2
##  4 A0B3            2
##  5 A0B5            2
##  6 A0B7            2
##  7 A0B8            2
##  8 A0BA            2
##  9 A0BC            2
## 10 A0BJ            2
## # … with 105 more rows

4c [2 points] For the participants identfied in 4b, create a scatter plot with the tumor on the axis and the normal on the axis (as specified by the variable in the tibble) with points representing the expression of gene .

# Get participant list
( twoplus_participant_list <- small_brca %>%
  group_by(participant) %>%
  summarise(count = n(), .groups = 'drop',
            tumor,
            ESR1)%>%
  filter(count > 1) %>%
  select(-count) %>%
  pivot_wider(names_from = tumor, values_from = ESR1, values_fn = mean) %>%
  rename(tumor = "TRUE", no.tumor = "FALSE"))

## # A tibble: 115 x 3
##    participant tumor no.tumor
##    <chr>       <dbl>    <dbl>
##  1 A0AU        28081    14812
##  2 A0AY        14412    19388
##  3 A0AZ         9980     8657
##  4 A0B3          401    24325
##  5 A0B5        10286     1773
##  6 A0B7          215     9369
##  7 A0B8        70814     2166
##  8 A0BA        33383    14830
##  9 A0BC        53861     9048
## 10 A0BJ        15806     4406
## # … with 105 more rows

When I create the tibble twoplus_paricipant_list, I take the mean of the values in the categories. For example, patient A0DB has 4 samples, 3 which tumor == TRUE and 1 FALSE. The values_fn = mean parameter in pivot wider takes the mean of the 3 TRUE values.

Some of the participants do not have both tumor variables, but these will be removed when creating the scatter plot.

ggplot(data = twoplus_participant_list, mapping = aes(x = tumor, y = no.tumor)) +
  geom_point()

## Warning: Removed 3 rows containing missing values (geom_point).

Question 5 [7 points]

lava <- small_brca %>% 
  select( c(participant, "age_at_diagnosis", tumor, c("ESR1", "BCL2", "ERBB2", "EGFR", "MKI67", "PGR") )) %>%
  reshape2::melt(id.vars = c("participant", "age_at_diagnosis", "tumor"), measure.vars =c("ESR1", "BCL2", "ERBB2", "EGFR", "MKI67", "PGR") ) %>%
  as_tibble

# Change variable data type
lava$age_at_diagnosis <- as.numeric(lava$age_at_diagnosis)

# Plot
ggplot(data = lava, mapping = aes(x = value, y = age_at_diagnosis, color = variable), size=.1) + 
  geom_point() +
  xlab("log expression") +
  scale_x_continuous(trans='log')

## Warning: Removed 6 rows containing missing values (geom_point).

Question 6 [4 points] In the lecture I said that the violin plot has killed the boxplot. Maya, your classmate, contradicted me and suggested boxplots are, actually, cool. Well I thought then that the whole class should get another question to explore Maya’s suggestion. Let’s all thank Maya!!

6a [3 points] In one figure (hint: ggarrange using the log-transformed expression of ESR1, create (i) a boxplot, (ii) a violin plot, (iii) a boxplot with beeswarm.

# Box plot
bxplt <- ggplot(data = small_brca) + 
  geom_boxplot(mapping = aes(y = ESR1)) +
  xlab("Box Plot") +
  scale_x_discrete() +
  scale_y_continuous(trans = "log", labels = round) 

# Violin plot
vlnplt <- ggplot(data = small_brca) + 
  geom_violin(mapping = aes(x = "", y = ESR1)) +
  xlab("Violin Plot") +
  scale_x_discrete() +
  scale_y_continuous(trans = "log", labels = round) + 
  ylab(NULL)

# Beeswarm plot
beeplot <- ggplot(data = small_brca, aes(x = "", y = ESR1)) +
  geom_boxplot() +
  ggbeeswarm::geom_beeswarm(alpha = 0.2) +
  xlab("Beeswarm Plot") +
  scale_x_discrete() + 
  scale_y_continuous(trans = "log", labels = round)  + 
  ylab(NULL)

# ggarrange and produce figure
ggpubr::ggarrange(bxplt, vlnplt, beeplot, ncol = 3, nrow = 1)

6b [1 point] In your opinion - whatever you think now - which is the nicest?
Opinion, but wrong if box plot :)

Question 7 [5 points]
The variable in has several different possible values including “WHITE”, “ASIAN”, “BLACK OR AFRICAN AMERICAN”, “AMERICAN INDIAN OR ALASKA NATIVE”, “[Not Available]”, “[Not Evaluated]”, or NA.

It is better to simply use NA instead of“[Not Available]”, “[Not Evaluated]”, or NA. It will simplify your code in downstream analyses.

Write R code to create a new variable where all samples with “[Not Available]”, “[Not Evaluated]” are changed to NA (Note: you shouldn’t use “NA”, but NA. They are not the same)

( more_better_small_brca <- small_brca %>%
    mutate(race_modified = replace(race, race == "[Not Available]" | race == "[Not Evaluated]", NA)) )

## # A tibble: 1,215 x 80
##    id    tss   participant barcode bcr_patient_uuid form_completion…
##    <chr> <chr> <chr>       <chr>   <chr>            <chr>           
##  1 TCGA… E9    A1NF        TCGA-E… a8b1f6e7-2bcf-4… 2011-6-23       
##  2 TCGA… D8    A27M        TCGA-D… ae65baeb-6b78-4… 2011-7-17       
##  3 TCGA… BH    A0GZ        TCGA-B… 27dfb9d4-3a2c-4… 2010-11-10      
##  4 TCGA… BH    A18V        TCGA-B… 6b960b58-28e1-4… 2011-7-2        
##  5 TCGA… A7    A13G        TCGA-A… 6cd9baf5-bbe0-4… 2011-5-16       
##  6 TCGA… C8    A275        TCGA-C… ce887e90-0660-4… 2011-7-21       
##  7 TCGA… AN    A0XS        TCGA-A… 95e949d3-b1f3-4… 2011-3-28       
##  8 TCGA… OL    A5RW        TCGA-O… A82D0A57-4383-4… 2013-5-16       
##  9 TCGA… OL    A5RX        TCGA-O… 43C0BEC6-6FAC-4… 2013-5-16       
## 10 TCGA… E9    A1RD        TCGA-E… 5fe3c51f-65e2-4… 2011-7-28       
## # … with 1,205 more rows, and 74 more variables: birth_days_to <chr>,
## #   gender <chr>, menopause_status <chr>, race <chr>, ethnicity <chr>,
## #   tumor_status <chr>, vital_status <chr>, death_days_to <chr>,
## #   histologic_diagnosis_other <chr>, initial_pathologic_dx_year <chr>,
## #   age_at_diagnosis <chr>, micromet_detection_by_ihc <chr>,
## #   lymph_nodes_examined_count <chr>, ajcc_pathologic_tumor_stage <chr>,
## #   er_status_by_ihc <chr>, er_status_ihc_Percent_Positive <chr>,
## #   pr_status_by_ihc <chr>, pr_status_ihc_percent_positive <chr>,
## #   her2_fish_status <chr>, her2_copy_number <chr>, histological_type <chr>,
## #   metastatic_tumor_indicator <chr>, tumor <lgl>, ANLN <dbl>, FOXC1 <dbl>,
## #   CDH3 <dbl>, FGFR4 <dbl>, UBE2T <dbl>, NDC80 <dbl>, PGR <dbl>, BIRC5 <dbl>,
## #   ORC6 <dbl>, ESR1 <dbl>, PHGDH <dbl>, PTTG1 <dbl>, MELK <dbl>, NAT1 <dbl>,
## #   CXXC5 <dbl>, BCL2 <dbl>, RRM2 <dbl>, GPR160 <dbl>, EXO1 <dbl>, UBE2C <dbl>,
## #   TYMS <dbl>, KRT5 <dbl>, KRT14 <dbl>, MAPT <dbl>, CDC6 <dbl>, MMP11 <dbl>,
## #   MYBL2 <dbl>, SFRP1 <dbl>, CCNE1 <dbl>, BLVRA <dbl>, BAG1 <dbl>, MLPH <dbl>,
## #   CDC20 <dbl>, CENPF <dbl>, MIA <dbl>, KRT17 <dbl>, FOXA1 <dbl>,
## #   ACTR3B <dbl>, CCNB1 <dbl>, MDM2 <dbl>, MYC <dbl>, CEP55 <dbl>,
## #   SLC39A6 <dbl>, ERBB2 <dbl>, GRB7 <dbl>, KIF2C <dbl>, NUF2 <dbl>,
## #   EGFR <dbl>, MKI67 <dbl>, TMEM45B <dbl>, race_modified <chr>

Question 8 [5 points]

Samples may have differences in the distributions of expression across the genes. This might be caused by technical noise because often not all samples are processed at the same time or the samples might be prepared slightly differently. This difference in distributions can sometimes get in the way of our analysis, especially when we are looking at many genes at the same time. A common approach that sometimes helps is to median center the expression of each gene. For each observation (row), the median centering transforms transcript counts to where . Here is the number of genes (in our case ).

Show code to create a new tibble where all expression measurements are median centered.

( median_centered <- small_brca %>%
    rowwise() %>%
    mutate(median = median( c_across(ANLN:TMEM45B) )) %>%
    mutate_at(vars(ANLN:TMEM45B), ~ . - median) %>% 
    select(-median) )

## # A tibble: 1,215 x 79
## # Rowwise: 
##    id    tss   participant barcode bcr_patient_uuid form_completion…
##    <chr> <chr> <chr>       <chr>   <chr>            <chr>           
##  1 TCGA… E9    A1NF        TCGA-E… a8b1f6e7-2bcf-4… 2011-6-23       
##  2 TCGA… D8    A27M        TCGA-D… ae65baeb-6b78-4… 2011-7-17       
##  3 TCGA… BH    A0GZ        TCGA-B… 27dfb9d4-3a2c-4… 2010-11-10      
##  4 TCGA… BH    A18V        TCGA-B… 6b960b58-28e1-4… 2011-7-2        
##  5 TCGA… A7    A13G        TCGA-A… 6cd9baf5-bbe0-4… 2011-5-16       
##  6 TCGA… C8    A275        TCGA-C… ce887e90-0660-4… 2011-7-21       
##  7 TCGA… AN    A0XS        TCGA-A… 95e949d3-b1f3-4… 2011-3-28       
##  8 TCGA… OL    A5RW        TCGA-O… A82D0A57-4383-4… 2013-5-16       
##  9 TCGA… OL    A5RX        TCGA-O… 43C0BEC6-6FAC-4… 2013-5-16       
## 10 TCGA… E9    A1RD        TCGA-E… 5fe3c51f-65e2-4… 2011-7-28       
## # … with 1,205 more rows, and 73 more variables: birth_days_to <chr>,
## #   gender <chr>, menopause_status <chr>, race <chr>, ethnicity <chr>,
## #   tumor_status <chr>, vital_status <chr>, death_days_to <chr>,
## #   histologic_diagnosis_other <chr>, initial_pathologic_dx_year <chr>,
## #   age_at_diagnosis <chr>, micromet_detection_by_ihc <chr>,
## #   lymph_nodes_examined_count <chr>, ajcc_pathologic_tumor_stage <chr>,
## #   er_status_by_ihc <chr>, er_status_ihc_Percent_Positive <chr>,
## #   pr_status_by_ihc <chr>, pr_status_ihc_percent_positive <chr>,
## #   her2_fish_status <chr>, her2_copy_number <chr>, histological_type <chr>,
## #   metastatic_tumor_indicator <chr>, tumor <lgl>, ANLN <dbl>, FOXC1 <dbl>,
## #   CDH3 <dbl>, FGFR4 <dbl>, UBE2T <dbl>, NDC80 <dbl>, PGR <dbl>, BIRC5 <dbl>,
## #   ORC6 <dbl>, ESR1 <dbl>, PHGDH <dbl>, PTTG1 <dbl>, MELK <dbl>, NAT1 <dbl>,
## #   CXXC5 <dbl>, BCL2 <dbl>, RRM2 <dbl>, GPR160 <dbl>, EXO1 <dbl>, UBE2C <dbl>,
## #   TYMS <dbl>, KRT5 <dbl>, KRT14 <dbl>, MAPT <dbl>, CDC6 <dbl>, MMP11 <dbl>,
## #   MYBL2 <dbl>, SFRP1 <dbl>, CCNE1 <dbl>, BLVRA <dbl>, BAG1 <dbl>, MLPH <dbl>,
## #   CDC20 <dbl>, CENPF <dbl>, MIA <dbl>, KRT17 <dbl>, FOXA1 <dbl>,
## #   ACTR3B <dbl>, CCNB1 <dbl>, MDM2 <dbl>, MYC <dbl>, CEP55 <dbl>,
## #   SLC39A6 <dbl>, ERBB2 <dbl>, GRB7 <dbl>, KIF2C <dbl>, NUF2 <dbl>,
## #   EGFR <dbl>, MKI67 <dbl>, TMEM45B <dbl>

Question 9 [7 points]
9a [1 point]: Using online resources, define what the variance of a distribution is. Find a function in R to compute the variance.
Variance is a measurement of spread between numbers in a data set. More specifically, it is the average of the squared differences from the mean of the points in your distribution. You can compute the variance using var().

9b [4 points]: Compute the variance for each of the genes in our dataset, create a tibble with just the and variables along with the genes. Order the columns (variables) of the tibble from most to least variable.

( ordered_gene_var <- small_brca %>%
  select(ANLN:TMEM45B) %>%
  pivot_longer(cols = everything(), names_to = 'genes') %>%
  pivot_wider(names_from = genes, values_from = value, values_fn = var) %>%
  pivot_longer(cols = everything()) %>%
  arrange(desc(value)) )

## # A tibble: 50 x 2
##    name           value
##    <chr>          <dbl>
##  1 SLC39A6 17751328814.
##  2 ERBB2   12270026602.
##  3 KRT14    2341482535.
##  4 KRT5     2224823511.
##  5 SFRP1    1155552308.
##  6 ESR1     1053044772.
##  7 KRT17     682646996.
##  8 MMP11     674490386.
##  9 EGFR      592271929.
## 10 MLPH      394356506.
## # … with 40 more rows

( ordered_tbl <- small_brca %>%
  select(participant, race, ordered_gene_var$name) )

## # A tibble: 1,215 x 52
##    participant race  SLC39A6  ERBB2  KRT14   KRT5 SFRP1   ESR1 KRT17  MMP11
##    <chr>       <chr>   <dbl>  <dbl>  <dbl>  <dbl> <dbl>  <dbl> <dbl>  <dbl>
##  1 A1NF        WHITE   13526  12797     39    111   124  39073   131 108677
##  2 A27M        WHITE    6580   7923  43714   7532 31149    512 17415  89135
##  3 A0GZ        WHITE  138393  18371   2194  21347   711  66040  7448  31347
##  4 A18V        WHITE    5126  15060  93442 102256  8177    472 26567   8397
##  5 A13G        WHITE   53107  27780    721   1168   839 133812   669   6721
##  6 A275        ASIAN    7044 811281    362    717   321    359   971  33272
##  7 A0XS        WHITE   70814  20913   1496   1868  1744  40853  1511  57101
##  8 A5RW        BLAC…    6038   8814 198790  59412 10705     83 28556  12724
##  9 A5RX        BLAC…   53480  34954  20535  26435 24217  17928 38416   2260
## 10 A1RD        WHITE  305439  24657    710    720   622  91953   522  24545
## # … with 1,205 more rows, and 42 more variables: EGFR <dbl>, MLPH <dbl>,
## #   PGR <dbl>, FOXA1 <dbl>, NAT1 <dbl>, PHGDH <dbl>, GRB7 <dbl>, MAPT <dbl>,
## #   MDM2 <dbl>, CDH3 <dbl>, MYC <dbl>, CENPF <dbl>, BAG1 <dbl>, MKI67 <dbl>,
## #   MYBL2 <dbl>, BCL2 <dbl>, CXXC5 <dbl>, FOXC1 <dbl>, FGFR4 <dbl>, RRM2 <dbl>,
## #   BIRC5 <dbl>, CDC6 <dbl>, BLVRA <dbl>, UBE2C <dbl>, CDC20 <dbl>, ANLN <dbl>,
## #   GPR160 <dbl>, CCNB1 <dbl>, KIF2C <dbl>, TYMS <dbl>, NUF2 <dbl>,
## #   TMEM45B <dbl>, PTTG1 <dbl>, CCNE1 <dbl>, CEP55 <dbl>, UBE2T <dbl>,
## #   MELK <dbl>, NDC80 <dbl>, MIA <dbl>, ACTR3B <dbl>, EXO1 <dbl>, ORC6 <dbl>

9c [2 points]: Create a violin plot of the genes ordered from most to least variable.

ordered_tbl %>%
  reshape2::melt(id.vars = c("participant", "race"), measure.vars =  c(3:52)) %>% # All genes
  ggplot(aes(x = variable, y = value)) + 
  geom_violin(draw_quantiles = 0.5) + 
  scale_y_continuous(trans = "log", labels = round) + 
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 0.9)) +
  xlab("Genes")

## Warning: Transformation introduced infinite values in continuous y-axis

## Warning: Removed 90 rows containing non-finite values (stat_ydensity).

# You can also use log1p to keep the 0 values

Question 10 [5 points]
Pick one of the genes from uniformly randomly (see the function). Using the NCBI, provide the following information or note that it is not availalbe:

set.seed(42)
colnames(small_brca)[c(30:79)] %>%
  .[runif(1, 1, 50)]

## [1] "GRB7"

10a Full name of the gene and the official name according to the HGNC.
growth factor receptor bound protein 7

10b First time it was reported in the literature.
First time it was reported was October 1st, 1992 (doi: 10.1073/pnas.89.19.8894)

10c Where it is located in the genome.
Chromosome 17 from 39737938 to 39747285

10d Its accession code or codes (if it has been modified).
Current accession code: NC_000017.11
Previous accession code: NC_000017.10

10e The number of exons.
18 exons

10f The number of alternative transcripts that have been record.
4 splice variants

10g Its full amino acid sequence. Provide its protein ID.
Protein ID: Q14451.2-1 (isoform 1)

        10         20         30         40         50
MELDLSPPHL SSSPEDLCPA PGTPPGTPRP PDTPLPEEVK RSQPLLIPTT 
        60         70         80         90        100
GRKLREEERR ATSLPSIPNP FPELCSPPSQ SPILGGPSSA RGLLPRDASR 
       110        120        130        140        150
PHVVKVYSED GACRSVEVAA GATARHVCEM LVQRAHALSD ETWGLVECHP 
       160        170        180        190        200
HLALERGLED HESVVEVQAA WPVGGDSRFV FRKNFAKYEL FKSSPHSLFP 
       210        220        230        240        250
EKMVSSCLDA HTGISHEDLI QNFLNAGSFP EIQGFLQLRG SGRKLWKRFF 
       260        270        280        290        300
CFLRRSGLYY STKGTSKDPR HLQYVADVNE SNVYVVTQGR KLYGMPTDFG 
       310        320        330        340        350
FCVKPNKLRN GHKGLRIFCS EDEQSRTCWL AAFRLFKYGV QLYKNYQQAQ 
       360        370        380        390        400
SRHLHPSCLG SPPLRSASDN TLVAMDFSGH AGRVIENPRE ALSVALEEAQ 
       410        420        430        440        450
AWRKKTNHRL SLPMPASGTS LSAAIHRTQL WFHGRISREE SQRLIGQQGL 
       460        470        480        490        500
VDGLFLVRES QRNPQGFVLS LCHLQKVKHY LILPSEEEGR LYFSMDDGQT 
       510        520        530 
RFTDLLQLVE FHQLNRGILP CLLRHCCTRV AL

10h Its protein structure, if known.