Question 1 [1 points]
Re-express the following small program using pipes.
x<-3; mu<-5; sigma <-7
(nrm <- ( 1 / (sigma * sqrt( 2 * pi ))) * exp( -(1/2) * ( (x - mu)/ sigma )^2))
## [1] 0.05471239
Answer: (Input example)
p <- c(x = 3, mu = 5, sigma = 7)
p %>%
{ ( 1 / (.[["sigma"]] * sqrt( 2 * pi ))) * exp( -(1/2) * ( (.[["x"]] - .[["mu"]])/ .[["sigma"]] )^2 ) }
## [1] 0.05471239
Alt answer: (Memory example)
exp( -(1/2) * ( (p[["x"]] - p[["mu"]])/ p[["sigma"]] )^2) %>%
"*"( 1 / (p[["sigma"]] * sqrt( 2 * pi )))
## [1] 0.05471239
Bonus [1/2 point] What is this equation?
The equation is the probability density function of a normal distribution
Question 2 [3 points]
Re-express the following small program using pipes.
site <- 15
gsub(";", "/", paste(paste( paste( "~hallett", paste("raw", "tara", sep = ";"), sep = ";"),
paste("site", site, sep="_"), sep=";"),
"dna_seq.fa", sep=";"))
## [1] "~hallett/raw/tara/site_15/dna_seq.fa"
Answer: (Input example)
site %>%
paste("site", ., sep="_") %>%
paste("dna_seq.fa", sep=";") %>%
paste("raw", "tara", ., sep = ";") %>%
paste("~hallett", ., sep = ";") %>%
gsub(";", "/", .)
## [1] "~hallett/raw/tara/site_15/dna_seq.fa"
Alt answer: (Memory example)
paste("~hallett") %>%
paste("raw", "tara", sep = ";") %>%
paste("site", sep = ";") %>%
paste(site, sep="_") %>%
paste("dna_seq.fa", sep=";") %>%
gsub(";", "/", .)
## [1] "~hallett/raw/tara/site_15/dna_seq.fa"
Question 3 [6 points]
3a [1 point] How many black or african women are included in the \({\tt small\_brca}\) dataset?
small_brca %>%
filter(gender == "FEMALE" & race == "BLACK OR AFRICAN AMERICAN") %>%
distinct(participant) %>%
summarise(count = n())
## # A tibble: 1 x 1
## count
## <int>
## 1 179
3b [1 point] How many black or african males?
small_brca %>%
filter(gender == "MALE" & race == "BLACK OR AFRICAN AMERICAN") %>%
summarise(count = n())
## # A tibble: 1 x 1
## count
## <int>
## 1 3
3c [3 points] Create a histogram of all black or african individuals grouped by age. Each bar in the histogram should have two colours corresponding to male and females. (For example, red might be used for the fraction of women in some age group while blue is used for men in that same age group). Choose the number of bins (or binsize) in a way that makes your plot informative.
small_brca %>%
filter(race == "BLACK OR AFRICAN AMERICAN") %>%
distinct(participant, .keep_all = TRUE) %>%
mutate(age_at_diagnosis = as.numeric(age_at_diagnosis)) %>%
ggplot(aes(x = age_at_diagnosis, fill = gender)) +
geom_histogram(binwidth = 5)
3d [1 point] What was the rationale for the number of bins you used?
I chose bin = 5 because it made the distribution look the smoothest while keeping the most information.
Question 4 [7 points]
4a [2 point] How many participants are there? (Careful: not samples but participants.) Show code.
small_brca %>%
select(participant) %>%
n_distinct()
## [1] 1092
4b [3 points] List all of the \({\tt participant}\) that have more than one sample. Show code.
small_brca %>%
group_by(participant) %>%
summarise(count = n(), .groups = 'drop') %>%
filter(count > 1)
## # A tibble: 115 x 2
## participant count
## <chr> <int>
## 1 A0AU 2
## 2 A0AY 2
## 3 A0AZ 2
## 4 A0B3 2
## 5 A0B5 2
## 6 A0B7 2
## 7 A0B8 2
## 8 A0BA 2
## 9 A0BC 2
## 10 A0BJ 2
## # … with 105 more rows
4c [2 points] For the participants identfied in 4b, create a scatter plot with the tumor on the \(x\) axis and the normal on the \(y\) axis (as specified by the \({\tt tumor}\) variable in the tibble) with points representing the expression of gene \({\tt ESR1}\).
# Get participant list
( twoplus_participant_list <- small_brca %>%
group_by(participant) %>%
summarise(count = n(), .groups = 'drop',
tumor,
ESR1)%>%
filter(count > 1) %>%
select(-count) %>%
pivot_wider(names_from = tumor, values_from = ESR1, values_fn = mean) %>%
rename(tumor = "TRUE", no.tumor = "FALSE"))
## # A tibble: 115 x 3
## participant tumor no.tumor
## <chr> <dbl> <dbl>
## 1 A0AU 28081 14812
## 2 A0AY 14412 19388
## 3 A0AZ 9980 8657
## 4 A0B3 401 24325
## 5 A0B5 10286 1773
## 6 A0B7 215 9369
## 7 A0B8 70814 2166
## 8 A0BA 33383 14830
## 9 A0BC 53861 9048
## 10 A0BJ 15806 4406
## # … with 105 more rows
When I create the tibble twoplus_paricipant_list, I take the mean of the values in the categories. For example, patient A0DB has 4 samples, 3 which tumor == TRUE and 1 FALSE. The values_fn = mean parameter in pivot wider takes the mean of the 3 TRUE values.
Some of the participants do not have both tumor variables, but these will be removed when creating the scatter plot.
ggplot(data = twoplus_participant_list, mapping = aes(x = tumor, y = no.tumor)) +
geom_point()
## Warning: Removed 3 rows containing missing values (geom_point).
Question 5 [7 points]
lava <- small_brca %>%
select( c(participant, "age_at_diagnosis", tumor, c("ESR1", "BCL2", "ERBB2", "EGFR", "MKI67", "PGR") )) %>%
reshape2::melt(id.vars = c("participant", "age_at_diagnosis", "tumor"), measure.vars =c("ESR1", "BCL2", "ERBB2", "EGFR", "MKI67", "PGR") ) %>%
as_tibble
# Change variable data type
lava$age_at_diagnosis <- as.numeric(lava$age_at_diagnosis)
# Plot
ggplot(data = lava, mapping = aes(x = value, y = age_at_diagnosis, color = variable), size=.1) +
geom_point() +
xlab("log expression") +
scale_x_continuous(trans='log')
## Warning: Removed 6 rows containing missing values (geom_point).
Question 6 [4 points] In the lecture I said that the violin plot has killed the boxplot. Maya, your classmate, contradicted me and suggested boxplots are, actually, cool. Well I thought then that the whole class should get another question to explore Maya’s suggestion. Let’s all thank Maya!!
6a [3 points] In one figure (hint: ggarrange using the log-transformed expression of ESR1, create (i) a boxplot, (ii) a violin plot, (iii) a boxplot with beeswarm.
# Box plot
bxplt <- ggplot(data = small_brca) +
geom_boxplot(mapping = aes(y = ESR1)) +
xlab("Box Plot") +
scale_x_discrete() +
scale_y_continuous(trans = "log", labels = round)
# Violin plot
vlnplt <- ggplot(data = small_brca) +
geom_violin(mapping = aes(x = "", y = ESR1)) +
xlab("Violin Plot") +
scale_x_discrete() +
scale_y_continuous(trans = "log", labels = round) +
ylab(NULL)
# Beeswarm plot
beeplot <- ggplot(data = small_brca, aes(x = "", y = ESR1)) +
geom_boxplot() +
ggbeeswarm::geom_beeswarm(alpha = 0.2) +
xlab("Beeswarm Plot") +
scale_x_discrete() +
scale_y_continuous(trans = "log", labels = round) +
ylab(NULL)
# ggarrange and produce figure
ggpubr::ggarrange(bxplt, vlnplt, beeplot, ncol = 3, nrow = 1)
6b [1 point] In your opinion - whatever you think now - which is the nicest?
Opinion, but wrong if box plot :)
Question 7 [5 points]
The variable \({\tt race}\) in \({\tt small\_brca}\) has several different possible values including “WHITE”, “ASIAN”, “BLACK OR AFRICAN AMERICAN”, “AMERICAN INDIAN OR ALASKA NATIVE”, “[Not Available]”, “[Not Evaluated]”, or NA.
It is better to simply use NA instead of“[Not Available]”, “[Not Evaluated]”, or NA. It will simplify your code in downstream analyses.
Write R code to create a new \({\tt race\_modified}\) variable where all samples with “[Not Available]”, “[Not Evaluated]” are changed to NA (Note: you shouldn’t use “NA”, but NA. They are not the same)
( more_better_small_brca <- small_brca %>%
mutate(race_modified = replace(race, race == "[Not Available]" | race == "[Not Evaluated]", NA)) )
## # A tibble: 1,215 x 80
## id tss participant barcode bcr_patient_uuid form_completion…
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 TCGA… E9 A1NF TCGA-E… a8b1f6e7-2bcf-4… 2011-6-23
## 2 TCGA… D8 A27M TCGA-D… ae65baeb-6b78-4… 2011-7-17
## 3 TCGA… BH A0GZ TCGA-B… 27dfb9d4-3a2c-4… 2010-11-10
## 4 TCGA… BH A18V TCGA-B… 6b960b58-28e1-4… 2011-7-2
## 5 TCGA… A7 A13G TCGA-A… 6cd9baf5-bbe0-4… 2011-5-16
## 6 TCGA… C8 A275 TCGA-C… ce887e90-0660-4… 2011-7-21
## 7 TCGA… AN A0XS TCGA-A… 95e949d3-b1f3-4… 2011-3-28
## 8 TCGA… OL A5RW TCGA-O… A82D0A57-4383-4… 2013-5-16
## 9 TCGA… OL A5RX TCGA-O… 43C0BEC6-6FAC-4… 2013-5-16
## 10 TCGA… E9 A1RD TCGA-E… 5fe3c51f-65e2-4… 2011-7-28
## # … with 1,205 more rows, and 74 more variables: birth_days_to <chr>,
## # gender <chr>, menopause_status <chr>, race <chr>, ethnicity <chr>,
## # tumor_status <chr>, vital_status <chr>, death_days_to <chr>,
## # histologic_diagnosis_other <chr>, initial_pathologic_dx_year <chr>,
## # age_at_diagnosis <chr>, micromet_detection_by_ihc <chr>,
## # lymph_nodes_examined_count <chr>, ajcc_pathologic_tumor_stage <chr>,
## # er_status_by_ihc <chr>, er_status_ihc_Percent_Positive <chr>,
## # pr_status_by_ihc <chr>, pr_status_ihc_percent_positive <chr>,
## # her2_fish_status <chr>, her2_copy_number <chr>, histological_type <chr>,
## # metastatic_tumor_indicator <chr>, tumor <lgl>, ANLN <dbl>, FOXC1 <dbl>,
## # CDH3 <dbl>, FGFR4 <dbl>, UBE2T <dbl>, NDC80 <dbl>, PGR <dbl>, BIRC5 <dbl>,
## # ORC6 <dbl>, ESR1 <dbl>, PHGDH <dbl>, PTTG1 <dbl>, MELK <dbl>, NAT1 <dbl>,
## # CXXC5 <dbl>, BCL2 <dbl>, RRM2 <dbl>, GPR160 <dbl>, EXO1 <dbl>, UBE2C <dbl>,
## # TYMS <dbl>, KRT5 <dbl>, KRT14 <dbl>, MAPT <dbl>, CDC6 <dbl>, MMP11 <dbl>,
## # MYBL2 <dbl>, SFRP1 <dbl>, CCNE1 <dbl>, BLVRA <dbl>, BAG1 <dbl>, MLPH <dbl>,
## # CDC20 <dbl>, CENPF <dbl>, MIA <dbl>, KRT17 <dbl>, FOXA1 <dbl>,
## # ACTR3B <dbl>, CCNB1 <dbl>, MDM2 <dbl>, MYC <dbl>, CEP55 <dbl>,
## # SLC39A6 <dbl>, ERBB2 <dbl>, GRB7 <dbl>, KIF2C <dbl>, NUF2 <dbl>,
## # EGFR <dbl>, MKI67 <dbl>, TMEM45B <dbl>, race_modified <chr>
Question 8 [5 points]
Samples may have differences in the distributions of expression across the genes. This might be caused by technical noise because often not all samples are processed at the same time or the samples might be prepared slightly differently. This difference in distributions can sometimes get in the way of our analysis, especially when we are looking at many genes at the same time. A common approach that sometimes helps is to median center the expression of each gene. For each observation (row), the median centering transforms transcript counts \(G = (g_1, g_2, \ldots, g_n)\) to \(G'= (g'_1, g'_2, \ldots, g'_n)\) where \(g'_i = g_i - median(G)\). Here \(n\) is the number of genes (in our case \(n=50\)).
Show code to create a new tibble where all expression measurements are median centered.
( median_centered <- small_brca %>%
rowwise() %>%
mutate(median = median( c_across(ANLN:TMEM45B) )) %>%
mutate_at(vars(ANLN:TMEM45B), ~ . - median) %>%
select(-median) )
## # A tibble: 1,215 x 79
## # Rowwise:
## id tss participant barcode bcr_patient_uuid form_completion…
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 TCGA… E9 A1NF TCGA-E… a8b1f6e7-2bcf-4… 2011-6-23
## 2 TCGA… D8 A27M TCGA-D… ae65baeb-6b78-4… 2011-7-17
## 3 TCGA… BH A0GZ TCGA-B… 27dfb9d4-3a2c-4… 2010-11-10
## 4 TCGA… BH A18V TCGA-B… 6b960b58-28e1-4… 2011-7-2
## 5 TCGA… A7 A13G TCGA-A… 6cd9baf5-bbe0-4… 2011-5-16
## 6 TCGA… C8 A275 TCGA-C… ce887e90-0660-4… 2011-7-21
## 7 TCGA… AN A0XS TCGA-A… 95e949d3-b1f3-4… 2011-3-28
## 8 TCGA… OL A5RW TCGA-O… A82D0A57-4383-4… 2013-5-16
## 9 TCGA… OL A5RX TCGA-O… 43C0BEC6-6FAC-4… 2013-5-16
## 10 TCGA… E9 A1RD TCGA-E… 5fe3c51f-65e2-4… 2011-7-28
## # … with 1,205 more rows, and 73 more variables: birth_days_to <chr>,
## # gender <chr>, menopause_status <chr>, race <chr>, ethnicity <chr>,
## # tumor_status <chr>, vital_status <chr>, death_days_to <chr>,
## # histologic_diagnosis_other <chr>, initial_pathologic_dx_year <chr>,
## # age_at_diagnosis <chr>, micromet_detection_by_ihc <chr>,
## # lymph_nodes_examined_count <chr>, ajcc_pathologic_tumor_stage <chr>,
## # er_status_by_ihc <chr>, er_status_ihc_Percent_Positive <chr>,
## # pr_status_by_ihc <chr>, pr_status_ihc_percent_positive <chr>,
## # her2_fish_status <chr>, her2_copy_number <chr>, histological_type <chr>,
## # metastatic_tumor_indicator <chr>, tumor <lgl>, ANLN <dbl>, FOXC1 <dbl>,
## # CDH3 <dbl>, FGFR4 <dbl>, UBE2T <dbl>, NDC80 <dbl>, PGR <dbl>, BIRC5 <dbl>,
## # ORC6 <dbl>, ESR1 <dbl>, PHGDH <dbl>, PTTG1 <dbl>, MELK <dbl>, NAT1 <dbl>,
## # CXXC5 <dbl>, BCL2 <dbl>, RRM2 <dbl>, GPR160 <dbl>, EXO1 <dbl>, UBE2C <dbl>,
## # TYMS <dbl>, KRT5 <dbl>, KRT14 <dbl>, MAPT <dbl>, CDC6 <dbl>, MMP11 <dbl>,
## # MYBL2 <dbl>, SFRP1 <dbl>, CCNE1 <dbl>, BLVRA <dbl>, BAG1 <dbl>, MLPH <dbl>,
## # CDC20 <dbl>, CENPF <dbl>, MIA <dbl>, KRT17 <dbl>, FOXA1 <dbl>,
## # ACTR3B <dbl>, CCNB1 <dbl>, MDM2 <dbl>, MYC <dbl>, CEP55 <dbl>,
## # SLC39A6 <dbl>, ERBB2 <dbl>, GRB7 <dbl>, KIF2C <dbl>, NUF2 <dbl>,
## # EGFR <dbl>, MKI67 <dbl>, TMEM45B <dbl>
Question 9 [7 points]
9a [1 point]: Using online resources, define what the variance of a distribution is. Find a function in R to compute the variance.
Variance is a measurement of spread between numbers in a data set. More specifically, it is the average of the squared differences from the mean of the points in your distribution. You can compute the variance using var().
9b [4 points]: Compute the variance for each of the \(50\) genes in our dataset, create a tibble with just the \({\tt participant}\) and \({\tt race}\) variables along with the \(50\) genes. Order the columns (variables) of the tibble from most to least variable.
( ordered_gene_var <- small_brca %>%
select(ANLN:TMEM45B) %>%
pivot_longer(cols = everything(), names_to = 'genes') %>%
pivot_wider(names_from = genes, values_from = value, values_fn = var) %>%
pivot_longer(cols = everything()) %>%
arrange(desc(value)) )
## # A tibble: 50 x 2
## name value
## <chr> <dbl>
## 1 SLC39A6 17751328814.
## 2 ERBB2 12270026602.
## 3 KRT14 2341482535.
## 4 KRT5 2224823511.
## 5 SFRP1 1155552308.
## 6 ESR1 1053044772.
## 7 KRT17 682646996.
## 8 MMP11 674490386.
## 9 EGFR 592271929.
## 10 MLPH 394356506.
## # … with 40 more rows
( ordered_tbl <- small_brca %>%
select(participant, race, ordered_gene_var$name) )
## # A tibble: 1,215 x 52
## participant race SLC39A6 ERBB2 KRT14 KRT5 SFRP1 ESR1 KRT17 MMP11
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 A1NF WHITE 13526 12797 39 111 124 39073 131 108677
## 2 A27M WHITE 6580 7923 43714 7532 31149 512 17415 89135
## 3 A0GZ WHITE 138393 18371 2194 21347 711 66040 7448 31347
## 4 A18V WHITE 5126 15060 93442 102256 8177 472 26567 8397
## 5 A13G WHITE 53107 27780 721 1168 839 133812 669 6721
## 6 A275 ASIAN 7044 811281 362 717 321 359 971 33272
## 7 A0XS WHITE 70814 20913 1496 1868 1744 40853 1511 57101
## 8 A5RW BLAC… 6038 8814 198790 59412 10705 83 28556 12724
## 9 A5RX BLAC… 53480 34954 20535 26435 24217 17928 38416 2260
## 10 A1RD WHITE 305439 24657 710 720 622 91953 522 24545
## # … with 1,205 more rows, and 42 more variables: EGFR <dbl>, MLPH <dbl>,
## # PGR <dbl>, FOXA1 <dbl>, NAT1 <dbl>, PHGDH <dbl>, GRB7 <dbl>, MAPT <dbl>,
## # MDM2 <dbl>, CDH3 <dbl>, MYC <dbl>, CENPF <dbl>, BAG1 <dbl>, MKI67 <dbl>,
## # MYBL2 <dbl>, BCL2 <dbl>, CXXC5 <dbl>, FOXC1 <dbl>, FGFR4 <dbl>, RRM2 <dbl>,
## # BIRC5 <dbl>, CDC6 <dbl>, BLVRA <dbl>, UBE2C <dbl>, CDC20 <dbl>, ANLN <dbl>,
## # GPR160 <dbl>, CCNB1 <dbl>, KIF2C <dbl>, TYMS <dbl>, NUF2 <dbl>,
## # TMEM45B <dbl>, PTTG1 <dbl>, CCNE1 <dbl>, CEP55 <dbl>, UBE2T <dbl>,
## # MELK <dbl>, NDC80 <dbl>, MIA <dbl>, ACTR3B <dbl>, EXO1 <dbl>, ORC6 <dbl>
9c [2 points]: Create a violin plot of the \(50\) genes ordered from most to least variable.
ordered_tbl %>%
reshape2::melt(id.vars = c("participant", "race"), measure.vars = c(3:52)) %>% # All genes
ggplot(aes(x = variable, y = value)) +
geom_violin(draw_quantiles = 0.5) +
scale_y_continuous(trans = "log", labels = round) +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 0.9)) +
xlab("Genes")
## Warning: Transformation introduced infinite values in continuous y-axis
## Warning: Removed 90 rows containing non-finite values (stat_ydensity).
# You can also use log1p to keep the 0 values
Question 10 [5 points]
Pick one of the \(50\) genes from \({\tt small\_brca}\) uniformly randomly (see the \({\tt runif}\) function). Using the NCBI, provide the following information or note that it is not availalbe:
set.seed(42)
colnames(small_brca)[c(30:79)] %>%
.[runif(1, 1, 50)]
## [1] "GRB7"
10a Full name of the gene and the official name according to the HGNC.
growth factor receptor bound protein 7
10b First time it was reported in the literature.
First time it was reported was October 1st, 1992 (doi: 10.1073/pnas.89.19.8894)
10c Where it is located in the genome.
Chromosome 17 from 39737938 to 39747285
10d Its \({\tt gi}\) accession code or codes (if it has been modified).
Current accession code: NC_000017.11
Previous accession code: NC_000017.10
10e The number of exons.
18 exons
10f The number of alternative transcripts that have been record.
4 splice variants
10g Its full amino acid sequence. Provide its protein ID.
Protein ID: Q14451.2-1 (isoform 1)
10 20 30 40 50
MELDLSPPHL SSSPEDLCPA PGTPPGTPRP PDTPLPEEVK RSQPLLIPTT
60 70 80 90 100
GRKLREEERR ATSLPSIPNP FPELCSPPSQ SPILGGPSSA RGLLPRDASR
110 120 130 140 150
PHVVKVYSED GACRSVEVAA GATARHVCEM LVQRAHALSD ETWGLVECHP
160 170 180 190 200
HLALERGLED HESVVEVQAA WPVGGDSRFV FRKNFAKYEL FKSSPHSLFP
210 220 230 240 250
EKMVSSCLDA HTGISHEDLI QNFLNAGSFP EIQGFLQLRG SGRKLWKRFF
260 270 280 290 300
CFLRRSGLYY STKGTSKDPR HLQYVADVNE SNVYVVTQGR KLYGMPTDFG
310 320 330 340 350
FCVKPNKLRN GHKGLRIFCS EDEQSRTCWL AAFRLFKYGV QLYKNYQQAQ
360 370 380 390 400
SRHLHPSCLG SPPLRSASDN TLVAMDFSGH AGRVIENPRE ALSVALEEAQ
410 420 430 440 450
AWRKKTNHRL SLPMPASGTS LSAAIHRTQL WFHGRISREE SQRLIGQQGL
460 470 480 490 500
VDGLFLVRES QRNPQGFVLS LCHLQKVKHY LILPSEEEGR LYFSMDDGQT
510 520 530
RFTDLLQLVE FHQLNRGILP CLLRHCCTRV AL
10h Its protein structure, if known.