Using the S. cerevisiae (Baker’s yeast) data that we imported into R in Lectures 11 and 12, show R code of how you would estimate the frequency of A, C, G, T nucleotides in coding regions only.
Should the mitochondrial chromosome be included or not. Yes/no with 1 sentence max.
library(tidyverse)
library(Biostrings)
scerevisiae_genome <- readRDS("/cloud/project/data/Scerevisiae_vers_1.0.rds")
( cds <- scerevisiae_genome@annotations %>%
filter(type == "CDS") %>%
mutate(chromosome_num = str_remove(.$seqid, "chr")) %>% # Extract roman numerals
filter(chromosome_num != "mt") # Remove mt
)
## # A tibble: 6,999 x 10
## seqid source type start end score strand phase attributes chromosome_num
## <fct> <fct> <fct> <int> <int> <dbl> <fct> <fct> <chr> <chr>
## 1 chrI SGD CDS 335 649 NA + 0 Parent=YAL0… I
## 2 chrI SGD CDS 538 792 NA + 0 Parent=YAL0… I
## 3 chrI SGD CDS 1807 2169 NA - 0 Parent=YAL0… I
## 4 chrI SGD CDS 2480 2707 NA + 0 Parent=YAL0… I
## 5 chrI SGD CDS 7235 9016 NA - 0 Parent=YAL0… I
## 6 chrI SGD CDS 10091 10399 NA + 0 Parent=YAL0… I
## 7 chrI SGD CDS 11565 11951 NA - 0 Parent=YAL0… I
## 8 chrI SGD CDS 12046 12426 NA + 0 Parent=YAL0… I
## 9 chrI SGD CDS 13363 13743 NA - 0 Parent=YAL0… I
## 10 chrI SGD CDS 21566 21850 NA + 0 Parent=YAL0… I
## # … with 6,989 more rows
# Change roman numerals to numeric
cds$chromosome_num <- as.numeric(as.roman(cds$chromosome_num))
subset_chrom <- function(genome, chrom, start, end) {
# Input: Genome object, chromosome number and positions of desired seq
# Output is a DNAStringSet object of the desired sequence
return(DNAStringSet(genome@chromosomes[[chrom]][start:end]))
}
Testing function
subset_chrom(genome = scerevisiae_genome, chrom = 1, 1, 10)
## DNAStringSet object of length 1:
## width seq
## [1] 10 CCACACCACA
# Return list of DNAStringSet objects
seq_list <- list()
for (row in 1:nrow(cds)) {
# Named vector for organization
tmp <- setNames(c(cds[[row, "chromosome_num"]],
cds[[row, "start"]],
cds[[row, "end"]]),
c("chrom", "start", "end"))
seq_list[[row]] <- subset_chrom(scerevisiae_genome, tmp[["chrom"]], tmp[["start"]], tmp[["end"]])
}
# Collapse objects
seq_list <- do.call(c, seq_list)
head(alphabetFrequency(seq_list, baseOnly = TRUE, as.prob = TRUE))
## A C G T other
## [1,] 0.2793651 0.3238095 0.11111111 0.2857143 0
## [2,] 0.3372549 0.2705882 0.08235294 0.3098039 0
## [3,] 0.2451791 0.1983471 0.29752066 0.2589532 0
## [4,] 0.2543860 0.2017544 0.21052632 0.3333333 0
## [5,] 0.3350168 0.2037037 0.15712682 0.3041526 0
## [6,] 0.3495146 0.1877023 0.10032362 0.3624595 0
You should not include the mitochondrial chromosome because it has a very high mutation rate compared to the genome (mtDNA instability).