Using the S. cerevisiae (Baker’s yeast) data that we imported into R in Lectures 11 and 12, show R code of how you would estimate the frequency of A, C, G, T nucleotides in coding regions only.

Should the mitochondrial chromosome be included or not. Yes/no with 1 sentence max.

library(tidyverse)
library(Biostrings)
scerevisiae_genome <- readRDS("/cloud/project/data/Scerevisiae_vers_1.0.rds")

Draw out the plan

1. Get CDS entries in annotations

( cds <- scerevisiae_genome@annotations %>%
    filter(type == "CDS") %>%
    mutate(chromosome_num = str_remove(.$seqid, "chr")) %>% # Extract roman numerals
    filter(chromosome_num != "mt") # Remove mt
  )
## # A tibble: 6,999 x 10
##    seqid source type  start   end score strand phase attributes   chromosome_num
##    <fct> <fct>  <fct> <int> <int> <dbl> <fct>  <fct> <chr>        <chr>         
##  1 chrI  SGD    CDS     335   649    NA +      0     Parent=YAL0… I             
##  2 chrI  SGD    CDS     538   792    NA +      0     Parent=YAL0… I             
##  3 chrI  SGD    CDS    1807  2169    NA -      0     Parent=YAL0… I             
##  4 chrI  SGD    CDS    2480  2707    NA +      0     Parent=YAL0… I             
##  5 chrI  SGD    CDS    7235  9016    NA -      0     Parent=YAL0… I             
##  6 chrI  SGD    CDS   10091 10399    NA +      0     Parent=YAL0… I             
##  7 chrI  SGD    CDS   11565 11951    NA -      0     Parent=YAL0… I             
##  8 chrI  SGD    CDS   12046 12426    NA +      0     Parent=YAL0… I             
##  9 chrI  SGD    CDS   13363 13743    NA -      0     Parent=YAL0… I             
## 10 chrI  SGD    CDS   21566 21850    NA +      0     Parent=YAL0… I             
## # … with 6,989 more rows
# Change roman numerals to numeric
cds$chromosome_num <-  as.numeric(as.roman(cds$chromosome_num))

2. Extract sequences from chromosomes attribute based on location

subset_chrom <- function(genome, chrom, start, end) {
  # Input: Genome object, chromosome number and positions of desired seq
  # Output is a DNAStringSet object of the desired sequence
  return(DNAStringSet(genome@chromosomes[[chrom]][start:end]))
}

Testing function

subset_chrom(genome = scerevisiae_genome, chrom = 1, 1, 10)
## DNAStringSet object of length 1:
##     width seq
## [1]    10 CCACACCACA
# Return list of DNAStringSet objects
seq_list <- list()

for (row in 1:nrow(cds)) {
  # Named vector for organization 
  tmp <- setNames(c(cds[[row, "chromosome_num"]], 
                    cds[[row, "start"]],
                    cds[[row, "end"]]), 
                  c("chrom", "start", "end"))
  
  seq_list[[row]] <- subset_chrom(scerevisiae_genome, tmp[["chrom"]], tmp[["start"]], tmp[["end"]])
  }

# Collapse objects
seq_list <- do.call(c, seq_list)

3. Get frequency

head(alphabetFrequency(seq_list, baseOnly = TRUE, as.prob = TRUE))
##              A         C          G         T other
## [1,] 0.2793651 0.3238095 0.11111111 0.2857143     0
## [2,] 0.3372549 0.2705882 0.08235294 0.3098039     0
## [3,] 0.2451791 0.1983471 0.29752066 0.2589532     0
## [4,] 0.2543860 0.2017544 0.21052632 0.3333333     0
## [5,] 0.3350168 0.2037037 0.15712682 0.3041526     0
## [6,] 0.3495146 0.1877023 0.10032362 0.3624595     0

You should not include the mitochondrial chromosome because it has a very high mutation rate compared to the genome (mtDNA instability).