CpG islands are regions of a genome enriched for CG dinucleotides. CpG islands play an important role in epigenetic regulation. They typically span from several hundred to several thousand basepairs long. Describe an HMM to recognize CpG islands. Remember to specify the three components of the Markov model and also the emission probabilities.
Describe how you parameterize this model.
The three components of a MM are:
1. States
2. Starting probabilities
3. Transition probabilities (transition matrix)
The addition of emission probabilities makes it an HMM
In our example, the HMM could be represented as such:
Let’s use the following diagram as an example:
The states in this model are CpG and not-CpG
The initial probability for either state is 0.5
To calculate the transition matrix, we would have to look at the occurrence of every transition event in contrast to the total number of events.
For example:
Using this diagram, we can calculate the probability of each transition:
Note: The denominator is the number in which the transitions occur with that starting state
\[ p(C \rightarrow C) = \dfrac{11 + 9}{22} = 0.91 \] \[ p(C \rightarrow N) = \dfrac{1 + 1}{22} = 0.09 \] \[ p(N \rightarrow N) = \dfrac{10 + 8 + 8}{28} = 0.93 \]
\[ p(N \rightarrow C) = \dfrac{1 + 1}{28} = 0.07 \]
Therefore:
The emissions in this example are the nucleotides A, C, T, G. In this case, we expect eight emission probabilities: The four nucleotides possibilities in either state. To get these probabilities, we would need a positive and negative learning set in which we know the states, allowing us to calculate the probabilities based on the nucleotide frequencies in those states.
For example:
In this case, the frequencies are:
\[
\begin{array}
& f^N_A = 16 & f^C_A = 2\\
f^N_C = 3 & f^C_C = 8 \\
f^N_G = 1 & f^N_G = 11 \\
f^N_T = 9 & f^C_T = 1
\end{array}
\] Therefore, you can determine the emission probabilities for each by dividing the frequency by the total number of nucleotides per state (N = 29, C = 22) \[
\begin{array}
& e_N(A) = 0.55 & e_C(A) = 0.09 \\
e_N(C) = 0.10 & e_C(C) = 0.36 \\
e_N(G) = 0.04 & e_C(G) = 0.50 \\
e_N(T) = 0.31 & e_C(T) = 0.05
\end{array}
\] Therefore: