# Derivation and implementation of Baum Welch Algorithm for Hidden Markov Model An important and complicated half of Hidden Markov Model is the Studying Drawback. Despite the fact that it can be used as Unsupervised approach, the extra widespread strategy is to use Supervised learning simply for defining quantity of hidden states. In this Derivation and implementation of Baum Welch Algorithm for Hidden Markov Model article we’ll go through step by step derivation course of of the Baum Welch Algorithm (a.okay.a Ahead-Backward Algorithm) and then implement is using both Python and R.

That is the third half of the Introduction to Hidden Markov Model Tutorial. To date we now have gone via the intuition of HMM, derivation and implementation of the Ahead and Backward Algorithm. In case you want a refresher please refer the half 2 of the tutorial collection.

Forward and Backward Algorithm in Hidden Markov Model

• The objective of the Studying Drawback is to estimate for ( a_ij) and ( b_jk) utilizing the coaching knowledge.
• The standard algorithm for Hidden Markov Model coaching is the Forward-Backward or Baum-Welch Algorithm.
• This algorithm makes use of a special case of the Expectation Maximization (EM) Algorithm.

### Instance utilizing Most Probability Estimate:

Now let’s attempt to get an intuition using an instance of Maximum Probability Estimate.Think about coaching a Easy Markov Model the place the hidden state is visible.

We we use our example used in the programming part (It is best to already have it in case you have followed part 2) where we had 2 hidden states [A,B] and three visible states [1,2,3]. (Assume on this example the hidden states are also recognized)

As you see here we now have four totally different units of sequences (every in various colours).

 3 2 2 1 1 three 1 2 three 2 1 1 B B A A A B B A B B A A

Now we’ll compute the HMM parameters by Maximum Probability Estimation utilizing the sample knowledge above.

#### Estimate Initial Chance Distribution

We’ll initialize ( pi ) utilizing the chance derived from the above sequences. Within the example above, one of the sequence started with A and rest all three with B. We will define,

[[[[
pi_A=1/3 , pi_B=2/3
]

#### Estimate Transition Chances:

Lets define our Transition Chance Matrix first as:

[[[[
hatA = beginbmatrix
p(A|A) & p(B|A)
p(A|B) & p(B|B)
finishbmatrix
]

We will calculate the possibilities from the instance as (Ignore the final hidden state since there’s to state to transition to):

[[[[
hatA = beginbmatrix
2/four & 2/4
3/4 & 1/four
endbmatrix
]

#### Estimate Emission Chances:

Similar method, following must be our Emission Chance Matrix.
[[[[
hatB =startbmatrix
p(1|A) & p(2|A) & p(three|A)
p(1|B) & p(2|B) & p(three|B)
finishbmatrix
]

Listed here are the calculated chances:

[[[[
hatB =startbmatrix
four/6 & 2/6 & zero/6
1/6 & 2/6 & three/6
endbmatrix
]

## Baum-Welch Algorithm:

The above maximum probability estimate will work only when the sequence of hidden states are recognized. Nevertheless thats not the case for us. Hence we have to discover another approach to estimate the Transition and Emission Matrix.

This algorithm is also referred to as Forward-Backward or Baum-Welch Algorithm, it’s a particular case of the Expectation Maximization (EM) algorithm.

### High Degree Steps of the Algorithm (EM):

Lets first understand what we’d like to be able to get an estimate for the parameters of the HMM. Listed here are the excessive degree steps:

1. Begin with preliminary chance estimates [A,B]. Initially set equal chances or outline them randomly.
2. Compute expectation of how typically every transition/emission has been used. We’ll estimate latent variables [ ( xi , gamma ) ] (That is widespread strategy for EM Algorithm)
3. Re-estimate the possibilities [A,B] based mostly on these estimates (latent variable).
4. Repeat until convergence

### Methods to remedy Baum-Welch Algorithm?:

There are two primary methods we will remedy the Baum-Welch Algorithm.

• Probabilistic Strategy : HMM is a Generative model, therefore we will clear up Baum-Welch utilizing Probabilistic Strategy.
• Lagrange Multipliers : The Studying drawback may be defined as a constrained optimization drawback, hence it can be solved using Lagrange Multipliers.

The ultimate equation for both A, B will look the identical irrespective of any of the above strategy since each A,B may be outlined utilizing joint and marginal chances. Let’s take a look at the formal definition of them :

#### Estimate for ( a_ij):

[[[[
hata_ij = fractextanticipated number of transitions from hidden state i to state jtextual contentanticipated number of transition from hidden state i
]

#### Estimate for ( b_jk):

[[[[
hatb_jk = fractextual contentexpected quantity of occasions in hidden state j and observing v(okay) textanticipated number of occasions in hidden state j
]

The above definition is simply the generalized view of the Most Probability Example we went by means of. Let’s use the Probabilistic Strategy and learn how we will estimate the parameters A,B

### Probabilistic Strategy:

#### Derivation of ( hata_ij):

If we all know the chance of a given transition from i to j at time step t, then we will sum over all the T occasions to estimate for the numerator in our equation for ( hatA).

By the best way ( hatA) is simply the matrix representation of ( hata_ij), so don’t be confused.

We will outline this as the chance of being in state i at time t and in state j at time t+1, given the statement sequence and the model.

Mathematically,
[[[[
p(s(t) = i,s(t+1)=j | V^T, theta )
]

We already know from the essential chance concept that,

[[[[
startalign
p(X, Y | Z) &= p(X | Y, Z) p( Y | Z )
p(X | Y, Z) &= fracp(X, Y Z )
endalign
]

We will now say,

[[[[
beginalign
p(s(t) = i,s(t+1)=j | V^T, theta ) &=frac theta )p(V^T
finishalign
]

The numerator of the equation might be expressed utilizing Forward and Backward Chances (Refer the diagram under):

[[[[
startalign
p(s(t) = i,s(t+1)=j , V^T | theta ) = alpha_i(t) a_ij b_jk text v(t+1) beta_j(t+1)
endalign
] The denominator ( p(V^T|theta)) is the chance of the remark sequence ( V^T) by any path given the model ( theta ). It may be expressed as the marginal chance:

[[[[
startalign
p(V^T | theta ) = sum_i=1^M sum_j=1^M alpha_i(t) a_ij b_jk textual content v(t+1) beta_j(t+1)
endalign
]

We’ll outline (xi ) because the latent variable representing ( p(s(t) = i,s(t+1)=j | V^T, theta ) ). We will now define (xi_ij (t) ) as:

[[[[
xi_ij (t) = fracalpha_i(t) a_ij b_jk textual content v(t+1) beta_j(t+1)sum_i=1^M sum_j=1^M alpha_i(t) a_ij b_jk text v(t+1) beta_j(t+1)
]

The (xi_ij (t) ) outlined above is just for one time step, we need to sum over all T to get the whole joint chance for all the transitions from hidden state i to hidden state j. This will probably be our numerator of the equation of ( hata_ij ).

For the denominator, we need to get the marginal chance which may be expressed as following,

[[[[
sum_t=1^T-1 sum_j=1^M xi_ij (t)
]

Now we will define ( hata_ij ) as,

[[[[
hata_ij = fracsum_t=1^T-1 xi_ij (t)sum_t=1^T-1 sum_j=1^M xi_ij (t) . . . . . . . . . (1)
]

#### Probabilistic view of the Denominator:

Before we transfer on estimating B, let’s perceive more on the denominator of ( hata_ij ). The denominator is the chance of a state i at time t, which could be expressed as :

[[[[
beginalign
p(s(t)=i | V^T , theta) & = frac theta) theta)
&= frac theta) p(v(t+1) … v(T) p(V^T
&=fracalpha_i(t) beta_i(t) theta)
&= fracalpha_i(t) beta_i(t) sum_i=1^M alpha_i(t) beta_i(t) = gamma_i(t)
finishalign
] if we use the above equation to outline our estimate for A, will probably be,

[[[[
hata_ij = fracsum_t=1^T-1 xi_ij (t)sum_t=1^T-1 gamma(t) . . . . . . . . . (2)
]

This is identical equation as ( (1) ) we derived earlier.

Nevertheless, since
[[[[
gamma_i(t) = sum_j=1^M xi_ij(t)
]

we will just use (xi_ij(t)) to define the (hata_ij). This can similar some computation.

In summary, in case you see the estimate of (a_ij) with this equation, don’t be confused, since both ((1) ) and ( (2)) are similar, even by way of the representations are totally different.

#### Derivation of ( hatb_jk):

( b_jk) is the chance of a given image (v_k) from the observations V given a hidden state j.

We already know the chance of being in state j at time t.

[[[[
gamma_j(t) = fracalpha_j(t) beta_j(t) sum_j=1^M alpha_j(t) beta_j(t)
]

We will compute ( hatb_jk) using (gamma_j(t)),

[[[[
hatb_jk = fracsum_t=1^T gamma_j(t) 1(v(t)=okay)sum_t=1^T gamma_j(t)
]

where (1(v(t)=okay)) is the indicator perform.

### Remaining EM Algorithm:

• initialize A and B
• iterate till convergence
• E-Step
• ( xi_ij (t) = fracalpha_i(t) a_ij b_jk textual content v(t+1) beta_j(t+1)sum_i=1^M sum_j=1^M alpha_i(t) a_ij b_jk textual content v(t+1) beta_j(t+1) )

• ( gamma_i(t) = sum_j=1^M xi_ij(t))
• M-Step
• ( hata_ij = fracsum_t=1^T-1 xi_ij (t)sum_t=1^T-1 sum_j=1^M xi_ij (t) )
• ( hatb_jk = fracsum_t=1^T gamma_j(t) 1(v(t)=okay)sum_t=1^T gamma_j(t) )
• return A,B

### Lagrange Multipliers:

We will symbolize the Studying drawback as a constrained optimization drawback and outline it as,

[[[[
startalign
textual contentOptimize & p(V^T| theta)
text where theta &= pi, A , B
textSubject to &
begininstances sum_i=1^M pi_i=1
sum_j=1^M a_ij=1, forall i in 1,…,M
sum_okay=1^M b_jk=1, forall j in 1,…,M
finishinstances
finishalign
]

We will then remedy this utilizing Lagrange Multipliers and by taking the derivatives. We aren’t going to by means of the small print of that derivation here, nevertheless in case you are interested let me know I can broaden this section if wanted.

## R-Script:

Here is the implementation of the algorithm.

• In line# 23-24, we are appending the T‘th data into the (gamma) since ( xi )’s length is T-1
• We’re utilizing ( xi ) to derive (gamma).
• The indicator perform has been carried out using which in line# 26.

Right here is the complete code.

#### About the author 