An important and complicated half of Hidden Markov Model is the Studying Drawback. Despite the fact that it can be used as Unsupervised approach, the extra widespread strategy is to use Supervised learning simply for defining quantity of hidden states. In this Derivation and implementation of Baum Welch Algorithm for Hidden Markov Model article we’ll go through step by step derivation course of of the Baum Welch Algorithm (a.okay.a AheadBackward Algorithm) and then implement is using both Python and R.
That is the third half of the Introduction to Hidden Markov Model Tutorial. To date we now have gone via the intuition of HMM, derivation and implementation of the Ahead and Backward Algorithm. In case you want a refresher please refer the half 2 of the tutorial collection.
Forward and Backward Algorithm in Hidden Markov Model
 The objective of the Studying Drawback is to estimate for ( a_ij) and ( b_jk) utilizing the coaching knowledge.
 The standard algorithm for Hidden Markov Model coaching is the ForwardBackward or BaumWelch Algorithm.
 This algorithm makes use of a special case of the Expectation Maximization (EM) Algorithm.
Instance utilizing Most Probability Estimate:
Now let’s attempt to get an intuition using an instance of Maximum Probability Estimate.Think about coaching a Easy Markov Model the place the hidden state is visible.
We we use our example used in the programming part (It is best to already have it in case you have followed part 2) where we had 2 hidden states [A,B] and three visible states [1,2,3]. (Assume on this example the hidden states are also recognized)
As you see here we now have four totally different units of sequences (every in various colours).
3  2  2  1  1  three  1  2  three  2  1  1 
B  B  A  A  A  B  B  A  B  B  A  A 
Now we’ll compute the HMM parameters by Maximum Probability Estimation utilizing the sample knowledge above.
Estimate Initial Chance Distribution
We’ll initialize ( pi ) utilizing the chance derived from the above sequences. Within the example above, one of the sequence started with A and rest all three with B. We will define,
[[[[pi_A=1/3 , pi_B=2/3
]
Estimate Transition Chances:
Lets define our Transition Chance Matrix first as:
[[[[hatA = beginbmatrix
p(AA) & p(BA)
p(AB) & p(BB)
finishbmatrix
]
We will calculate the possibilities from the instance as (Ignore the final hidden state since there’s to state to transition to):
[[[[hatA = beginbmatrix
2/four & 2/4
3/4 & 1/four
endbmatrix
]
Estimate Emission Chances:
Similar method, following must be our Emission Chance Matrix.
[[[[
hatB =startbmatrix
p(1A) & p(2A) & p(threeA)
p(1B) & p(2B) & p(threeB)
finishbmatrix
]
Listed here are the calculated chances:
hatB =startbmatrix
four/6 & 2/6 & zero/6
1/6 & 2/6 & three/6
endbmatrix
]
BaumWelch Algorithm:
The above maximum probability estimate will work only when the sequence of hidden states are recognized. Nevertheless thats not the case for us. Hence we have to discover another approach to estimate the Transition and Emission Matrix.
This algorithm is also referred to as ForwardBackward or BaumWelch Algorithm, it’s a particular case of the Expectation Maximization (EM) algorithm.
High Degree Steps of the Algorithm (EM):
Lets first understand what we’d like to be able to get an estimate for the parameters of the HMM. Listed here are the excessive degree steps:
 Begin with preliminary chance estimates [A,B]. Initially set equal chances or outline them randomly.
 Compute expectation of how typically every transition/emission has been used. We’ll estimate latent variables [ ( xi , gamma ) ] (That is widespread strategy for EM Algorithm)
 Reestimate the possibilities [A,B] based mostly on these estimates (latent variable).
 Repeat until convergence
Methods to remedy BaumWelch Algorithm?:
There are two primary methods we will remedy the BaumWelch Algorithm.
 Probabilistic Strategy : HMM is a Generative model, therefore we will clear up BaumWelch utilizing Probabilistic Strategy.
 Lagrange Multipliers : The Studying drawback may be defined as a constrained optimization drawback, hence it can be solved using Lagrange Multipliers.
The ultimate equation for both A, B will look the identical irrespective of any of the above strategy since each A,B may be outlined utilizing joint and marginal chances. Let’s take a look at the formal definition of them :
Estimate for ( a_ij):
[[[[hata_ij = fractextanticipated number of transitions from hidden state i to state jtextual contentanticipated number of transition from hidden state i
]
Estimate for ( b_jk):
[[[[hatb_jk = fractextual contentexpected quantity of occasions in hidden state j and observing v(okay) textanticipated number of occasions in hidden state j
]
The above definition is simply the generalized view of the Most Probability Example we went by means of. Let’s use the Probabilistic Strategy and learn how we will estimate the parameters A,B
Probabilistic Strategy:
Derivation of ( hata_ij):
If we all know the chance of a given transition from i to j at time step t, then we will sum over all the T occasions to estimate for the numerator in our equation for ( hatA).
By the best way ( hatA) is simply the matrix representation of ( hata_ij), so don’t be confused.
We will outline this as the chance of being in state i at time t and in state j at time t+1, given the statement sequence and the model.
Mathematically,
[[[[
p(s(t) = i,s(t+1)=j  V^T, theta )
]
We already know from the essential chance concept that,
[[[[startalign
p(X, Y  Z) &= p(X  Y, Z) p( Y  Z )
p(X  Y, Z) &= fracp(X, Y Z )
endalign
]
We will now say,
[[[[beginalign
p(s(t) = i,s(t+1)=j  V^T, theta ) &=frac theta )p(V^T
finishalign
]
The numerator of the equation might be expressed utilizing Forward and Backward Chances (Refer the diagram under):
[[[[startalign
p(s(t) = i,s(t+1)=j , V^T  theta ) = alpha_i(t) a_ij b_jk text v(t+1) beta_j(t+1)
endalign
]
The denominator ( p(V^Ttheta)) is the chance of the remark sequence ( V^T) by any path given the model ( theta ). It may be expressed as the marginal chance:
[[[[startalign
p(V^T  theta ) = sum_i=1^M sum_j=1^M alpha_i(t) a_ij b_jk textual content v(t+1) beta_j(t+1)
endalign
]
We’ll outline (xi ) because the latent variable representing ( p(s(t) = i,s(t+1)=j  V^T, theta ) ). We will now define (xi_ij (t) ) as:
xi_ij (t) = fracalpha_i(t) a_ij b_jk textual content v(t+1) beta_j(t+1)sum_i=1^M sum_j=1^M alpha_i(t) a_ij b_jk text v(t+1) beta_j(t+1)
]
The (xi_ij (t) ) outlined above is just for one time step, we need to sum over all T to get the whole joint chance for all the transitions from hidden state i to hidden state j. This will probably be our numerator of the equation of ( hata_ij ).
For the denominator, we need to get the marginal chance which may be expressed as following,
[[[[sum_t=1^T1 sum_j=1^M xi_ij (t)
]
Now we will define ( hata_ij ) as,
[[[[hata_ij = fracsum_t=1^T1 xi_ij (t)sum_t=1^T1 sum_j=1^M xi_ij (t) . . . . . . . . . (1)
]
Probabilistic view of the Denominator:
Before we transfer on estimating B, let’s perceive more on the denominator of ( hata_ij ). The denominator is the chance of a state i at time t, which could be expressed as :
beginalign
p(s(t)=i  V^T , theta) & = frac theta) theta)
&= frac theta) p(v(t+1) … v(T) p(V^T
&=fracalpha_i(t) beta_i(t) theta)
&= fracalpha_i(t) beta_i(t) sum_i=1^M alpha_i(t) beta_i(t) = gamma_i(t)
finishalign
]
if we use the above equation to outline our estimate for A, will probably be,
[[[[hata_ij = fracsum_t=1^T1 xi_ij (t)sum_t=1^T1 gamma(t) . . . . . . . . . (2)
]
This is identical equation as ( (1) ) we derived earlier.
Nevertheless, since
[[[[
gamma_i(t) = sum_j=1^M xi_ij(t)
]
we will just use (xi_ij(t)) to define the (hata_ij). This can similar some computation.
In summary, in case you see the estimate of (a_ij) with this equation, don’t be confused, since both ((1) ) and ( (2)) are similar, even by way of the representations are totally different.
Derivation of ( hatb_jk):
( b_jk) is the chance of a given image (v_k) from the observations V given a hidden state j.
We already know the chance of being in state j at time t.
[[[[gamma_j(t) = fracalpha_j(t) beta_j(t) sum_j=1^M alpha_j(t) beta_j(t)
]
We will compute ( hatb_jk) using (gamma_j(t)),
[[[[hatb_jk = fracsum_t=1^T gamma_j(t) 1(v(t)=okay)sum_t=1^T gamma_j(t)
]
where (1(v(t)=okay)) is the indicator perform.
Remaining EM Algorithm:
 initialize A and B
 iterate till convergence
 EStep

( xi_ij (t) = fracalpha_i(t) a_ij b_jk textual content v(t+1) beta_j(t+1)sum_i=1^M sum_j=1^M alpha_i(t) a_ij b_jk textual content v(t+1) beta_j(t+1) )
 ( gamma_i(t) = sum_j=1^M xi_ij(t))
 MStep
 ( hata_ij = fracsum_t=1^T1 xi_ij (t)sum_t=1^T1 sum_j=1^M xi_ij (t) )
 ( hatb_jk = fracsum_t=1^T gamma_j(t) 1(v(t)=okay)sum_t=1^T gamma_j(t) )
 return A,B
Lagrange Multipliers:
We will symbolize the Studying drawback as a constrained optimization drawback and outline it as,
[[[[startalign
textual contentOptimize & p(V^T theta)
text where theta &= pi, A , B
textSubject to &
begininstances sum_i=1^M pi_i=1
sum_j=1^M a_ij=1, forall i in 1,…,M
sum_okay=1^M b_jk=1, forall j in 1,…,M
finishinstances
finishalign
]
We will then remedy this utilizing Lagrange Multipliers and by taking the derivatives. We aren’t going to by means of the small print of that derivation here, nevertheless in case you are interested let me know I can broaden this section if wanted.
RScript:
Here is the implementation of the algorithm.
 In line# 2324, we are appending the T‘th data into the (gamma) since ( xi )’s length is T1
 We’re utilizing ( xi ) to derive (gamma).
 The indicator perform has been carried out using which in line# 26.
BaumWelch = perform(v, a, b, initial_distribution, n.iter = 100)
for(i in 1:n.iter)
T = length(v)
M = nrow(a)
Okay=ncol(b)
alpha = ahead(v, a, b, initial_distribution)
beta = backward(v, a, b)
xi = array(zero, dim=c(M, M, T1))
for(t in 1:T1)
denominator = ((alpha[t,] %*% a) * b[,v[t+1]]) %*% matrix(beta[t+1,])
for(s in 1:M)
numerator = alpha[t,s] * a[s,] * b[,v[t+1]]* beta[t+1,]
xi[s,,t]=numerator/as.vector(denominator)
xi.all.t = rowSums(xi, dims = 2)
a = xi.all.t/rowSums(xi.all.t)
gamma = apply(xi, c(1, three), sum)
gamma = cbind(gamma, colSums(xi[, , T1]))
for(l in 1:Okay)
b[, l] = rowSums(gamma[, which(v==l)])
b = b/rowSums(b)
return(listing(a = a, b = b, initial_distribution = initial_distribution))
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 
BaumWelch = perform(v, a, b, initial_distribution, n.iter = 100) for(i in 1:n.iter) T = size(v) M = nrow(a) Okay=ncol(b) alpha = forward(v, a, b, initial_distribution) beta = backward(v, a, b) xi = array(0, dim=c(M, M, T1)) for(t in 1:T1) denominator = ((alpha[t,] %*% a) * b[,v[t+1]]) %*% matrix(beta[t+1,]) for(s in 1:M) numerator = alpha[t,s] * a[s,] * b[,v[t+1]]* beta[t+1,] xi[s,,t]=numerator/as.vector(denominator) xi.all.t = rowSums(xi, dims = 2) a = xi.all.t/rowSums(xi.all.t) gamma = apply(xi, c(1, 3), sum) gamma = cbind(gamma, colSums(xi[, , T1])) for(l in 1:Okay) b[, l] = rowSums(gamma[, which(v==l)]) b = b/rowSums(b) return(record(a = a, b = b, initial_distribution = initial_distribution)) 
Right here is the complete code.
ahead = perform(v, a, b, initial_distribution)
T = size(v)
M = nrow(a)
alpha = matrix(zero, T, M)
alpha[1, ] = initial_distribution*b[, v[1]]for(t in 2:T)
tmp = alpha[t1, ] %*% a
alpha[t, ] = tmp * b[, v
return(alpha)
backward = perform(v, a, b)
T = size(v)
M = nrow(a)
beta = matrix(1, T, M)
for(t in (T1):1)
tmp = as.matrix(beta[t+1, ] * b[, v[t+1]])
beta[t, ] = t(a %*% tmp)
return(beta)
BaumWelch = perform(v, a, b, initial_distribution, n.iter = 100)
for(i in 1:n.iter)
T = size(v)
M = nrow(a)
Okay=ncol(b)
alpha = forward(v, a, b, initial_distribution)
beta = backward(v, a, b)
xi = array(zero, dim=c(M, M, T1))
for(t in 1:T1)
denominator = ((alpha[t,] %*% a) * b[,v[t+1]]) %*% matrix(beta[t+1,])
for(s in 1:M)
numerator = alpha[t,s] * a[s,] * b[,v[t+1]]* beta[t+1,]
xi[s,,t]=numerator/as.vector(denominator)
xi.all.t = rowSums(xi, dims = 2)
a = xi.all.t/rowSums(xi.all.t)
gamma = apply(xi, c(1, 3), sum)
gamma = cbind(gamma, colSums(xi[, , T1]))
for(l in 1:Okay)
b[, l] = rowSums(gamma[, which(v==l)])
b = b/rowSums(b)
return(record(a = a, b = b, initial_distribution = initial_distribution))
knowledge = learn.csv(“data_r.csv”)
M=2; Okay=3
A = matrix(1, M, M)
A = A/rowSums(A)
B = matrix(1:6, M, Okay)
B = B/rowSums(B)
initial_distribution = c(1/2, half)
(myout = BaumWelch(knowledge$Seen, A, B, initial_distribution, n.iter = 100))
1
2
three
four
5
6
7
eight
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
forward = perform(v, a, b, initial_distribution)
T = length(v)
M = nrow(a)
alpha = matrix(0, T, M)
alpha[1, ] = initial_distribution*b[, v[1]]
for(t in 2:T)
tmp = alpha[t1, ] %*% a
alpha[t, ] = tmp * b[, v
Output:
$a
[,1] [,2]
[1,] 0.5381634 zero.4618366
[2,] 0.4866444 0.5133556
$b
[,1] [,2] [,3]
[1,] 0.1627751 0.2625807 zero.5746441
[2,] zero.2514996 0.2778097 zero.4706907
$initial_distribution
[1] zero.5 0.5
$a
[,1] [,2] [1,] zero.5381634 0.4618366
[2,] zero.4866444 0.5133556$b
[,1] [,2] [,3] [1,] 0.1627751 zero.2625807 0.5746441
[2,] zero.2514996 0.2778097 0.4706907$initial_distribution
[1] 0.5 zero.5Validate End result:
Let’s validate our outcome with the HMM R package deal.
library(HMM)
hmm =initHMM(c(“A”, “B”), c(1, 2, 3),
startProbs = initial_distribution,
transProbs = A, emissionProbs = B)
true.out = baumWelch(hmm, knowledge$Seen, maxIterations=100, pseudoCount=0)
true.out$hmm
library(HMM) hmm =initHMM(c(“A”, “B”), c(1, 2, 3), startProbs = initial_distribution, transProbs = A, emissionProbs = B) true.out = baumWelch(hmm, knowledge$Visible, maxIterations=100, pseudoCount=zero) true.out$hmm 
Right here is the output, which is strictly similar as our output.
$States
[1] “A” “B”
$Symbols
[1] 1 2 3
$startProbs
A B
zero.5 0.5
$transProbs
to
from A B
A 0.5381634 0.4618366
B 0.4866444 zero.5133556
$emissionProbs
symbols
states 1 2 3
A 0.1627751 0.2625807 zero.5746441
B 0.2514996 zero.2778097 zero.4706907
1
2
3
4
5
6
7
eight
9
10
11
12
13
14
15
16
17
18
19
20
21
$States
[1] “A” “B”$Symbols
[1] 1 2 3$startProbs
A B
0.5 zero.5
$transProbs
to
from A B
A 0.5381634 0.4618366
B 0.4866444 0.5133556
$emissionProbs
symbols
states 1 2 3
A zero.1627751 zero.2625807 0.5746441
B zero.2514996 zero.2778097 0.4706907
Python:
Here is the python code for the Baum Welch algorithm, the logic is similar as we have now used in R.
def baum_welch(V, a, b, initial_distribution, n_iter=100):
M = a.shape[0]
T = len(V)
for n in range(n_iter):
alpha = forward(V, a, b, initial_distribution)
beta = backward(V, a, b)
xi = np.zeros((M, M, T – 1))
for t in vary(T – 1):
denominator = np.dot(np.dot(alpha[t, :].T, a) * b[:, V[t + 1]].T, beta[t + 1, :])
for i in range(M):
numerator = alpha[t, i] * a[i, :] * b[:, V[t + 1]].T * beta[t + 1, :].T
xi[i, :, t] = numerator / denominator
gamma = np.sum(xi, axis=1)
a = np.sum(xi, 2) / np.sum(gamma, axis=1).reshape((1, 1))
# Add further T’th component in gamma
gamma = np.hstack((gamma, np.sum(xi[:, :, T – 2], axis=zero).reshape((1, 1))))
Okay = b.form[1]
denominator = np.sum(gamma, axis=1)
for l in vary(Okay):
b[:, l] = np.sum(gamma[:, V == l], axis=1)
b = np.divide(b, denominator.reshape((1, 1)))
return “a”:a, “b”:b
1 2 3 4 5 6 7 eight 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 
def baum_welch(V, a, b, initial_distribution, n_iter=100): M = a.form[0] T = len(V) for n in vary(n_iter): alpha = forward(V, a, b, initial_distribution) beta = backward(V, a, b) xi = np.zeros((M, M, T – 1)) for t in vary(T – 1): denominator = np.dot(np.dot(alpha[t, :].T, a) * b[:, V[t + 1]].T, beta[t + 1, :]) for i in vary(M): numerator = alpha[t, i] * a[i, :] * b[:, V[t + 1]].T * beta[t + 1, :].T xi[i, :, t] = numerator / denominator gamma = np.sum(xi, axis=1) a = np.sum(xi, 2) / np.sum(gamma, axis=1).reshape((1, 1)) # Add further T’th factor in gamma gamma = np.hstack((gamma, np.sum(xi[:, :, T – 2], axis=0).reshape((1, 1)))) Okay = b.form[1] denominator = np.sum(gamma, axis=1) for l in range(Okay): b[:, l] = np.sum(gamma[:, V == l], axis=1) b = np.divide(b, denominator.reshape((1, 1))) return “a”:a, “b”:b 
Right here is the complete code:
import pandas as pd
import numpy as np
def forward(V, a, b, initial_distribution):
alpha = np.zeros((V.form[0], a.shape[0]))
alpha[0, :] = initial_distribution * b[:V[:V[:V[:V[0]]for t in range(1, V.shape[0]):
for j in range(a.shape[0]):
# Matrix Computation Steps
# ((1×2) . (1×2)) * (1)
# (1) * (1)
alpha[t, j] = alpha[t – 1].dot(a[:, j]) * b[j, V
def backward(V, a, b):
beta = np.zeros((V.form[0], a.form[0]))
# setting beta(T) = 1
beta[Vshape[Vshape[Vshape[Vshape[0] – 1]= np.ones((a.form[0]))
# Loop in backward method from T1 to
# Resulting from python indexing the actual loop will probably be T2 to 0
for t in range(V.form[0] – 2, 1, 1):
for j in range(a.shape[0]):
beta[t, j] = (beta[t + 1] * b[:, V[t + 1]]).dot(a[j, :])
return beta
def baum_welch(V, a, b, initial_distribution, n_iter=100):
M = a.shape[0]
T = len(V)
for n in range(n_iter):
alpha = forward(V, a, b, initial_distribution)
beta = backward(V, a, b)
xi = np.zeros((M, M, T – 1))
for t in range(T – 1):
denominator = np.dot(np.dot(alpha[t, :].T, a) * b[:, V[t + 1]].T, beta[t + 1, :])
for i in vary(M):
numerator = alpha[t, i] * a[i, :] * b[:, V[t + 1]].T * beta[t + 1, :].T
xi[i, :, t] = numerator / denominator
gamma = np.sum(xi, axis=1)
a = np.sum(xi, 2) / np.sum(gamma, axis=1).reshape((1, 1))
# Add further T’th factor in gamma
gamma = np.hstack((gamma, np.sum(xi[:, :, T – 2], axis=zero).reshape((1, 1))))
Okay = b.shape[1]
denominator = np.sum(gamma, axis=1)
for l in range(Okay):
b[:, l] = np.sum(gamma[:, V == l], axis=1)
b = np.divide(b, denominator.reshape((1, 1)))
return “a”:a, “b”:b
knowledge = pd.read_csv(‘data_python.csv’)
V = knowledge[‘Visible’].values
# Transition Chances
a = np.ones((2, 2))
a = a / np.sum(a, axis=1)
# Emission Chances
b = np.array(((1, 3, 5), (2, four, 6)))
b = b / np.sum(b, axis=1).reshape((1, 1))
# Equal Chances for the initial distribution
initial_distribution = np.array((0.5, 0.5))
print(baum_welch(V, a, b, initial_distribution, n_iter=100))
1 2 3 four 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 
import pandas as pd import numpy as np def ahead(V, a, b, initial_distribution): alpha = np.zeros((V.shape[0], a.form[0])) alpha[0, :] = initial_distribution * b[:V[:V[:V[:V[0]] for t in range(1, V.shape[0]): for j in vary(a.form[0]): # Matrix Computation Steps # ((1×2) . (1×2)) * (1) # (1) * (1) alpha[t, j] = alpha[t – 1].dot(a[:, j]) * b[j, V 
Output:
Here is the output of our code. Its the identical as previous one, nevertheless the precision is totally different.
‘a’: array([[0.53816345, 0.46183655],
[0.48664443, 0.51335557]]),
‘b’: array([[0.16277513, 0.26258073, 0.57464414],
[0.2514996 , 0.27780971, 0.47069069]])
‘a’: array([[0.53816345, 0.46183655], [0.48664443, 0.51335557]]), ‘b’: array([[0.16277513, 0.26258073, 0.57464414], [0.2514996 , 0.27780971, 0.47069069]]) 
We went by way of the small print of the Studying Algorithm of HMM right here. I hope that this text helped you to know the idea.
Click on the hyperlink to get the code:
Additionally, listed here are the record of all the articles in this collection:
 Introduction to Hidden Markov Model
 Ahead and Backward Algorithm in Hidden Markov Model
 Derivation and implementation of Baum Welch Algorithm for Hidden Markov Model
 Implement Viterbi Algorithm in Hidden Markov Model using Python and R