In the previous post, we saw some theory behind Markov chains. In this one, we're going to define the data structures and the process involved in the creation and usage of one.
Cover phto by Maria Orlova on Pexels
Important: Although some code chunks in this post might look like JS of Python, they are all just pseudo-code.
Data structures
Markov chains are directed graphs; the usual way to represent these is using an adjacency matrix. Each row will represent a word A and each column a word B; each cell will store the probability of B being the next word after A.
So if we had the following text: "How much wood would a woodchuck chuck if a woodchuck could chuck wood", the matrix would like like this:
\ | a | chuck | could | how | if | much | wood | woodchuck | would |
a | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
chuck | 0 | 0 | 0 | 0 | 0.5 | 0 | 0.5 | 0 | 0 |
could | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
how | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
if | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
much | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
wood | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
woodchuck | 0 | 0.5 | 0.5 | 0 | 0 | 0 | 0 | 0 | 0 |
would | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Do you notice something special about this matrix? Most of its entries are zero! So instead of implementing a matrix as, for example, an array of arrays and wasting a lot of space in zeroes, we should use a sparse matrix implementation with a dictionary or tuples and lists. Something like this:
[
(a, [(woodchuck, 1)]),
(chuck, [(if, 0.5), (wood, 0.5)]),
(could, [(chuck, 1)]),
(how, [(much, 1)]),
(if, [(a, 1)]),
(much, [(wood, 1)]),
(wood, [(would, 1)]),
(woodchuck, [(chuck, 0.5), (could, 0.5)])
(would, [(a, 1)])
]
and finally, we could compress it even more if we tokenize the text, replacing words with code numbers. In this case, the tokens will be the positions in the sorted list of unique words(a
is 1
, chuck
is 2
, etc), like this:
[
[(8, 1)],
[(5, 0.5), (7, 0.5)],
[(2, 1)],
[(6, 1)],
[(1, 1)],
[(7, 1)],
[(9, 1)],
[(2, 0.5), (3, 0.5)],
[(1, 1)]
]
Much better than the first matrix, right? Now let's see how we get the chain from the text
Building the chain
Preprocess the text
The first step consists on cleaning the text so we can get more out of it. Some relevant steps here would be converting the whole text to lowercase, expanding contractions and performing tokenization (replacing words with their number of first appearance, for example). Steps like removing punctuation may be optional in this case, and other usual steps like lemmatization or removing stop words make no sense here.
An example of this would be:
Before preprocessing:
I don't think Albert will come. I think he said he wouldn't.
After expansion and conversion to lowercase:
i do not think albert will come. i think he said he would not.
After tokenization, use the following regex to define tokens:
([^\W_]+)|([,.;:?!]+)
(Tokens will be words, numbers and punctuation signs).token_dictionary = { 'i': 1, 'do': 2, 'not': 3, 'think': 4, 'albert': 5, 'will': 6, 'come': 7, '.': 8, 'he': 9, 'said': 10, 'would': 11 } tokenized_text = [-1, 1, 2, 3, 4, 5, 6, 7, 8, 1, 4, 9, 10, 9, 11, 3, -2]
Wait, what are those -1
and -2
doing at the beginning and the end of the list? Well, they are special tokens reserved to mark just that: when a sentence starts and when it ends. They will be important when generating sentences with the chain.
Make a list of words and their followers
Now that we have a list with the tokenized text, we have to iterate it storing the number of appearances of each word and the frequency that other words appear after it:
build_markov(tokenized_text):
appearances = {}
matrix = {}
it = 1
while it < tokenized_text.length
current = tokenized_text[it - 1]
next = tokenized_text[it]
// EXISTING WORD
if current in appearances.keys
appearances[current]++
// EXISTING PAIR
if next in matrix[current].keys
matrix[current][next]++
// NEW PAIR
else
matrix[current] = {next: 1}
// NEW WORD
else
appearances[current] = 1
matrix[current] = {next: 1}
it = it + 1
// FINALLY, NORMALIZE THE PROBABILITIES
for word in matrix.keys
for follower in matrix[word].keys
matrix[word][follower] = matrix[word][follower] / appearances[word]
return matrix
And your Markov chain would be ready to use!
Generating text from the chain
Now, to generate the chain we will need a roulette selection function to pick an element from a word-probability dictionary.
roulette (words):
selection = random_number() // floating between 0 and 1
for candidate in words.keys
probability = words[candidate]
selection = selection - probability
if selection < 0
return candidate
And now, using that function we can use the Markov chain to generate a new word token given the previous:
generate_next_token (matrix, token):
return roulette(matrix[token])
Remember the special tokens -1
and -2
? This is where they come in handy. Using them we are able to make the function that generates a full sentence:
generate_sentence_tokens (matrix):
tokens = [-1]
while tokens.last != -2
tokens.push(generate_next_token(matrix, token.last))
return tokens
Finally, we just have to translate back tokens to words, ignoring the first and the last ones. You might also want to do some postprocessing to put uppercase where is due and fix the whitespaces, for example before punctuation signs. I leave that to you.
Conclusion
The pseudo-code I've shown here works for a first order Markov chain. In the next posts, I will explain how to do a chain of order N and implement it in several languages. I hope you found this useful, please leave your thoughts in the comments!