Building and using Markov chains - AI for text generation - Part II

Building and using Markov chains - AI for text generation - Part II

·

5 min read

In the previous post, we saw some theory behind Markov chains. In this one, we're going to define the data structures and the process involved in the creation and usage of one.

Cover phto by Maria Orlova on Pexels

Important: Although some code chunks in this post might look like JS of Python, they are all just pseudo-code.

Data structures

Markov chains are directed graphs; the usual way to represent these is using an adjacency matrix. Each row will represent a word A and each column a word B; each cell will store the probability of B being the next word after A.

So if we had the following text: "How much wood would a woodchuck chuck if a woodchuck could chuck wood", the matrix would like like this:

\achuckcouldhowifmuchwoodwoodchuckwould
a000000010
chuck00000.500.500
could010000000
how000001000
if100000000
much000000100
wood000000001
woodchuck00.50.5000000
would100000000

Do you notice something special about this matrix? Most of its entries are zero! So instead of implementing a matrix as, for example, an array of arrays and wasting a lot of space in zeroes, we should use a sparse matrix implementation with a dictionary or tuples and lists. Something like this:

[
(a, [(woodchuck, 1)]),
(chuck, [(if, 0.5), (wood, 0.5)]),
(could, [(chuck, 1)]),
(how, [(much, 1)]),
(if, [(a, 1)]),
(much, [(wood, 1)]),
(wood, [(would, 1)]),
(woodchuck, [(chuck, 0.5), (could, 0.5)])
(would, [(a, 1)])
]

and finally, we could compress it even more if we tokenize the text, replacing words with code numbers. In this case, the tokens will be the positions in the sorted list of unique words(a is 1, chuck is 2, etc), like this:

[
[(8, 1)],
[(5, 0.5), (7, 0.5)],
[(2, 1)],
[(6, 1)],
[(1, 1)],
[(7, 1)],
[(9, 1)],
[(2, 0.5), (3, 0.5)],
[(1, 1)]
]

Much better than the first matrix, right? Now let's see how we get the chain from the text

Building the chain

Preprocess the text

The first step consists on cleaning the text so we can get more out of it. Some relevant steps here would be converting the whole text to lowercase, expanding contractions and performing tokenization (replacing words with their number of first appearance, for example). Steps like removing punctuation may be optional in this case, and other usual steps like lemmatization or removing stop words make no sense here.

An example of this would be:

  • Before preprocessing:

      I don't think Albert will come. I think he said he wouldn't.
    
  • After expansion and conversion to lowercase:

      i do not think albert will come. i think he said he would not.
    
  • After tokenization, use the following regex to define tokens: ([^\W_]+)|([,.;:?!]+) (Tokens will be words, numbers and punctuation signs).

      token_dictionary = {
        'i': 1,
        'do': 2,
        'not': 3,
        'think': 4,
        'albert': 5,
        'will': 6,
        'come': 7,
        '.': 8,
        'he': 9,
        'said': 10,
        'would': 11
      }
      tokenized_text = [-1, 1, 2, 3, 4, 5, 6, 7, 8, 1, 4, 9, 10, 9, 11, 3, -2]
    

Wait, what are those -1 and -2 doing at the beginning and the end of the list? Well, they are special tokens reserved to mark just that: when a sentence starts and when it ends. They will be important when generating sentences with the chain.

Make a list of words and their followers

Now that we have a list with the tokenized text, we have to iterate it storing the number of appearances of each word and the frequency that other words appear after it:

build_markov(tokenized_text):
    appearances = {}
    matrix = {}
    it = 1
    while it < tokenized_text.length 
        current = tokenized_text[it - 1]
        next = tokenized_text[it]

        // EXISTING WORD
        if current in appearances.keys
            appearances[current]++
            // EXISTING PAIR
            if next in matrix[current].keys
                matrix[current][next]++
            // NEW PAIR
            else
                matrix[current] = {next: 1}

        // NEW WORD
        else
            appearances[current] = 1
            matrix[current] = {next: 1}
        it = it + 1
    // FINALLY, NORMALIZE THE PROBABILITIES
    for word in matrix.keys
        for follower in matrix[word].keys
            matrix[word][follower] = matrix[word][follower] / appearances[word]
    return matrix

And your Markov chain would be ready to use!

Generating text from the chain

Now, to generate the chain we will need a roulette selection function to pick an element from a word-probability dictionary.

roulette (words):
    selection = random_number() // floating between 0 and 1
    for candidate in words.keys
        probability = words[candidate]
        selection = selection - probability
        if selection < 0
            return candidate

And now, using that function we can use the Markov chain to generate a new word token given the previous:

generate_next_token (matrix, token):
    return roulette(matrix[token])

Remember the special tokens -1 and -2? This is where they come in handy. Using them we are able to make the function that generates a full sentence:

generate_sentence_tokens (matrix):
    tokens = [-1]
    while tokens.last != -2
        tokens.push(generate_next_token(matrix, token.last))
    return tokens

Finally, we just have to translate back tokens to words, ignoring the first and the last ones. You might also want to do some postprocessing to put uppercase where is due and fix the whitespaces, for example before punctuation signs. I leave that to you.

Conclusion

The pseudo-code I've shown here works for a first order Markov chain. In the next posts, I will explain how to do a chain of order N and implement it in several languages. I hope you found this useful, please leave your thoughts in the comments!