In this lab we will experiment with recurrent neural networks. These are a useful type of model for predicting sequences or handling sequences of things as inputs. We will implement them in Keras+Tensorflow but many implementations can be found online with many sets of variants. Here are installation instructions for Keras: https://keras.io/#installation, and here are installation instructions for Tensorflow: https://github.com/tensorflow/tensorflow#download-and-setup. You should also be able to run those from a Docker container.
We will take a set of 10 thousand image descriptions from the MS-COCO dataset (400,000 sentences) and make our recurrent network learn how to compose new sentences character by character. You can download this data here: http://www.cs.virginia.edu/~vicente/recognition/captions_train.txt.zip
First, let's import libraries and make sure you have everything properly installed.
import tensorflow as tf
import numpy as np
from keras.models import Sequential
from keras.layers import Dense, Activation, LSTM
from keras.optimizers import RMSprop
from keras.layers.wrappers import TimeDistributed
We will first read the sentences and map each character to a unique identifier so that we can treat each sentence as an array of character ids. The code below loads the captions from a text file and places them inside a caption tensor that is a matrix of size numCaptions x maxCaptionLength x charVocabularySize. We will also create a caption tensor that contains the sentences but shifted by one character. Each character is mapped to an incremental ID, so we keep two hashmaps to convert from character to id and back.
# Read captions into a python list.
maxSamples = 10000
captions = []
fopen = open('captions_train.txt', 'r')
iterator = 0
for line in fopen:
if iterator < maxSamples:
captions.append(line.lower().strip())
iterator += 1
fopen.close()
# Compute a char2id and id2char vocabulary.
char2id = {}
id2char = {}
charIndex = 0
for caption in captions:
for char in caption:
if char not in char2id:
char2id[char] = charIndex
id2char[charIndex] = char
charIndex += 1
# Add a special starting and ending character to the dictionary.
char2id['S'] = charIndex; id2char[charIndex] = 'S' # Special sentence start character.
char2id['E'] = charIndex + 1; id2char[charIndex + 1] = 'E' # Special sentence ending character.
# Place captions inside tensors.
maxSequenceLength = 1 + max([len(x) for x in captions])
# inputChars has one-hot encodings for every character, for every caption.
inputChars = np.zeros((len(captions), maxSequenceLength, len(char2id)), dtype=np.bool)
# nextChars has one-hot encodings for every character for every caption (shifted by one).
nextChars = np.zeros((len(captions), maxSequenceLength, len(char2id)), dtype=np.bool)
for i in range(0, len(captions)):
inputChars[i, 0, char2id['S']] = 1
nextChars[i, 0, char2id[captions[i][0]]] = 1
for j in range(1, maxSequenceLength):
if j < len(captions[i]) + 1:
inputChars[i, j, char2id[captions[i][j - 1]]] = 1
if j < len(captions[i]):
nextChars[i, j, char2id[captions[i][j]]] = 1
else:
nextChars[i, j, char2id['E']] = 1
else:
inputChars[i, j, char2id['E']] = 1
nextChars[i, j, char2id['E']] = 1
print("input:")
print(inputChars.shape) # Print the size of the inputCharacters tensor.
print("output:")
print(nextChars.shape) # Print the size of the nextCharacters tensor.
print("char2id:")
print(char2id) # Print the character to ids mapping.
Note: In order to clearly show how inputChars, and nextChars store the sequences, let's try printing a sentence back from its stored format in these two arrays.
trainCaption = inputChars[25, :, :] # Pick some caption
labelCaption = nextChars[25, :, :] # Pick what we are trying to predict.
def printCaption(sampleCaption):
charIds = np.zeros(sampleCaption.shape[0])
for (idx, elem) in enumerate(sampleCaption):
charIds[idx] = np.nonzero(elem)[0].squeeze()
print(np.array([id2char[x] for x in charIds]))
printCaption(trainCaption)
printCaption(labelCaption)
In the above output, you will notice that the sentences are indeed shifted. This is because we are going to predict the next character at each timestep. The first character is 'S' which means start of sentences, and the next character in our target should be 'a' which is the first actual character of the sentence. The later characters in the sentence will also use the "history" of all previous characters to find out what goes next.
Next we will create a recurrent neural network using Keras which takes an input set of characters (one-hot encoded) of size (batch_size, maxSequenceLength, charVocabularySize), similarly the output of this network will be a vector of size (batch_size, maxSequenceLength, charVocabularySize). However, the output does not contain one-hot encodings. The output contains a probability distribution (the output of a softmax) for every time step in the sequence. We see in section 4 how to decode the sequence from this distribution, you can just take the character corresponding to the index with the max probability for every time step.
print('Building training model...')
hiddenStateSize = 128
hiddenLayerSize = 128
model = Sequential()
# The output of the LSTM layer are the hidden states of the LSTM for every time step.
model.add(LSTM(hiddenStateSize, return_sequences = True, input_shape=(maxSequenceLength, len(char2id))))
# Two things to notice here:
# 1. The Dense Layer is equivalent to nn.Linear(hiddenStateSize, hiddenLayerSize) in Torch.
# In Keras, we often do not need to specify the input size of the layer because it gets inferred for us.
# 2. TimeDistributed applies the linear transformation from the Dense layer to every time step
# of the output of the sequence produced by the LSTM.
model.add(TimeDistributed(Dense(hiddenLayerSize)))
model.add(TimeDistributed(Activation('relu')))
model.add(TimeDistributed(Dense(len(char2id)))) # Add another dense layer with the desired output size.
model.add(TimeDistributed(Activation('softmax')))
# We also specify here the optimization we will use, in this case we use RMSprop with learning rate 0.001.
# RMSprop is commonly used for RNNs instead of regular SGD.
# See this blog for info on RMSprop (http://sebastianruder.com/optimizing-gradient-descent/index.html#rmsprop)
# categorical_crossentropy is the same loss used for classification problems using softmax. (nn.ClassNLLCriterion)
model.compile(loss='categorical_crossentropy', optimizer = RMSprop(lr=0.001))
print(model.summary()) # Convenient function to see details about the network model.
# Test a simple prediction on a batch for this model.
print("Sample input Batch size:"),
print(inputChars[0:32, :, :].shape)
print("Sample input Batch labels (nextChars):"),
print(nextChars[0:32, :, :].shape)
outputs = model.predict(inputChars[0:32, :, :])
print("Output Sequence size:"),
print(outputs.shape)
Keras already implements a generic trainModel functionality through the model.fit function, but it also contains model.train_on_batch if you want to perform the training for loop yourself. For more informations about Keras model functionalities you can see here: https://keras.io/models/model/
If you installed Tensorflow with GPU support, this will automatically run on the GPU.
model.fit(inputChars, nextChars, batch_size = 128, nb_epoch = 10)
Here we input an arbitrary caption from the training set (one-hot encoded), compute the output using the trained model, and decode this output back into a char array. Ideally we should see the same input caption shifted by one character. However you would need to run the training code for around 24 hours straight to get the model close to that point (it is ok if you only run the model for 10 iterations for the purposes of this lab).
# Test a simple prediction on a batch for this model.
captionId = 132
inputCaption = inputChars[captionId:captionId+1, :, :]
outputs = model.predict(inputCaption)
printCaption(inputCaption[0])
print([id2char[x.argmax()] for x in outputs[0, :, :]])
We verified in the previous section that the model was somewhat working on training data. However, we want to be able to create new sentences from this model starting from zero. We want to use the same parameters of the trained model to produce text character by character. We build here such model and just copy the parameters from our trained model above. We show in the following section (section 6) how to produce the sentences using this inference_model. Please pay attention to all the comments in the code below to see what are the differences with the model at training time.
# The only difference with the "training model" is that here the input sequence has
# a length of one because we will predict character by character.
print('Building Inference model...')
inference_model = Sequential()
# Two differences here.
# 1. The inference model only takes one sample in the batch, and it always has sequence length 1.
# 2. The inference model is stateful, meaning it inputs the output hidden state ("its history state")
# to the next batch input.
inference_model.add(LSTM(hiddenStateSize, batch_input_shape=(1, 1, len(char2id)), stateful = True))
# Since the above LSTM does not output sequences, we don't need TimeDistributed anymore.
inference_model.add(Dense(hiddenLayerSize))
inference_model.add(Activation('relu'))
inference_model.add(Dense(len(char2id)))
inference_model.add(Activation('softmax'))
# Copy the weights of the trained network. Both should have the same exact number of parameters (why?).
inference_model.set_weights(model.get_weights())
# Given the start Character 'S' (one-hot encoded), predict the next most likely character.
startChar = np.zeros((1, 1, len(char2id)))
startChar[0, 0, char2id['S']] = 1
nextCharProbabilities = inference_model.predict(startChar)
# print the most probable character that goes next.
print(id2char[nextCharProbabilities.argmax()])
Now that we have our inference_model working we can start producing new sentences by random sampling from the output of next character probabilities one step at a time. We rely on the np.random.multinomial function from numpy. To see what it does please check the documentation and make sure you understand what it does http://docs.scipy.org/doc/numpy/reference/generated/numpy.random.multinomial.html
inference_model.reset_states() # This makes sure the initial hidden state is cleared every time.
startChar = np.zeros((1, 1, len(char2id)))
startChar[0, 0, char2id['S']] = 1
for i in range(0, maxSequenceLength):
nextCharProbs = inference_model.predict(startChar)
# In theory I should be able to input nextCharProbs to np.random.multinomial.
nextCharProbs = np.asarray(nextCharProbs).astype('float64') # Weird type cast issues if not doing this.
nextCharProbs = nextCharProbs / nextCharProbs.sum() # Re-normalize for float64 to make exactly 1.0.
nextCharId = np.random.multinomial(1, nextCharProbs.squeeze(), 1).argmax()
print id2char[nextCharId], # The comma at the end avoids printing a return line character.
startChar.fill(0)
startChar[0, 0, nextCharId] = 1
Notice how the model learns to always predict 'E' once it has already predicted the first 'E' and does not produce any other character after that. In practice we can stop the for loop once we already found 'E', this has the effect of producing sentences of arbitrary size, meaning our model has learned when to finish a sentence. The sentence might not be perfect at this point in training but probably it has already learned to produce basic words like "a", "the", "and" or "with", however it still produces pseudo-words that look like words but are not actual words. Try running the above code many times, sentences will sound funny if you read them I guess. If you keep training the model for longer it should get better and better.