In this lab we will experiment with basic image and text retrieval. We will be using a subset of the COCO Dataset http://mscoco.org/, a popular dataset these days for many tasks involving images and text. This is a dataset of images with descriptions written by people. We are using a subset of this dataset for this Lab containing 50k images for training, and 10k images for validation/development. Each image is associated with a single image description (However the full dataset contains 5 image descriptions/captions). The images are resized and center-cropped to 256 by 256 pixels. In this lab we will build a simple retrieval system based on image similarities.
import pickle
import numpy as np
import matplotlib.pyplot as plt
from scipy.misc import imread, imresize
%matplotlib inline
When you work with images and vision, it is always helpful to see your data. The first step in any project should be spending some time analyzing what your data looks like. Here we are displaying one image from this dataset however I encourage you to look at a large number of images and captions before proceeding to the next step.
# Load data and show some images.
data = pickle.load(open('mscoco_small.p'))
train_data = data['train']
val_data = data['val']
# Pick an image and show the image.
sampleImageIndex = 290 # Try changing this number and visualizing some other images from the dataset.
plt.figure()
plt.imshow(imread('mscoco/%s' % train_data['images'][sampleImageIndex]))
print(sampleImageIndex, train_data['captions'][sampleImageIndex])
Answer the following questions for this section:
This is perhaps the simplest idea to compute the similarity between two images. Use a distance between the raw pixels between the two images. However, it is considered a good idea (not just because of memory space) to scale the images to a low resolution. We resize them here to 16x16x3 and flatten the pixels as a vector. First let's visualize our above image in this resolution below.
# Pick an image and show the image.
image = imread('mscoco/%s' % train_data['images'][sampleImageIndex])
tiny_image = imresize(image, (16, 16), interp = 'nearest')
plt.imshow(tiny_image)
Answer the following questions for this section:
Now we are going to compute a matrix of size 50000 x 768, storing in the rows the pixels at this low resolution for all images in our MSCOCO 50k train set.
# Compute features for the training set.
train_features = np.zeros((len(train_data['images']), 768), dtype=np.float) # 768 = 16 * 16 * 3
for (counter, image_id) in enumerate(train_data['images']):
image = imread('mscoco/%s' % image_id)
tiny_image = imresize(image, (16, 16), interp = 'nearest')
train_features[counter, :] = tiny_image.flatten().astype(np.float) / 255
if (1 + counter) % 10000 == 0:
print ('Computed features for %d train-images' % (1 + counter))
# Compute features for the validation set.
val_features = np.zeros((len(val_data['images']), 768), dtype=np.float) # 768 = 16 * 16 * 3
for (counter, image_id) in enumerate(val_data['images']):
image = imread('mscoco/%s' % image_id)
tiny_image = imresize(image, (16, 16), interp = 'nearest')
val_features[counter, :] = tiny_image.flatten().astype(np.float) / 255
if (1 + counter) % 10000 == 0:
print ('Computed features for %d val-images' % (1 + counter))
pickle.dump(train_features, open('train_features.p', 'w')) # Store in case this notebook crashes.
pickle.dump(val_features, open('val_features.p', 'w')) # Store in case this notebook crashes.
We first divide our dataset into train, validation and test. We will then compute distances between images in our validation set and train set. We will use this validation set to decide on how many nearest neighbors to use, what metric to use, and what features to use depending on what works best. We will leave aside the test set until the very end when we have decided on our best set of parameters and only to report our performance. Once we use the test set, we should not make any more changes to the algorithm.
from scipy.spatial.distance import cdist
# Try changing this image.
sampleTestImageId = 197
# Retrieve the feature vector for this image.
sampleImageFeature = val_features[sampleTestImageId : sampleTestImageId + 1, :]
# Compute distances between this image and the training set of images.
distances = cdist(sampleImageFeature, train_features, 'correlation')
# Compute ids for the closest images in this feature space.
nearestNeighbors = np.argsort(distances[0, :]) # Retrieve the nearest neighbors for this image.
# Show the image and nearest neighbor images.
plt.imshow(imread('mscoco/%s' % val_data['images'][sampleTestImageId])); plt.axis('off')
plt.title('query image:')
fig = plt.figure()
for (i, neighborId) in enumerate(nearestNeighbors[:5]):
fig.add_subplot(1, 5, i + 1)
plt.imshow(imread('mscoco/%s' % train_data['images'][neighborId]))
plt.axis('off')
We can also use the above nearest neighbors to return the captions from these neighbors. The assumption is that if the images are similar enough to the query image, then the captions are likely to also be descriptive of the query image.
print('Query Image Description: ' + val_data['captions'][sampleTestImageId] + '\n')
for (i, neighborId) in enumerate(nearestNeighbors[:5]):
print('(' + str(i) + ')' + train_data['captions'][neighborId])
How can we measure performance for the similarity of our set of returned images? One idea is to use the text returned along with the images to measure how many words in those descriptions match words in the Query Image Description. We could measure this in terms of the number of words matched in the description to the number of words matched in the query.
One such metric is called BLEU and it measures the similarity between two sentences based on how many substrings match subject to a penalty for returning sentences that are too short. You can read details in the paper that proposed this metric here.
We will split the sentences into words using the NTLK library and compute a very simple version of BLEU scores by only computing the number of common words between the reference caption and the top candidate caption.
from nltk import word_tokenize
reference = [w.lower() for w in word_tokenize(val_data['captions'][sampleTestImageId])]
candidate = [w.lower() for w in word_tokenize(train_data['captions'][nearestNeighbors[0]])]
print ('ref', reference)
print ('cand', candidate)
bleu_score = float(len(set(reference) & set(candidate))) / len(candidate)
print("BLEU-1 score = ", bleu_score)
In this lab we computed the most similar images for a single test image in our validation set. We computed the bleu score of the top returned "candidate" caption (corresponding to the most similar image) agasint the "reference" caption associated with the query image. Compute the average of this score across all images in the validation set. (1pts). Hint: This task only requires in principle writing a for-loop around the code provided in this lab, however this will likely take very long to compute, note that cdist can also compute efficiently the distance between two sets of vectors at once.
Repeat the above experiment, however this time use a random caption from the training set as the "candidate" caption. How does this number compare to the previous number obtained in step 1? (2pts).
The feature that we used in this dataset is just a vector containing the raw image pixels for each image at a 16x16 resolution. A more robust feature for returning similar images is the Histogram of Oriented Gradients (HOG features). Which uses gradient information (edges) as opposed to color information. Use this feature instead and record here the BLEU score in the entire validation set. Feel free to use the scikit-image package to compute HOG features http://scikit-image.org/docs/dev/auto_examples/plot_hog.html. How does this number compare to the previous two? (2pts)
Put in the table below the numbers obtained in 1, 2, and 3.
random | color-feature | HoG-feature | |
BLEU-1 | 0.00 | 0.00 | 0.00 |