Image Captioning-Video Summarizer

Overview

Image captioning using attention based encoder-decoder model.The idea is discussed in Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. Recurrent Neural Networks (RNN) are used for varied number of applications including machine translation. The Encoder-Decoder architecture is utilized for such settings where a varied-length input sequence is mapped to the varied-length output sequence. The same network can also be used for image captioning. We used a ResNet with pretrained weights as encoder to make feature vectors from the input images and GRU an variant of RNN as decoder.Now for Video Summarization using OpenCV library we will capture frames in video at specific time interval(1 frame per 2 seconds) and we will generate captions to all these frames using above said Image captioning model and retain only those captions which have a low similarity score with the immediate previous caption and that Threshold similarity score is 0.5.Then we perform Abstractive Summarization using T5 base Transformer model

Implementation

In the image_captioning_.ipynb we download the datasets and all of the preprocessing training and evaluation takes place.

>Dataset Used: MS-COCO(subset containing 15000 randomly shuffled images)

Vocabulary: The vocabulary consists of mapping between words and indices(we limited the size of vocabulary to 5000 instead of 10000 as discussed in paper to save memory)

Encoder: ResNet without the final classification layer with pretrained weights. we could also try trainig the encoder instead of loading pretrained weights.

Decoder: GRU(Gated recurrent unit) is used as decoder with Bahdanau attention. Using attention based architechture we can observe which parts of images were identified for generating words(or captions). 2 GRUs are stacked on top of each other and 3 fully connected layers for predictions with 0.25 droupout at every stage in decoder.

Caption Generation: Based on highest probability/greedy search.

Training: Teacher forcing is used to reduce training time for the RNN.

Score: Maximum cosine similarity between the 5 true captions and the predicted caption. Mean cosine similarity of 50 random images : 0.82622829

Video to frames: Using OpenCV

Transformer used(for Summarization): T5 base

The Cynaptics Club

Image Captioning-Video Summarizer