Week 12: Blog Post #4

Model Architecture

Posted by Yi on November 10, 2022

Vision and Language Transformer (ViLT)

The goal of this project is to explore the potential of multimodal model on lip reading task. Vision and Language Transformer (ViLT), which is the simplest architecture by far for a vision-and-language model as it commissions the transformer module to extract and process visual features in place of a separate deep visual embedder, will be deployed for this project. The design of ViLT doesn't have any convolution or regional supervision which leads to significant runtime and parameter efficiency compared to other models. The detailed architecture of ViLT is demonstrated in the following figure.

Image


Model Details

This section will discuss the implementation details and experiment setup of ViLT for this project.


Word Embedder: BERT Tokenizer

BERT uses what is called a WordPiece tokenizer. It works by splitting words either into the full forms (e.g., one word becomes one token) or into word pieces — where one word can be broken into multiple tokens.
For example:

Word Token(s)
snow ['snow']
snowing ['snow', '##ing']
snowboard ['snow', '##board']
snowboarding ['snow', '##board', '##ing']


Visual Embedder: Patch Projection

The idea of patch projection was introduced by Vision Transformer (ViT), ViLT use a 32 x 32 patch projection as visual embedder.

Pre-trained Datasets:

Pre-trained Tasks:

  • Image Text Matching
  • Masked Language Modeling
  • Word Patch Alignment

Dataset: MIRACL-VC1

MIRACL-VC1 will be used as the dataset that is recorded by Microsoft Kinect of 15 people each saying 10 words and 10 phrases 10 times, so there are 3,000 instances in this dataset. In this dataset, each instance is a string of 640×480 pixel color and depth images.

Image


Loss Function: Cross Entropy Loss

Cross-entropy loss, or log loss, measures the performance of a classification model whose output is a probability value between 0 and 1. Cross-entropy loss increases as the predicted probability diverges from the actual label.

Image
  • M - number of classes (dog, cat, elephant)
  • log - the natural log
  • y - binary indicator (0 or 1) if class label (c) is the correct classification for observation (o)
  • p - predicted probability observation (o) is of class (c)

Experiment Design:

We will use MIRACL-VC1 to fine-tune the pre-trained ViLT and compare performance with a traditional encoder decoder based architecture baseline (CNN encoder with LSTM decoder). For each word or phrase per speaker, eight instances will be used for fine-tuning, and two will be used for testing. If time permits, we would also record some new instances to test how our proposed model performs with new speakers.