The 18 Best Thunkable Apps of 2018

Every year we highlight some of our favorite apps built on the Thunkable platform. And we’ve showcased an inspiring mix of games, information sharing apps and productivity tools. In 2018, we saw a…

Smartphone

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转




Generating Video From Text

Ajay Halthor, Cem Birler, Dhvani Vora, Shreyas Kolpe, Tayyab Anwar

What is the problem?

We aim to create video frames using generative models that are solely dependent on a text input by identifying the objects and actions in the given sentence. This is done by building on top of previous work that focuses on video frame prediction and adding additional components to those models that make them function independently with text.

But why it is important?

There are countless applications of a text-to-video system. Even though these are beyond our project’s scope here is few of them:

1. In education, long descriptions in textbooks can be converted into simple animations to enhance learning.

2. In the movie industry, novels can be used to generate movies. This would be a huge leap as films can be so easily produced, saving time and millions of dollars.

3. In social media, text-to-image/video can be used to create a new way of communication.

Previous Works

Recent advances in generative modeling have been accomplished by two fundamental architectures: Variational Autoencoders (VAE) and Generative Adversarial Networks (GAN). In “Unsupervised Learning for Physical Interaction through Video Prediction” Finn et al. use Convolutional Dynamic Neural Advection (CDNA) to determine the tᵗʰ frame of a video given t-1 frames. The masked weighted average of these candidate frames is determined as the next predicted frame. However, this makes the frames deterministic and the video lower in quality.

Stochastic Variation Video Prediction (SV2P) (Babaeizadeh et al., 2018) fixes this by inputting a stochastic latent variable to the CDNA model by means of a VAE architecture. Stochastic Adversarial Video Prediction (SAVP) (Lee et al., 2018) is a modification of SV2P that uses a GAN generator in place of CDNA. This architecture is able to produce good quality videos given contextual frames.

How Our Model is Different?

Previous research has shown promising results in text-to-image generation; our project extends a VAE based approach for generating multiple, coherent images based on the provided textual description. In our project, we used SV2P architecture as a base.

In SV2P we added an “action-conditioning” component while passing in the latent vector from the encoder VAE to the CDNA. This vector acts as a stand-in for the text, with future work focusing on using text representations such as sentence embedding and embedding with attention.

Experiments and Results

Moving MNIST dataset is based on its use in the literature for video generation and prediction tasks. In our project, we used an existing script to generate this dataset and added vertical, horizontal and circular motion. Our program also modifies this script to generate captions. At the end 2000 moving MNIST frames were generated, 1700 of them are used for training and validation and 300 of them are used for testing.

Generated video frames from moving MNIST dataset
Ground truth from moving MNIST dataset

We set a baseline by training the basic SV2P model by providing 10 context frames, set to predict the next 20 frames. The loss was used to analyze the improvements made by action conditioning. Table 1&2 shows the recorded L2-losses in comparison to the other models.

Table 1&2. Comparison of losses after 1000 steps.

2. Action Supported

In this experiment, we provided the model with the action in the form of a vector that defined the direction of the movement of the digits. Our experiments showed improvement in the prediction of the model with the action included. This model was tested with providing 1 and 10 contextual frames. Below gifs show our results after 1000 steps with and without actions.

Ground truth of frames with and without actions

3. Action and State Supported

Apart from actions, we performed experiments with providing the digit to the model using the state vector. L2 loss decreased after state comes into the picture.

4. Action and State Without Context Frames

Our ultimate target was the train the model to not depend on any context frame and only condition the prediction using actions and states provided. We compared the results for with action, with actions and a state, and empty frame (1 frame with all zero values) with actions and states. As expected loss was minimum when both actions and states were given. Below is our loss results for 3 models:

Look at that generation of number 4 moving left and right, morphing into a shape of cauliflower :)
The time when we invented a new number!

5. Context Frames without Actions

As another experiment, we trained our network without any actions but changing the number of context frames. Below are results for this experiment:

Change of losses with a step size of 100 for 3 different models Each model uses a different number of context frames without any actions.
0 context frame (all zeros) without actions generated and ground truth
1 context frame without actions generated and ground truth
10 context frames without actions generated and ground truth

Without actions, model (base model of SV2P) is not providing consistent results, i.e., the loss should have decreased from using 0 context frames to 10 context frames, but the loss is highest with 10 context frames and is minimum with 1 context frames.

Possible speculation for this kind of result is, with 1 context frame, it does not have to understand how to generate it from scratch and just has to understand how the motion works. With 10 context frames, it has more amount of motions to learn as SV2P uses the t-1ᵗʰ frame to determine tᵗʰ frame, which makes it difficult to learn the next frame as it has more amount of t-1ᵗʰframe, thus increasing the loss. With just 0 frames, it needs to understand both — how to generate the first frame from scratch as well as the motion of the digit, thus increasing the loss.

Generation of 1 moving up from 0 context frame (all zeros). Left without actions, middle with actions, right with actions and states.

In all of the models learning rate was set to 0.001. During the training, Adam Optimizer is used with a batch size of 32. The training was optimized used KL-divergence+L2-loss, however, we chose to plot only L2 loss because of its explainability.

Future Work

Sentence Embeddings

Instead of directly feeding our model an action vector (for every frame), we could feed it an English sentence. This will extend our model’s ability to handle complete, complex sentences.

Attention

Another possible area of improvement would be the use of attention on the input sentence. An attention mechanism would allow focussing on different parts of the sentence while generating every frame. This can lead to more distinct, realistic frames.

Our poster during the presentation day

Add a comment

Related posts:

Eximchain welcomes Visual Designer Mario Nikolic

Mario completed a degree in Audiovisual Studies at the University of Belgrade in 2012. Upon graduation, he worked as the Lead Designer at Farmia, a startup that modernized livestock trading…

The Psychological Driving Force That Makes Social Networks so Successful

For a poodle-haired French philosopher born in the elegance of a post-Renaissance Paris, a social network would describe the group of friends that he spends his time with, sipping tea in a lavish…

MongoDB CLI com Docker

Mesmo utilizando o MongoDB através de um container do Docker ainda assim é possível executar os comandos via terminal como se o MongoDB estive instalado diretamente em sua máquina. Para isso, fa…