Yuge Shi from Oxford University will give a talk titled "Multimodal Learning with Deep Generative Models".

Title : Multimodal Learning with Deep Generative Models

Abstract: In this talk, I will present my two works on multi-modal representation learning using deep generative models. In these works, we mainly focus on multi-modal scenarios that naturally occur in the real world that depict common concepts, such as image-caption, photo-sketch, video-audio etc. In the first work, we propose to use a mixture-of-expert posterior in VAE to achieve balanced representation learning of different modalities; by doing so, the model is able to leverage the commonality between modalities to learn more robust representations and achieve better generative performance. In addition, we also proposed 4 criteria (with evaluation metrics) that multi-modal deep generative models should satisfy; in the second work, we designed a contrastive-ELBO objective for multi-modal VAEs that greatly reduced the amount of paired data needed to train such models. We show that our objective is effective on multiple SOTA multi-modal VAEs and on different datasets, and showed that only 20% of data is needed to achieve similar performance to a model trained on the original objective.

