Multi-Modal Knowledge Representation Learning
via Webly-Supervised Relationships Mining


Introduction


Knowledge representation encodes enormous structured information with entities and relations into a continuous low-dimensional semantic space. Most conventional methods solely focus on learning knowledge representation from single modality, yet neglect the complemental information from others. This paper proposes a novel multi-modal knowledge representation learning (MM-KRL) framework which is attempt to handle knowledge from both textural and visual modal web data. It consists of two stages, i.e., webly-supervised multi-modal relationship mining, and bi-enhanced cross-modal knowledge representation learning.

Compared with existing knowledge representation methods, our framework has several advantages:

  1. It can effectively mine multi-modal knowledge with structured textural and visual relationships from web automatically.
  2. It is able to learn a common knowledge space which is independent to both task and modality by the proposed Bi-enhanced Cross-modal Deep Neural Network (BC-DNN).
  3. It is able to represent unseen multi-modal relationships by transferring the learned knowledge with seen isolated entities and relations into unseen relationships. We build a large-scale multi-modal relationship dataset (MMR-D) and the experimental results show that our framework achieves superior performance in zero-shot multi-modal retrieval and visual relationship recognition.

Motivation


The difference with previous work:


Fig.1 Illustration the difference of conventional textual KRL, visual KRL and the proposed MM-KRL.

  1. Textual KRL: textual knowledge representation learning.
  2. Visual KRL: visual knowledge representation learning.
  3. MM-KRL: multi-modal knowledge representation learning.

Framework


Fig.2 Proposed framework for multi-modal knowledge learning.

Proposed Bi-enhanced DNN method



Fig.3 Bi-enhanced cross-modal knowledge representation.

Dataset


Training data 115.0 GB
Training data includes 597299 instances.

Test data 17.6 GB
Test data includes 90690 instances.
  • list of all relationships

  •  #Multi-modal relationships   #Textual relationship instances 
     #Visual relationship instances 
     20726  20726  687784

    Source Code


    tripletRelationship.py
    Source code for textual knowledge representation learning based on Triplet Relationship strategy.

    caffe-multilabel.zip    solver.prototxt    train_val.prototxt
    Source code for visual knowledge representation learning based on deep multivariate regression strategy.
    caffe-multilabel.zip: Modify the Caffe code to make it support deep multivariate regression which only need one LMDB file.
    solver.prototxt & train_val.prototxt : parameters on visual knowledge representation learning stage.

    Training Log Files
    Training log files (iteration and training loss) in multi-modal knowledge representation learning strategy.

    Results



    Text-Text Retrieval 910 KB
    Results of zero-shot text-text retrieval.

    Image-Image Retrieval 3.10 MB
    Results of zero-shot image-image retrieval.

    Text-Image Retrieval 2.93 MB
    Results of zero-shot text-image cross modal retrieval.


    Last updated on 2017/03/29