Multi-Modal Knowledge Representation Learning
via Webly-Supervised Relationships Mining


Knowledge representation encodes enormous structured information with entities and relations into a continuous low-dimensional semantic space. Most conventional methods solely focus on learning knowledge representation from single modality, yet neglect the complemental information from others. This paper proposes a novel multi-modal knowledge representation learning (MM-KRL) framework which is attempt to handle knowledge from both textural and visual modal web data. It consists of two stages, i.e., webly-supervised multi-modal relationship mining, and bi-enhanced cross-modal knowledge representation learning.

Compared with existing knowledge representation methods, our framework has several advantages:

  1. It can effectively mine multi-modal knowledge with structured textural and visual relationships from web automatically.
  2. It is able to learn a common knowledge space which is independent to both task and modality by the proposed Bi-enhanced Cross-modal Deep Neural Network (BC-DNN).
  3. It is able to represent unseen multi-modal relationships by transferring the learned knowledge with seen isolated entities and relations into unseen relationships. We build a large-scale multi-modal relationship dataset (MMR-D) and the experimental results show that our framework achieves superior performance in zero-shot multi-modal retrieval and visual relationship recognition.


The difference with previous work:

Fig.1 Illustration the difference of conventional textual KRL, visual KRL and the proposed MM-KRL.

  1. Textual KRL: textual knowledge representation learning.
  2. Visual KRL: visual knowledge representation learning.
  3. MM-KRL: multi-modal knowledge representation learning.


Fig.2 Proposed framework for multi-modal knowledge learning.

Proposed Bi-enhanced DNN method

Fig.3 Bi-enhanced cross-modal knowledge representation.


Training data 115.0 GB
Training data includes 597299 instances.

Test data 17.6 GB
Test data includes 90690 instances.
  • list of all relationships

  •  #Multi-modal relationships   #Textual relationship instances 
     #Visual relationship instances 
     20726  20726  687784

    Source Code
    Source code for textual knowledge representation learning based on Triplet Relationship strategy.    solver.prototxt    train_val.prototxt
    Source code for visual knowledge representation learning based on deep multivariate regression strategy. Modify the Caffe code to make it support deep multivariate regression which only need one LMDB file.
    solver.prototxt & train_val.prototxt : parameters on visual knowledge representation learning stage.

    Training Log Files
    Training log files (iteration and training loss) in multi-modal knowledge representation learning strategy.


    Text-Text Retrieval 910 KB
    Results of zero-shot text-text retrieval.

    Image-Image Retrieval 3.10 MB
    Results of zero-shot image-image retrieval.

    Text-Image Retrieval 2.93 MB
    Results of zero-shot text-image cross modal retrieval.

    Last updated on 2017/03/29