Knowledge representation encodes enormous structured information with entities and relations into a continuous low-dimensional semantic space. Most conventional methods solely focus on learning
knowledge representation from single modality, yet neglect the complemental information from others. This paper proposes a novel multi-modal knowledge representation learning (MM-KRL) framework
which is attempt to handle knowledge from both textural and visual modal web data. It consists of two stages, i.e., webly-supervised multi-modal relationship mining, and bi-enhanced
cross-modal knowledge representation learning.
Compared with existing knowledge representation methods, our framework has several advantages:
- It can effectively mine multi-modal knowledge with structured textural and visual relationships from web automatically.
- It is able to learn a common knowledge space which is independent to both task and modality by the proposed Bi-enhanced Cross-modal Deep Neural Network (BC-DNN).
- It is able to represent unseen multi-modal relationships by transferring the learned knowledge with seen isolated entities and relations into unseen relationships.
We build a large-scale multi-modal relationship dataset (MMR-D) and the experimental results show that our framework achieves superior performance in zero-shot multi-modal retrieval
and visual relationship recognition.