CrossOSN-U: CASIA Cross-OSN dataset based on overlapped Users

Introduction

In today's social media, the huge web data distribute among different OSNs (Online Social Network). These data from different sources share a unique overlapped user base, i.e., the individuals who simultaneously get involved in different OSNs for data generation and consumption. Analyzing the cross-OSN data based on overlapped users provides one important way to connect and exploit the isolated social media data islands. To advance the research around this topic, CrossOSN-U is released with the overlapped users' behavioral and social relation data on different OSNs (e.g., Google+, YouTube, Twitter, Flickr, Instagram, Tumblr).

The CrossOSN-U dataset is constructed as follows: (1) The first step is to obtain the userIDs for the same individual (overlapped user) on different OSNs. Third-part social media aggregation tools like About.me and social network sites like Google+ encourage users to disclose their userIDs in other OSNs, from which we collect the overlapped users' cross-OSN userIDs. (2) Respective APIs are then leveraged to crawl the userID's available data on the corresponding OSNs. The current CrossOSN-U consists of several sub-datasets, to enable the exploration of overlapped users' cross-OSN data from different views and towards different applications.



The pipleline to collect overlapp users' cross-OSN data.


CrossOSN-U: Hetero

[Overview]

This sub-dataset consists of users' heterogeneous behavioral data (i.e., interacting with objects of different modalities) on Twitter and YouTube. Specifically, the dataset contains user profile and historical video behaviors on YouTube; and user profile, social relation, and historical tweeting data on Twitter. The metadata for all the involved YouTube videos are also included.

            Overlapp users' heterogeneous data on Twitter and YouTube.

Note that the original tweeting data are not released due to the Twitter data policy. The Twitter tweeting data is provided as users' topical distribution (modeled by LDA over 39,659 Twitter users). The basic statistics of our dataset is presented in table below:

 #YouTube users   #Twitter users 
 #Overlapped users 
 #Videos 
 #Average videos per 
YouTube user
 #Average friends per 
Twitter user
 38,377  39,659  11,687  2,280,129  93.60  891.1


[Download]
Twitter Data
YouTube Data

Detailed description of the data format is available at: readme.pdf

Please cite the following papers if this dataset helps your research:

  • Ming Yan, Jitao Sang, Changsheng Xu.
    Mining Cross-network Association for YouTube Video Promotion. [paper] [bibtex]  [project]
    ACM Multimedia 2014: 557-566.
    
    @inproceedings{DBLP:conf/mm/YanSX14,
      author    = {Ming Yan and
                   Jitao Sang and
                   Changsheng Xu},
      title     = {Mining Cross-network Association for YouTube Video Promotion},
      booktitle = {ACM Multimedia},
      pages     = {557--566},
      year      = {2014}
    }
    
    
  • Ming Yan, Jitao Sang, Changsheng Xu.
    Unified YouTube Video Recommendation via Cross-network Collaboration. [paper] [bibtex]  [project]
    ACM ICMR 2015: 19-26.   [Best Student Paper]
    
    @inproceedings{DBLP:conf/mir/YanSX15,
      author    = {Ming Yan and
                   Jitao Sang and
                   Changsheng Xu},
      title     = {Unified YouTube Video Recommendation via Cross-network Collaboration},
      booktitle = {ACM ICMR},
      pages     = {19--26},
      year      = {2015}
    }
    
    

CrossOSN-U: Homo

[Overview]

In addition to the cross-OSN heterogeneous behaviors, there also exist cross-OSN homogeneous behaviors, where the interacted objects are from the same modality. The cross-OSN homogeneous behaviors capture significantly different meanings even involved with the same modality of objects, which is one important difference of cross-OSN computing from cross-media computing. This sub-dataset consists of overlapped users' homogeneous behaviors regarding videos on YouTube and Google+.


Overlapp users' homogeneous behaviors regarding video on Google+ and YouTube.

In particular, the video-related behavior of uploading, add-to-playlist, favorite, rating, commenting on YouTube, and that of sharing, commenting on Google+ are collected for the overlapped users. The videos are from a unique video pool on YouTube. The basic statistics of our dataset is presented in table below:

 #YouTube users   #Google+ users 
 #Overlapped users 
 #Videos 
 9,560  9,728  8,492  1,620,404


[Download]
User Profile Data
Video-related Interaction Data
Video Metadata

The following paper provide example research based on this sub-dataset: quantifying the signficance of cross-OSN homogeneous behaviors in reflecting user interests.



CrossOSN-U: SN

[Overview]

In this sub-dataset, the overlapped users between Twitter and Flickr are socially connected (i.e. following or be followed) on both OSNs. Users' social networks on different OSNs can be analyzed and compared based on this sub-dataset. (Note that both Twitter and Flickr are unidirectional social networks, bidirectional social network can be constructed for analysis by examining the reciprocal relations.)


Overlapp users are connected on both Twitter and Flickr.

Specifically, on Twitter, users’ unidirectional social links and their user profiles are provided; On Flickr, users’ friend list, user profiles, and shared photo/group id collections are given. The basic statistics of our dataset is presented in table below:

 #Overlapped users 
 #Average followers per  
Twitter user
 #Average friends per  
Twitter user
 #Average friends per  
Flickr user
 7,118  1,808  1,032  101

[Download]
Twitter Data
Flickr Data

Detailed description of the data format is available at: readme.pdf

More details about the dataset and relevant analysis can be found at:

  • Ming Yan, Jitao Sang, Tao Mei, Changsheng Xu
    Friendtransfer: Cold-start Friend Recommendation with Cross-platform Transfer Learning of Social Knowledge. [paper] [bibtex]  [project]
    ICME 2013, oral paper.
    
    @inproceedings{DBLP:conf/icmcs/YanSMX13,
      author    = {Ming Yan and
                   Jitao Sang and
                   Tao Mei and
                   Changsheng Xu},
      title     = {Friend transfer: Cold-start friend recommendation with cross-platform
                   transfer learning of social knowledge},
      booktitle = {IEEE ICME},
      pages     = {1--6},
      year      = {2013}
    }
    
    


CrossOSN-U: Event

[Overview]

This sub-dataset is constructed based on overlapped users' behaviors around common events between Twitter and YouTube. 20 Google trending events in the year 2012 are selected which have wide coverage on both Twitter and YouTube. The events are listed as follows:

We identified the overlapped users who involved in at least one of the selected events. For each event, the number of involved users in one or two OSNs are summarized below:


                      Overlapp users involved in different events on both Twitter and Flickr.

Totally 8,540 overlapped users between Twitter and YouTube are examined. This sub-dataset contains these users' historical video behaviors on YouTube, and their historical tweeting behaviors on Twitter. The textual metadata for the involved YouTube videos are also included. The basic statistics of our dataset is presented in table below:

 #Events 
 #Overlapped users  
 #Average video behaviors per  
YouTube user
 #Average tweets per  
Twitter user
 20  8,540  82.7  998

[Download]
Twitter Data
  • Users’ historical tweeting behaviors on Twitter.
    As Twitter orginal data cannot be released publicly, please drop an email to ming.yan AT nlpr.ia.ac.cn or cheney8023 AT gmail.com if interested. Note that people who acquire this dataset must guarantee that the use of the data is restricted to research purpose only.
  • The involved Twitter user list for each event.
YouTube Data

Detailed description of the data format is available at: readme.pdf

The following papers provide example research based on this sub-dataset: examining overlapped users' responses to the same events in different OSNs.

  • Ming Yan, Zhengyu Deng, Jitao Sang, Changsheng Xu
    User-Oriented Social Analysis across Social Media Sites. [paper] [bibtex]
    ICIAP 2013, oral paper.
    
    @inproceedings{DBLP:conf/iciap/YanDSX13,
      author    = {Ming Yan and
                   Zhengyu Deng and
                   Jitao Sang and
                   Changsheng Xu},
      title     = {User-Oriented Social Analysis across Social Media Sites},
      booktitle = {ICIAP},
      pages     = {482--490},
      year      = {2013}
    }
    
    
  • Zhengyu Deng, Ming Yan, Jitao Sang, Changsheng Xu
    Twitter is Faster: Personalized Time-aware Video Recommendation from Twitter to YouTube. [paper] [bibtex] [project]
    TOMM 11(2): 31:1-31:23, 2014.
    
    @article{DBLP:journals/tomccap/DengYSX14,
      author    = {Zhengyu Deng and
                   Ming Yan and
                   Jitao Sang and
                   Changsheng Xu},
      title     = {Twitter is Faster: Personalized Time-Aware Video Recommendation from
                   Twitter to YouTube},
      journal   = {{TOMM}},
      volume    = {11},
      number    = {2},
      pages     = {31:1--31:23},
      year      = {2014}
    }