This page contains links to selected datasets collection that I’ve found. Feel free to email me if you have any suggestions!
Social Network Analysis
Stanford Large Network Dataset Collection
[SNAP is the best!] A substantial collection of data sets describing large networks.
Datasets for Social Network Analysis (Aminer.org)
Microblogging networks, patent data set, online social networks, knowledge linking dataset, mobile dataset, etc.
Network Data Repository
“The first interactive data and network repository with real-time analytics.”
konect - The Koblenz Network Collection
“KONECT (the Koblenz Network Collection) is a project to collect large network datasets of all types in order to perform research in network science and related fields, collected by the Institute of Web Science and Technologies at the University of Koblenz–Landau.KONECT contains over a hundred network datasets of various types, including directed, undirected, bipartite, weighted, unweighted, signed and rating networks.” — From the website.
Social Computing Data Repository at ASU - Datasets
Network datasets collected from famous websites including BlogCatalog, Buzznet, Delicious, Digg, Douban, Flickr, Flixster, Last.fm, Twitter, YouTube and so on. Some datasets contain both the contact network and selected group membership information. (Most datasets contain around 100k nodes.)
Datasets | Tore Opsahl
Datasets collected by Tore Opsahl (in tnet-format and some also in UCINET-format). It contains some small networks (# of nodes: 32-16,726).
BGU Social Networks Security Research Group
OSN datasets collection of BGU Social Networks Security Research Group. It contains directed networks (Anybeat, Academia.edu, Google+), undirected networks (TheMarker Cafe), multigraph networks (Students Network, WikiTree), and some other datasets of Facebook.
Social Computing Research @ MPI-SWS
- Flickr, LiveJournal, Orkut, YouTube (user, links, groups, group members):
Other sources of network data
Causal Inference
- The famous Lalonde dataset: Almost everywhere. For example, load it in R by
data(lalonde, package="MatchIt")
. - Right Heart Catheterization Dataset
- Datasets for the Atlantic Causal Inference Conference Competition (2016/2017): The GitHub repository contains codes to generate the datasts.
- National Collaborative Perinatal Project: A study that was conducted on pregnant women and their children with the aim of identifying causal factors leading to developmental disorders. There are 6,700 data items on the approximately 58,000 study pregnancies.
Last updated: 2019/5/21