In my research about influence maximization, I need datasets with learned influence probabilities. This post is about two things.

Tools that I have been using.
Related datasets.
How I obtain the influence propbabilities on edges.

Github repository: yishilin14/learn-influence-prob (codes and datasets)

Background

Goal: Learn influence probabilities in social networks.
Paper: Goyal, A., Bonchi, F., & Lakshmanan, L. V. (2010, February). Learning influence probabilities in social networks. In _Proceedings of the third ACM international conference on Web search and data mining_ (pp. 241-250). ACM.

Source Codes:

http://www.cs.ubc.ca/~goyal/code-release.php
Download codes for this paper: Amit Goyal, Francesco Bonchi, Laks V.S. Lakshmanan, _A Data-based approach to Social Influence Maximization_, To appear in PVLDB 2012.
For more details about the software, please refer to the readme file inside. I will only focus on how to use the tool to learn influence probabilities on edges.

Compilation:

“Make” (I am using Archlinux & GCC5.3.0)
Solution to the error “‘getpid’ was not declared in this scope”: Add #include<unistd.h>

Bernoulli distribution under the static model:

Static model: independent of time and simply to learn
Bernoulli distribution under the static model: $p_{v,u}= A_{v2u} / A_{v}$
- $p_{v,u}$: learned influence probability of v on u
- $A_{v2u}$: the number of actions propagated from v to u
- $A_{v}$: the number of actions v performed
- The first scan of the tool outputs $A_{v2u}$ and $A_{v}$.

Now, we try to learn influence probabilities for some public datasets (with dirty codes).

Flixster Dataset

Download here: http://www.cs.ubc.ca/~jamalim/datasets/ (Update on June 6,2017: The link is no longer available. If you know the new link, please let me know. I put the dataset that I downloaded before in my Git repo.)
- Download the main dataset
- Download “The ratings in Flixster are associated with timestamps”
Prepare the file “graph.txt”
- Append a zero column to links.txt
- Desired format: Each line contains “user_from user_to 0”.
Prepare the file “actions_log.txt”
- Preprocess Ratings.time.txt as follows
- Remove the first line
- Remove the third column (rating)
- Convert the date column (the last column) to a column of timestamps
- Sort action logs on action-ids and tuples on an action are chronologically ordered.
- Desired format: Each line contains “user_id action_id timestamp”.
Prepare the file “actions_ids.txt”
- Desired format: Each line contains an action id
Misc.
- Clean the file: iconv -f utf-8 -t utf-8 -c Ratings.timed.txt > ratings.timed.txt

The preprocessing step

#!/usr/bin/env python
import time
from datetime import date, datetime
from igraph import *

# Input: links.txt (undirected)
# Output grpah.txt
# Read the graph
g = Graph.Read_Edgelist("raw/links.txt", directed=False)
print("Number of nodes: ", g.vcount())
print("Number of undirected edges: ", g.ecount())
# Simplify
g = g.simplify()
g.es["weight"] = 0  # dummy weight
print("After simplify:")
print("Number of nodes: ", g.vcount())
print("Number of undirected edges: ", g.ecount())
# Output the undirected graph (with weights)
g.write_ncol("learn/graph_undir.txt", names=None)
g.to_directed(mutual=True)
# Output the directed graph (with weights)
print("Number of directed edges: ", g.ecount())
g.write_ncol("learn/graph_dir.txt", names=None)

# Input: ratings.timed.txt
# Output: actions.txt
fin_rating = open("raw/ratings.timed.txt", "r")
line = fin_rating.readline()  # skip the first line
movie_set = set()
action_logs = []
for line in fin_rating:
    arr = line.replace('\00', '').split()
    if len(arr) == 5 :
        user = int(arr[0])
        movie = int(arr[1])
        movie_set.add(movie)
        timestamp = int(datetime.strptime(arr[3], '%Y-%m-%d').strftime("%s"))
        action_logs.append([user, movie, timestamp])
fin_rating.close()
print("Number of logs: ", len(action_logs))
print("Number of action ids: ", len(movie_set))

# Sort action logs and output
print("Sorting action_logs...")
action_logs.sort(key = lambda t:(t[1],t[0],t[2]))
print("Writing action_logs...")
with open("learn/action_logs.txt", "w") as fout_action:
    for line in action_logs:
        if line[2] > 1000000000:
            fout_action.write("%d %d %d\n" % (line[0], line[1], line[2]))

# Output all action-ids
with open("learn/action_ids.txt", "w") as fout_actionid:
    for id in movie_set:
        fout_actionid.write("%d\n" % id)

Codes for converting the output file

Run:./InfluenceModels -c config.txt
We convert file "edgesCounts.txt" to a file "inf_prob.txt" containing a set of directed edges with probabilities, using codes as follows.

My config file for InfluenceModels

# Config file for learning paramters (Scan 1)

phase : 1
computeUserInf : 0
graphFile : learn/graph_undir.txt
actionsFile : learn/action_logs.txt
outdir : learn 
maxTuples : 0
trainingActionsFile : learn/action_ids.txt

Now, we try to get a weighted directed graph. The weights are the learned influence probabilities. BTW, I only keep the largest weakly connected component.

#!/usr/bin/env python
from igraph import *

# Read the directed graph (output of preprocess.py)
g = Graph.Read_Ncol("learn/graph_dir.txt", directed=True)
g.es["weight"] = 0.0
print("Number of nodes: ", g.vcount())
print("Number of directed edges: ", g.ecount())

# Read actions performed by each user
print("Loading usersCounts...")
f_nodes = open("learn/usersCounts.txt", "r")
user_action = {}
for line in f_nodes:
    v, a_v = [int(x) for x in line.split()]
    user_action[v] = a_v
f_nodes.close()

# Read edges, and replace the weights in g by influence probabilities
print("Loading edgesCounts...")
f_edges = open("learn/edgesCounts.txt", "r")
for line in f_edges:
    line = line.split()
    u, v, a_u2v, a_v2u = [int(x) for x in line[0:4]]
    if u in user_action.keys() and v in user_action.keys():
        # u -> v
        uid = g.vs.find(str(u))
        vid = g.vs.find(str(v))
        eid_uv = g.get_eid(uid, vid, error=False)
        if eid_uv >=0 and user_action[u] > 0:
            g.es[eid_uv]["weight"] = round(float(a_u2v) / user_action[u], 6)
        # v -> u
        eid_vu = g.get_eid(vid, uid, error=False)
        if eid_vu >= 0 and user_action[v] > 0:
            g.es[eid_vu]["weight"] = round(float(a_v2u) / user_action[v], 6)
f_edges.close()

# Delete edges with zero influence probabilities
g.delete_edges(g.es.select(weight_le=0))
print("Number of edges after deletion: ", g.ecount())

lwcc = g.clusters(mode='weak').giant()
lwcc.vs["name"] = range(lwcc.vcount())
print("Largest weakly connected component")
print("Number of nodes: ", lwcc.vcount(), float(lwcc.vcount()) / g.vcount())
print("Number of directed edges: ", lwcc.ecount())

# Output to files
with open("flixster_learn/attribute.txt", "w") as f_attr:
    f_attr.write("n=%d\n" % lwcc.vcount())
    f_attr.write("m=%d\n" % lwcc.ecount())

lwcc.write_ncol("flixster_learn/graph_ic.inf")
with open("flixster_learn/graph_ic_nm.inf", "w") as f_graph_nm:
    f_graph_nm.write("%d %d\n" % (lwcc.vcount(), lwcc.ecount()))
    with open("flixster_learn/graph_ic.inf", "r") as f_tmp:
        f_graph_nm.write(f_tmp.read())

Other datasets

Here are several datasets that both user links and user action logs are available. You will find corresponding scripts in my Github repo.

Digg 2009: http://www.isi.edu/integration/people/lerman/downloads.html
Twitter 2010: http://www.isi.edu/~lerman/downloads/twitter/twitter2010.html
Flickr 2009: http://socialnetworks.mpi-sws.org/data-www2009.html