Learning Influence Probabilities In Social Networks

In my research about influence maximization, I need datasets with learned influence probabilities. This post is about two things.

  • Tools that I have been using.
  • Related datasets.
  • How I obtain the influence propbabilities on edges.

Github repository: yishilin14/learn-influence-prob (codes and datasets)

Background

Goal: Learn influence probabilities in social networks. Paper: Goyal, A., Bonchi, F., & Lakshmanan, L. V. (2010, February). Learning influence probabilities in social networks. In Proceedings of the third ACM international conference on Web search and data mining (pp. 241-250). ACM.

Source Codes:

  • http://www.cs.ubc.ca/~goyal/code-release.php
  • Download codes for this paper: Amit Goyal, Francesco Bonchi, Laks V.S. Lakshmanan, A Data-based approach to Social Influence Maximization, To appear in PVLDB 2012.
  • For more details about the software, please refer to the readme file inside. I will only focus on how to use the tool to learn influence probabilities on edges.

Compilation:

  • “Make” (I am using Archlinux & GCC5.3.0)
  • Solution to the error “‘getpid’ was not declared in this scope”: Add #include<unistd.h>

Bernoulli distribution under the static model:

  • Static model: independent of time and simply to learn
  • Bernoulli distribution under the static model: $p_{v,u}= A_{v2u} / A_{v}$

    • $p_{v,u}$: learned influence probability of v on u
    • $A_{v2u}$: the number of actions propagated from v to u
    • $A_{v}$: the number of actions v performed
    • The first scan of the tool outputs $A_{v2u}$ and $A_{v}$.

Now, we try to learn influence probabilities for some public datasets (with dirty codes).

Flixster Dataset

  • Download here: http://www.cs.ubc.ca/~jamalim/datasets/ (Update on June 6,2017: The link is no longer available. If you know the new link, please let me know. I put the dataset that I downloaded before in my Git repo.)
    • Download the main dataset
    • Download “The ratings in Flixster are associated with timestamps”
  • Prepare the file “graph.txt”
    • Append a zero column to links.txt
    • Desired format: Each line contains “user_from user_to 0”.
  • Prepare the file “actions_log.txt”
    • Preprocess Ratings.time.txt as follows
    • Remove the first line
    • Remove the third column (rating)
    • Convert the date column (the last column) to a column of timestamps
    • Sort action logs on action-ids and tuples on an action are chronologically ordered.
    • Desired format: Each line contains “user_id action_id timestamp”.
  • Prepare the file “actions_ids.txt”
    • Desired format: Each line contains an action id
  • Misc.
    • Clean the file: iconv -f utf-8 -t utf-8 -c Ratings.timed.txt > ratings.timed.txt

The preprocessing step

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
#!/usr/bin/env python
import time
from datetime import date, datetime
from igraph import *

# Input: links.txt (undirected)
# Output grpah.txt
# Read the graph
g = Graph.Read_Edgelist("raw/links.txt", directed=False)
print("Number of nodes: ", g.vcount())
print("Number of undirected edges: ", g.ecount())
# Simplify
g = g.simplify()
g.es["weight"] = 0 # dummy weight
print("After simplify:")
print("Number of nodes: ", g.vcount())
print("Number of undirected edges: ", g.ecount())
# Output the undirected graph (with weights)
g.write_ncol("learn/graph_undir.txt", names=None)
g.to_directed(mutual=True)
# Output the directed graph (with weights)
print("Number of directed edges: ", g.ecount())
g.write_ncol("learn/graph_dir.txt", names=None)

# Input: ratings.timed.txt
# Output: actions.txt
fin_rating = open("raw/ratings.timed.txt", "r")
line = fin_rating.readline() # skip the first line
movie_set = set()
action_logs = []
for line in fin_rating:
arr = line.replace('\00', '').split()
if len(arr) == 5 :
user = int(arr[0])
movie = int(arr[1])
movie_set.add(movie)
timestamp = int(datetime.strptime(arr[3], '%Y-%m-%d').strftime("%s"))
action_logs.append([user, movie, timestamp])
fin_rating.close()
print("Number of logs: ", len(action_logs))
print("Number of action ids: ", len(movie_set))

# Sort action logs and output
print("Sorting action_logs...")
action_logs.sort(key = lambda t:(t[1],t[0],t[2]))
print("Writing action_logs...")
with open("learn/action_logs.txt", "w") as fout_action:
for line in action_logs:
if line[2] > 1000000000:
fout_action.write("%d %d %d\n" % (line[0], line[1], line[2]))

# Output all action-ids
with open("learn/action_ids.txt", "w") as fout_actionid:
for id in movie_set:
fout_actionid.write("%d\n" % id)

Codes for converting the output file

  • Run:./InfluenceModels -c config.txt
  • We convert file "edgesCounts.txt" to a file "inf_prob.txt" containing a set of directed edges with probabilities, using codes as follows.

My config file for InfluenceModels

1
2
3
4
5
6
7
8
9
# Config file for learning paramters (Scan 1)

phase : 1
computeUserInf : 0
graphFile : learn/graph_undir.txt
actionsFile : learn/action_logs.txt
outdir : learn
maxTuples : 0
trainingActionsFile : learn/action_ids.txt

Now, we try to get a weighted directed graph. The weights are the learned influence probabilities. BTW, I only keep the largest weakly connected component.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
#!/usr/bin/env python
from igraph import *

# Read the directed graph (output of preprocess.py)
g = Graph.Read_Ncol("learn/graph_dir.txt", directed=True)
g.es["weight"] = 0.0
print("Number of nodes: ", g.vcount())
print("Number of directed edges: ", g.ecount())

# Read actions performed by each user
print("Loading usersCounts...")
f_nodes = open("learn/usersCounts.txt", "r")
user_action = {}
for line in f_nodes:
v, a_v = [int(x) for x in line.split()]
user_action[v] = a_v
f_nodes.close()

# Read edges, and replace the weights in g by influence probabilities
print("Loading edgesCounts...")
f_edges = open("learn/edgesCounts.txt", "r")
for line in f_edges:
line = line.split()
u, v, a_u2v, a_v2u = [int(x) for x in line[0:4]]
if u in user_action.keys() and v in user_action.keys():
# u -> v
uid = g.vs.find(str(u))
vid = g.vs.find(str(v))
eid_uv = g.get_eid(uid, vid, error=False)
if eid_uv >=0 and user_action[u] > 0:
g.es[eid_uv]["weight"] = round(float(a_u2v) / user_action[u], 6)
# v -> u
eid_vu = g.get_eid(vid, uid, error=False)
if eid_vu >= 0 and user_action[v] > 0:
g.es[eid_vu]["weight"] = round(float(a_v2u) / user_action[v], 6)
f_edges.close()

# Delete edges with zero influence probabilities
g.delete_edges(g.es.select(weight_le=0))
print("Number of edges after deletion: ", g.ecount())

lwcc = g.clusters(mode='weak').giant()
lwcc.vs["name"] = range(lwcc.vcount())
print("Largest weakly connected component")
print("Number of nodes: ", lwcc.vcount(), float(lwcc.vcount()) / g.vcount())
print("Number of directed edges: ", lwcc.ecount())

# Output to files
with open("flixster_learn/attribute.txt", "w") as f_attr:
f_attr.write("n=%d\n" % lwcc.vcount())
f_attr.write("m=%d\n" % lwcc.ecount())

lwcc.write_ncol("flixster_learn/graph_ic.inf")
with open("flixster_learn/graph_ic_nm.inf", "w") as f_graph_nm:
f_graph_nm.write("%d %d\n" % (lwcc.vcount(), lwcc.ecount()))
with open("flixster_learn/graph_ic.inf", "r") as f_tmp:
f_graph_nm.write(f_tmp.read())

Other datasets

Here are several datasets that both user links and user action logs are available. You will find corresponding scripts in my Github repo.