In my research about influence maximization, I need datasets with learned influence probabilities. This post is about two things.

• Tools that I have been using.
• Related datasets.
• How I obtain the influence propbabilities on edges.

Github repository: yishilin14/learn-influence-prob (codes and datasets)

# Background

Goal: Learn influence probabilities in social networks. Paper: Goyal, A., Bonchi, F., & Lakshmanan, L. V. (2010, February). Learning influence probabilities in social networks. In Proceedings of the third ACM international conference on Web search and data mining (pp. 241-250). ACM.

Source Codes:

• http://www.cs.ubc.ca/~goyal/code-release.php
• Download codes for this paper: Amit Goyal, Francesco Bonchi, Laks V.S. Lakshmanan, A Data-based approach to Social Influence Maximization, To appear in PVLDB 2012.
• For more details about the software, please refer to the readme file inside. I will only focus on how to use the tool to learn influence probabilities on edges.

Compilation:

• “Make” (I am using Archlinux & GCC5.3.0)
• Solution to the error “‘getpid’ was not declared in this scope”: Add #include<unistd.h>

Bernoulli distribution under the static model:

• Static model: independent of time and simply to learn
• Bernoulli distribution under the static model: $p_{v,u}= A_{v2u} / A_{v}$

• $p_{v,u}$: learned influence probability of v on u
• $A_{v2u}$: the number of actions propagated from v to u
• $A_{v}$: the number of actions v performed
• The first scan of the tool outputs $A_{v2u}$ and $A_{v}$.

Now, we try to learn influence probabilities for some public datasets (with dirty codes).

# Flixster Dataset

• Prepare the file “graph.txt”
• Append a zero column to links.txt
• Desired format: Each line contains “user_from user_to 0”.
• Prepare the file “actions_log.txt”
• Preprocess Ratings.time.txt as follows
• Remove the first line
• Remove the third column (rating)
• Convert the date column (the last column) to a column of timestamps
• Sort action logs on action-ids and tuples on an action are chronologically ordered.
• Desired format: Each line contains “user_id action_id timestamp”.
• Prepare the file “actions_ids.txt”
• Desired format: Each line contains an action id
• Misc.
• Clean the file: iconv -f utf-8 -t utf-8 -c Ratings.timed.txt > ratings.timed.txt

## Codes for converting the output file

• Run:./InfluenceModels -c config.txt
• We convert file "edgesCounts.txt" to a file "inf_prob.txt" containing a set of directed edges with probabilities, using codes as follows.

My config file for InfluenceModels

Now, we try to get a weighted directed graph. The weights are the learned influence probabilities. BTW, I only keep the largest weakly connected component.

# Other datasets

Here are several datasets that both user links and user action logs are available. You will find corresponding scripts in my Github repo.