Social Network Analysis and
Graph Algorithms for the Web

List of accepted papers :

  • Mapping the Invocation Structure of Online Political Interaction
    Authors: Manish Raghavan, Ashton Anderson and Jon Kleinberg

    Keywords: online polarization, online news, reply graphs

    The surge in political information, discourse, and interaction has been one of the most important developments in social media over the past several years. Much of the contention takes place not just in the positioning of content at diverse points on an ideological spectrum, but also in the interaction among different viewpoints on this spectrum. However, we still have only a limited analytical vocabulary for expressing the ways in which these viewpoints interact. In this paper, we develop network-based methods that operate on a superposition of users and the content that they share; we construct emph{invocation graphs} on Web domains showing the extent to which links from one domain are invoked by users to reply to posts containing links from other domains. When the domains are located on a political spectrum induced from the data, we obtain an embedded graph showing how these interaction links span different distances on the spectrum. The structure of this embedded network, and its evolution over time, helps us derive macro-level insights about how political interaction unfolded through 2016, leading up to the US Presidential election. In particular, we find that the domains invoked in replies spanned increasing distances on the spectrum over the months approaching the election, and that there was clear asymmetry between the left-to-right and right-to-left patterns of linkage.

  • Modeling Success and Engagement for the App Economy
    Authors: Haim Mendelson and Ken Moon

    Keywords: Apps, Diffusion, Experience curve, Hidden Markov model, Time series, User adoption, User engagement and retention

    While apps on the whole increasingly drive and shape present-day Web usage, they individually thrive and fall based on rich patterns of user adoption and long-term engagement. In studying these patterns at the population level, researchers remain curtailed by practical difficulties in accessing consistently detailed data covering large, representative cross-sections of apps. We address this challenge by proposing an empirical framework for analyzing an app’s patterns of adoption and engagement. Importantly, our approach requires obtaining only an app’s daily, weekly, and monthly usership time-series data, which are popular metrics tracked and made available for many apps. Modeling how the nuanced co-dynamics of an app’s daily, weekly, and monthly usership measures (DAU, WAU, and MAU) reliably reveal user adoption and repeat engagement, we extend and demonstrably improve on the predictive performance of prior install- or DAU-based methods. To further study the mechanisms by which successful apps cultivate their userbases, we position apps along the dual axes of viral adoption and retentive engagement as mechanisms to potentially explain success. Applying our approach to data on Facebook apps, we show that these apps over time became less viral and more engaging. Interestingly, despite their diminished virality, their improved user retention subtly and counter-intuitively boosted these apps’ rates of new user adoptions through user-activity-based word of mouth, because an engaged user contributes to word of mouth over a longer period of time. In a final case application, we introduce evidence that developers of apps that learn to successfully retain their users carry this valuable experience over into their new apps.

  • Minimizing Latency in Online Ride and Delivery Services
    Authors: Abhimanyu Das, Sreenivas Gollapudi, Anthony Kim, Debmalya Panigrahi and Chaitanya Swamy

    Keywords: vehicle routing problems, minimum latency problems, online ride services, online delivery services

    We study natural variants of the classical multi-vehicle minimum latency problems where the objective is to route a set of vehicles located at depots to serve requests located at different points in a metric space so as to minimize the total latency. In this paper, we consider point-to-point requests that come with source-destination pairs and release-time constraints that restrict when each request can be served. The point-to-point requests and release-time constraints model taxi rides and deliveries. For all the variants considered, we show constant-factor approximation algorithms based on a linear programming framework. To the best of our knowledge, these are the first set of results for the aforementioned variants of the minimum latency problems. Furthermore, we provide an empirical study of heuristics based on our theoretical algorithms on a real data set of taxi rides.

  • Minimizing Polarization and Disagreement in Social Networks
    Authors: Cameron Musco, Christopher Musco and Charalampos Tsourakakis

    Keywords: Social Networks, Opinion Dynamics, Polarization, Optimization

    The rise of social media and online social networks has been a disruptive force in society. Opinions are increasingly shaped by interactions on online social media, and social phenomena including disagreement and polarization are now tightly woven into everyday life. In this work we initiate the study of the following question: “ Given n agents, each with its own initial opinion that reflects its core value on a topic, and an opinion dynamics model, what is the structure of a social network that minimizes polarization and disagreement simultaneously? ” This question is central to recommender systems: should a recommender system prefer a link suggestion between two online users with similar mindsets in order to keep disagreement low, or between two users with different opinions in order to expose each to the others viewpoint of the world, and decrease overall levels of polarization? Such decisions have an important global effect on society williams2007social. Our contributions include a mathematical formalization of this question as an optimization problem and an exact, time-efficient algorithm. We also prove that there always exists a graph with O(n/epsilon^2) edges that is a (1+epsilon) approximation to the optimum. Our formulation is an instance of optimization over graph topologies, also considered e.g., in boyd2004fastest,daitch2009fitting,sun2006fastest. For a fixed graph, we additionally show how to optimize our objective function over the agents’ innate opinions in polynomial time. We perform an empirical study of our proposed methods on synthetic and real-world data that verify their value as mining tools to better understand the trade-off between of disagreement and polarization. We find that there is a lot of space to reduce both polarization and disagreement in real-world networks; for instance, on a Reddit network where users exchange comments on politics, our methods achieve a ?60,000-fold reduction in polarization and disagreement. Our code is available at

  • Fast Exact CoSimRank Search on Evolving and Static Graphs
    Authors: Weiren Yu and Fan Wang

    Keywords: CoSimRank, similarity search, link analysis, algorithm, optimisation, evolving graphs

    In real Web applications, CoSimRank has been proposed as a powerful measure of node-pair similarity based on graph topologies. It recursively captures the notion that “two nodes are considered as similar if their in-neighbours are similar”, but the similarity of each node with itself is not constantly 1, which is different from SimRank. However, existing work on CoSimRank is restricted to static graphs, where each node-pair CoSimRank score is retrieved from the sum of the dot product of two Personalised PageRank vectors. When the graph is updated with new edges arriving over time, this approach becomes impractical, due to its cost-inhibitive overheads for recomputing CoSimRank scores from scratch. In this study, we propose a fast dynamic scheme, D-CoSim, for accurate CoSimRank search over evolving graph streams. Based on D-CoSim, we also propose a fast scheme, F-CoSim, that greatly accelerates CoSimRank search over static graphs. Our theoretical analysis shows that D-CoSim and F-CoSim guarantee the exactness of CoSimRank scores. On the static graph G, to efficiently retrieve CoSimRank scores S, F-CoSim is based on three ideas: (i) It first finds a “spanning polytree” T over G. (ii) On T, a fast algorithm is designed to compute the CoSimRank scores ST over the “spanning polytree”. (iii) On G, D-CoSim is employed to compute the changes of S(T) in response to the delta graph (G-T). Experimental evaluations verify the superiority of D-CoSim over evolving graphs, and the fast speedup of F-CoSim on large-scale static graphs against its competitors, without any loss of accuracy.

  • Listing $k$-cliques in Sparse Real-World Graphs
    Authors: Maximilien Danisch, Oana Denisa Balalau and Mauro Sozio

    Keywords: cliques, parallel algorithms, dense subgraphs

    Motivated by recent studies in the data mining community, we develop the most efficient parallel algorithm for listing all k-cliques in a graph. Our theoretical analysis shows that our algorithm boasts the best asymptotic upper bound on the running time for the case when the input graph is sparse. Our experimental evaluation on large real-world graphs demonstrates that our parallel algorithm is faster than state-of-the-art algorithms, while boasting an excellent degree of parallelism. In particular, we are able to list all k-cliques (for any value of k) in graphs containing up to tens of millions of edges as well as all 10-cliques in graphs containing billions of edges, within a few minutes and a few hours respectively. We show how it can be employed as an effective subroutine for finding the k-clique core decomposition and an approximate k-clique densest subgraphs in very large real-world graphs.

  • TIMES: Temporal Information Maximally Extracted from Structures
    Authors: Abram Magner, Jithin Sreedharan, Ananth Grama and Wojciech Szpankowski

    Keywords: statistical inference, dynamic networks, random graphs

    Inferring the node arrival sequence from a snapshot of a dynamic network is an important problem, with applications ranging from identifying sources of contagion to flow of capital in financial transaction networks. Variants of this problem have received significant recent research attention, including results on infeasibility of solution for prior formulations. We present a new formulation of the problem that admits probabilistic solutions for broad classes of dynamic network models. Instantiating our framework for a preferential attachment model, we present effectively computable and practically tight bounds on the tradeoff curve between optimal achievable precision and density/recall. We also present efficient algorithms for partial recovery of node arrival orders and derive theoretical and empirical performance bounds on the precision and density/recall of our methods in comparison to the best possible. We validate our methods through experiments on both synthetic and real networks to show that their performance is robust to model changes, and that they yield excellent results in practice. We also demonstrate their utility in the context of a novel application in analysis of the human brain connectome to draw new insights into the functional and structural organization and evolution of the human brain.

  • Any-k: Anytime Top-k Tree Pattern Retrieval in Labeled Graphs
    Authors: Xiaofeng Yang, Deepak Ajwani, Wolfgang Gatterbauer, Patrick K. Nicholson, Mirek Riedewald and Alessandra Sala

    Keywords: top-k, heterogeneous information networks, graph pattern matching, database

    Many problems in areas as diverse as recommendation systems, social network analysis, semantic search and distributed root cause analysis can be modeled as graph query patterns on heterogeneous information networks. A fundamental challenge is, given a large network and a graph query pattern with node and edge label constraints, to find the top-k matches according to a monotone ranking function over edge and node weights. We focus on sum-of-weights for acyclic queries over arbitrary graphs. Furthermore, we consider the more challenging scenario where k is not given in advance, but the user interactively requests matching patterns in order of ranking. Sub-graph isomorphism is hard in general, even for path queries. We are therefore interested in practical algorithms that effectively exploit (1) properties of heterogeneous networks, in particular highly selective constraints on nodes and edges, and (2) that the user will often explore only a fraction of the top-ranked results. Our solution, {ouralgorithmname}, carefully integrates aggressive pruning that leverages the acyclic nature of the query, and incremental guided search. It enables us to prove strong non-trivial time and space guarantees, which is generally considered very hard for this type of non-trivial graph search problem. Through experimental studies we show that {ouralgorithmname} achieves running times in the order of milliseconds for tree patterns on large networks with millions of nodes and edges.

  • Dual Graph Convolutional Networks for Graph-Based Semi-Supervised Classification
    Authors: Chenyi Zhuang and Qiang Ma

    Keywords: Graph convolutional networks, Semi-supervised learning, Graph diffusion, Adjacency matrix, Pointwise mutual information

    The problem of extracting meaningful data through graph analysis spans a range of different fields, such as the internet, social networks, biological networks, and many others. The importance of being able to effectively mine and learn from such data continues to grow as more and more structured data become available. In this paper, we present a simple and scalable semi-supervised learning method for graph-structured data in which only a very small portion of the training data are labeled. To sufficiently embed the graph knowledge, our method performs graph convolution from different views of the raw data. In particular, a dual graph convolutional neural network method is devised to jointly consider the two essential assumptions of semi-supervised learning: (1) local consistency and (2) global consistency. Accordingly, two convolutional neural networks are devised to embed the local-consistency-based and global-consistency-based knowledge, respectively. Given the different data transformations from the two networks, we then introduce an unsupervised temporal loss function for the ensemble. In experiments using both unsupervised and supervised loss functions, our method outperforms state-of-the-art techniques on different datasets.

  • RaRE: Social Rank Regulated Large-scale Network Embedding
    Authors: Yupeng Gu, Yizhou Sun, Yanen Li and Yang Yang

    Keywords: network embedding, social rank, representation learning

    Network embedding algorithms that map nodes in a network into a low-dimensional vector space are prevalent in recent years, due to their superior performance in many network-based tasks, such as clustering, classification, and link prediction. The main assumption of existing algorithms is that the learned latent representation for nodes should preserve the structure of the network, in terms of first-order or higher-order connectivity. In other words, nodes that are more similar will have higher probability to connect to each other. This phenomena is typically explained as homophily in network science. However, there is another factor usually neglected by the existing embedding algorithms, which is the popularity of a node. For example, celebrities in a social network usually receive numerous followers, which cannot be fully explained by the similarity of the two users. We denote this factor with the terminology “social rank”. We then propose a network embedding model that considers both of the two factors in link generation, and learn proximity-based embedding and social rank-based embedding separately. Rather than simply treating these two factors independent with each other, a carefully designed link generation model is proposed, which explicitly models the interdependency between these two types of embeddings. Experiments on several real-world datasets across different domains demonstrate the superiority of our novel network embedding model over the state-of-the-art methods.

  • Low rank spectral network alignment
    Authors: Huda Nassar, Nate Veldt, Shahin Mohammadi, Ananth Grama and David F. Gleich

    Keywords: Network Alignment, Graph Matching, Low rank matrix, Eigenvector, Bipartite Matching

    Network alignment or graph matching is the classic problem of finding matching vertices between two graphs with applications in network de-anonymization and bioinformatics. There exist a wide variety of algorithms for it, but a challenging scenario for all of the algorithms is aligning two networks without any information about which nodes might be good matches. In this case, the vast majority of principled algorithms demand quadratic memory in the size of the graphs. We show that one such method—the recently proposed and theoretically grounded EigenAlign algorithm—admits a novel implementation which requires memory that is linear in the size of the graphs. The key step to this insight is identifying low-rank structure in the node-similarity matrix used by EigenAlign for determining matches. With an exact, closed-form low-rank structure, we then solve a maximum weight bipartite matching problem on that low-rank matrix to produce the matching between the graphs. For this task, we show a new, a-posteriori, approximation bound for a simple algorithm to approximate a maximum weight bipartite matching problem on a low-rank matrix. The combination of our two new methods then enables us to tackle much larger network alignment problems than previously possible and to do so quickly. Problems that take hours with existing methods take only seconds with our new algorithm. We thoroughly validate our low-rank algorithm against the original EigenAlign approach. We also compare a variety of existing algorithms on problems in bioinformatics and social networks. Our approach can also be combined with existing algorithms to improve their performance and speed.

  • Spectral Algorithms for Temporal Graph Cuts
    Authors: Arlei Silva, Ambuj Singh and Ananthram Swami

    Keywords: Graph Mining, Spectral Theory, Dynamic Networks

    The sparsest cut problem consists of identifying a small set of edges that breaks the graph into balanced sets of vertices. The normalized cut problem balances the total degree, instead of the size, of the resulting sets. Applications of graph cuts include community detection and computer vision. However, cut problems were originally proposed for static graphs, an assumption that does not hold in many modern applications where graphs are highly dynamic. In this paper, we introduce the sparsest and normalized cut problems in temporal graphs, which generalize their standard definitions by enforcing the smoothness of cuts over time. We propose novel formulations and algorithms for computing temporal cuts using spectral graph theory, multiplex graphs, divide-and-conquer and low-rank matrix approximation. Furthermore, we extend our formulation to dynamic graph signals, where cuts also capture node values. Experiments show that our solutions are accurate and scalable, enabling the discovery of dynamic communities and the analysis of dynamic graph processes.

  • Fast and Accurate Random Walk with Restart on Dynamic Graphs with Guarantees
    Authors: Minji Yoon, Woojeong Jin and U Kang

    Keywords: Graph mining, Online algorithm, Dynamic Random Walk with Restart

    Given a time-evolving graph, how can we track similarity between nodes in a fast and accurate way, with theoretical guarantees on the convergence and the error? Random Walk with Restart (RWR) is a popular measure to estimate the similarity between nodes and has been exploited in numerous applications. Many real-world graphs are dynamic with frequent insertion/deletion of edges; thus, tracking RWR scores on dynamic graphs in an efficient way has aroused much interest among data mining researchers. Recently, dynamic RWR models based on the propagation of scores across a given graph have been proposed, and have succeeded in outperforming previous other approaches to compute RWR dynamically. However, those models fail to guarantee exactness and convergence time for updating RWR in a generalized form. In this paper, we propose OSP, a fast and accurate algorithm for computing dynamic RWR with insertion/deletion of nodes/edges in a directed/undirected graph. When the graph is updated, OSP first calculates offset scores around the modified edges, propagates the offset scores across the updated graph, and then merges them with the current RWR scores to get updated RWR scores. We prove the exactness of OSP and introduce OSP-T, a version of OSP which regulates a trade-off between accuracy and computation time by using error tolerance ?. Given restart probability c, OSP-T guarantees to return RWR scores with O(?/c) error in O(log(?/2)/log(1-c)) iterations. Through extensive experiments, we show that OSP tracks RWR exactly up to 4605x faster than existing static RWR method on dynamic graphs, and OSP-T requires up to 15x less time with 730x lower L1 norm error and 3.3x lower rank error than other state-of-the-art dynamic RWR methods.

  • Fully Dynamic k-Center Clustering
    Authors: T.-H. Hubert Chan, Arnaud Guerquin and Mauro Sozio

    Keywords: k-center clustering, dynamic algorithms, social networks

    Static and dynamic clustering algorithms are a fundamental tool in any machine learning library. Most of the efforts in developing dynamic machine learning and data mining algorithms have been focusing on the sliding window model (where at any given point in time only the most recent data items are retained) or more simplistic models. However, in many real-world applications one might need to deal with arbitrary deletions and insertions. For example, one might need to remove data items that are not necessarily the oldest ones, because they have been flagged as containing inappropriate content or due to privacy concerns. Clustering trajectory data might also require to deal with more general update operations. We develop a (2+epsilon)-approximation algorithm for the k-center clustering problem, which requires polylogarithmic amortized cost under a fully dynamic adversarial model. In such a model, points can be added or removed arbitrarily, provided that the adversary does not have access to the random choices of our algorithm. Our theoretical results are complemented with an extensive experimental evaluation on dynamic data from Twitter, Flickr, as well as trajectory data, demonstrating the effectiveness of our approach.

  • VERSE: Versatile Graph Embeddings from Similarity Measures
    Authors: Anton Tsitsulin, Davide Mottin, Panagiotis Karras and Emmanuel Müller

    Keywords: graph embedding, vertex similarity, graph sampling

    Embedding a web-scale information network into a low-dimensional vector space facilitates tasks such as link prediction, classification, and visualization. Past research has addressed the problem of extracting such embeddings by adopting methods from words to graphs, without defining a clearly comprehensible graph-related objective. Yet, as we show, the objectives used in past works implicitly utilize similarity measures among graph nodes. In this paper, we carry the similarity orientation of previous works to its logical conclusion; we propose VERtex Similarity Embeddings (VERSE), a simple, versatile, and memory-efficient method that derives graph embeddings explicitly calibrated to preserve the distributions of a chosen vertex-to-vertex similarity measure. While the default, scalable version of VERSE learns such embeddings by sampling the graph to train a single-layer neural network, we also develop a full variant that does not rely on sampling. Our experimental study on standard benchmarks and real-world datasets demonstrates that VERSE, instantiated with diverse similarity measures, outperforms state-of-the-art methods in terms of precision and recall in major data mining tasks and supersedes them in time and space efficiency, while the scalable sampling-based variant achieves equally good results as the non-scalable full variant.

  • Co-Regularized Deep Multi-Network Embedding
    Authors: Jingchao Ni, Shiyu Chang, Xiao Liu, Wei Cheng, Haifeng Chen, Dongkuan Xu and Xiang Zhang

    Keywords: Information network, Deep learning, Network representation learning, Multi-network, Social network

    Network embedding aims to learn a low-dimensional vector representation (or embedding) for each node in the social and information networks, with the constraint to preserve network structures. Most existing methods focus on single network embedding, ignoring the relationship between multiple networks. In many real-world applications, however, related multiple networks (e.g., social networks from different platforms) may contain complementary information which can lead to further refined node embeddings. Thus, in this paper, we propose a novel multi-network embedding method, DMNE. DMNE is flexible, which allows different networks to have different sizes, to be (un)weighted and (un)directed. It leverages multiple networks via cross-network relationships between nodes in different networks, which may form many-to-many node mappings, and be associated with weights. To model the non-linearity of the network data, we develop DMNE to have a new deep learning architecture, which coordinates multiple neural networks (one for each input network data) with a co-regularized loss function to manipulate cross-network relationships. With multiple layers of non-linear mappings, DMNE progressively transforms each input network into a highly non-linear latent space, and in the meantime, adapts different latent spaces to each other through a co-regularized learning schema. Extensive experimental results on four real-life datasets demonstrate the effectiveness of our method.

  • Demarcating Endogenous and Exogenous Opinion Diffusion Process on Social Networks
    Authors: Abir De, Sourangshu Bhattacharya and Niloy Ganguly

    Keywords: Opinion diffusion, Social networks, Machine learning, Temporal process

    The networked opinion diffusion in online social networks (OSN) is governed by the two genres of opinions–endogenous opinions that are driven by the influence of social contacts between users, and exogenous opinions which are formed by external effects like news, feeds etc. Such duplex opinion dynamics is led by users belonging to two categories– organic users who generally post endogenous opinions and extrinsic users who are susceptible to externalities, and mostly post the exogenous messages. Precise demarcation of endogenous and exogenous messages offers an important cue to opinion modeling, thereby enhancing its predictive performance. On the other hand, accurate user selection aids to detect extrinsic users, which in turn helps in opinion shaping. In this paper, we design CherryPick, a novel learning machinery that classifies the opinions and users by solving a joint inference task in message and user set, from a temporal stream of sentiment messages. Furthermore, we validate the efficacy of our proposal from both modeling and shaping perspectives. Moreover, for the latter, we formulate the opinion shaping problem in a novel framework of stochastic optimal control, in which the selected extrinsic users optimally post exogenous messages so as to guide the opinions of others in desired way. On five datasets crawled from Twitter, CherryPick offers a significant accuracy boost in terms of opinion forecasting, against several competitors. Furthermore, it can precisely determine the quality of a set of control users, which together with the proposed online shaping strategy, consistently steers the opinion dynamics more effectively than several state-of-the-art baselines.

  • Collective Classification of Spam Campaigners on Twitter: A Hierarchical Meta-Path Based Approach
    Authors: Srishti Gupta, Abhinav Khattar, Arpit Gogia, Ponnurangam Kumaraguru and Tanmoy Chakraborty

    Keywords: spammers, Twitter, Phone number, Heterogeneous networks, meta-path, active learning, hierarchical structure, Online Social Networks

    Cybercriminals have leveraged the popularity of a large user base available on Online Social Networks (OSNs) to spread spam campaigns by propagating phishing URLs, attaching malicious contents, etc. However, another kind of spam attacks using phone numbers has recently become prevalent on OSNs, where spammers advertise phone numbers to attract users’ attention and convince them to make a call to these phone numbers. The dynamics of phone number based spam is different from URL-based spam due to an inherent trust associated with a phone number. While previous work has proposed strategies to mitigate URL-based spam attacks, phone number based spam attacks have received less attention. In this paper, we aim to detect spammers that use phone numbers to promote campaigns on Twitter. To this end, we collected information (tweets, user meta-data, etc.) about 3, 370 campaigns spread by 670, 251 users. We model the Twitter dataset as a heterogeneous network by leveraging various interconnections between different types of nodes present in the dataset. In particular, we make the following contributions – (i) We propose a simple yet effective metric, called Hierarchical Meta-Path Score (HMPS) to measure the proximity of an unknown user to the other known pool of spammers. (ii) We design a feedback-based active learning strategy and show that it significantly outperforms three state-of-the-art baselines for the task of spam detection. Our method achieves 6.9% and 67.3% higher F1-score and AUC, respectively compared to the best baseline method. (iii) To overcome the problem of less training instances for supervised learning, we show that our proposed feedback strategy achieves 25.6% and 46% higher F1-score and AUC respectively than other oversampling strategies. Finally, we perform a case study to show how our method is capable of detecting those users as spammers who have not been suspended by Twitter (and other baselines) yet.

  • Mining Tours and Paths in Activity Networks
    Authors: Sofia Maria Nikolakaki, Charalampos Mavroforakis, Alina Ene and Evimaria Terzi

    Keywords: activity network, prize-collecting subgraphs, group steiner tree

    The proliferation of online social networks and the spread of smart mobile devices enable the collection of information related to a multitude of users’ activities. These networks, where every node is associated with a type of action and a frequency, are usually referred to as activity networks. Examples of such networks include city road networks, where the nodes are intersections and the edges are road segments. Each node is associated with a number of geolocated actions that users of an online platform took in its vicinity. In these networks, we define a prize-collecting subgraph to be a connected set of nodes, which is compact, i.e., the nodes are close to each other, and exhibits high levels of activity. The k-PCSubgraphs problem we address in this paper is defined as follows: given an activity network and an integer k, identify k non-overlapping and connected subgraphs of the network such that the nodes of each subgraph are close to each other, and the number of actions that they are associated with is high. Here, we define and study two new variants of the k-PCSubgraphs problem, where the subgraphs of interest are tours and paths. Since both these problems are NP-hard, we provide approximate algorithms that solve them in time almost linear to the number of edges. In our experiments, we use real activity networks obtained by combining actual city road networks and projecting on them user activity from Twitter and Flickr. Our experimental results demonstrate both the efficiency and the practical utility of our methods.

  • When Online Dating Meets Nash Social Welfare: Achieving Efficiency and Fairness
    Authors: Yongzheng Jia, Xue Liu, Maria Renhui Zhang and Wei Xu

    Keywords: online dating, user modeling, data science, online algorithm, game theory

    Mobile dating applications such as Coffee Meets Bagel, Tantan, and Tinder, have become significant for young adults to meet new friends and discover romantic relationships. From a system designer’s perspective, in order to achieve better user experience in these applications, we should take both the efficiency and fairness of a dating market into consideration, so as to increase the overall satisfaction for all users. Towards this goal, we investigate the nature of diminishing marginal returns for online dating markets (i.e., captured by the submodularity), and trade-off between the efficiency and fairness of the market with Nash social welfare. We further design effective online algorithms to the apps. We verify our models and algorithms through sound theoretical analysis and empirical studies by using real data and show that our algorithms can significantly improve the ecosystems of the online dating applications.

  • Measuring and Improving the Core Resilience of Networks
    Authors: Ricky Laishram, Ahmet Erdem Sar?yüce, Tina Eliassi-Rad, Ali Pinar and Sucheta Soundarajan

    Keywords: k-core, core resilience, complex networks

    k-cores have emerged as an important concept for understanding the global structure of networks, as well as for identifying central or important nodes within a network. It is often valuable to understand the resilience of the k-cores of a network to attacks and dropped edges (i.e., damaged communications links). We provide a formal definition of a network’s core resilience, and examine the problem of characterizing core resilience in terms of the network’s structural features: in particular, which structural properties cause a network to have high or low core resilience? To measure this, we introduce two novel node properties – Core Strength and Core Influence, which measure the resilience of individual nodes’ core numbers and their influence on other nodes’ core numbers. Using these properties, we propose algorithms to add edges to increase the core resilience of a network. We consider two attack scenarios, randomly deleted edges and randomly deleted nodes. Through experiments on a variety of technological and infrastructure network datasets, we verify the efficacy of our node-based resilience measures at predicting the resilience of a network, and evaluate our algorithms.

  • Preferential Attachment as a Unique Equilibrium
    Authors: Chen Avin, Avi Cohen, Pierre Fraigniaud, Zvi Lotker and David Peleg

    Keywords: Preferential Attachment, Network Formation Games, Social Networks, Young’s Lattice

    This paper demonstrates that the Preferential Attachment rule naturally emerges in the context of evolutionary network formation, as the unique Nash equilibrium of a simple natural game. In this game, each node aims at maximizing its degree in the future, representing its social capital in the “society” formed by the nodes and their connections. This result provides additional formal support to the commonly used Preferential Attachment model, initially designed to capture the “rich get richer” aphorism. In the process of establishing our result, we expose new connections between Preferential Attachment, random walks, and Young’s Lattice.

  • SIDE: Representation Learning in Signed Directed Networks
    Authors: Junghwan Kim, Haekyu Park, Ji-Eun Lee and U Kang

    Keywords: Network embedding, Representation learning, Signed directed network

    Given a signed directed network, how can we learn node representations which fully encode structural information of the network including sign and direction of edges? Node representation learning or network embedding learns a mapping of each node to a vector. The mapping encodes structural information on network, providing low-dimensional dense node features for general machine learning and data mining frameworks. Since many social networks allow trust (friend) and distrust (enemy) relationships described by signed and directed edges, generalizing network embedding method to learn from sign and direction information in networks is crucial. In addition, social theories are critical tool in signed network analysis. However, none of the existing methods supports all of the desired properties: considering sign, direction, and social theoretical interpretation. In this paper, we propose SIDE, a general network embedding method that represents both sign and direction of edges in the embedding space. SIDE carefully formulates and optimizes likelihood over both direct and indirect signed connections. We provide socio-psychological interpretation for each component of likelihood function. We prove linear scalability of our algorithm and propose additional optimization techniques to reduce the training time and improve accuracy. Through extensive experiments on real-world signed directed networks, we show that SIDE effectively encodes structural information into the learned embedding.

  • Provable and Practical Approximations for the Degree Distribution using Sublinear Graph Samples
    Authors: Talya Eden, Shweta Jain, Ali Pinar, Dana Ron and C. Seshadhri

    Keywords: degree distribution, graphs, sublinear algorithms, sampling

    The degree distribution is one of the most fundamental properties used in the analysis of massive graphs. There is a large literature on graph sampling, where the goal is to estimate properties (especially the degree distribution) of a large graph through a small, random sample. The degree distribution estimation poses a significant challenge, due to its heavy-tailed nature and the large variance in degrees. We design a new algorithm, SADDLES, for this problem, using recent mathematical techniques from the field of sublinear algorithms. The SADDLES algorithm gives provably accurate outputs for all values of the degree distribution. For the analysis, we define two fatness measures of the degree distribution, called the h-index and the z-index. We prove that SADDLES is sublinear in the graph size when these indices are large. A corollary of this result is a provably sublinear algorithm for any degree distribution bounded below by a power law. We deploy our new algorithm on a variety of real datasets and demonstrate its excellent empirical behavior. In all instances, we get extremely accurate approximations for all values in the degree distribution by observing at most 1% of the vertices. This is a major improvement over the state-of-the-art sampling algorithms, which typically sample more than 10% of the vertices to give comparable results. We also observe that the h and z-indices of real graphs are large, validating our theoretical analysis.

  • Deep Collective Classification in Heterogeneous Information Networks
    Authors: Yizhou Zhang, Yun Xiong, Xiangnan Kong and Yangyong Zhu

    Keywords: Collective Classification, Graph Convolution, Heterogeneous Information Networks, Graph Mining, Deep Learning

    Collective classification has attracted considerable attention in the last decade, where the labels within a group of instances are correlated and should be inferred collectively, instead of independently. Conventional approaches on collective classification mainly focus on exploiting simple relational features (such as count and exists aggregators on neighboring nodes). However, many real-world applications involve complex dependencies among the instances, which are obscure/hidden in the networks. To capture these dependencies in collective classification, we need to go beyond simple relational features and extract deep dependencies between the instances. In this paper, we study the problem of deep collective classification in Heterogeneous Information Networks (HINs), which involves different types of autocorrelations, from simple to complex relations, among the instances. Different from conventional autocorrelations, which are given explicitly by the links in the network, complex autocorrelations are obscure/hidden in HINs, and should be inferred from existing links in a hierarchical order. This problem is highly challenging due to the multiple types of dependencies among the nodes and the complexity of the relational features. In this study, we proposed a deep convolutional collective classification method, called GraphInception, to learn the deep relational features in HINs. The proposed method can automatically generate a hierarchy of relational features with different complexities. Extensive experiments on four real world networks demonstrate that our approach can improve the collective classification performance by considering deep relational features in HINs.

  • On Exploring Semantic Meanings of Links for Embedding Social Networks
    Authors: Linchuan Xu, Xiaokai Wei, Jiannong Cao and Philip S Yu

    Keywords: Representation Learning, Social Networks, Data Mining

    There are increasing interests in learning low-dimensional and dense node representations from the network structure which is usually high-dimensional and sparse. However, most existing methods equally treat all social links which may have different semantic meanings. Links have different semantic meanings because the similarities between two nodes can be different, e.g., two nodes share common neighbors and two nodes share similar interests which are demonstrated in node-generated content. In this paper, the former type of links is referred to as structure-close links while the latter type is referred to as content-close links. These two types of links naturally indicate there are two types of characteristics that nodes expose in a social network. Hence, we propose to learn two representations for each node, and render each representation responsible for encoding the corresponding type of links, which is achieved by jointly embedding the network structure and inferring the type of each link. In the experiments, the proposed method is demonstrated to be more effective than five recent mehtods on four social networks through applications including visualization, link prediction and multi-label classification.

  • A Correlation Clustering Framework for Community Detection
    Authors: Nate Veldt, David F. Gleich and Anthony Wirth

    Keywords: graph clustering, community detection, network analysis, correlation clustering, sparsest cut, cluster deletion

    Graph clustering, or community detection, is the task of identifying groups of closely related objects in a large network. In this paper we introduce a new community detection framework called LambdaCC that is based on a specially weighted version of correlation clustering. A key component in our methodology is a clustering resolution parameter, ?, which implicitly controls the size and structure of clusters formed by our framework. We show that, by increasing this parameter, our objective effectively interpolates between two different strategies in graph clustering: finding a sparse cut and forming dense subgraphs. Our methodology unifies and generalizes a number of other important clustering quality functions including modularity, sparsest cut, and cluster deletion, and places them all within the context of an optimization problem that has been well studied from the perspective of approximation algorithms. Our approach to clustering is particularly relevant in the regime of finding dense clusters, as it leads to the first constant-factor approximation for the cluster deletion problem. We use our approach to cluster several real world graphs, including a number of large collaboration networks from online databases.

  • SIR-Hawkes: Linking Epidemic Models and Hawkes Processes to Model Diffusions in Finite Populations
    Authors: Marian-Andrei Rizoiu, Swapnil Mishra, Quyu Kong, Mark Carman and Lexing Xie

    Keywords: equivalence epidemic and point process models, stochastic information diffusion models, probability distribution of final cascade size, diffusion model in finite population

    Among the statistical tools for online information diffusion, both epidemic models and Hawkes point processes are popular choices. The former originate from epidemiology, and consider information as a viral contagion which spreads into a population of online users. The latter have roots in geophysics and finance, view individual actions as discrete events in continuous time, and modulates the event rate according to social principles. Here, we establish a novel connection between these two frameworks. Namely, the rate of events in an extended Hawkes model is identical to the rate of new infections in the Susceptible-Infected-Recovered (SIR) model after marginalizing out recovery events – which are unobserved in a Hawkes process. This result paves the way to apply tools developed for SIR to Hawkes, and vice versa. We make three further contributions in this work. First, we propose HawkesN, a generalization of the basic Hawkes model to account for a finite population size. Second, we show that HawkesN can explain real retweet cascades better than state-of-the-art Hawkes processes. The size of the population can also be estimated from data. Third, we derive the final size distribution for a partially observed HawkesN cascade, inspired by methods in stochastic SIR. Such distributions provide nuanced explanations to the general unpredictability of popularity: the distribution for diffusion cascade sizes tends to have two maxima, one corresponding to large cascade sizes and another one around zero.