Program
of
track
Posters

Posters will be displayed in the exhibition area during the three days of the conference, and will be presented during breaks.

List of accepted posters :

  • The Impact of Semantic Context Cues on the User Acceptance of Tag Recommendations: An Online Study
    Authors: Dominik Kowald, Paul Seitlinger, Tobias Ley and Elisabeth Lex

    Keywords: Tag Recommendations, Online Evaluation, 3Layers, Most Popular

    Abstract:
    In this paper, we present the results of an online study with the aim to shed light on the impact that semantic context cues have on the user acceptance of tag recommendations. Therefore, we conducted a work-integrated social bookmarking scenario with 17 university employees in order to compare the user acceptance of a context-aware tag recommendation algorithm called 3Layers with the user accpetance of a simple popularity-based baseline. In this scenario, we validated and verified the hypothesis that semantic context cues have a higher impact on the user acceptance of tag recommendations in a collaborative tagging setting than in an individual tagging setting. With this paper, we contribute to the sparse line of research presenting results of online tag recommendation studies.

  • Higher-order Network Representation Learning
    Authors: Ryan Rossi, Nesreen Ahmed and Eunyee Koh

    Keywords: network representation learning, network motifs, higher-order graph mining

    Abstract:
    This paper describes a general framework for learning Higher-Order Network Embeddings (HONE) from graph data based on network motifs. The HONE framework is highly expressive and flexible with many interchangeable components. The experimental results demonstrate the effectiveness of learning higher-order network representations. In all cases, HONE outperforms recent embedding methods that are unable to capture higher-order structures with a mean relative gain in AUC of 19% (and up to 75% gain) across a wide variety of networks and embedding methods.

  • Identifying Your Representative Work Based on Credit Allocation
    Authors: Peng Bao and Jiahui Wang

    Keywords: citation network, representative work, credit allocation

    Abstract:
    With the rapid development of scientific impact quantification in the field of science of success, the ability to identify the representative work of a researcher has important implications in a wide range of areas, including hiring, funding, and promotion systems. In this paper, we propose a two-step credit allocation algorithm (TSCA) for identifying the representative work of a researcher. This algorithm explicitly captures the importance of a paper, its relevance to other papers, and the unequally distributed contribution of each citation. We validate TSCA by applying it on the citation data from American Physical Society (APS) in the scenario of identifying the Nobel prize winning papers of the Nobel laureates. Experiments demonstrate that the proposed algorithm can significantly outperform the existing methods.

  • AttAE-RL²: Attention based Autoencoder for Rap Lyrics Representation Learning
    Authors: Hongru Liang, Qian Li, Haozheng Wang, Hang Li, Jin-Mao Wei and Zhenglu Yang

    Keywords: rap lyrics, representation learning, autoencoder, natural language processing

    Abstract:
    Learning rap lyrics is an important area of music information retrieval because it is the basis of many applications, such as recommendation systems, automatic categorization, and so on. In this paper, we tackle the issue pertaining to the lack of an effective approach to aggregate various features of lyrics by proposing an attention-based autoencoder for rap lyrics representation learning (AttAE-RL²). The proposed method can appropriately integrate the semantic and prosodic features of rap lyrics. Experimental results demonstrate that our approach outperforms the state-of-the-art ones by a large margin.

  • Predicting Website Abuse Using Update Histories
    Authors: Yuta Takata, Mitsuaki Akiyama, Takeshi Yagi, Kunio Hato and Shigeki Goto

    Keywords: Website Abuse, Software Update, Internet Archive, CMS

    Abstract:
    Threats of abusing websites that webmasters have stopped updating have increased. In this poster, we propose a method of predicting potentially abusable websites by retrospectively analyzing updates of software that composes websites. The method captures webmaster behaviors from accesses to archived snapshots of a website and analyzes the changes of web servers and web applications used in the past as update histories. A classifier that predicts website abuses is finally built by using update histories from snapshots of known malicious websites before the detections. Evaluation results showed that the classifier could predict various website abuses, such as drive-by downloads, phishes, and defacements, with accuracy: a 76% true positive rate and a 26% false positive rate.

  • Network Embedding Based Recommendation Method in Social Networks
    Authors: Yufei Wen, Lei Guo, Zhumin Chen and Jun Ma

    Keywords: Social Recommendation, Network Embedding, Matrix Factorization

    Abstract:
    With the advent of online social networks, the use of information hidden in social networks for recommendation has been extensively studied. Unlike previous work regarded social influence as regularization terms, we take the advantage of network embedding techniques and propose an embedding based recommendation method. Specifically, we first pre-train a network embedding model on the users’ social network to map each user into a low dimensional space, and then incorporate them into a matrix factorization model, which combines both latent and pre-learned features for recommendation. The experimental results on two real-world datasets indicate that our proposed model is more effective and can reach better performance than other related methods.

  • An Improved Sampler for Bayesian Personalized Ranking by Leveraging View Data
    Authors: Jingtao Ding, Fuli Feng, Xiangnan He, Guanghui Yu, Yong Li and Depeng Jin

    Keywords: Bayesian Personalized Ranking, Recommendation, Sampler, View Data

    Abstract:
    Bayesian Personalized Ranking (BPR) is a representative pairwise learning method for optimizing recommendation models. It is widely known that the performance of BPR depends largely on the quality of the negative sampler. In this short paper, we make two contributions with respect to BPR. First, we find that sampling negative items from the whole space is unnecessary and may even degrade the performance. Second, focusing on the purchase feedback of the E-commerce domain, we propose a simple yet effective sampler for BPR by leveraging the additional view data. Compared to the vanilla BPR that applies a uniform sampler on all candidates, our view-aware sampler enhances BPR with a relative improvement of 48.45% in recommendation performance on average (27.36% and 69.54% on two datasets, respectively) .

  • That Makes Sense: Joint Sense Retrofitting from Contextual and Ontological Information
    Authors: Ting-Yu Yen, Yang-Yin Lee, Hen-Hsen Huang and Hsin-Hsi Chen

    Keywords: Semantic relatedness, sense embedding, joint sense retrofitting

    Abstract:
    While recent word embedding models demonstrate their abilities to capture syntactic and semantic information, the demand for sense level embedding is getting higher. In this study, we propose a novel joint sense embedding learning model that retrofits the word representation into sense representation from contextual and ontological information. The experiments show the effectiveness and robustness of our model that outperforms previous approaches in four public available benchmark datasets.

  • PDSM: Pregel-Based Distributed Subgraph Matching on Large Scale RDF Graphs
    Authors: Qiang Xu and Xin Wang

    Keywords: subgraph matching, RDF graph, Pregel

    Abstract:
    This paper presents a novel Pregel-based Distributed Subgraph Matching method PDSM to answer subgraph matching queries on big RDF graphs. In our method, the query graph is transformed to a spanning tree based on the breadth-first search (BFS). Two optimization techniques are proposed to filter out part of the unpromising intermediate results and postpone the Cartesian product operations in the Pregel iterative computation. The extensive experiments on both synthetic and real-world datasets show that PDSM outperforms the state-of-the-art methods by an order of magnitude.

  • P3RPQ: Pregel-Based Parallel Provenance-Aware Regular Path Query Processing on Large RDF Graphs
    Authors: Yueqi Xin, Bingyi Zhang and Xin Wang

    Keywords: regular path queries, provenance-aware, RDF graphs, Pregel

    Abstract:
    This paper proposes a novel method for answering Pregel-based Parallel Provenance-aware Regular Path Queries (P3RPQ) on large RDF graphs. Our method is developed using the Pregel framework, which utilizes Glushkov automata to keep track of the matching process of RPQs in parallel. Meanwhile, four optimization strategies are devised, which can reduce the response time of the basic algorithm dramatically and overcome the counting paths problem to some extent. Extensive experiments are conducted to efficiently evaluate the algorithms on both synthetic and real-world datasets.

  • Detecting Personal Life Events from Twitter by Multi-Task LSTM
    Authors: An-Zi Yen, Hen-Hsen Huang and Hsin-Hsi Chen

    Keywords: lifelogging, personal event detection, social media

    Abstract:
    People are used to log their life on the social media platform. Life event can be expressed explicitly or implicitly in a text description. However, a description does not always contain life events related to a specific individual. To tell if there exist any life events and further know their categories is indispensable for event retrieval. This paper explores various LSTM models to detect and classify life events in tweets. Experiments show that the proposed Multi-Task LSTM model with attention achieves the best performance.

  • Characterizing efficient referrals in social networks
    Authors: Reut Apel, Elad Yom Tov and Moshe Tennenholtz

    Keywords: Social networks, Filter bubble, Web mining

    Abstract:
    Users of social networks often focus on specific areas of that network, leading to the well-known “filter bubble” effect. Connecting people to a new area of the network in a way that will cause them to become active in that area could help alleviate this effect and improve social welfare. Here we present preliminary analysis of network referrals, that is, attempts by users to connect peers to other areas of the network. We classify these referrals by their efficiency, i.e., the likelihood that a referral will result in a user becoming active in the new area of the network. We show that by using features describing past experience of the referring author and the content of their messages we are able to predict whether referral will be effective, reaching an AUC of 0.87 for those users who are most experienced in writing efficient referrals. Our results represent a first step towards being able to algorithmically construct efficient referrals with the goal of mitigating the “filter bubble” effect pervasive in on line social networks.

  • Automatic Matching of Resource Needs and Availabilities in Microblogs for Post-Disaster Relief
    Authors: Moumita Basu, Anurag Shandilya, Kripabandhu Ghosh and Saptarshi Ghosh

    Keywords: Microblogs, Disaster relief, Resource needs, Resource availabilities, Matching

    Abstract:
    During a disaster event, two types of information that are especially useful for coordinating relief operations are needs and availabilities of different types of resources. Microblogging sites like Twitter are increasingly being used for aiding post-disaster relief operations, and there have been prior studies on identifying tweets which inform about resource needs and availabilities (termed as need-tweets and availability-tweets respectively). However, there has not been much attempt to effectively utilise such tweets. In this context, we introduce the problem of automatically matching need-tweets with appropriate availability-tweets, which is practically important for effective coordination of post-disaster relief operations. We also experiment with several methodologies for automatically matching need-tweets and availability-tweets.

  • Which Algorithm Performs Best: Algorithm Selection for Community Detection
    Authors: Gaoyang Guo, Chaokun Wang and Xiang Ying

    Keywords: algorithm selection, community detection, classification

    Abstract:
    Myriads of community detection methods, which detect communities according to specific features of networks, have been developed in diverse disciplines, such as physics, sociology, and computer science. Consequentially, we have to face the problem of Algorithm Selection for Community Detection ASCD): Given a specific network, which algorithm should we select to reveal latent community structures on the network? In this paper, we propose a model called CYDES to address the ASCD problem. CYDES mainly consists of two parts, network feature matrix construction and algorithm classification. We combine three effective feature extraction methods with the idea of BOW model to construct a fixed-size feature matrix. After a nonlinear transformation to the feature matrix, a softmax regression model is utilized to generate a classification label representing the best community detection algorithm we select. Extensive experimental results demonstrate that CYDES has high algorithm selection quality for community detection in networks.

  • A Fresh Look at Understanding News Events Evolution
    Authors: Longtao Huang, Shangwen Lv, Liangjun Zang, Yipeng Su, Jizhong Han and Songlin Hu

    Keywords: event evolution, storyline, document understanding

    Abstract:
    This paper proposes a novel approach to retrieve news articles related to a specific event and generate a storyline to help people understand the event evolution. First, a similarity calculation method is proposed to retrieve news articles related to the specific event, which combines textual similarity, temporal similarity and entity similarity. Then a multi-view attribute graph is constructed to represent the relationship between retrieved articles. Finally, a community detection algorithm is developed to segment and chain subevents in the graph. Experimental results on real-world datasets demonstrate that the proposed approach achieve better results than existing methods.

  • Comparison of users’ and designers’ differences in mobile shopping app interface preferences and designs
    Authors: Yu Fu, Dongliang Zhang and Hao Jiang

    Keywords: Mobile shopping, User interface design, Perception difference, Users and designers’ preferences

    Abstract:
    Besides usability, visual appearance also plays an important role in influencing users’ attitudes towards the mobile shopping apps. This article presents a pilot study that explores users and designers’ preferences for mobile shopping app interfaces. The study consisted of two phases. (1) Eliciting participants’ perception of interfaces similarity by sorting, using DISTATIS and cluster analysis, the interfaces similarity perceptual space was identified. (2) Eliciting participants’ overall preference by rating. The results identified three typical interfaces and the distribution of ideal preference interfaces for users and designers. Last, users and designers’ differences in the interfaces preference were discussed and guidelines were proposed

  • Handling Confounding for Realistic Off-Policy Evaluation
    Authors: Saurabh Sohoney, Nikita Prabhu and Vineet Chaoji

    Keywords: Off-policy evaluation, Counterfactual-analysis, Inverse Propensity Score estimator, Confounding

    Abstract:
    Inverse Propensity Score estimator (IPS) is a basic, unbiased, off-policy evaluation technique to measure the impact of a user-interactive system without serving live traffic. We present our work on applying IPS to real-world settings by addressing some of the practical challenges. In particular, we show that accurate off-policy evaluation can be impossible in the absence of a complete context. We describe a systematic way of choosing the right context when it is not well-defined and show results highlighting its efficacy.

  • Hierarchical Type Constrained Topic Entity Detection for Knowledge Base Question Answering
    Authors: Yunqi Qiu, Manling Li, Yuanzhuo Wang, Yantao Jia and Xiaolong Jin

    Keywords: Question Answering, Topic Entity Detection, Hierarchical Types

    Abstract:
    Topic entity detection is to find out the main entity asked in a question, which is a significant task in question answering. Traditional methods ignore the information of entities, especially entity types and their hierarchical structures, restricting the performance. To take full advantage of KB and detect topic entities correctly, we propose a deep neural model to leverage type hierarchy and relations of entities in KB. Experimental results demonstrate the effectiveness of the proposed method.

  • Identifying Time Intervals for Knowledge Graph Facts
    Authors: Dhruv Gupta and Klaus Berberich

    Keywords: Temporal Expressions, Knowledge Graphs, Temporal Facts

    Abstract:
    Knowledge graphs capture very little temporal information associated with facts. In this work, we address the problem of identifying time intervals of knowledge graph facts from large document collections annotated with temporal expressions. Prior approaches in this direction have leveraged limited metadata associated with documents in large collections (e.g., publication dates) or have limited techniques to model the uncertainty and dynamics of temporal expressions. Our approach to identify time intervals for time-sensitive facts in knowledge graphs leverages a time model that incorporates uncertainty and models them at different levels of granularity (e.g., day, month, and year). Evaluation on a temporal fact benchmark using two large news archives amounting to more than eleven million documents show the quality of our results.

  • User Type Affinity Estimation Using Gamma-Poisson Model
    Authors: Fei Wu, Yanen Li and Ning Xu

    Keywords: Recommendation, User modeling, User type affinity

    Abstract:
    The affinity of a user to a type of items (e.g., stories from the same publisher, and movies of the same genre) is an important signal reflecting the user’s interests. An accurate estimation of the user type affinity has various applications in ranking and recommendation systems. For frequent users, simply dividing the number of interactions with content (e.g., clicks) by the number of impressions (e.g., the number of times the content is presented to each user) would be a good estimate. However, such estimates are erroneous for users who have sparse interaction history, (e.g., new users). To alleviate the problem, feature-based approaches aim to learn functions predicting the affinity score using only none-click features, such as user demographics, locations, and interests. Likewise, such approaches do not take full advantage of the interaction history of frequent users. Motivated by the limitations of the two approaches, we propose a Gamma-Poisson model that aims at utilizing the interaction history of frequent users, as well as leveraging a feature-based model for infrequent users. Our intuition is that we should rely more on the interaction history when estimating affinity for frequent users, and weigh more on feature-based model for infrequent users. We present experiment result on large-scale real-world data in a publisher content clicks prediction task to demonstrate the effectiveness of user type affinity scores estimated by proposed method.

  • Ranking-based Method for News Stance Detection
    Authors: Qiang Zhang, Emine Yilmaz and Shangsong Liang

    Keywords: fake news, stance detection, ranking

    Abstract:
    The wide spread of fake news on social media has been treated as a new cyber-threat around the world. The study of fake news detection on social media has recently drawn numerous attention. A valuable step towards news veracity assessment is to understand stance from different information sources, and the process is known as the stance detection. Specifically, the stance detection is to detect four kinds of stances (“agree”, “disagree”, “discuss” and “unrelated”) of the news towards a claim. Existing methods tried to tackle the stance detection problem by classification-based algorithms. However, classification-based algorithms make a strong assumption that there is clear distinction between any two stances, which may not be held in the context of stance detection. Accordingly, we frame the detection problem as a ranking problem and propose a ranking-based method to improve detection performance. Compared with the classification-based methods, the ranking-based method compare the true stance and false stances and maximize the difference between them. Experimental results show a 5.7% evaluation score improvement compared with the state-of-art methods. As high as 10.86% and 187.86% accuracy improvement are achieved for “agree” and “disagree”, respectively, two stances of which are believed as the most difficult and relevant to fake news detection.

  • Retrieving Information from Multiple Sources
    Authors: Anurag Roy, Kripabandhu Ghosh, Moumita Basu, Parth Gupta and Saptarshi Ghosh

    Keywords: Multi-view retrieval, Word embedding, Deep learning

    Abstract:
    The Web today has several information sources on which an ongoing event is discussed. To get a complete picture of the event, it is important to retrieve information from all the sources (views). In this work, we propose a novel neural network based model which integrates the embeddings from multiple sources, and thus retrieves information from them jointly, as opposed to combining multiple retrieval results. The importance of the proposed model is that no document-aligned comparable data is needed. Experiments on a real-world dataset comprising of posts related to a particular event from three different sources – Facebook, Twitter and WhatsApp – exhibit the efficacy of the proposed model.

  • Metadata vs. Ground-truth: A Myth behind the Evolution of Community Detection Methods
    Authors: Tanmoy Chakraborty, Zhe Cui and Noseong Park

    Keywords: Metadata, Community detection, Ground-truth, Community evaluation

    Abstract:
    A community detection (CD) method is usually evaluated by what extent it is able to discover the ‘ground-truth’ community structure of a network. A certain ‘node-centric metadata’ is used to define the ground-truth partition. However, nodes in real networks often have multiple metadata types (e.g., occupation, location); each can potentially form a ground-truth partition. Our experiment with 10 CD methods on 5 datasets (having multiple metadata-based ground-truth partitions) show that the metadata-based evaluation is misleading because there is no single CD method that can outperform others by detecting all types of metadata-based partitions. We further show that the community structure obtained from the CD methods is usually topologically stronger than any metadata-based partitions. Finally, we suggest a new task-based evaluation framework for CD methods and show that a certain type of CD methods is useful for a certain type of task.

  • LAAN: A Linguistic-Aware Attention Network for Sentiment Analysis
    Authors: Zeyang Lei, Yujiu Yang and Yi Liu

    Keywords: Sentiment analysis, linguistic knowledge, interactive attention, dynamic semantic attention, word-level semantics, phrase-level linguistic structure, sentiment-specific sentence representation

    Abstract:
    Sentiment analysis of social media and comment data is an important issue in opinion monitoring. Although deep neural networks have gained remarkable success recently for this task, these approaches do not fully exploit the linguistic knowledge. In this paper, we propose a Linguistic-Aware Attention Network (LANN) to enhance the performance of sentiment analysis, which integrates the sentiment linguistic knowledge into the deep neural network. LANN adopts a two-stage strategy to model the sentiment-specific sentence representation. First, an interactive attention mechanism is designed to model word-level semantics. Second, to capture phrase-level linguistic structure, a dynamic semantic attention is proposed to select the crucial phrase chunks in the sentence. The experiments demonstrate that LANN has robust superiority over competitors and has reached the state-of-the-art performance.

  • Educational Migration from Russia to the Nordic Countries, China and Middle East. Social Media Data
    Authors: Daniel Alexandrov, Ilya Musabirov, Viktor Karepin and Daria Chuprina

    Keywords: educational migration, international students, social media, webometrics

    Abstract:
    We use social media andWWWdata to analyze international educational migration from Russia. We find substantial regional differences in migration patterns for three contrast directions: Scandinavia, China, the Middle East. We build a model of migration flows with geographic distances to destination countries, various socio-demographic data and institutional characteristics of educational institutions.

  • Optimal vehicle dispatching for ride-sharing platforms via dynamic pricing
    Authors: Mengjing Chen, Weiran Shen, Pingzhong Tang and Song Zuo

    Keywords: vehicle dispatching, ride-sharing, dynamic pricing

    Abstract:
    Over the past few years, ride-sharing has been proven to be an effective way to relieve urban traffic congestion, as evidenced by several emerging ride-sharing platforms such as Uber and Didi. A key economic problem for these platforms is to design a revenue optimal (or welfare-optimal) pricing scheme and a corresponding vehicle dispatching policy that incorporates geographic information, and more importantly, dynamic supply and demand. In this paper, we aim to solve this problem by introducing a unified model that takes into account both travel time and driver redirection. We tackle the non-convexity problem using the “ironing” technique and formulate the optimization problem as a Markov decision process (MDP), where the states are the driver distributions and the decision variables are the prices. Our main finding is to give an efficient algorithm that computes the exact revenue (or welfare) optimal randomized pricing schemes. We characterize the optimal solutions of the MDP by primal-dual analysis of a convex program. We also conduct empirical analyses of our solution with real data of a major ride-sharing platform and show its significant advantages over fixed pricing schemes as well as those prevalent surge-based pricing schemes.

  • A Portfolio Optimization Approach for Splitting Budget Across Multiple Advertising Campaigns
    Authors: Deguang Kong, Konstantin Shmakov and Jian Yang

    Keywords: bid, advertising, optimization

    Abstract:
    Bid optimization, which aims to find the competitive bid to achieve the best performance for the advertiser, is an important problem in online advertising. The optimal bid recommendation enables the advertisers to make informed decisions without actually spending the budget. In this paper, we consider a bid optimization scenario that the advertiser’s budget can be split across multiple campaigns. To achieve the optimal performance, we formalize the bid optimization problem as a constraint portfolio optimization problem, and derive an effective method to solve it. Experiment studies on real-world ad campaigns demonstrate the effectiveness of our method.

  • Multi-task Learning for Author Profiling with Hierarchical Features
    Authors: Jiang Zhile, Min Yang, Qiang Qu, Junyu Luo and Juncheng Liu

    Keywords: User profiling, Multi-task learning, Hierarchical Features

    Abstract:
    Author profiling is an important but challenge task. In this paper, we propose a novel Multi-Task learning framework for Author Profiling (MTAP), in which a document modeling module is shared across three different author profiling tasks (i.e., age, gender and job classification tasks). To further boost author profiling, we integrate hierarchical features learned by different models. Concretely, we employ CNN, LSTM and topic model to learn the character-level, word-level and topic-level features, respectively. MTAP thus leverages the benefits of supervised deep neural networks as well as an unsupervised probabilistic generative model to enhance the document representation learning. Experimental results on a real-life blog dataset show that MTAP has robust superiority over competitors and sets state-of-the-art for all the three author profiling tasks.

  • Open Information Extraction with Global Structure Constraints
    Authors: Qi Zhu, Xiang Ren, Jingbo Shang, Yu Zhang, Frank F. Xu and Jiawei Han

    Keywords: Open Information Extraction, knowledge acquisition, distant supervision

    Abstract:
    Extracting entities and their relations from text is an important task for understanding massive text corpora. In this paper, we propose a novel open IE system, called ReMine, which integrates local context signal and global structural signal in a unified framework with distant supervision. The new system can be efficiently applied to different domains as it uses facts from external knowledge bases as supervision; and can effectively score sentence-level tuple extractions based on corpus-level statistics. Specifically, we design a joint optimization problem to unify (1) segmenting entity/relation phrases in individual sentences based on local context; and (2) measuring the quality of sentence-level extractions with a translating-based objective. Experiments on real-world corpora from different domains demonstrate the effectiveness and robustness of ReMine when compared to other open IE systems.

  • GPSP: Graph Partition and Space Projection based Approach for Heterogeneous Network Embedding
    Authors: Wenyu Du, Shuai Yu, Min Yang and Qiang Qu

    Keywords: Network Embedding, Network representation learning, Graph partition, Space projection

    Abstract:
    In this paper, we propose GPSP, a novel Graph Partition and Space Projection based approach, to learn the representation of the heterogeneous network that consists of multiple types of nodes and links. Concretely, we first partition the heterogeneous network into homogeneous and bipartite subnetworks.Then the projective relations hidden in bipartite subnetworks are extracted by learning the projective embedding vectors. Finally, we concatenate the projective vectors from bipartite subnetworks with the ones learned from homogeneous subnetworks to form the final representation of the heterogeneous network. Extensive experiments are conducted on a real-life dataset. The experimental results demonstrate that GPSP outperforms state-of-the-art baselines in two key network mining tasks: node classification and clustering

  • RealGraph: A Graph Engine Leveraging the Power-Law Distribution of Social Networks
    Authors: Yong-Yeon Jo, Myung-Hwan Jang, Hyungsoo Jung and Sang-Wook Kim

    Keywords: Social network, graph processing, big data

    Abstract:
    Existing single-machine-based graph engines do not leverage the characteristic of social networks following the power-law degree distribution. We propose a new graph engine, RealGraph, that is tailored for huge social networks by exploiting the power-law degree property and aims at processing and analyzing them efficiently.

  • Spam2Vec: Learning Biased Embeddings for Spam Detection in Twitter
    Authors: Suman Kalyan Maity, Santosh K C and Arjun Mukherjee

    Keywords: Spam detection, Biased embedding, Biased Random walks

    Abstract:
    In this paper, we propose a semi-supervised framework Spam2Vec to identify spammers in Twitter. This algorithmic framework learns the spam representations of the node in the network by leveraging biased random walks. Our spammer detection method yields an AUC of 0.54 with precision@100 as 0.12 and performs significantly better with 7.77% increase in AUC and a 2.4 times improvement on precision over the best performing baseline.

  • Attention Network for Information Diffusion Prediction
    Authors: Zhitao Wang, Chengyao Chen and Wenjie Li

    Keywords: Information Diffusion, Attention Network, Diffusion Denpendency

    Abstract:
    In this paper, we propose an attention network for diffusion prediction problem. The developed diffusion attention module can effectively explore the implicit user-to-user diffusion dependency among information cascade users. Additionally, the user-to-cascade importance and the time-decay effect are also captured and utilized by the model. The superiority of the proposed model over state-of-the-art methods is demonstrated by diffusion prediction experiments on real diffusion data.

  • Path Ranking with Path Difference Sets for Maintaining Knowledge Base Integrity
    Authors: Po-Cheng Huang, Hen-Hsen Huang and Hsin-Hsi Chen

    Keywords: Knowledge base completion, Knowledge base integrity, Path ranking

    Abstract:
    Knowledge base completion (KBC) involves in discovering missing facts. However, knowledge changes over time. Some facts need to be removed from knowledge base (KB) to keep knowledge base integrity (KBI) while new facts are inserted or old facts are deleted. This paper proposes a path-based learning model to learn the dependency of dynamic relations automatically. In this way, we can eliminate the conflicting facts and keep KB clean. That would be a significant benefit for KBC and other tasks using KB.

  • LinCa: A Page Loading Time Optimization Approach for Users Subject to Internet Access Restriction
    Authors: Chen Ling, Lei Wang, Jun Lang, Qiufen Xia, Guoxuan Chang, Kun Wang and Peng Zhao

    Keywords: Page Loading Time Optimization, Invalid Link, Internet Access Restriction, User Experience, Rule Base, LinCa

    Abstract:
    Internet access restriction is ubiquitous and will only intensify. Unblocked websites or web pages often accompany couple fate-to-fail invalid links from blocked servers due to such restriction, while users wait for a long time to see contents of these invalid links. Therefore, it is better to directly show expired links to users rather than making users excessively wait. To this end, page loading time should be optimized to give users better user experiences. Existing strategies optimized page loading time without considering the latency caused by Internet access restriction. In this paper, we present LinCa, (Links Catcher), a novel approach that reduces page loading time on client-side by parsing all requests when a navigation starts, and intercepting all invalid links, with an aim to optimize the page loading time by fully considering the Internet access restriction. To this end, we first create and maintain a Rule Base to store invalid links under given access restriction rules. We then update the Rule Base periodically to cover as many as invalid links and remove links that become valid. We finally demonstrate the effectiveness of LinCa through experiments by building a Chrome extension. Experimental results show that LinCa can efficiently reduce page loading time subject to Internet access restriction.

  • Tink: a temporal graph analytics library for Apache Flink
    Authors: Wouter Lightenberg, Yulong Pei, George Fletcher and Mykola Pechenizkiy

    Keywords: Temporal Graph Analytics, Apache Flink, Gelly, Tink

    Abstract:
    We introduce the Tink library for distributed temporal graph analytics. Increasingly, reasoning about temporal aspects of graph-structured data collections is an important aspect of analytics. For example, in a communication network, time plays a fundamental role in the propagation of information within the network. Whereas existing tools for temporal graph analysis are built stand alone, Tink is a library in the Apache Flink ecosystem, thereby leveraging its advanced mature features such as distributed processing and query optimization. Furthermore, Flink requires little effort to process and clean the data without having to use different tools before analyzing the data. Tink focuses on interval graphs in which every edge is associated with a starting time and an ending time. The library provides facilities for temporal graph creation and maintenance, as well as standard temporal graph measures and algorithms. Furthermore, the library is designed for ease of use and extensibility.

  • Contextual Word Embedding: A Case Study in Clustering Tweets about Emergency Situations
    Authors: Debasis Ganguly and Kripabandhu Ghosh

    Keywords: Word embedding, Tweets, Clustering

    Abstract:
    Effective clustering of short documents, such as tweets, is difficult because of the lack of enough semantic context. Word embedding is a technique that can be applied to address this lack of semantic context. However, the process of word vector embedding, in turn, relies on the availability of sufficient contexts to learn the word associations. To get around this problem, we propose a novel word vector training approach that takes into account topically similar tweets to better learn the word associations. We test our proposed word embedding approach by clustering a collection of tweets on disasters. We observe that the proposed method improves clustering effectiveness by up to 14\%.

  • The Effects of Real-world Events on Music Listening Behavior: An Intervention Time Series Analysis
    Authors: Markus Schedl, Christine Bauer and Eelco Wiechert

    Keywords: intervention time series analysis, music listening behavior, real-world events, Last.fm, Google Trends

    Abstract:
    We approach the research question whether real-world events, such as sport events or product launches, influence music consumption behavior. To this end, we consider events of different categories from Google Trends and model listening events as time series using Last.fm data. Performing an auto-regressive integrated moving average analysis to decompose the signal and subsequently an intervention time series analysis, we find significant signal discontinuities, in particular for the Google news category. For the categories news and events, an upward jump is found, while tech events cause less music being played.

  • Editorial Algorithms: Optimizing Recency, Relevance and Diversity for Automated News Curation
    Authors: Abhijnan Chakraborty, Mohammad Luqman, Sidhartha Satapathy and Niloy Ganguly

    Keywords: Editorial Algorithms, Recency, Relevancy, Diversity, News Recommendation

    Abstract:
    With humongous amount of information getting published online, news websites need to curate interesting news stories for their readers. Although traditionally news curation was the sole domain of human editors, the volume of information has led many media outlets to turn to editorial algorithms. However, such algorithms are often proprietary, and smaller media outlets may not have the resources to build them from scratch. In this paper, we present a novel framework Brahma to automatically curate news stories by optimizing recency, relevancy and diversity of the selected stories. Extensive evaluations over two real-world news datasets show that Brahma outperforms several state-of-the-art baselines in matching the news curation performed by human editors.

  • Trusternity: auditing Transparent-log server with blockchain
    Authors: Long Nguyen, Claudia-Lavinia Ignat and Olivier Perrin

    Keywords: blockchain, key transparency, audit, Ethereum

    Abstract:
    Public key server is a simple yet effective way of key management in secure end-to-end communication. To ensure the trustworthiness of a public key server, transparent log systems such as CONIKS employ a tamper-evident data structure on the server and a gossiping protocol among clients in order to detect compromised servers.However, due to lack of incentive and vulnerability to malicious clients, a gossiping protocol is hard to implement in practice. Mean-while, alternative solutions such as EthIKS are not scalable. This paper presents Trusternity, an auditing scheme relying on Ethereum blockchain that is easy to implement, scalable and inexpensive to operate.

  • MapSQ: A Plugin-based MapReduce Framework for SPARQL Queries on GPU
    Authors: Jiaming Song, Xiaowang Zhang, Peng Peng, Zhiyong Feng and Lei Zou

    Keywords: RDF, SPARQL, MapReduce, GPU, Parallel computing

    Abstract:
    In this poster, we present a plugin-based framework (MapSQ) with three parts for SPARQL queries utilizing high-performance of GPU to accelerate answering in a convenient way. Selector chooses suitable join ordering according to characteristics of data and queries. Executor answers subqueries and return middle solutions and GPU Computing obtains the join of middle solutions through MapReduce. Finally, we evaluate MapSQ bulit on gStore and RDF-3X on the LUBM benchmark and YAGO datasets (over 200 million triples). The experimental results show that MapSQ significantly improves the performance of SPARQL query engines with speedup up to 33.

  • Demystifying Advertising Campaign for CPA Goal Optimization
    Authors: Deguang Kong, Konstantin Shmakov and Jian Yang

    Keywords: optimization, advertising, CPA

    Abstract:
    In cost-per-click (CPC) or cost-per-impression (CPM) advertising campaigns, advertisers always run the risk of spending the budget without getting enough conversions. Moreover, the bidding on advertising inventory has few connections with propensity that can reach to cost-per-acquisition (CPA) goals. To address this problem, this paper presents a bid optimization scenario to achieve the desired CPA goals for advertisers. In particular, we build the optimization engine to make a decision by solving the constrained optimization problem. The proposed model can naturally recommend the bid that meets the advertisers’ expectations by making inference over history auction behaviors. The bid optimization model outperforms the baseline methods on real-world campaigns, and can be applied into a wide range of scenarios for performance improvement and revenue liftup.

  • Siamese Cookie Embedding Networks for Cross-Device User Matching
    Authors: Ugo Tanielian, Anne-Marie Tousch and Flavian Vasile

    Keywords: cross-device, cookie matching, convolutional networks, sequence embedding, similarity learning

    Abstract:
    Over the last decade, the number of devices per person has increased substantially. This poses a challenge for cookie-based personalization applications, such as online search and advertising, as it narrows the personalization signal to a single device environment. A key task is to find which cookies belong to the same person to recover a complete cross-device user journey. Recent work on the topic has shown the benefits of using unsupervised embeddings learned on user event sequences. In this paper, we extend this approach to a supervised setting and introduce the Siamese Cookie Embedding Network (SCEmNet), a siamese convolutional architecture that leverages the multi-modal aspect of sequences, and show significant improvement over the state-of-the-art.

  • Matching Resumes to Jobs via Deep Siamese Network
    Authors: Saket Maheshwary and Hemant Misra

    Keywords: Resume-Job Matching, Deep Learning, Natural Language Processing

    Abstract:
    In this paper we investigate an important and challenging task of finding appropriate jobs to job seeking candidates by matching semi structured resumes of candidates to job descriptions. To perform this task we propose to use a siamese adaptation of convolutional neural network. The proposed approach effectively captures the underlying semantics thus enabling us to project similar resumes and job descriptions closer to each other, and make dissimilar resumes and job descriptions distant from each other in the semantic space. Our experimental results on a set of 1314 resumes and a set of 3809 job descriptions (5,005,026 resume-job description pairs) demonstrate that our approach is better than the current state-of-the-art approaches.

  • Visualizing the Flow of Discourse with a Concept Ontology
    Authors: Baoxu Shi and Tim Weninger

    Keywords: Wikipedia, Concept Ontology, Discourse Visualization

    Abstract:
    Understanding and visualizing human discourse has long being a challenging task. Although recent work on argument mining have shown success in classifying the role of various sentences, the task of recognizing concepts and understanding the ways in which they are discussed remains challenging. Given an email thread, online discussion, or a transcript of a group discussion, our task is to extract the relevant concepts and understand how they are referenced and re-referenced throughout the discussion. In the present work, we present a preliminary approach for extracting and visualizing group discourse by adapting Wikipedia’s category hierarchy to be an external concept ontology. From a user study, we found that our method achieved better results than 4 strong alternative approaches, and we illustrate our visualization method based on the extracted discourse flows.

  • An Inflection Point Approach for Advertising Bid Optimization
    Authors: Deguang Kong, Konstantin Shmakov and Jian Yang

    Keywords: inflection, optimization, bid

    Abstract:
    In online advertising, a common objective for advertisers is to get the maximum returns on investment given the budget. On one hand, if the bid is too high, the advertiser pays more money than he should pay for the same number of clicks. On the other hand, it the bid is too low, the advertiser cannot win in auctions and therefore it loses the opportunity. A challenging problem is how to recommend the bid to achieve the maximum values for advertisers. In this paper, we present an inflection point approach for bid recommendation from discovering the bid price of click(bid) function at which the function changes from significant increase (i.e. concave downward) to slow increase (convex upward). We derive the optimal solution using history sparse and noisy observations given the budget limit. In real word advertising campaign evaluations, the proposed bid recommendation scenario brings in 15.47% bid increase and 30.24% click increase over the baselines.

  • Patterns of volunteer behaviour across online citizen science
    Authors: Helen Spiers, Ali Swanson, Lucy Fortson, Brooke Simmons, Laura Trouille, Samantha Blickhan and Chris Lintott

    Keywords: Citizen Science, Volunteer Behaviour, Project Design

    Abstract:
    An increasingly applied data reduction technique is to develop human-computer systems, such as citizen science platforms like the Zooniverse. These platforms function as social machines, combining volunteer efforts with automated processes to enable distributed data analysis. The rapid growth of this approach is increasing need to understand how we can improve volunteer interaction and engagement. Here, we utilize the most comprehensive collection of online citizen science data gathered to date to examine multiple variables across 63 Zooniverse projects. Our analyses reveal how subtle design changes can influence many facets of volunteer interaction, generating insights that have implications for the design and study of citizen science projects, and future research.

  • Crowd-Machine Collaboration for Item Screening
    Authors: Evgeny Krivosheev, Bahareh Harandizadeh, Fabio Casati and Boualem Benatallah

    Keywords: crowdsourcing, machine learning, hybrid systems, classification

    Abstract:
    In this paper we describe how crowd and machine classifier can be efficiently combined to screen items that satisfy a set of predicates. We show that this is a recurring problem in many domains, present machine-human (hybrid) algorithms that screen items efficiently and estimate the gain over human-only or machine-only screening in terms of performance and cost.

  • Fairness of Extractive Text Summarization
    Authors: Anurag Shandilya, Kripabandhu Ghosh and Saptarshi Ghosh

    Keywords: Extractive text summarization, Fairness, Fair summarization, Adverse impact

    Abstract:
    We propose to evaluate extractive summarization algorithms from a completely new perspective. Considering that an extractive summarization algorithm selects a subset of the textual units in the input for inclusion in the summary, we investigate whether this selection is fair. We use several summarization algorithms over datasets that have a certain distribution of a sensitive attribute (e.g., gender, political affiliation), and find that the generated summaries often have very different distributions of the said attribute. Specifically, some classes of the textual units are under-represented in the summaries according to the fairness notion of adverse impact. To our knowledge, this is the first work on fairness of text summarization, and is likely to open up interesting research problems.

  • Text-Enriched Representations for News Image Classification
    Authors: Elias Moons, Tinne Tuytelaars and Marie-Francine Moens

    Keywords: News image classification, Deep learning, Limited training data

    Abstract:
    Images have a prominent role in the communication of news on the Web. We propose a novel method for image classi€cation with subject categories when limited annotated images are available for training the classi€er. A neural network based encoder learns image representations from paired news images and their texts. Once trained, this encoder transforms any image to a text-enriched representation of the image, which is then used as input for the classi€er that categorizes an image according to its subject category. In our experiments we have trained classi€ers with di‚erent amounts of annotated images and have found that the image classi€er that uses the text-enriched image representations outperforms a baseline model that only uses image features, especially in cases where the size of the annotated training dataset is small.

  • User Fairness in Recommender Systems
    Authors: Jurek Leonhardt, Avishek Anand and Megha Khosla

    Keywords: fairness, user fairness, recommender system, post-processing, collaborative filtering

    Abstract:
    Recent works in recommendation systems have focused on diversity in recommendations as an important aspect of recommendation quality. In this work we argue that the post-processing algorithms aimed at only improving diversity among recommendations lead to discrimination among the users. We introduce the notion of \emph{user fairness} which has been overlooked in literature so far and propose measures to quantify it. Our experiments on two diversification algorithms show that an increase in aggregate diversity results in increased disparity among the users.

  • FORank: Fast ObjectRank for Large Heterogeneous Graphs
    Authors: Tomoki Sato, Hiroaki Shiokawa, Yuto Yamaguchi and Hiroyuki Kitagawa

    Keywords: ObjectRank, Graph data mining, Link structure analysis

    Abstract:
    ObjectRank is one of the popular graph mining methods that enables us to evaluate the importance of each vertex on heterogeneous graphs. However, it is computationally expensive to apply it to large graphs since ObjectRank needs to compute the importance of all vertices iteratively. In this work, we presents a fast ObjectRank algorithm, FORank, that accurately approximates the keyword search results. FORank iteratively prunes vertices whose convergence score likely has less impact on the results during iterative computation. The experiments showed that FORank runs 7 times faster than ObjectRank computation with over 90% accuracy approximation.

  • Contextual Web Summarization: A Supervised Ranking Approach
    Authors: Amit Sarkar and G Srinivasaraghavan

    Keywords: Summarization, Context, Salience ranking, Information retrieval

    Abstract:
    We investigate the task of contextual web summarization, where the goal is to extract information relevant to the current reading context from a cited web article. In a certain linked document sets such as Wikipedia articles, scientific papers as well as news and blogs, such contextual summaries can be useful in providing additional related information to the user helping in the reading task. In this work, we tried to find out the significant components that contribute to building up the reading context. We built a supervised model for ranking sentences from the cited document according to their contextual salience. Initial evaluation based on annotated data-set of web articles show that our ranking model performs better than the baseline generic summaries as well as context-sensitive summaries.

  • Identifying Task Boundaries in Digital Assistants
    Authors: Madian Khabsa, Ahmed El Kholy, Ahmed Awadallah, Imed Zitouni and Milad Shokouhi

    Keywords: task boundaries, digital assistants, session

    Abstract:
    Digital assistants are emerging to become more prevalent in our daily lives. In interacting with these assistants, users may engage in multiple tasks within a short period of time. Identifying task boundaries and isolating them within a session is critical for measuring the performance of the system on each individual task. In this paper we aim to automatically identify sequences of interactions within a session that constitute a task together. To this end, we sample interactions from a real world digital assistant and use crowd judges to segment a session into multiple tasks. After that, we use a machine learned model to identify task boundaries. Our learned model with its features significantly outperform the baselines. To the best of our knowledge, this is the first work that aims to identify tasks within digital assistant sessions.

  • Homophily at Academic Conferences
    Authors: Martin Atzmueller and Florian Lemmerich

    Keywords: Homophily, Hypotheses, Network Analysis, Bayesian Statistics

    Abstract:
    Academic conferences are a backbone for the exchange of ideas in scientific communities. However, so far little is known about the communication networks emerging at those venues. Besides personal knowledge, network homophily has been identified as a driving factor for establishing contacts and followerships in social networks, i.e., people are more likely to engage with others if they are similar with respect to certain attributes. In this paper, we describe work in progress on investigating homophily at four academic conferences based on face-to-face (F2F) contact data collected using wearable sensors between conference participants. In particular, we study which personal attributes are predictive for face-to-face contacts. For that purpose, we obtained diverse personal attributes from online sources in order to elicit a variety of hypotheses, which can then be compared using descriptive statistics and a Bayesian method for comparing hypotheses in networks. Our results suggest that personal knowledge (as derived from DBLP and ResearchGate networks) and homophilic behavior with respect to several attributes, e.g., gender or country of origin, are important factors for contacts at academic conferences.

  • Practical Privacy-Preserving Friend Recommendations on Social Networks
    Authors: William Brendel, Fangqiu Han, Luis Marujo, Aleksandra Korolova and Roger Jie Luo

    Keywords: Friend Recommendations, Privacy, Social Networks

    Abstract:
    Making friend recommendations is an important task for social networks, as having more friends typically leads to a better user experience. Most current friend recommendations systems grow the existing network at the cost of privacy. In particular, any given user’s friend graph may be directly or indirectly leaked as a result of such recommendations. In many situations this is not desirable, as the friend list may reveal much about the user — from their identity to their sexual orientation and interests. In this work, we focus on the “cold start” problem (our framework can also be used for warm start as well) of making friend recommendations for new users while raising the bar on protecting the privacy of the friend list of all users. We propose a practical real-time friend recommendation framework, tested on the Snapchat social network, that preserves the privacy of users’ friends lists with respect to brute-force attacks and scales to millions of users.

  • The Equilibrium of IA-Select Diversification
    Authors: Yingying Wu, Yiqun Liu, Ke Zhou, Xiaochuan Wang, Min Zhang and Shaoping Ma

    Keywords: Diversified search, Subtopic retrieval, Heuristic algorithms

    Abstract:
    Diversifying search results to satisfy as many users’ intents as possible is NP-hard. There has been a plethora of studies on the result diversification problem; some employ a pruned exhaustive search and some use the greedy algorithm. However, the objective function of the result diversification problem adopts the cascade assumption, which assumes users’ information needs will drop once their subtopic search intents are satisfied. As a result of this assumption, the intent distribution of diversified results deviates from the actual distribution of user intentions until each subtopic is chosen equally. Such a selection is unreasonable, especially when the original distribution of user intent is unbalanced. In this paper, we prove that having the standard deviation of subtopic distribution approach zero is a necessary and sufficient condition for the diversification equilibrium and provides empirical evidence for the equilibrium.

  • Deep Modeling of the Evolution of User Preferences and Item Attributes in Dynamic Social Networks
    Authors: Peizhi Wu, Yi Tu and Zhenglu Yang

    Keywords: user modeling, social networks, MLP, RNN

    Abstract:
    Modeling the evolution of user preferences and item attributes in a dynamic social network is important because it is the basis for many applications, including recommendation systems and user behavior analysis. Majority of previous methods in this area only considered incomplete aspects of the problem, such as dynamic preferences of users, dynamic attributes of items, evolution of a social network, or partial integration. This study proposes a comprehensive general neural framework with several optimal strategies to jointly model the evolution of user preferences and item attributes in dynamic social networks. Experimental results conducted on two real-world datasets demonstrate that our model performs better than the state-of-the-art methods.

  • Machine Learning for the Peer Assessment Credibility
    Authors: Yingru Lin, Soyeon Han and Byeong-Ho Kang

    Keywords: Online Peer Assessment, Educational Data Mining, Credibility Assessment, Machine Learning, Web based Educational System

    Abstract:
    The peer assessment approach is considered to be one of the best solutions for scaling both assessment and peer learning to global classrooms, such as MOOCs. However, some academic staff hesitate to use a peer assessment approach for their classes due to concerns about its credibility and reliability. The focus of our research is to detect the credibility level of each assessment performed by students during peer assessment. We found three major scopes in assessing the credibility level of evaluations, 1) Informativity, 2) Accuracy, and 3) Consistency. We collect assessments, including comments and grades provided by students during the peer assessment process and then each feedback-and-grade pair is labeled with its credibility level by Mechanical Turk evaluators. We extract relevant features from each labeled assessment and use them to build a classifier that attempts to automatically assess its level of credibility in C5.0 Decision Tree classifier. The evaluation results show that the model can be used to automatically classify peer assessments as credible or non-credible, with accuracy in the range of 88%.

  • A Multi-Attention based Neural Network Model for Predicting Story Ending
    Authors: Qian Li, Ziwei Li, Zhenglu Yang and Yanhui Gu

    Keywords: natural language processing, neural network, story cloze

    Abstract:
    Predicting the ending of a story is an interesting issue that has attracted considerable attention, as in case of the ROC Story Cloze Task (SCT). Although several studies have addressed this issue, the performance remains unsatisfactory due to ineffectiveness of story comprehension. In this paper, we propose to construct a multi-attention neural network model (MANN) with well-designed optimizations such as Highway Network, and concatenated features with embedding representations. The effectiveness of our model is experimentally evaluated and demonstrated to be superior to that of state-of-the-art approaches.

  • An Effective Joint Framework of Extractive and Abstractive Summarization with Adversarial Learning
    Authors: Min Gui, Zhengkun Zhang and Zhenglu Yang

    Keywords: Abstractive Summarization, Extractive summarization, Sequence-to-Sequence Framework

    Abstract:
    Document summarization is an important research issue and has attracted much attention from the academe.The approaches for document summarization can be classified as extractive and abstractive. As far as we know, however, no study has integrated these two methods. In this work, we introduce an effective joint framework that integrates extractive and abstractive summarization models, with optimal strategy such as adversatial learning borrowed from LeakGAN to address the problem. Experiments on real benchmark dataset demonstrate that our model is competitive with state-of-the-art methods by up to 2 ROUGE points.