Program
of
track
Web Search and Mining

List of accepted papers :

  • Joint User- and Event- Driven Stable Social Event Organization
    Authors: Xin Wang, Wenwu Zhu, Chun Chen and Martin Ester

    Keywords: social event organization, stability, user happiness, organizer happiness

    Abstract:
    The problem of social event organization (SEO) rises with the advent of online
    web services such as Meetup and Plancast, which plays an important role in
    helping users discover new offline events.
    To date existing work on SEO only assumes that different users have different
    preferences towards different events, ignoring the fact that
    each event (its organizer) may have a separate preference towards every user.
    In this paper, we investigate joint user- and event- driven SEO by simultaneously
    considering user preferences (towards events) and event preferences (towards users).
    A risen challenging problem is that this joint consideration may suffer
    instabilities between
    users and events which are NP-hard to handle in SEO.
    Stability is a desired property that needs to be maintained in SEO,
    otherwise participants will incline towards
    changing to other events and trust less the organizer.
    To the best of our knowledge, we are the first to study SEO with
    both preferences from users to events
    and preferences from events to users being considered.
    In this work, we formulate the stable social event organization (Stable-SEO) problem
    and prove the NP-hardness of Stable-SEO.
    We then propose an efficient greedy heuristic
    algorithm to solve Stable-SEO, taking both user preferences and event preferences
    into account.
    The proposed approach is able to find an assignment of a group of users
    to a set of events for an event organization,
    in which there is a minimized number of users involved in
    user-event pairs such that both the user and event in one such pair will be
    better off when reassigning the user to this event.
    Our experiments on two real-world datasets
    demonstrate the strong superiority of our proposed approach over existing methods.

  • Learning on Partial-Order Hypergraphs
    Authors: Fuli Feng, Xiangnan He, Yiqun Liu, Liqiang Nie and Tat-Seng Chua

    Keywords: Hypergraph, Partial-Order Hypergraph, Graph-based Learning, University Ranking, Popularity Prediction

    Abstract:
    Graph-based learning methods explicitly consider the relations between two entities (i.e., vertices) for learning the prediction function. They have been widely used in semi-supervised learning, manifold ranking, and clustering, among other tasks. Enhancing the expressiveness of simple graphs, hypergraphs formulate an edge as a link to multiple vertices, so as to model the higher-order relations among entities.
    Typically, hyperedges in a hypergraph are used to encode certain kinds of similarity among vertices.
    To the best of our knowledge, all existing hypergraph structures represent the hyperedge as an unordered set of vertices, without considering the possible ordering relationship among vertices. In real-world data, the ordering relationships commonly exist, such as in graded categorical features (e.g., users’ ratings on movies) and numerical features (e.g., monthly income of customers). When constructing a hypergraph, ignoring such ordering relationships among entities will lead to severe information loss, resulting in suboptimal performance of the subsequent learning algorithms.
    In this work, we address the inherent limitation of existing hypergraphs by proposing a new data structure named Partial-Order Hypergraph, which specifically injects the partially ordering relationships of vertices into a hyperedge. We develop regularization-based learning theories for partial-order hypergraphs, generalizing conventional hypergraph learning by incorporating logical rules that encode the partial-order relations. We apply our proposed method to two applications: university ranking from Web data and popularity prediction of online content. Extensive experiments demonstrate the superiority of our proposed partial-order hypergraphs, which consistently improve over conventional hypergraphs.

  • “Satisfaction with Failure” or “Unsatisfied Success”: Investigating the Relationship between Search Success and User Satisfaction
    Authors: Mengyang Liu, Yiqun Liu, Jiaxin Mao, Cheng Luo, Min Zhang and Shaoping Ma

    Keywords: search evaluation, user satisfaction, search success

    Abstract:
    User satisfaction has been paid much attention to in recent Web search evaluation studies. Although satisfaction is often considered as an important symbol of search success, it doesn’t guarantee success in many cases, especially for complex search task scenarios. In this study, we investigate the differences between user satisfaction and search success, and try to adopt the findings to predict search success in complex search tasks. To achieve these research goals, we conduct a laboratory study in which search success and user satisfaction are annotated by domain expert assessors and search users, respectively. We find that both “Satisfaction with Failure” and “Unsatisfied Success” cases happen in these search tasks and together they account for as many as 40.3% of all search sessions. The factors (e.g. document readability and credibility) that lead to the inconsistency of search success and user satisfaction are also investigated and adopted to predict whether one search task is successful. Experimental results show that our proposed prediction method is effective in predicting search success.

  • TEM: Tree-enhanced Embedding Model for Explainable Recommendation
    Authors: Xiang Wang, Xiangnan He, Fuli Feng, Liqiang Nie and Tat-Seng Chua

    Keywords: Explainable Recommendation, Tree-based Model, Embedding-based Model, Neural Attention Network

    Abstract:
    While collaborative filtering is the dominant technique in personalized recommendation, it models user-item interactions only and cannot provide concrete reasons for a recommendation. Meanwhile, the rich side information affiliated with user-item interactions (e.g., user demographics and item attributes), which provide valuable evidence that why a recommendation is suitable for a user, has not been fully explored in providing explanations.
    On the technical side, embedding-based methods, such as Wide&Deep and neural factorization machines, provide state-of-the-art recommendation performance. However, they work like a black-box, for which the reasons underlying a prediction cannot be explicitly presented. On the other hand, tree-based methods like decision trees predict by inferring decision rules from data. While being explainable, they cannot generalize to unseen feature interactions thus fail in collaborative filtering applications.
    In this work, we propose a novel solution named Tree-enhanced Embedding Method that combines the strengths of embedding-based and tree-based models. We first employ a tree-based model to learn explicit decision rules (a.k.a., cross features) from the rich side information. We next design an embedding model that can incorporate explicit cross features and generalize to unseen cross features on user ID and item ID. At the core of our embedding method is an easy-to-interpret attention network, making the recommendation process fully transparent and explainable. We conduct experiments on two datasets of tourist attraction and restaurant recommendation, demonstrating the superior performance and explainability of our solution.

  • Ad Hoc Table Retrieval using Semantic Similarity
    Authors: Shuo Zhang and Krisztian Balog

    Keywords: table retrieval, table search, semantic similarity

    Abstract:
    We introduce and address the problem of ad hoc table retrieval: answering a keyword query with a ranked list of tables. This task is not only interesting on its own account, but is also being used as a core component in many other table-based information access scenarios, such as table completion or table mining.
    The main novel contribution of this work is a method for performing semantic matching between queries and tables.
    Specifically, we (i) represent queries and tables in multiple semantic spaces (both discrete sparse and continuous dense vector representations) and (ii) introduce various similarity measures for matching those semantic representations.
    We consider all possible combinations of semantic representations and similarity measures and use these as features in a supervised learning model.
    Using a purpose-built test collection based on Wikipedia tables, we demonstrate significant and substantial improvements over a state-of-the-art baseline.

  • Query Suggestion With Feedback Memory Network
    Authors: Bin Wu, Chenyan Xiong, Maosong Sun and Zhiyuan Liu

    Keywords: Query Suggestion, Feeback Memory Network, User Model

    Abstract:
    This paper presents, Feedback Memory Network (FMN), which models user interactions with the search engine for query suggestion.
    Besides modeling the queries issued by the user, FMN also considers user’s feedback on the search results.
    It converts user’s browsing and click actions to the attention over the top-ranked documents, combines them into the feedback memories of the query, and better models the user’s information needs.
    The feedback memories and the query sequence are then combined to suggest queries by the sequence-to-sequence neural network.
    Modeling user feedback makes it possible to suggest diverse queries for the same query sequence if users have preferred different search results, which indicate different information needs.
    Our experiments on the search log from a Chinese commercial search engine showed the stable and robust advantages of FMN.
    Especially when the feedback is richer or more informative, FMN provides more diverse and accurate suggestions, which is exceptionally helpful for ambiguous sessions where more information is needed to infer the search intents.

  • A Sparse Topic Model for Extracting Aspect-Specific Summaries from Online Reviews
    Authors: Vineeth Rakesh, Weicong Ding, Aman Ahuja, Nikhil Rao, Yifan Sun and Chandan K. Reddy

    Keywords: topic model, generative model, aspect summarization, text processing, information retrieval, approximate inference, gibbs sampling

    Abstract:
    Online reviews have become an inevitable part of a consumer’s decision making process, where the likelihood of purchase not only depends on the product’s overall rating, but also on the description of it’s aspects. Therefore, websites such as Amazon, Walmart, and Netflix constantly encourage users to write good quality reviews and categorically summa- rize different facets of the product. However, despite such efforts, it takes a significant effort to skim through thousands of reviews and look for answers that addresses the query of consumers. For example, a gamer might be interested in buying a monitor with fast refresh rates, support for Gsync and Freesync technologies etc., while a photographer might be interested in aspects such as color reproduction and Delta-e scores. Therefore, in this paper, we propose a generative aspect summarization model called APSUM that is capable of providing fine-grained summaries of on- line reviews. To overcome the inherent problem of aspect sparsity, we jointly constraint both the document-topic and the word-topic distribution by introducing a semi-supervised variation of the spike-and-slab prior. Using rigorous set of experiments, we show that the proposed model is capable of outperforming other state-of-the-art aspect-topic models over a variety of datasets and deliver intuitive fine-grained summaries that could simplify the purchase decisions of customers.

  • Neural Attentional Rating Regression with Review-level Explanations
    Authors: Chong Chen, Min Zhang, Yiqun Liu and Shaoping Ma

    Keywords: Recommender systems, Rating prediction, Attention model, Recommendation explanation, Review usefulness

    Abstract:
    Reviews information is dominant for users to make online purchasing decisions in e-commerces. However, the usefulness of reviews is varied. We argue that less-useful reviews hurt model’s performance, and are also less meaningful for user’s reference. While some existing models utilize reviews for improving the performance of recommender systems, few of them consider the usefulness of reviews for recommendation quality. In this paper, we introduce a novel attention mechanism to explore the usefulness of reviews, and propose a Neural Attentional Regression model with Review-level Explanations (NARRE) for recommendation. Specifically, NARRE can not only predict precise ratings, but also learn the usefulness of each review simultaneously. Therefore, the highly-useful reviews are obtained which provide review-level explanations to help users make better and faster decisions. Extensive experiments on benchmark datasets of Amazon and Yelp on different domains show that the proposed NARRE model consistently outperforms the state-of-the-art recommendation approaches, including PMF, NMF, SVD++, HFT, and DeepCoNN in terms of rating prediction, by the proposed attention model that takes review usefulness into consideration. Furthermore, the selected reviews are shown to be effective when taking existing review-usefulness ratings in the system as ground truth. Besides, crowd-sourcing based evaluations reveal that in most cases, NARRE achieves equal or even better performances than system’s usefulness rating method in selecting reviews. And it is flexible to offer great help on the dominant cases in real e-commerce scenarios when the ratings on review-usefulness are not available in the system.

  • Learning from Multi-View Multi-Way Data via Structural Factorization Machines
    Authors: Chun-Ta Lu, Lifang He, Hao Ding, Bokai Cao and Philip S. Yu

    Keywords: multi-way interaction, tensor factorization, multi-view learning

    Abstract:
    Real-world relations among entities can often be observed and determined by different perspectives/views. For example, the decision made by a user on whether to adopt an item relies on multiple aspects such as the contextual information of the decision, the item’s attributes, the user’s profile and the reviews given by other users. Different views may exhibit multi-way interactions among entities and provide complementary information. In this paper, we introduce a multi-tensor-based approach that can preserve the underlying structure of multi-view data in a generic predictive model. Specifically, we propose structural factorization machines (SFMs) that learn the common latent spaces shared by multi-view tensors and automatically adjust the importance of each view in the predictive model. Furthermore, the complexity of SFMs is linear in the number of parameters, which make SFMs suitable to large-scale problems. Extensive experiments on real-world datasets demonstrate that the proposed SFMs outperform several state-of-the-art methods in terms of prediction accuracy and computational cost.

  • Scalable Supervised Discrete Hashing for Large-Scale Search
    Authors: Xin Luo, Ye Wu and Xin-Shun Xu

    Keywords: Learning-to-Hash, Supervised Hashing, Scalable Search, Discrete Optimization

    Abstract:
    Supervised hashing methods have attracted much attention in these years. However, most existing supervised hashing algorithms have some of the following problems. First, most of them leverage the pairwise similarity matrix, whose size is quadratic to the number of training samples, to supervise the learning of hash codes. Thus, they are not scalable when dealing with large data. Second, most of them relax the discrete constraints for easy optimization and then quantize the learnt real-valued solution to binary hash codes. Therefore, the quantization error caused by the relaxation may lead to a decline of retrieval performance. To address these issues and make the supervised method scalable to large datasets, we present a novel hashing method, named Scalable Supervised Discrete Hashing (SSDH). Specifically, based on a new loss function, SSDH bypasses the direct optimization on the n by n pairwise similarity matrix. In addition, SSDH adopts no relaxation optimization scheme in the learning procedure and avoids the large quantization error problem. Moreover, during learning, it leverages both the pairwise similarity matrix and label matrix; thus, more semantic information can be embedded to the learning of hash codes. Extensive experiments are conducted on six benchmark datasets including two large-scale datasets, i.e., NUSWIDE and ImageNet. The results show that SSDH can outperform state-of-the-art baselines on these datasets, demonstrating its effectiveness and efficiency.

  • Subgraph-augmented Path Embedding for Semantic User Search on Heterogeneous Social Network
    Authors: Zemin Liu, Vincent W. Zheng, Zhou Zhao, Hongxia Yang, Kevin Chang, Minghui Wu and Jing Ying

    Keywords: semantic user search, heterogeneous social network, subgraph-augmented path embedding

    Abstract:
    Semantic user search is an important task on heterogeneous social network. Its core problem is to measure the proximity between two user objects in the network w.r.t. certain semantic user relation. The state-of-the-art solutions often take a path-based approach, which uses the sequences of objects connecting a query user and a target user to measure their proximity. Despite the success of this path-based approach, we argue that path as a low-order structure is insufficient to capture the rich semantics between two users. Therefore, in this paper we introduce a novel concept of subgraph-augmented path for semantic user search. Specifically, we consider sampling a set of object paths from a query user to a target user; then in each object path, we replace the linear object sequence between its every two neighboring users with their shared subgraph instances. Such subgraph-augmented paths are expected to leverage both path’s distance awareness and subgraph’s high-order structure. As it is non-trivial to model such subgraph-augmented paths, we develop a Subgraph-augmented Path Embedding (SPE) framework to accomplish the task. We evaluate our solution on six semantic user relations in three real-world public data sets, and show it outperforms the baselines.

  • Leveraging Fine-Grained Wikipedia Categories for Entity Search
    Authors: Denghao Ma, Yueguo Chen, Kevin Chang and Xiaoyong Du

    Keywords: entity search, category matching, language model

    Abstract:
    Ad-hoc entity search, which is to retrieve a ranked list of relevant entities in response to a query of natural language question, has been widely studied. It has been shown that category matching of entities, especially when matching to fine-grained entity categories, is critical to the performance of entity search. However, the potentials of fine-grained Wikipedia categories, has not been well exploited by existing studies. Based on the observation of how people describe entities of a specific type, we propose a headword-and-modifier model to deeply interpret both queries and fine-grained entity categories. Probabilistic generative models are designed to effectively estimate the relevance of headwords and modifiers as a pattern-based matching problem, taking the Wikipedia type taxonomy as an important input to address the ad-hoc representations of concepts/entities in queries. Extensive experimental results on three widely-used test sets: INEX-XER 2009, SemSearch-LS and TREC-Entity, show that our method achieves a significant improvement of the entity search performance over the state-of-the-art methods.

  • Online Compact Convexified Factorization Machine
    Authors: Xiao Lin, Wenpeng Zhang, Min Zhang, Peilin Zhao, Wenwu Zhu, Jian Pei and Junzhou Huang

    Keywords: Factorization Machine, Online Learning, Online Convex Optimization, Conditional Gradient

    Abstract:
    Factorization Machine (FM) is a supervised learning approach with a powerful capability of feature engineering. It yields state-of-the-art performance in various batch learning tasks where all the training data is available before training. However, due to the unavoidable expensive re-training cost, it is clearly impractical for many real-world applications where the data arrives sequentially in a streaming manner. We address this limitation by developing an online learning algorithm for Factorization Machine. The primal challenge of this problem stems from the requirements in Online Convex Optimization, the paramount framework for online learning algorithm design. As no previous formulations for FM meet the requirements in OCO, we first invent a new convexification scheme and the resulting Compact Convexified FM (CCFM) seamlessly meets the requirements. For learning Compact Convexified FM in the online learning setting, most existing algorithms suffer from the expensive projection operation, which is another challenge of the problem. To tackle this challenge, we follow the general projection-free algorithmic framework of Online Conditional Gradient and propose an Online Compact Convex Factorization Machine (OCCFM) algorithm that eschews the projection operation with efficient linear optimization steps. We prove that the proposed algorithm achieves a sub-linear regret bound, which provides theoretical guarantee for its usefulness. To evaluate the empirical performance of OCCFM, we conduct extensive experiments on 6 real-world datasets for recommendation and binary classification tasks. The experimental results show that OCCFM outperforms the state-of-art online learning algorithms.

  • Understanding and Predicting Delay in Reciprocal Relations
    Authors: Jundong Li, Jiliang Tang, Yilin Wang, Yali Wan, Yi Chang and Huan Liu

    Keywords: Reciprocal Relations, Time Delay, Dynamic Networks

    Abstract:
    Reciprocity in directed networks points to user’s willingness to return favors in building mutual interactions. High reciprocity has been widely observed in many directed social media networks such as following relations in Twitter and Tumblr. Therefore, reciprocal relations between users are often regarded as a basic mechanism to create stable social ties and play a crucial role in the formation and evolution of networks. Each reciprocity relation is formed by two parasocial links in a back-and-forth manner with a time delay. Hence, understanding the delay can help us gain better insights into the underlying mechanisms of network dynamics. Meanwhile, the accurate prediction of delay has practical implications in advancing a variety of real-world applications such as friend recommendation and marketing campaign. For example, by knowing when will users follow back, service providers can focus on the users with a potential long reciprocal delay for effective targeted marketing. This paper presents the initial investigation of the time delay in reciprocal relations. Our study is based on a large-scale directed network from Tumblr that consists of 62.8 million users and 3.1 billion user following relations with a timespan of multiple years (from 31 Oct 2007 to 24 Jul 2013). We reveal a number of interesting patterns about the delay that motivate the development of a principled learning model to predict the delay in reciprocal relations. Experimental results on the above mentioned dynamic networks corroborate the effectiveness of the proposed delay prediction model.

  • Hierarchical Variational Memory Network for Dialogue Generation
    Authors: Hongshen Chen, Zhaochun Ren, Jiliang Tang, Yihong Eric Zhao and Dawei Yin

    Keywords: Dialogue generation, Hierarchical Variational Memory Network, Recurrent Encoder-Decoder Model

    Abstract:
    Dialogue systems, also known as interactive conversational agents, are widely used in applications aiming to interact naturally and intelligently with humans. Dialogue generation is about generating utterance given previous utterances as contexts. Among di?erent spectrums of dialogue generation approaches, end-to-end neural generation models surge in recent researches. Unlike other methods, those end-to-end neural generation models are capable of generating natural sounding sentences with a unifed neural network encoder-decoder structure. The end-to-end structure sequentially encodes each word in an input context and generate the response word-by-word deterministically when decoding. However, the lack of variation and the limited ability in capturing long-term dependencies between utterances still make the dialogue generation a challenging research problem. In this paper, we propose a novel neural network encoder-decoder based model—the hierarchical variational memory network (HVMN), to take advantage of both the hierarchical structure and the variational memory network, which not only emulates human-to-human dialogues, but also latently captures the high-level abstract variations and long-term memories during dialogue tracking. Extensive experiments conducted on two benchmark datasets in English and Chinese, and an e-commerce customer service dataset show the e?ectiveness of our proposed method. Moreover, we fnd that HVMN outperforms state-of-the-art baselines in all three datasets signifcantly

  • Strategies for geographical scoping and improving a gazetteer
    Authors: Sanket Kumar Singh and Davood Rafiei

    Keywords: Geographical scope, Bounding box detection, Gazetteer improvement, Gazetteer enrichment, Geotagging

    Abstract:
    Many applications that use geographical databases (a.k.a. gazetteers) rely on the accuracy of the information in the database. However, poor data quality is an issue when data is integrated from multiple sources with different quality constraints and sometimes with little information about the sources. One major consequence of this is that the geographical scope of a location and/or its position may not be known or may not be accurate.
    In this paper, we study the problem of detecting the scope of locations in a geographical database and its applications in identifying inconsistencies and improving the quality of a gazetteer. We develop novel strategies, including probabilistic and geometric approaches, to accurately derive the geographical scope of places based on the spatial hierarchy of a gazetteer as well as other public information (such as area) that may be available. We show how the boundary information derived here can be useful in identifying inconsistencies, enhancing the location hierarchy and improving the applications that rely on gazetteers. Our experimental evaluation on two public-domain gazetteers reveals that the proposed approaches significantly outperform, in terms of the accuracy of the geographical bounding boxes, a baseline that is based on the parent-child relationship of a gazetteer. Among applications, we show that the boundary information derived here can move more than 20% of locations in a public gazetteer to better positions in the hierarchy and that the accuracy of those moves is over 90%.

  • “Detecting Crowdturfing “”Add to Favorites”” Activities in Online Shopping”
    Authors: Ning Su, Yiqun Liu, Zhao Li, Yuli Liu, Min Zhang and Shaoping Ma

    Keywords: Online shopping, Crowdsourcing Manipulation, Spam detection

    Abstract:
    “”Add to Favorites”” is a popular function in online shopping sites which helps users to make a record of potentially interesting items for future purchases. It is usually regarded as a type of explicit feedback signal for item popularity and therefore also adopted as a ranking signal by many shopping search engines. With the increasing usage of crowdsourcing platforms, some malicious online sellers also organize crowdturfing activities to increase the numbers of “”Add to Favorites”” for their items. By this means, they expect the items to gain higher positions in search ranking lists and therefore boost sales. This kind of newly-appeared malicious activity proposes challenges to traditional search spam detection efforts because it involves the participation of many crowd workers who are normal online shopping users in most of the times, and these activities are composed of a series of behaviors including search, browse, click and add to favorites.
    To shed light on this research question, we are among the first to investigate this particular spamming activity by looking into both the task organization information in crowdsourcing platforms and the user behavior information from online shopping sites. With a comprehensive analysis of some ground truth spamming activities from the perspective of behavior, user and item, we propose a factor graph based model to identify this kind of spamming activity. Experimental results based on data collected in practical shopping search environment show that our model helps detect malicious “”Add to Favorites”” activities effectively.

  • Search Process as Transitions Between Neural States
    Authors: Yashar Moshfeghi and Frank Pollick

    Keywords: Search Process, fMRI Study, Neural State, Brain Activity, Functional Brain Network, Information Need, Query Formulation, Query Submission, Relevance Judgment, Satisfaction

    Abstract:
    Search is one of the most performed activities on the World Wide Web. Various conceptual models postulate that the search process can be broken down into distinct emotional and cognitive states of searchers while they engage in a search process. These models significantly contribute to our understanding of the search process. However, they are typically based on self-report measures, such as surveys, questionnaire, etc. and therefore, only indirectly monitor the brain activity that supports such a process. With this work, we take one step further and directly measure the brain activity involved in a search process. To do so, we break down a search process into five time periods: a realisation of Information Need, Query Formulation, Query Submission, Relevance Judgment and Satisfaction Judgment. We then investigate the brain activity between these time periods. Using functional Magnetic Resonance Imaging (fMRI), we monitored the brain activity of twenty-four participants during a search process that involved answering questions carefully selected from the TREC-8 and TREC 2001 Q/A Tracks. This novel analysis that focuses on transitions rather than states reveals the contrasting brain activity between time periods — which enables the identification of the distinct parts of the search process as the user moves through them. This work, therefore, provides an important first step in representing the search process based on the transitions between neural states. Discovering more precisely how brain activity relates to different parts of the search process will enable the development of brain-computer interactions that better support search and search interactions, which we believe our study and conclusions advance.

  • StaQC: A Systematically Mined Question-Code Dataset from Stack Overflow
    Authors: Ziyu Yao, Daniel Weld, Wei-Peng Chen and Huan Sun

    Keywords: Question Answering, Question-Code Pairs, Deep Neural Networks, Stack Overflow

    Abstract:
    Stack Overflow has been a great source of natural language questions and their code solutions (i.e., question-code pairs), which are critical for many tasks including retrieving code snippets based on a keyword query and code annotation using natural language. In most existing research, question-code pairs were collected heuristically and tend to have low quality: For example, given a question post and its answer post, one simply pairs the question title with every code snippet (if any) in the answer post, which is problematic since a code snippet may not serve as a solution to the question.
    In this paper, we investigate a new problem of systematically mining question-code pairs from Stack Overflow (in contrast to heuristically collecting them). The task is formulated as predicting whether or not a code snippet is a standalone solution to a question. We propose a novel Bi-view Hierarchical Neural Network which can capture both the programming content and the textual context of a code snippet (i.e., two views) to make a prediction. On two manually annotated datasets in Python and SQL domain, our framework substantially outperforms heuristic methods with at least 15% higher F1 and accuracy. Furthermore, we obtain StaQC (Stack Overflow Question-Code pairs), the largest dataset to date of 148K Python and 120K SQL question-code pairs, automatically mined from SO using our framework. Under various case studies, we demonstrate that StaQC can greatly help develop data-hungry models for associating natural language with programming language.

  • Finding Subcube Heavy Hitters in Data Streams
    Authors: Branislav Kveton, Muthu Muthukrishnan, Hoa Vu and Yikun Xian

    Keywords: data streams, heavy hitters, algorithms, graphical models, high dimensional data

    Abstract:
    We address the problem of finding subcube heavy hitters in high dimensional data streams. Formally, the data stream consists of d-dimensional items, and a subcube is a subset of coordinates. The goal is to report all heavy hitters of an arbitrary query subcube correctly with high probability.
    We show that the sampling approach uses space that matches the lower bound given by Liberty et al. up to polylogarithmic factors. This lower bound implies a quadratic dependency on the number of dimensions d in the worst case. Our main contribution is to circumvent this quadratic bottleneck via a model-based approach. In particular, we assume that the dimensions are related to each other via the Naive Bayes model. We present a new two-pass algorithm for our problem that uses space that is linear in the number of dimensions d. Furthermore, we exhibit a fast polynomial time algorithm for reporting all heavy hitters of a query subcube.
    We also perform empirical study with a synthetic dataset as well as real datasets from Adobe and Yandex. We show that our algorithm achieves the least error in finding subcube heavy hitters compared to a one-pass variant or the sampling approach in small space. Our work shows the potential of model-based approach to data stream analysis.

  • Conversational Query Understanding Using Sequence to Sequence Modeling
    Authors: Gary Ren, Xiaochuan Ni, Manish Malik and Qifa Ke

    Keywords: search, deep learning, conversations, query understanding, sequence to sequence

    Abstract:
    Understanding conversations is crucial to enabling conversational search in technologies such as chatbots, digital assistants, and smart home devices that are becoming increasingly popular. Conventional search engines are powerful at answering open domain queries but are mostly capable of stateless search. In this paper, we define a conversational query as a query that depends on the context of the current conversation, and we formulate the conversational query understanding problem as context aware query reformulation, where the goal is to reformulate the conversational query into a search engine friendly query in order to satisfy users’ information needs in conversational settings. Such context aware query reformulation problem lends itself to sequence to sequence modeling. We present a large scale open domain dataset of conversational queries and various sequence to sequence models that are learned from this dataset. The best model correctly reformulates over half of all conversational queries, showing the potential of sequence to sequence modeling for this task.

  • Privacy and Efficiency Tradeoffs for Multiword Top K Search with Linear Additive Rank Scoring
    Authors: Daniel Agun, Jinjin Shao, Shiyu Ji, Stefano Tessaro and Tao Yang

    Keywords: Privacy-Preserving Query Processing, Additive Rank Scoring, Top K Search

    Abstract:
    This paper proposes a privacy-aware ranking scheme with
    linear additive scoring for efficient top keyword search on modest-sized
    cloud datasets. This scheme strikes for tradeoffs between privacy
    and efficiency by proposing single-round client-server collaboration
    with server-side partial ranking based on blinded feature weights with
    random masks. Client-side preprocessing includes query decomposition
    with chunked postings to facilitate earlier range intersection and
    fast access of server-side key-value stores, and server-side query
    processing deals with feature vector sparsity through optional feature
    matching and enables result filtering for queries that yield too many
    matched documents. This paper provides details on indexing and run-time
    conjunctive query processing and presents an evaluation that assesses
    the accuracy, efficiency, and privacy tradeoffs of this scheme through
    five datasets with various sizes.

  • Manifold Learning for Rank Aggregation
    Authors: Shangsong Liang, Ilya Markov, Zhaochun Ren and Maarten de Rijke

    Keywords: Web Search, Rank Aggregation, Data Fusion, Ad Hoc Search

    Abstract:
    We propose manifold learning aggregation approaches, ManX and v-ManX. ManX regularizes document fusion scores, so that documents that appear within a manifold, receive similar scores, whereas v-ManX first generates \emph{virtual} adversarial documents and then regularizes the fusion scores of both original and virtual adversarial documents. Since aggregation methods built on the cluster hypothesis are computationally expensive, we adopt an optimization method that uses the top-k documents as \emph{anchors} and considerably reduces the computational complexity of manifold-based methods, resulting in two efficient aggregation approaches, a-ManX and a-v-ManX. We assess the proposed approaches experimentally and show that they significantly outperform the state-of-the-art aggregation approaches, while a-ManX and a-v-ManX run faster than ManX, v-ManX, respectively.