Web Content Analysis,
Semantics and Knowledge

List of accepted papers :

  • Time Expression Recognition Using a Time-related Tagging Scheme
    Authors: Xiaoshi Zhong and Erik Cambria

    Keywords: Time expression recognition, Constituent-based tagging scheme, Inconsistent tag assignment, Position-based tagging scheme, Named entity recognition

    We analyze four datasets for the characteristics of time expressions, finding that time expressions are formed by loose structure and that the words used to express time information can differentiate time expressions from common text. The findings drive us to design a learning method named TOMN to model time expressions. TOMN defines a time-related tagging scheme named TOMN scheme with four tags, namely T, O, M, and N, indicating the constituents of time expression, namely Time token, Modifier, Numeral, and the words Outside time expression. Essentially, our constituent-based TOMN scheme overcomes the problem of inconsistent tag assignment that is caused by the conventional position-based tagging schemes (e.g., BIO scheme and BILOU scheme). In modeling, TOMN assigns a word with a TOMN tag under a framework of conditional random fields with minimal features. Experiments show that TOMN is equally or more effective than state-of-the-art methods on various datasets, and much more robust on cross-datasets. Moreover, our analysis can help explain many empirical observations in other works about time expression recognition and named entity recognition.

  • Parabel: Partitioned Label Trees for Extreme Classification with Application to Dynamic Search Advertising
    Authors: Yashoteja Prabhu, Anil Kag, Shrutendra Harsola, Rahul Agrawal and Manik Varma

    Keywords: Extreme Classification, Dynamic Search Advertising, Extreme Multi-Label Learning

    This paper develops the Parabel algorithm for extreme multi-label learning where the objective is to learn classifiers that can annotate a data point with the most relevant subset of labels from an extremely large label set. The state-of-the-art DiSMEC and PPDSparse algorithms are the most accurate but take weeks for training and prediction as they learn and apply an independent linear classifier per label. Consequently, they do not scale to large datasets with millions of labels. Parabel addresses these limitations by: (a) cutting down the training time to a few hours on a single core of a standard desktop by learning a hierarchy of coarse to fine label classifiers, each trained on a small subset of datapoints and (b) by cutting down the prediction time to a few milliseconds per test point by leveraging the classifier hierarchy for logarithmic time prediction in number of labels. This allows Parabel to scale to tasks considered infeasible for DiSMEC and PPDSparse such as predicting the subset of search engine queries that might lead to a click on a given ad-landing page for dynamic search advertising.
    Experimental results demonstrated that Parabel could train in 80 hours on a proprietary dataset with 7 million labels which is beyond the scale of both DiSMEC and PPDSparse. Results on some of the largest publically available datasets revealed that Parabel could be 1,000x faster at training than both DiSMEC and PPDSparse, as well as 10,000x and 40x faster at prediction than DiSMEC and PPDSparse without any significant loss in prediction accuracy. Moreover, Parabel was also found to be much more accurate than other tree based extreme classifiers and could be more than 10x faster at training with a 10x smaller model. Finally, Parabel was demonstrated to significantly improve dynamic search advertising on Bing by more than doubling the ad coverage, as well as improving the click-through rate by 20%.

  • Dynamic Embeddings for Language Evolution
    Authors: Maja Rudolph and David Blei

    Keywords: word embeddings, dynamic modeling, probabilistic modeling

    Word embeddings are a powerful approach for unsupervised analysis
    of language. Recently, Rudolph et al. developed exponential
    family embeddings, which cast word embeddings in a probabilistic
    framework. Here, we develop dynamic embeddings, building on
    exponential family embeddings to capture how the meanings of words
    change over time. We use dynamic embeddings to analyze three large
    collections of historical texts: the U.S. Senate speeches from 1858
    to 2009, the history of computer science ACM abstracts from 1951
    to 2014, and machine learning papers on the ArXiv from 2007 to
    2015. We find dynamic embeddings provide better fits than classical
    embeddings and capture interesting patterns about how language

  • HighLife: Higher-arity Fact Harvesting
    Authors: Patrick Ernst, Amy Siu and Gerhard Weikum

    Keywords: Higher-arity relation extraction, Knowledge graphs, Partial fact observation, Knowledge Base Construction, Text-based knowledge harvesting, Health, Tree pattern learning, Distant supervision

    Text-based knowledge extraction methods, for populating knowledge bases, have focused on binary facts: relationships between two entities. However, in advanced domains such as health, it is often crucial to consider ternary and higher-arity relations. An example is to capture which drug is used for which disease at which dosage (e.g. 2.5 mg/day) for which kinds of patients (e.g., children vs. adults). In this work, we present an approach to harvest higher-arity facts from textual sources. Our method is distantly supervised by seed facts, and uses the fact-pattern duality principle to gather fact candidates with high recall. For high precision, we devise a constraint-based reasoning method to eliminate false candidates. A major novelty is in coping with the difficulty that higher-arity facts are often expressed only partially in texts. For example, one sentence may refer to a drug, a disease and a group of patients, whereas another sentence talks about the drug, its dosage and the target group without mentioning the disease. Our methods cope well with such partially observed facts, at both pattern-learning and constraint-reasoning stages. Experiments with news articles and with health-related documents demonstrate the viability of our method.

  • Content Attention Model for Aspect Based Sentiment Analysis
    Authors: Qiao Liu, Haibin Zhang, Yifu Zeng, Ziqi Huang and Zufeng Wu

    Keywords: Sentiment Analysis, Aspect Based, Attention Mechanism

    Aspect based sentiment classification is a crucial task for sentiment analysis. Recent advances in the neural attention model demonstrate that it can be helpful in aspect based sentiment classification task, in that it can help identify the focus words of the human’s statements. However, according to our empirical study, prevalent content attention mechanisms proposed for aspect based sentiment classification mostly focused on identifying the sentiment words or shifters, without considering the relevance of such words with respect to the given aspects in the sentence. Therefore, it is usually insufficient for dealing with multi-aspect sentences and the syntactically complex sentence structures. To solve this problem, we propose a novel content attention based aspect based sentiment classification model, with two attention enhancing mechanisms: the sentence-level content attention mechanism is capable of capturing the important information about given aspect from a global perspective; the context attention mechanism is responsible for taking into account the order of the words and their correlations to the given aspects simultaneously, by embedding them into a series of customized memories. Experimental results on three benchmark datasets demonstrate the proposed model outperforms the state-of-the-art, in which the proposed mechanisms play a key role.

  • Financing the Web of Data with Delayed-Answer Auctions
    Authors: Tobias Grubenmann, Abraham Bernstein, Dmitry Moor and Sven Seuken

    Keywords: Web of Data, Slot Auctions, Delay, Sponsored Search Results, Vickrey-Clarke-Groves mechanism

    The World Wide Web is a massive network of interlinked documents. One of the reasons the World Wide Web is so successful is the fact that most content is available free of any charge. Inspired by the success of the World Wide Web, the Web of Data applies the same strategy of interlinking to data. To this point, most of data in the Web of Data is also free of charge. The fact that the data is freely available raises the question of financing these services, however. As we will discuss in this paper, advertisement and donations cannot easily be applied to this new setting.
    To create incentives to subsidize data providers, we propose that sponsors should pay the providers to promote sponsored data. In return, sponsored data will be privileged over non-sponsored data. Since it is not possible to enforce a certain ordering on the data the user will receive, we propose to split up the data into different batches and deliver these batches with different delays. In this way, we can privilege sponsored data without withholding any non-sponsored data from the user.
    In this paper, we introduce a new concept of a delayed-answer auction, where sponsors can pay to prioritize their data. We introduce a new model which captures the particular situation when a user access data in the Web of Data. We show how the weighted Vickrey-Clarke-Groves auction mechanism can be applied to our scenario and we discuss how certain parameters can influence the nature of our auction. With our new concept, we build a first step to a free yet financial sustainable Web of Data.

  • Estimating the Cardinality of Conjunctive Queries over RDF Data Using Graph Summarisation
    Authors: Giorgio Stefanoni, Boris Motik and Egor V. Kostylev

    Keywords: RDF, cardinality estimation, synopsis, query processing, SPARQL

    Accurately estimating the cardinality (i.e., the number of answers) of conjunctive queries is a core problem in data management. Common solutions to this problem are susceptible to high estimation errors that compound exponentially with the number of joins; hence, existing techniques can be inaccurate in data models such as RDF where queries are navigational and often involve many (self-)joins. In this paper we present a new technique for
    estimating the cardinality of conjunctive queries in RDF. We use a summary of an RDF graph as a synopsis that we interpret using a possible world semantics. We formalise the estimation problem as computing the expectation of query cardinality over all RDF graphs represented by the summary, and we present a closed-form formula for computing the expectation of arbitrary queries. We also discuss approaches to RDF graph summarisation. Finally, we show empirically that our cardinality technique is more accurate and more consistent, often by orders of magnitude, than several state-of-the-art approaches.

  • Never-Ending Learning for Open-Domain Question Answering over Knowledge Bases
    Authors: Abdalghani Abujabal, Rishiraj Saha Roy, Mohamed Yahya and Gerhard Weikum

    Keywords: Question Answering, Continuous Learning, Knowledge Bases

    Translating natural language questions to semantic representations such as
    SPARQL is a core challenge in open-domain question answering over knowledge bases (KB-QA).
    Existing methods rely on a clear separation between an offline training phase, where a model is learned, and an online phase
    where this model is deployed to answer users’ questions.
    Two major shortcomings are that such methods require access to a large training set,
    which is not always readily available, and that they fail on questions from domains not seen during training.
    To overcome these limitations, this paper presents a continuous learning paradigm for KB-QA, called NEQA. Offline, NEQA automatically learns templates from a small number of training question-answer pairs.
    Once deployed, continuous learning is triggered on cases where templates
    are insufficient. Using a semantic similarity function
    between questions and by judicious invocation of non-expert user feedback,
    NEQA learns new templates that capture previously-unseen syntactic structures.
    This way, NEQA gradually extends and improves its template repository.
    NEQA periodically re-trains its underlying models, allowing it to adapt to the language used after deployment.
    Our experiments demonstrate NEQA’s practical viability, with steady improvement in answering quality over time, and the gains on questions from previously-unobserved domains.

  • Large-Scale Hierarchical Text Classification with Recursively Regularized Deep Graph-CNN
    Authors: Hao Peng, Jianxin Li, Yu He, Yaopeng Liu, Mengjiao Bao, Lihong Wang, Yangqiu Song and Qiang Yang

    Keywords: Hierarchical Text Classification, Recursive Regularization, Graph-of-words, Deep Learning, Convolutional Neural Networks

    Text classification to a hierarchical taxonomy of topics is a common and practical problem. Traditional approaches simply use bag-of-words and have achieved good results.However, when there are a lot of labels with different topical granularities, bag-of-words representation may not be enough.Deep learning models have been proven to be effective to automatically learn different levels of representations for image data.It is interesting to study what is the best way to represent texts.In this paper, we propose a graph-CNN based deep learning model to first convert texts to graph-of-words, and then use graph convolution operations to convolve the word graph.Graph-of-words representation of texts has the advantage of capturing non-consecutive and long-distance semantics.CNN models have the advantage of learning different level of semantics.To further leverage the hierarchy of labels, we regularize the deep architecture with the dependency among labels.Our results on both RCV1 and NYTimes datasets show that we can significantly improve large-scale hierarchical text classification over traditional hierarchical text classification and existing deep models.

  • Estimating Rule Quality for Knowledge Base Completion with the Relationship between Coverage Assumption
    Authors: Kaja Zupanc and Jesse Davis

    Keywords: Rule mining, Knowledge base, ILP, Open world assumption

    Currently, there are many large, automatically constructed knowledge bases (KBs). One interesting task is learning from a KB to generate new knowledge either in the form of inferred facts or rules that define regularities. One challenge for learning is that KBs are necessarily open world: we cannot assume anything about the truth values of facts not included in the KB. From a learning perspective, this means we lack negative examples. To address this problem, we propose a novel score function for evaluating the quality of a first-order definite clause learned from a KB. Our metric attempts to include information about the facts not in the KB when evaluating the quality of a potential rule. Empirically, we find that our metric results in more precise predictions than previous approaches.

  • Improving Word Embedding Compositionality using Lexicographic Definitions
    Authors: Thijs Scheepers, Evangelos Kanoulas and Efstratios Gavves

    Keywords: Word Embeddings, Compositionality, Deep Learning, Distributional Semantics, Representation Learning

    We present an in-depth analysis of various popular word embeddings (Word2Vec, GloVe, fastText and Paragram) in terms of their compositionality, as well as a method to tune them towards better compositionality. We find that training the embeddings to compose lexicographic definitions improves their performance in this task significantly, while also getting similar or better performance in both word similarity evaluations and sentence embedding evaluations.
    Word embeddings are tuned using a simple neural network architecture with definitions and lemmas from WordNet. Since dictionary definitions can be composed into the lemmas they define, they are also suitable for tuning, as well as evaluating for compositionality. Our architecture allows for the embeddings to be composed using simple arithmetic operations, which makes these embeddings specifically suitable for production applications such as web search and data mining. We call our model structure: CompVec.
    In our analysis, we evaluate original embeddings, as well as tuned embeddings, using existing word similarity and sentence embedding evaluation methods. Aside from these evaluation methods used in related work, we also evaluate embeddings using a ranking method which tests composed vectors using the lexicographic definitions already mentioned. In contrast to other evaluation methods, ours is not invariant to the magnitude of the embedding vector—which we show is important for composition. We consider this new evaluation method (CompVecEval) to be a key contribution.

  • Browserless Web Data Extraction: Challenges and Opportunities
    Authors: Ruslan Fayzrakhmanov, Emanuel Sallinger, Ben Spencer, Tim Furche and Georg Gottlob

    Keywords: web data extraction, scraping, deep web, web automation, HTTP, AJAX

    Most modern web scrapers use an embedded browser to render
    web pages and to simulate user actions. Such scrapers (or wrappers)
    are therefore expensive to execute, in terms of time and network
    traffic. In contrast, it is magnitudes more resource-efficient to
    use a “”browserless”” wrapper which directly accesses a web server
    through HTTP requests, and takes the desired data directly from
    the raw replies. However, creating and maintaining browserless
    wrappers of high precision requires specialists, and is prohibitively
    labor-intensive at scale. In this paper, we demonstrate the principle
    feasibility of automatically translating visual wrappers into
    “”browserless”” wrappers. We present the first algorithm and system
    performing such an automated translation on suitably restricted
    types of web sites. This system works in the vast majority of test
    cases and produces very fast and extremely resource-saving wrappers.
    We discuss research challenges for extending our approach to
    a general method applicable to a yet larger number of cases.

  • Short-Text Topic Modeling via Nonnegative Matrix Factorization Enriched with Local Word-Context Correlations from Word Embedding
    Authors: Tian Shi, Kyeongpil Kang, Jaegul Choo and Chandan K. Reddy

    Keywords: Topic Modeling, Short Texts, Non-negative Matrix Factorization, Word Embedding

    Being a prevalent form of social communications on the Internet, billions of short texts are generated everyday. Discovering knowledge from them has gained a lot of interest from both industry and academia. The short texts have a limited contextual information, and they are sparse, noisy and ambiguous, and hence, automatically learning topics from them remains an important challenge. To tackle this problem, in this paper, we propose a semantics-assisted non-negative matrix factorization (SeaNMF) model to discover topics for the short texts. It effectively incorporates the word-context semantic correlations into the model, where the semantic relationships between the words and their contexts are learned from the skip-gram view of the corpus. The SeaNMF model is solved using a block coordinate decent algorithm. We also develop a sparse variant of the SeaNMF model which can achieve a better model interpretability. Extensive quantitative evaluations on various real-world short text datasets demonstrate the superior performance of the proposed models over several other state-of-the-art methods in terms of topic coherence and classification accuracy. The qualitative semantic analysis demonstrates the interpretability of our models by discovering meaningful and consistent topics. With a simple formulation and the superior performance, SeaNMF can be an effective standard topic model for short texts.

  • Are All People Married? Determining Obligatory Attributes in Knowledge Bases
    Authors: Jonathan Lajus and Fabian M. Suchanek

    Keywords: Knowledge Bases, Completeness, Classes, Attributes

    In the absence of counter-evidences, people generalize as an effective way to predict real-world facts. In this paper we discuss methods to automatically mine valid generalizations from a Knowledge Base (KB). In particular we want to determine if all instances of a class must have an attribute in the real world, if the attribute is mandatory for the class. For example, has-Birth-Date is an obligatory attribute for the class Person, while has-Spouse is not. We introduce a new way to model incompleteness and derive a method to automatically determine obligatory attributes with a precision of up to 90%.

  • Socioeconomic dependencies of linguistic patterns in Twitter: a multivariate analysis
    Authors: Jacob Levy Abitbol, Márton Karsai, Jean-Philippe Magué, Jean-Pierre Chevrot and Eric Fleury

    Keywords: computational sociolinguistics, Twitter data, socioeconomic status inference, social network analysis, spatiotemporal data

    Our usage of language is not solely reliant on cognition but is arguably determined by myriad external factors leading to a global variability of linguistic patterns. This issue, which lies at the core of sociolinguistics and is backed by many small-scale studies on face-to-face communication, is addressed here by constructing a dataset combining the largest French Twitter corpus to date with detailed socioeconomic maps obtained from national census in France. We show how key linguistic variables measured in individual Twitter streams depend on factors like socioeconomic status, location, time, and the social network of individuals. We found that (i) people of higher socioeconomic status, active to a greater degree during the daytime, use a more standard language; (ii) the southern part of the country is more prone to use more standard language than the northern one, while locally the used variety or dialect is determined by the spatial distribution of socioeconomic status; and (iii) individuals connected in the social network are closer linguistically than disconnected ones, even after the effects of status homophily have been removed. Our results inform sociolinguistic theory and may inspire novel learning methods enabling the inference of socioeconomic status of people from the way they tweet.

  • An Attention Factor Graph Model for Tweet Entity Linking
    Authors: Chenwei Ran, Wei Shen and Jianyong Wang

    Keywords: Twitter, Knowledge graph, Entity linking, Factor model, Attention model

    The rapid expansion of Twitter has attracted worldwide attention. With more than 500 million tweets posted per day, Twitter becomes an invaluable information and knowledge source. Many Twitter related tasks have been studied, such as event extraction, hashtag recommendation, and topic detection. A critical step in understanding and mining information from Twitter is to disambiguate entities in tweets, i.e., tweet entity linking. It is a challenging task because tweets are short, noisy, and fresh. Many tweet-specific signals have been found to solve the tweet entity linking problem, such as user interest, temporal popularity, location information and so on. However, two common weaknesses exist in previous work. First, most proposed models are not flexible and extendable to fit new signals. Second, their scalability is not good enough to handle the large-scale social network like Twitter. In this work, we formalize the tweet entity linking problem into a factor graph model which has shown its effectiveness and efficiency in many other applications. We also propose selective attention over entities to increase the scalability of our model, which brings linear complexity. To adopt the attention mechanism in the factor graph, we propose a new type of nodes called pseudo-variable nodes to solve the asymmetry attention problem caused by the undirected characteristic of the factor graph. We evaluated our model on two different manually annotated tweet datasets. The experimental results show that our model achieves better performance in terms of both effectiveness and efficiency compared with the state-of-the-art approaches.

  • Find the Conversation Killers: a Predictive Study of Thread-ending Posts
    Authors: Yunhao Jiao, Cheng Li, Fei Wu and Qiaozhu Mei

    Keywords: Social conversations, Conversation prediction, Deep learning

    How to improve the quality of conversations in online communities has attracted considerable attention recently. Having engaged, urbane, and reactive online conversations have a critical effect on the social life of Internet users. In this study, we are particularly interested in identifying a post in a multi-party conversation that is unlikely to be further replied to, which therefore kills that thread of the conversation. For this purpose, we propose a deep learning model called the ConverNet. ConverNet is attractive due to its capability of modeling the internal structure of a long conversation and its appropriate encoding of the contextual information of the conversation, through effective integration of attention mechanisms. Empirical experiments on real-world datasets demonstrate the effectiveness of the proposed model. For the widely concerned topic, our analysis also offers implications for improving the quality and user experience of online conversations.

  • Semantics and Complexity of GraphQL
    Authors: Olaf Hartig and Jorge Pérez

    Keywords: GraphQL, Graph Databases, Semantics, Complexity

    GraphQL is a recently proposed, and increasingly adopted, conceptual framework for providing a new type of data access interface on the Web. The framework includes a new graph query language whose semantics has been specified informally only. This has prevented the formal study of the main properties of the language.
    We embark on the formalization and study of GraphQL. To this end, we first formalize the semantics of GraphQL queries based on a labeled-graph data model. Thereafter, we analyze the language and show that it admits really efficient evaluation methods. In particular, we prove that the complexity of the GraphQL evaluation problem is NL-complete. Moreover, we show that the enumeration problem
    can be solved with constant delay. This implies that a server can answer a GraphQL query and send the response byte-by-byte while spending just a constant amount of time between every byte sent.
    Despite these positive results, we prove that the size of a GraphQL response might be prohibitively large for an internet scenario. We present experiments showing that current practical implementations suffer from this issue. We provide a solution to cope with this problem by showing that the total size of a GraphQL response can be computed in polynomial time. Our results on polynomial-time size computation plus the constant-delay enumeration can help developers to provide more robust GraphQL interfaces on the Web.

  • Sentiment Analysis by Capsules
    Authors: Yequan Wang, Aixin Sun, Jialong Han, Ying Liu and Xiaoyan Zhu

    Keywords: Sentiment Analysis, Capsule, Recurrent Neural Network, Attention

    In this paper, we propose RNN-Capsule, a capsule model based on Recurrent Neural Network (RNN) for sentiment analysis. For a given problem, one capsule is built for each sentiment category e.g., ‘positive’, ‘neutral’, and ‘negative’. Each capsule has an attribute, a state, and three modules: representation module, probability module, and reconstruction module. The attribute of a capsule is the assigned sentiment category. Given an instance encoded in hidden vectors by a typical RNN, the representation module builds capsule representation by the attention mechanism. Based on capsule representation, the probability module computes the capsule’s state probability. A capsule’s state is active if its state probability is the largest among all capsules for the given instance, and inactive otherwise. On two benchmark datasets (i.e., Movie Review and Stanford Sentiment Treebank) and one proprietary dataset (i.e., Hospital Feedback), we show that RNN-Capsule achieves state-of-the-art performance on sentiment classification. More importantly, without using any linguistic knowledge, RNN-Capsule is capable of outputting words with sentiment tendencies reflecting capsules’ attributes. The words well reflect the domain specificity of the dataset. To the best of our knowledge, this is the first capsule model for sentiment analysis.

  • Modelling Dynamics in Semantic Web Knowledge Graphs with Formal Concept Analysis
    Authors: Larry Gonzalez and Aidan Hogan

    Keywords: Semantic Web, Schema, Knowledge Graph, Dynamics, FCA

    In this paper, we propose a novel data-driven schema for large-scale heterogeneous knowledge graphs inspired by Formal Concept Analysis (FCA). We first extract the sets of properties associated with individual entities; these property sets (aka. characteristic sets) are annotated with cardinalities and used to induce a lattice based on set-containment relations, forming a natural hierarchical structure describing the knowledge graph. We then propose an algebra over such schema lattices, which allows to compute diffs between lattices (for example, to summarise the changes from one version of a knowledge graph to another), to add lattices (for example, to project future changes), and so forth. While we argue that this lattice structure (and associated algebra) may have various applications, we currently focus on the use-case of modelling and predicting the dynamic behaviour of knowledge graphs. Along those lines, we instantiate and evaluate our methods for analysing how versions of the Wikidata knowledge graph have changed over a period of 11 weeks. We propose algorithms for constructing the lattice-based schema from Wikidata, and evaluate their efficiency and scalability. We then evaluate use of the resulting schema(ta) for predicting how the knowledge graph will evolve in future versions.

  • Scalable Instance Reconstruction in Knowledge Bases via Relatedness Affiliated Embedding
    Authors: Richong Zhang, Junpeng Li, Jiajie Mei and Yongyi Mao

    Keywords: Knowledge Base Completion, Embedding, Link Prediction

    The knowledge base (KB) completion problem is usually formulated as a link prediction problem. Such formulation is incapable of capturing certain application scenarios when the KB contains multi-fold relations. In this paper, we present a new class of KB completion problems, called instance reconstruction, complementing the scope of link prediction. Unlike its link-prediction counterpart, which has linear complexity in KB size, this problem has its complexity behave as a high-degree polynomial. This presents significant challenges in developing scalable reconstruction algorithms. In this paper, we present a novel KB embedding model (RAE) and build on it an instance reconstruction algorithm (SIR). The SIR algorithm utilizes schema-based filtering as well as “relatedness” filtering for complexity reduction. Here relatedness refers to the likelihood that two entities co-participate in a common instance, and the relatedness metric is learned from the RAE model. We show experimentally that SIR significantly reduces computation complexity without sacrificing reconstruction performance. The complexity reduction corresponds to reducing the KB size by 100 to 1000 folds.

  • Leveraging Social Media Signals for Record Linkage
    Authors: Andrew Schneider, Arjun Mukherjee and Eduard Dragut

    Keywords: Record Linkage, Social Media, Data Cleaning

    Many data-intensive applications collect (structured) data from a variety of sources. A key task in this process is record linkage, which is the problem of determining the records from these sources that refer to the same real-world entities.
    Traditional approaches use the record representation of entities to accomplish this task. With the nascence of social media, entities on the Web are now accompanied by user generated content. We present a method for record linkage that uses this hitherto untapped source of entity information. We use document-based distances, with an emphasis on word embedding document distances, to determine if two entities match.
    Our rationale is that user evaluations of entities converge in semantic content, and hence in the word embedded space, as the number of user evaluations grows. We analyze the effectiveness of the proposed method both as a stand-alone method and in combination with the record-based record linkage methods.
    Experimental results using real-world reviews demonstrate the high effectiveness of our approach. To our knowledge, this is the first work exploring the use of user generated content accompanying entities in the record linkage task.

  • A Structured Approach to Understanding Recovery and Relapse in AA
    Authors: Yue Zhang, Arti Ramesh, Jennifer Golbeck, Dhanya Sridhar and Lise Getoor

    Keywords: alcoholism, alcoholics anonymous, social media analysis, hinge-loss Markov random fields, Twitter

    Alcoholism, also known as Alcohol Use Disorder (AUD) is a serious problem affecting millions of people worldwide. Recovery from AUD is known to be challenging and often leads to relapse at various points after enrolling in a rehabilitation program such as Alcoholics Anonymous (AA). In this work, we take a structured approach to understand recovery and relapse from AUD using social media data. To do so, we combine linguistic and psychological attributes of users with relational features that capture useful structure in the user interaction network. We evaluate our models on AA-attending users extracted from the Twitter social network and predict recovery at two different points—90-days and 1 year after the user joins AA, respectively. Our experiments reveal that our structured approach is helpful in predicting recovery in these users. We perform extensive quantitative analysis of different groups of features and dependencies among them. Our analysis sheds light on the role of each feature group and how they combine to predict recovery and relapse. Finally, we present qualitative analysis of different reasons behind users relapsing to AUD. Our models and analysis are helpful in making meaningful predictions in scenarios where only a subset of features are available. Our analysis can potentially be helpful in identifying and preventing relapse early.

  • Facet Annotation Using Reference Knowledge Bases
    Authors: Riccardo Porrini, Matteo Palmonari and Isabel Cruz

    Keywords: facet annotation, semantic analysis, table annotation, faceted search, data lifting, eCommerce

    Faceted interfaces are omnipresent on the web to support data exploration and filtering. A facet is a triple: a domain (e.g., book), a property (e.g., author, language), and a set of values (e.g., {Austen, Beauvoir, Coelho, Dostoevsky, Eco, Kerouac, Su?skind,. . .}, {French, English, German, Italian, Portuguese, Russian,. . .}). Given a property, a set of homogeneous values can be used to select those domain entities whose property values match the given set of values (e.g., given English and Italian, the book entities that have authors Austen, Eco, Kerouac,…, are selected). Multiple properties and respective values can be considered at the same time or applied successively. To implement faceted interfaces in a way that is scalable to very large datasets, it is necessary to automate the process of extracting facets. Prior work considers the problem of associating a set of homogeneous values with a facet domain, but does not annotate a facet property. In this paper, we annotate a facet property with a predicate from a reference Knowledge Base (KB) so as to maximize the semantic similarity between the property and the predicate. We define semantic similarity in terms of three new metrics: specificity, coverage, and frequency. Our experimental evaluation uses the DBpedia and YAGO KBs and shows that for the facet annotation problem, we obtain better results than a state-of-the-art approach for the annotation of web tables as modified to annotate a set of values.

  • MemeSequencer: Sparse Matching for Embedding Image Macros
    Authors: Abhimanyu Dubey, Esteban Moro, Manuel Cebrian and Iyad Rahwan

    Keywords: online imagery, social media, memes, image virality, computer vision, sparse representation, sparse matching, social networks, image understanding, semantic understanding

    The analysis of the creation, mutation, and propagation of social media content on the Internet is an essential problem in computational social science, affecting areas ranging from marketing to political mobilization. A first step towards understanding the evolution of images online is the analysis of rapidly modifying and propagating memetic imagery or `memes’. However, a pitfall in proceeding with such an investigation is the current incapability to produce a robust semantic space for such imagery, capable of understanding differences in Image Macros. In this study, we provide a first step in the systematic study of image evolution on the Internet, by proposing an algorithm based on sparse representations and deep learning to decouple various types of content in such images and produce a rich semantic embedding. We demonstrate the efficacy of our approach on a variety of tasks pertaining to memes and Image Macros, image clustering, image retrieval, topic prediction and virality prediction, surpassing the existing methods on each. In addition to its utility on quantitative tasks, our method opens up the possibility of obtaining the first large-scale understanding of the evolution and propagation of memetic imagery.

  • Matching Natural Language Sentences with Hierarchical Sentence Factorization
    Authors: Bang Liu, Ting Zhang, Fred X. Han, Di Niu, Kunfeng Lai and Yu Xu

    Keywords: Hierarchical Sentence Factorization, Sentence Reordering, Semantic Matching, Ordered Word Mover’s Distance, Abstract Meaning Representation

    Semantic matching of natural language sentences or identifying the relationship between two sentences is a core research problem underlying many natural language tasks.
    Prior research has proposed both unsupervised distance-based schemes, when training data is not available, and deep learning schemes for sentence matching, given training data.
    However, previous approaches either omit or fail to fully utilize the ordered, hierarchical, and flexible structures of language objects, as well as the interaction between them.
    In this paper, we propose \textit{Hierarchical Sentence Factorization}—a technique that is able to factorize a sentence into a hierarchical representation, with the components at each different scale reordered into a “predicate-argument” form. The proposed sentence factorization technique leads to the invention of: 1) a new unsupervised distance metric which calculates the semantic distance between a pair of text snippets by solving a penalized optimal transport problem while preserving the logical relationship of words in the reordered sentences, and 2) new multi-scale deep learning models for supervised semantic training, based on factorized sentence hierarchies.
    We apply our techniques to text-pair similarity estimation and text-pair relationship classification tasks, based on multiple datasets such as STSbenchmark, the Microsoft Research paraphrase identification (MSRP) dataset, the SICK dataset, etc. Extensive experiments show that the proposed hierarchical sentence factorization can be used to significantly improve the performance of existing unsupervised distance-based metrics as well as multiple supervised deep learning models based on the convolutional neural network (CNN) and long short-term memory (LSTM).

  • Why Reinvent the Wheel – Let’s Build Question Answering Systems Together
    Authors: Kuldeep Singh, Arun Sethupat Radhakrishna, Andreas Both, Saeedeh Shekarpour, Ioanna Lytra, Ricardo Usbeck, Akhilesh Vyas, Akmal Khikmatullaev, Dharmen Punjani, Christoph Lange, Maria-Esther Vidal, Jens Lehmann and Sören Auer

    Keywords: Question Answering, Software Reusability, Semantic Web, Semantic Search, QA Framework

    Modern question answering (QA) systems need to flexibly integrate a number of components specialised to fulfil specific tasks in a QA pipeline. Key QA tasks include Named Entity Recognition and Disambiguation, Relation Extraction, and Query Building. Since a number of different software components exist, implementing different strategies for each of these tasks, a major challenge when building QA systems, is how to select and combine the most suitable components into a QA system, given the characteristics of a question. We study this optimisation problem and train Classifiers, which take features of a question as input and have the goal of optimising the selection of QA components based on those features. We then devise a greedy algorithm to identify the pipelines that include the suitable components and can effectively answer the given question. We implement this model within Frankenstein, a QA framework able to select QA components and compose QA pipelines. We evaluate the effectiveness of the pipelines generated by Frankenstein using the QALD and LC-QuAD benchmarks. These results not only suggest that Frankenstein precisely solves the QA optimisation problem, but also enables the automatic composition of optimised QA pipelines, which outperform the static Baseline QA pipeline. Thanks to this flexible and fully automated pipeline generation process, new QA components can be easily included in Frankenstein, thus improving the performance of the generated pipelines.

  • Weakly-supervised Relation Extraction by Pattern-enhanced Embedding Learning
    Authors: Meng Qu, Xiang Ren, Yu Zhang and Jiawei Han

    Keywords: Relation Extraction, Weakly Supervised, Co-training

    Extracting relations from text corpora is an important task with wide applications. However, it becomes particularly challenging when focusing on weakly-supervised relation extraction, that is, utilizing a few relation instances (i.e., a pair of entities and their relation) as seeds to extract from corpora more instances of the same relation. Existing distributional approaches leverage the corpus-level co-occurrence statistics of entities to predict their relations, and require large amount of labeled instances to learn effective relation classifiers. Alternatively, pattern-based approaches perform boostrapping or apply neural networks to model the local contexts, but still rely on large amount of labeled instances to build reliable models. In this paper, we study the integration of distributional and pattern-based methods in a weakly-supervised setting such that the two kinds of methods can provide complementary supervision for each other to build an effective, unified model. We propose a novel framework with a distributional module and a pattern module. During training, the distributional module helps the pattern module discriminate between the informative patterns and other patterns, and the pattern module generates some highly-confident instances to improve the distributional module. The whole framework can be effectively optimized by iterating between improving the pattern module and updating the distributional module. We conduct experiments on two tasks: knowledge base completion and corpus-level relation extraction. Experimental results prove the effectiveness of our framework over many competitive baselines.

  • Finding Needles in an Encyclopedic Haystack: Detecting Classes Among Wikipedia Articles
    Authors: Marius Pasca

    Keywords: knowledge acquisition, concepts, classes, unstructured text, topic classification, knowledge repositories, open-domain information extraction

    A lightweight method distinguishes articles within Wikipedia that are classes (“”Novel””, “”Book””) from other articles (“”Three Men in a Boat””, “”Diary of a Pilgrimage””). It exploits clues available within the article text and within categories associated with articles in Wikipedia, while not requiring any linguistic preprocessing tools. Experimental results show that classes can be identified among Wikipedia articles in multiple languages, at aggregate precision and recall above 0.9 and 0.6 respectively.

  • User-guided Hierarchical Attention Network for Multi-modal Social Image Popularity Prediction
    Authors: Wei Zhang, Wen Wang, Jun Wang and Hongyuan Zha

    Keywords: Social Image Popularity, Multi-modal Analysis, Attention Network

    Popularity prediction for the growing social images has opened unprecedented opportunities for wide commercial applications, such as precision advertising and recommender system. While a few studies have explored this significant task, little research has addressed its unstructured properties of both visual and textual modalities, and further considered to learn effective representation from multi-modalities for popularity prediction. To this end, we propose a model named User-guided Hierarchical Attention Network (UHAN) with two novel user-guided attention mechanisms to hierarchically attend both visual and textual modalities. It is capable of not only learning effective representation for each modality, but also fusing them to obtain an integrated multi-modal representation under the guidance of user embedding. As no benchmark dataset exists, we extend a publicly available social image dataset by adding the descriptions of images. Comprehensive experiments have demonstrated the rationality of our proposed UHAN and its better performance than several strong alternatives.

  • A Coherent Unsupervised Model for Toponym Resolution
    Authors: Ehsan Kamalloo and Davood Rafiei

    Keywords: toponym resolution, geolocation extraction, unsupervised disambiguation, context-bound hypotheses, spatial hierarchies

    Toponym Resolution, the task of assigning a location mention in a document to a geographic referent (i.e. latitude/longitude), plays a pivotal role in analyzing location-aware content. However, the ambiguities of natural language and a huge number of possible interpretations for toponyms constitute insurmountable hurdles for this task. In this paper, we study the problem of toponym resolution with no additional information other than a gazetteer and no training data. We demonstrate that a dearth of large enough annotated data makes supervised methods less capable of generalizing. Our proposed method estimates the geographic scope of documents and leverages the connections between nearby place names as evidence to resolve toponyms. We explore the interactions between multiple interpretations of mentions and the relationships between different toponyms in a document to build a model that finds the most coherent resolution. Our model is evaluated on three news corpora, two from the literature and one collected and annotated by us; then, we compare our methods to the state-of-the-art unsupervised and supervised techniques. We also examine three commercial products including Reuters OpenCalais, Yahoo! YQL Placemaker, and Google Cloud Natural Language API. The evaluation shows that our method outperforms the unsupervised technique as well as Reuters OpenCalais and Google Cloud Natural Language API on all three corpora; also, our method shows a performance close to that of the state-of-the art supervised method and outperforms it when the test data has 40% or more toponyms that are not seen in the training data.

  • Inferring Missing Categorical Information in Noisy and Sparse Web Markup
    Authors: Nicolas Tempelmeier, Elena Demidova and Stefan Dietze

    Keywords: Web Markup, Supervised Learning, Information Inferring

    Embedded markup of Web pages has seen widespread adoption
    throughout the past years, driven by standards such as RDFa and
    Microdata and initiatives such as, where recent studies
    show an adoption by 38% of all Web pages already in 2016. While
    this constitutes a significant source of information aiding tasks such
    as Web search, Web page classification or knowledge graph augmentation,
    individual markup nodes are usually sparsely described,
    where essential information usually is not provided. For instance,
    63% of nodes provide less than three statements/properties while
    from 24 million nodes describing events only 244 thousand (0.9%)
    provide specific event types. However, given the scale and diversity
    of markup data, in particular for categorical properties as in the
    previous example, there exist sufficiently large amounts of nodes,
    which provide the sought after information. These nodes constitute
    potential training data from which to build supervised models for
    inferring missing information to significantly augment sparsely
    described nodes. In this work, we introduce a supervised approach
    for inferring missing categorical properties in Web markup. Our
    results, conducted on properties of events and movies show a performance
    of 83% F1 score, outperforming existing
    baselines significantly.

  • Towards Annotating Relational Data on the Web with Language Models
    Authors: Matteo Cannaviccio, Denilson Barbosa and Paolo Merialdo

    Keywords: Web table annotation, Language Models, Knowledge Graph Augmentation

    Tables and structured lists on Web pages have long been identified as a potential source of valuable information, and several methods have been proposed to annotate such tables and lists with semantics that can be leveraged for search, question answering and information extraction. This paper is concerned with the specific problem of finding and ranking relations from a given Knowledge Graph (KG) that hold over the pairs of entities juxtaposed in a table or structured list. The current state-of-the-art for this task is to attempt to link the entities mentioned in the table cells to objects in the KG and rank the relations that hold for those linked objects. As a result, these methods are hampered by the incompleteness and uneven coverage in even the best knowledge graphs available today. The alternative described here relies on ranking relations using generative language models derived from Web-scale corpora instead of entity linking. As such, it can produce high quality results even when the entities in the table are missing from the graph. The experimental validation, designed to expose the challenges posed by KG incompleteness, shows the proposed approach is robust and effective on a wide range of domains.

  • CESI: Canonicalizing Open Knowledge Bases using Embeddings and Side Information
    Authors: Shikhar Vashishth, Prince Jain and Partha Talukdar

    Keywords: Knowledge Graphs, Open Knowledge Bases, Canonicalization, Knowledge Graph Embeddings

    Open Information Extraction (OpenIE) methods extract (noun phrase, relation phrase, noun phrase) triples from text, resulting in the construction of large open Knowledge Bases (KBs). The noun phrases (NPs) and relation phrases in such open KBs are not canonicalized, leading to the storage of redundant and ambiguous facts. Recent research has posed canonicalization of open KBs as clustering over manually-defined feature spaces. Manual feature engineering is expensive and often sub-optimal. In order to overcome this challenge, we propose CESI — a novel approach which performs canonicalization over learned embeddings of Open KBs. CESI extends recent advances in KB embedding by incorporating relevant NP and relation phrase side information in a principled manner. Through extensive experiments on multiple real-world datasets, we demonstrate CESI’s effectiveness. We plan to make publicly available all the data and code used in the paper.