Web Content Analysis,
Semantics and Knowledge

List of accepted papers :

  • Large-Scale Hierarchical Text Classification with Recursively Regularized Deep Graph-CNN
    Authors: Hao Peng, Jianxin Li, Yu He, Yaopeng Liu, Mengjiao Bao, Lihong Wang, Yangqiu Song and Qiang Yang

    Keywords: Hierarchical Text Classification, Recursive Regularization, Graph-of-words, Deep Learning, Convolutional Neural Networks

    Text classification to a hierarchical taxonomy of topics is a common and practical problem. Traditional approaches simply use bag-of-words and have achieved good results.However, when there are a lot of labels with different topical granularities, bag-of-words representation may not be enough.Deep learning models have been proven to be effective to automatically learn different levels of representations for image data.It is interesting to study what is the best way to represent texts.In this paper, we propose a graph-CNN based deep learning model to first convert texts to graph-of-words, and then use graph convolution operations to convolve the word graph.Graph-of-words representation of texts has the advantage of capturing non-consecutive and long-distance semantics.CNN models have the advantage of learning different level of semantics.To further leverage the hierarchy of labels, we regularize the deep architecture with the dependency among labels.Our results on both RCV1 and NYTimes datasets show that we can significantly improve large-scale hierarchical text classification over traditional hierarchical text classification and existing deep models.

  • Weakly-supervised Relation Extraction by Pattern-enhanced Embedding Learning
    Authors: Meng Qu, Xiang Ren, Yu Zhang and Jiawei Han

    Keywords: Relation Extraction, Weakly Supervised, Co-training

    Extracting relations from text corpora is an important task with wide applications. However, it becomes particularly challenging when focusing on weakly-supervised relation extraction, that is, utilizing a few relation instances (i.e., a pair of entities and their relation) as seeds to extract from corpora more instances of the same relation. Existing distributional approaches leverage the corpus-level co-occurrence statistics of entities to predict their relations, and require large amount of labeled instances to learn effective relation classifiers. Alternatively, pattern-based approaches perform boostrapping or apply neural networks to model the local contexts, but still rely on large amount of labeled instances to build reliable models. In this paper, we study the integration of distributional and pattern-based methods in a weakly-supervised setting such that the two kinds of methods can provide complementary supervision for each other to build an effective, unified model. We propose a novel framework with a distributional module and a pattern module. During training, the distributional module helps the pattern module discriminate between the informative patterns and other patterns, and the pattern module generates some highly-confident instances to improve the distributional module. The whole framework can be effectively optimized by iterating between improving the pattern module and updating the distributional module. We conduct experiments on two tasks: knowledge base completion and corpus-level relation extraction. Experimental results prove the effectiveness of our framework over many competitive baselines.

  • Scalable Instance Reconstruction in Knowledge Bases via Relatedness Affiliated Embedding
    Authors: Richong Zhang, Junpeng Li, Jiajie Mei and Yongyi Mao

    Keywords: Knowledge Base Completion, Embedding, Link Prediction

    The knowledge base (KB) completion problem is usually formulated as a link prediction problem. Such formulation is incapable of capturing certain application scenarios when the KB contains multi-fold relations. In this paper, we present a new class of KB completion problems, called instance reconstruction, complementing the scope of link prediction. Unlike its link-prediction counterpart, which has linear complexity in KB size, this problem has its complexity behave as a high-degree polynomial. This presents significant challenges in developing scalable reconstruction algorithms. In this paper, we present a novel KB embedding model (RAE) and build on it an instance reconstruction algorithm (SIR). The SIR algorithm utilizes schema-based filtering as well as “relatedness” filtering for complexity reduction. Here relatedness refers to the likelihood that two entities co-participate in a common instance, and the relatedness metric is learned from the RAE model. We show experimentally that SIR significantly reduces computation complexity without sacrificing reconstruction performance. The complexity reduction corresponds to reducing the KB size by 100 to 1000 folds.

  • Improving Word Embedding Compositionality using Lexicographic Definitions
    Authors: Thijs Scheepers, Evangelos Kanoulas and Efstratios Gavves

    Keywords: Word Embeddings, Compositionality, Deep Learning, Distributional Semantics, Representation Learning

    We present an in-depth analysis of various popular word embeddings (Word2Vec, GloVe, fastText and Paragram) in terms of their compositionality, as well as a method to tune them towards better compositionality. We find that training the embeddings to compose lexicographic definitions improves their performance in this task significantly, while also getting similar or better performance in both word similarity evaluations and sentence embedding evaluations. Word embeddings are tuned using a simple neural network architecture with definitions and lemmas from WordNet. Since dictionary definitions can be composed into the lemmas they define, they are also suitable for tuning, as well as evaluating for compositionality. Our architecture allows for the embeddings to be composed using simple arithmetic operations, which makes these embeddings specifically suitable for production applications such as web search and data mining. We call our model structure: CompVec. In our analysis, we evaluate original embeddings, as well as tuned embeddings, using existing word similarity and sentence embedding evaluation methods. Aside from these evaluation methods used in related work, we also evaluate embeddings using a ranking method which tests composed vectors using the lexicographic definitions already mentioned. In contrast to other evaluation methods, ours is not invariant to the magnitude of the embedding vector—which we show is important for composition. We consider this new evaluation method (CompVecEval) to be a key contribution.

  • CESI: Canonicalizing Open Knowledge Bases using Embeddings and Side Information
    Authors: Shikhar Vashishth, Prince Jain and Partha Talukdar

    Keywords: Knowledge Graphs, Open Knowledge Bases, Canonicalization, Knowledge Graph Embeddings

    Open Information Extraction (OpenIE) methods extract (noun phrase, relation phrase, noun phrase) triples from text, resulting in the construction of large open Knowledge Bases (KBs). The noun phrases (NPs) and relation phrases in such open KBs are not canonicalized, leading to the storage of redundant and ambiguous facts. Recent research has posed canonicalization of open KBs as clustering over manually-defined feature spaces. Manual feature engineering is expensive and often sub-optimal. In order to overcome this challenge, we propose CESI — a novel approach which performs canonicalization over learned embeddings of Open KBs. CESI extends recent advances in KB embedding by incorporating relevant NP and relation phrase side information in a principled manner. Through extensive experiments on multiple real-world datasets, we demonstrate CESI’s effectiveness. We plan to make publicly available all the data and code used in the paper.

  • Estimating Rule Quality for Knowledge Base Completion with the Relationship between Coverage Assumption
    Authors: Kaja Zupanc and Jesse Davis

    Keywords: Rule mining, Knowledge base, ILP, Open world assumption

    Currently, there are many large, automatically constructed knowledge bases (KBs). One interesting task is learning from a KB to generate new knowledge either in the form of inferred facts or rules that define regularities. One challenge for learning is that KBs are necessarily open world: we cannot assume anything about the truth values of facts not included in the KB. From a learning perspective, this means we lack negative examples. To address this problem, we propose a novel score function for evaluating the quality of a first-order definite clause learned from a KB. Our metric attempts to include information about the facts not in the KB when evaluating the quality of a potential rule. Empirically, we find that our metric results in more precise predictions than previous approaches.

  • Are All People Married?: Determining Obligatory Attributes in Knowledge Bases
    Authors: Jonathan Lajus and Fabian M. Suchanek

    Keywords: Knowledge Bases, Completeness, Classes, Attributes

    In the absence of counter-evidences, people generalize as an effective way to predict real-world facts. In this paper we discuss methods to automatically mine valid generalizations from a Knowledge Base (KB). In particular we want to determine if all instances of a class must have an attribute in the real world, if the attribute is mandatory for the class. For example, has-Birth-Date is an obligatory attribute for the class Person, while has-Spouse is not. We introduce a new way to model incompleteness and derive a method to automatically determine obligatory attributes with a precision of up to 90%.

  • Never-Ending Learning for Open-Domain Question Answering over Knowledge Bases
    Authors: Abdalghani Abujabal, Rishiraj Saha Roy, Mohamed Yahya and Gerhard Weikum

    Keywords: Question Answering, Continuous Learning, Knowledge Bases

    Translating natural language questions to semantic representations such as SPARQL is a core challenge in open-domain question answering over knowledge bases (KB-QA). Existing methods rely on a clear separation between an offline training phase, where a model is learned, and an online phase where this model is deployed to answer users’ questions. Two major shortcomings are that such methods require access to a large training set, which is not always readily available, and that they fail on questions from domains not seen during training. To overcome these limitations, this paper presents a continuous learning paradigm for KB-QA, called NEQA. Offline, NEQA automatically learns templates from a small number of training question-answer pairs. Once deployed, continuous learning is triggered on cases where templates are insufficient. Using a semantic similarity function between questions and by judicious invocation of non-expert user feedback, NEQA learns new templates that capture previously-unseen syntactic structures. This way, NEQA gradually extends and improves its template repository. NEQA periodically re-trains its underlying models, allowing it to adapt to the language used after deployment. Our experiments demonstrate NEQA’s practical viability, with steady improvement in answering quality over time, and the gains on questions from previously-unobserved domains.

  • Short-Text Topic Modeling via Non-negative Matrix Factorization Enriched with Local Word-Context Correlations
    Authors: Tian Shi, Kyeongpil Kang, Jaegul Choo and Chandan K. Reddy

    Keywords: Topic Modeling, Short Texts, Non-negative Matrix Factorization, Word Embedding

    Being a prevalent form of social communications on the Internet, billions of short texts are generated everyday. Discovering knowledge from them has gained a lot of interest from both industry and academia. The short texts have a limited contextual information, and they are sparse, noisy and ambiguous, and hence, automatically learning topics from them remains an important challenge. To tackle this problem, in this paper, we propose a semantics-assisted non-negative matrix factorization (SeaNMF) model to discover topics for the short texts. It effectively incorporates the word-context semantic correlations into the model, where the semantic relationships between the words and their contexts are learned from the skip-gram view of the corpus. The SeaNMF model is solved using a block coordinate decent algorithm. We also develop a sparse variant of the SeaNMF model which can achieve a better model interpretability. Extensive quantitative evaluations on various real-world short text datasets demonstrate the superior performance of the proposed models over several other state-of-the-art methods in terms of topic coherence and classification accuracy. The qualitative semantic analysis demonstrates the interpretability of our models by discovering meaningful and consistent topics. With a simple formulation and the superior performance, SeaNMF can be an effective standard topic model for short texts.

  • Estimating the Cardinality of Conjunctive Queries over RDF Data Using Graph Summarisation
    Authors: Giorgio Stefanoni, Boris Motik and Egor V. Kostylev

    Keywords: RDF, cardinality estimation, synopsis, query processing, SPARQL

    Accurately estimating the cardinality (i.e., the number of answers) of conjunctive queries is a core problem in data management. Common solutions to this problem are susceptible to high estimation errors that compound exponentially with the number of joins; hence, existing techniques can be inaccurate in data models such as RDF where queries are navigational and often involve many (self-)joins. In this paper we present a new technique for estimating the cardinality of conjunctive queries in RDF. We use a summary of an RDF graph as a synopsis that we interpret using a possible world semantics. We formalise the estimation problem as computing the expectation of query cardinality over all RDF graphs represented by the summary, and we present a closed-form formula for computing the expectation of arbitrary queries. We also discuss approaches to RDF graph summarisation. Finally, we show empirically that our cardinality technique is more accurate and more consistent, often by orders of magnitude, than several state-of-the-art approaches.

  • HighLife: Higher-arity Fact Harvesting
    Authors: Patrick Ernst, Amy Siu and Gerhard Weikum

    Keywords: Higher-arity relation extraction, Knowledge graphs, Partial fact observation, Knowledge Base Construction, Text-based knowledge harvesting, Health, Tree pattern learning, Distant supervision

    Text-based knowledge extraction methods, for populating knowledge bases, have focused on binary facts: relationships between two entities. However, in advanced domains such as health, it is often crucial to consider ternary and higher-arity relations. An example is to capture which drug is used for which disease at which dosage (e.g. 2.5 mg/day) for which kinds of patients (e.g., children vs. adults). In this work, we present an approach to harvest higher-arity facts from textual sources. Our method is distantly supervised by seed facts, and uses the fact-pattern duality principle to gather fact candidates with high recall. For high precision, we devise a constraint-based reasoning method to eliminate false candidates. A major novelty is in coping with the difficulty that higher-arity facts are often expressed only partially in texts. For example, one sentence may refer to a drug, a disease and a group of patients, whereas another sentence talks about the drug, its dosage and the target group without mentioning the disease. Our methods cope well with such partially observed facts, at both pattern-learning and constraint-reasoning stages. Experiments with news articles and with health-related documents demonstrate the viability of our method.

  • Browserless Web Data Extraction: Challenges and Opportunities
    Authors: Ruslan Fayzrakhmanov, Emanuel Sallinger, Ben Spencer, Tim Furche and Georg Gottlob

    Keywords: web data extraction, scraping, deep web, web automation, HTTP, AJAX

    Most modern web scrapers use an embedded browser to render web pages and to simulate user actions. Such scrapers (or wrappers) are therefore expensive to execute, in terms of time and network traffic. In contrast, it is magnitudes more resource-efficient to use a “browserless” wrapper which directly accesses a web server through HTTP requests, and takes the desired data directly from the raw replies. However, creating and maintaining browserless wrappers of high precision requires specialists, and is prohibitively labor-intensive at scale. In this paper, we demonstrate the principle feasibility of automatically translating visual wrappers into “browserless” wrappers. We present the first algorithm and system performing such an automated translation on suitably restricted types of web sites. This system works in the vast majority of test cases and produces very fast and extremely resource-saving wrappers. We discuss research challenges for extending our approach to a general method applicable to a yet larger number of cases.

  • A Coherent Unsupervised Model for Toponym Resolution
    Authors: Ehsan Kamalloo and Davood Rafiei

    Keywords: toponym resolution, geolocation extraction, unsupervised disambiguation, context-bound hypotheses, spatial hierarchies

    Toponym Resolution, the task of assigning a location mention in a document to a geographic referent (i.e. latitude/longitude), plays a pivotal role in analyzing location-aware content. However, the ambiguities of natural language and a huge number of possible interpretations for toponyms constitute insurmountable hurdles for this task. In this paper, we study the problem of toponym resolution with no additional information other than a gazetteer and no training data. We demonstrate that a dearth of large enough annotated data makes supervised methods less capable of generalizing. Our proposed method estimates the geographic scope of documents and leverages the connections between nearby place names as evidence to resolve toponyms. We explore the interactions between multiple interpretations of mentions and the relationships between different toponyms in a document to build a model that finds the most coherent resolution. Our model is evaluated on three news corpora, two from the literature and one collected and annotated by us; then, we compare our methods to the state-of-the-art unsupervised and supervised techniques. We also examine three commercial products including Reuters OpenCalais, Yahoo! YQL Placemaker, and Google Cloud Natural Language API. The evaluation shows that our method outperforms the unsupervised technique as well as Reuters OpenCalais and Google Cloud Natural Language API on all three corpora; also, our method shows a performance close to that of the state-of-the art supervised method and outperforms it when the test data has 40% or more toponyms that are not seen in the training data.

  • Towards Annotating Relational Data on the Web with Language Models
    Authors: Matteo Cannaviccio, Denilson Barbosa and Paolo Merialdo

    Keywords: Web table annotation, Language Models, Knowledge Graph Augmentation

    Tables and structured lists on Web pages have long been identified as a potential source of valuable information, and several methods have been proposed to annotate such tables and lists with semantics that can be leveraged for search, question answering and information extraction. This paper is concerned with the specific problem of finding and ranking relations from a given Knowledge Graph (KG) that hold over the pairs of entities juxtaposed in a table or structured list. The current state-of-the-art for this task is to attempt to link the entities mentioned in the table cells to objects in the KG and rank the relations that hold for those linked objects. As a result, these methods are hampered by the incompleteness and uneven coverage in even the best knowledge graphs available today. The alternative described here relies on ranking relations using generative language models derived from Web-scale corpora instead of entity linking. As such, it can produce high quality results even when the entities in the table are missing from the graph. The experimental validation, designed to expose the challenges posed by KG incompleteness, shows the proposed approach is robust and effective on a wide range of domains.

  • Inferring Missing Categorical Information in Noisy and Sparse Web Markup
    Authors: Nicolas Tempelmeier, Elena Demidova and Stefan Dietze

    Keywords: Web Markup, Supervised Learning, Information Inferring

    Embedded markup of Web pages has seen widespread adoption throughout the past years, driven by standards such as RDFa and Microdata and initiatives such as, where recent studies show an adoption by 38% of all Web pages already in 2016. While this constitutes a significant source of information aiding tasks such as Web search, Web page classification or knowledge graph augmentation, individual markup nodes are usually sparsely described, where essential information usually is not provided. For instance, 63% of nodes provide less than three statements/properties while from 24 million nodes describing events only 244 thousand (0.9%) provide specific event types. However, given the scale and diversity of markup data, in particular for categorical properties as in the previous example, there exist sufficiently large amounts of nodes, which provide the sought after information. These nodes constitute potential training data from which to build supervised models for inferring missing information to significantly augment sparsely described nodes. In this work, we introduce a supervised approach for inferring missing categorical properties in Web markup. Our results, conducted on properties of events and movies show a performance of 83% F1 score, outperforming existing baselines significantly.

  • Facet Annotation Using Reference Knowledge Bases
    Authors: Riccardo Porrini, Matteo Palmonari and Isabel Cruz

    Keywords: facet annotation, semantic analysis, table annotation, faceted search, data lifting, eCommerce

    Faceted interfaces are omnipresent on the web to support data exploration and filtering. A facet is a triple: a domain (e.g., book), a property (e.g., author, language), and a set of values (e.g., {Austen, Beauvoir, Coelho, Dostoevsky, Eco, Kerouac, Su?skind,. . .}, {French, English, German, Italian, Portuguese, Russian,. . .}). Given a property, a set of homogeneous values can be used to select those domain entities whose property values match the given set of values (e.g., given English and Italian, the book entities that have authors Austen, Eco, Kerouac,…, are selected). Multiple properties and respective values can be considered at the same time or applied successively. To implement faceted interfaces in a way that is scalable to very large datasets, it is necessary to automate the process of extracting facets. Prior work considers the problem of associating a set of homogeneous values with a facet domain, but does not annotate a facet property. In this paper, we annotate a facet property with a predicate from a reference Knowledge Base (KB) so as to maximize the semantic similarity between the property and the predicate. We define semantic similarity in terms of three new metrics: specificity, coverage, and frequency. Our experimental evaluation uses the DBpedia and YAGO KBs and shows that for the facet annotation problem, we obtain better results than a state-of-the-art approach for the annotation of web tables as modified to annotate a set of values.

  • Why Reinvent the Wheel: Let’s Build Question Answering Systems Together
    Authors: Kuldeep Singh, Arun Sethupat Radhakrishna, Andreas Both, Saeedeh Shekarpour, Ioanna Lytra, Ricardo Usbeck, Akhilesh Vyas, Akmal Khikmatullaev, Dharmen Punjani, Christoph Lange, Maria-Esther Vidal, Jens Lehmann and Sören Auer

    Keywords: Question Answering, Software Reusability, Semantic Web, Semantic Search, QA Framework

    Modern question answering (QA) systems need to flexibly integrate a number of components specialised to fulfil specific tasks in a QA pipeline. Key QA tasks include Named Entity Recognition and Disambiguation, Relation Extraction, and Query Building. Since a number of different software components exist, implementing different strategies for each of these tasks, a major challenge when building QA systems, is how to select and combine the most suitable components into a QA system, given the characteristics of a question. We study this optimisation problem and train Classifiers, which take features of a question as input and have the goal of optimising the selection of QA components based on those features. We then devise a greedy algorithm to identify the pipelines that include the suitable components and can effectively answer the given question. We implement this model within Frankenstein, a QA framework able to select QA components and compose QA pipelines. We evaluate the effectiveness of the pipelines generated by Frankenstein using the QALD and LC-QuAD benchmarks. These results not only suggest that Frankenstein precisely solves the QA optimisation problem, but also enables the automatic composition of optimised QA pipelines, which outperform the static Baseline QA pipeline. Thanks to this flexible and fully automated pipeline generation process, new QA components can be easily included in Frankenstein, thus improving the performance of the generated pipelines.

  • Sentiment Analysis by Capsules
    Authors: Yequan Wang, Aixin Sun, Jialong Han, Ying Liu and Xiaoyan Zhu

    Keywords: Sentiment Analysis, Capsule, Recurrent Neural Network, Attention

    In this paper, we propose RNN-Capsule, a capsule model based on Recurrent Neural Network (RNN) for sentiment analysis. For a given problem, one capsule is built for each sentiment category e.g., ‘positive’, ‘neutral’, and ‘negative’. Each capsule has an attribute, a state, and three modules: representation module, probability module, and reconstruction module. The attribute of a capsule is the assigned sentiment category. Given an instance encoded in hidden vectors by a typical RNN, the representation module builds capsule representation by the attention mechanism. Based on capsule representation, the probability module computes the capsule’s state probability. A capsule’s state is active if its state probability is the largest among all capsules for the given instance, and inactive otherwise. On two benchmark datasets (i.e., Movie Review and Stanford Sentiment Treebank) and one proprietary dataset (i.e., Hospital Feedback), we show that RNN-Capsule achieves state-of-the-art performance on sentiment classification. More importantly, without using any linguistic knowledge, RNN-Capsule is capable of outputting words with sentiment tendencies reflecting capsules’ attributes. The words well reflect the domain specificity of the dataset. To the best of our knowledge, this is the first capsule model for sentiment analysis.

  • Content Attention Model for Aspect Based Sentiment Analysis
    Authors: Qiao Liu, Haibin Zhang, Yifu Zeng, Ziqi Huang and Zufeng Wu

    Keywords: Sentiment Analysis, Aspect Based, Attention Mechanism

    Aspect based sentiment classification is a crucial task for sentiment analysis. Recent advances in the neural attention model demonstrate that it can be helpful in aspect based sentiment classification task, in that it can help identify the focus words of the human’s statements. However, according to our empirical study, prevalent content attention mechanisms proposed for aspect based sentiment classification mostly focused on identifying the sentiment words or shifters, without considering the relevance of such words with respect to the given aspects in the sentence. Therefore, it is usually insufficient for dealing with multi-aspect sentences and the syntactically complex sentence structures. To solve this problem, we propose a novel content attention based aspect based sentiment classification model, with two attention enhancing mechanisms: the sentence-level content attention mechanism is capable of capturing the important information about given aspect from a global perspective; the context attention mechanism is responsible for taking into account the order of the words and their correlations to the given aspects simultaneously, by embedding them into a series of customized memories. Experimental results on three benchmark datasets demonstrate the proposed model outperforms the state-of-the-art, in which the proposed mechanisms play a key role.

  • User-guided Hierarchical Attention Network for Multi-modal Social Image Popularity Prediction
    Authors: Wei Zhang, Wen Wang, Jun Wang and Hongyuan Zha

    Keywords: Social Image Popularity, Multi-modal Analysis, Attention Network

    Popularity prediction for the growing social images has opened unprecedented opportunities for wide commercial applications, such as precision advertising and recommender system. While a few studies have explored this significant task, little research has addressed its unstructured properties of both visual and textual modalities, and further considered to learn effective representation from multi-modalities for popularity prediction. To this end, we propose a model named User-guided Hierarchical Attention Network (UHAN) with two novel user-guided attention mechanisms to hierarchically attend both visual and textual modalities. It is capable of not only learning effective representation for each modality, but also fusing them to obtain an integrated multi-modal representation under the guidance of user embedding. As no benchmark dataset exists, we extend a publicly available social image dataset by adding the descriptions of images. Comprehensive experiments have demonstrated the rationality of our proposed UHAN and its better performance than several strong alternatives.

  • Modelling Dynamics in Semantic Web Knowledge Graphs with Formal Concept Analysis
    Authors: Larry Gonzalez and Aidan Hogan

    Keywords: Semantic Web, Schema, Knowledge Graph, Dynamics, FCA

    In this paper, we propose a novel data-driven schema for large-scale heterogeneous knowledge graphs inspired by Formal Concept Analysis (FCA). We first extract the sets of properties associated with individual entities; these property sets (aka. characteristic sets) are annotated with cardinalities and used to induce a lattice based on set-containment relations, forming a natural hierarchical structure describing the knowledge graph. We then propose an algebra over such schema lattices, which allows to compute diffs between lattices (for example, to summarise the changes from one version of a knowledge graph to another), to add lattices (for example, to project future changes), and so forth. While we argue that this lattice structure (and associated algebra) may have various applications, we currently focus on the use-case of modelling and predicting the dynamic behaviour of knowledge graphs. Along those lines, we instantiate and evaluate our methods for analysing how versions of the Wikidata knowledge graph have changed over a period of 11 weeks. We propose algorithms for constructing the lattice-based schema from Wikidata, and evaluate their efficiency and scalability. We then evaluate use of the resulting schema(ta) for predicting how the knowledge graph will evolve in future versions.

  • Dynamic Embeddings for Language Evolution
    Authors: Maja Rudolph and David Blei

    Keywords: word embeddings, dynamic modeling, probabilistic modeling

    Word embeddings are a powerful approach for unsupervised analysis of language. Recently, Rudolph et al. developed exponential family embeddings, which cast word embeddings in a probabilistic framework. Here, we develop dynamic embeddings, building on exponential family embeddings to capture how the meanings of words change over time. We use dynamic embeddings to analyze three large collections of historical texts: the U.S. Senate speeches from 1858 to 2009, the history of computer science ACM abstracts from 1951 to 2014, and machine learning papers on the ArXiv from 2007 to 2015. We find dynamic embeddings provide better fits than classical embeddings and capture interesting patterns about how language changes.

  • Semantics and Complexity of GraphQL
    Authors: Olaf Hartig and Jorge Pérez

    Keywords: GraphQL, Graph Databases, Semantics, Complexity

    GraphQL is a recently proposed, and increasingly adopted, conceptual framework for providing a new type of data access interface on the Web. The framework includes a new graph query language whose semantics has been specified informally only. This has prevented the formal study of the main properties of the language. We embark on the formalization and study of GraphQL. To this end, we first formalize the semantics of GraphQL queries based on a labeled-graph data model. Thereafter, we analyze the language and show that it admits really efficient evaluation methods. In particular, we prove that the complexity of the GraphQL evaluation problem is NL-complete. Moreover, we show that the enumeration problem can be solved with constant delay. This implies that a server can answer a GraphQL query and send the response byte-by-byte while spending just a constant amount of time between every byte sent. Despite these positive results, we prove that the size of a GraphQL response might be prohibitively large for an internet scenario. We present experiments showing that current practical implementations suffer from this issue. We provide a solution to cope with this problem by showing that the total size of a GraphQL response can be computed in polynomial time. Our results on polynomial-time size computation plus the constant-delay enumeration can help developers to provide more robust GraphQL interfaces on the Web.

  • Matching Natural Language Sentences with Hierarchical Sentence Factorization
    Authors: Bang Liu, Ting Zhang, Fred X. Han, Di Niu, Kunfeng Lai and Yu Xu

    Keywords: Hierarchical Sentence Factorization, Sentence Reordering, Semantic Matching, Ordered Word Mover’s Distance, Abstract Meaning Representation

    Semantic matching of natural language sentences or identifying the relationship between two sentences is a core research problem underlying many natural language tasks. Prior research has proposed both unsupervised distance-based schemes, when training data is not available, and deep learning schemes for sentence matching, given training data. However, previous approaches either omit or fail to fully utilize the ordered, hierarchical, and flexible structures of language objects, as well as the interaction between them. In this paper, we propose extit{Hierarchical Sentence Factorization}—a technique that is able to factorize a sentence into a hierarchical representation, with the components at each different scale reordered into a “predicate-argument” form. The proposed sentence factorization technique leads to the invention of: 1) a new unsupervised distance metric which calculates the semantic distance between a pair of text snippets by solving a penalized optimal transport problem while preserving the logical relationship of words in the reordered sentences, and 2) new multi-scale deep learning models for supervised semantic training, based on factorized sentence hierarchies. We apply our techniques to text-pair similarity estimation and text-pair relationship classification tasks, based on multiple datasets such as STSbenchmark, the Microsoft Research paraphrase identification (MSRP) dataset, the SICK dataset, etc. Extensive experiments show that the proposed hierarchical sentence factorization can be used to significantly improve the performance of existing unsupervised distance-based metrics as well as multiple supervised deep learning models based on the convolutional neural network (CNN) and long short-term memory (LSTM).

  • Find the Conversation Killers: A Predictive Study of Thread-ending Posts
    Authors: Yunhao Jiao, Cheng Li, Fei Wu and Qiaozhu Mei

    Keywords: Social conversations, Conversation prediction, Deep learning

    How to improve the quality of conversations in online communities has attracted considerable attention recently. Having engaged, urbane, and reactive online conversations have a critical effect on the social life of Internet users. In this study, we are particularly interested in identifying a post in a multi-party conversation that is unlikely to be further replied to, which therefore kills that thread of the conversation. For this purpose, we propose a deep learning model called the ConverNet. ConverNet is attractive due to its capability of modeling the internal structure of a long conversation and its appropriate encoding of the contextual information of the conversation, through effective integration of attention mechanisms. Empirical experiments on real-world datasets demonstrate the effectiveness of the proposed model. For the widely concerned topic, our analysis also offers implications for improving the quality and user experience of online conversations.

  • Detecting Absurd Conversations from Intelligent Assistant Logs by Exploiting User Feedback Utterances
    Authors: Chikara Hashimoto and Manabu Sassano
    Presentation moved from track Intelligent and Autonomous systems on the Web

    Keywords: Intelligent assistant, User feedback, Absurdity detection, Natural language processing

    Intelligent assistants, such as Siri, are expected to converse comprehensibly with users. To facilitate improvement of their conversational ability, we have developed a method that detects absurd conversations recorded in intelligent assistant logs by identifying user feedback utterances that indicate users’ favorable and unfavorable evaluations of intelligent assistant responses; e.g., “great!” is favorable, whereas “what are you talking about?” is unfavorable. Assuming that absurd/comprehensible conversations tend to be followed by unfavorable/favorable utterances, our method extracts some absurd/comprehensible conversations from the log to train a conversation classifier that sorts all the conversations recorded in the log as either absurd or not. The challenge is that user feedback utterances are often ambiguous; e.g., a user may give an unfavorable utterance (e.g., “don’t be silly!”) to a comprehensible conversation in which the intelligent assistant was attempting to make a joke. An utterance classifier is thus used to score the feedback utterances in accordance with how unambiguously they indicate absurdity. Experiments showed that our method significantly outperformed baselines that lacked a conversation and/or utterance classifier, indicating the effectiveness of the two classifiers. Our method only requires user feedback utterances, which would be independent of domains. Experiments focused on chitchat, web search, and weather domains indicated that our method is likely domain-independent.

  • Socioeconomic Dependencies of Linguistic Patterns in Twitter: a Multivariate Analysis
    Authors: Jacob Levy Abitbol, Márton Karsai, Jean-Philippe Magué, Jean-Pierre Chevrot and Eric Fleury

    Keywords: computational sociolinguistics, Twitter data, socioeconomic status inference, social network analysis, spatiotemporal data

    Our usage of language is not solely reliant on cognition but is arguably determined by myriad external factors leading to a global variability of linguistic patterns. This issue, which lies at the core of sociolinguistics and is backed by many small-scale studies on face-to-face communication, is addressed here by constructing a dataset combining the largest French Twitter corpus to date with detailed socioeconomic maps obtained from national census in France. We show how key linguistic variables measured in individual Twitter streams depend on factors like socioeconomic status, location, time, and the social network of individuals. We found that (i) people of higher socioeconomic status, active to a greater degree during the daytime, use a more standard language; (ii) the southern part of the country is more prone to use more standard language than the northern one, while locally the used variety or dialect is determined by the spatial distribution of socioeconomic status; and (iii) individuals connected in the social network are closer linguistically than disconnected ones, even after the effects of status homophily have been removed. Our results inform sociolinguistic theory and may inspire novel learning methods enabling the inference of socioeconomic status of people from the way they tweet.

  • Time Expression Recognition Using a Constituent-based Tagging Scheme
    Authors: Xiaoshi Zhong and Erik Cambria

    Keywords: Time expression recognition, Constituent-based tagging scheme, Inconsistent tag assignment, Position-based tagging scheme, Named entity recognition

    We analyze four datasets for the characteristics of time expressions, finding that time expressions are formed by loose structure and that the words used to express time information can differentiate time expressions from common text. The findings drive us to design a learning method named TOMN to model time expressions. TOMN defines a time-related tagging scheme named TOMN scheme with four tags, namely T, O, M, and N, indicating the constituents of time expression, namely Time token, Modifier, Numeral, and the words Outside time expression. Essentially, our constituent-based TOMN scheme overcomes the problem of inconsistent tag assignment that is caused by the conventional position-based tagging schemes (e.g., BIO scheme and BILOU scheme). In modeling, TOMN assigns a word with a TOMN tag under a framework of conditional random fields with minimal features. Experiments show that TOMN is equally or more effective than state-of-the-art methods on various datasets, and much more robust on cross-datasets. Moreover, our analysis can help explain many empirical observations in other works about time expression recognition and named entity recognition.

  • Finding Needles in an Encyclopedic Haystack: Detecting Classes Among Wikipedia Articles
    Authors: Marius Pasca

    Keywords: knowledge acquisition, concepts, classes, unstructured text, topic classification, knowledge repositories, open-domain information extraction

    A lightweight method distinguishes articles within Wikipedia that are classes (“Novel”, “Book”) from other articles (“Three Men in a Boat”, “Diary of a Pilgrimage”). It exploits clues available within the article text and within categories associated with articles in Wikipedia, while not requiring any linguistic preprocessing tools. Experimental results show that classes can be identified among Wikipedia articles in multiple languages, at aggregate precision and recall above 0.9 and 0.6 respectively.

  • Leveraging Social Media Signals for Record Linkage
    Authors: Andrew Schneider, Arjun Mukherjee and Eduard Dragut

    Keywords: Record Linkage, Social Media, Data Cleaning

    Many data-intensive applications collect (structured) data from a variety of sources. A key task in this process is record linkage, which is the problem of determining the records from these sources that refer to the same real-world entities. Traditional approaches use the record representation of entities to accomplish this task. With the nascence of social media, entities on the Web are now accompanied by user generated content. We present a method for record linkage that uses this hitherto untapped source of entity information. We use document-based distances, with an emphasis on word embedding document distances, to determine if two entities match. Our rationale is that user evaluations of entities converge in semantic content, and hence in the word embedded space, as the number of user evaluations grows. We analyze the effectiveness of the proposed method both as a stand-alone method and in combination with the record-based record linkage methods. Experimental results using real-world reviews demonstrate the high effectiveness of our approach. To our knowledge, this is the first work exploring the use of user generated content accompanying entities in the record linkage task.

  • MemeSequencer: Sparse Matching for Embedding Image Macros
    Authors: Abhimanyu Dubey, Esteban Moro, Manuel Cebrian and Iyad Rahwan

    Keywords: online imagery, social media, memes, image virality, computer vision, sparse representation, sparse matching, social networks, image understanding, semantic understanding

    The analysis of the creation, mutation, and propagation of social media content on the Internet is an essential problem in computational social science, affecting areas ranging from marketing to political mobilization. A first step towards understanding the evolution of images online is the analysis of rapidly modifying and propagating memetic imagery or `memes’. However, a pitfall in proceeding with such an investigation is the current incapability to produce a robust semantic space for such imagery, capable of understanding differences in Image Macros. In this study, we provide a first step in the systematic study of image evolution on the Internet, by proposing an algorithm based on sparse representations and deep learning to decouple various types of content in such images and produce a rich semantic embedding. We demonstrate the efficacy of our approach on a variety of tasks pertaining to memes and Image Macros, image clustering, image retrieval, topic prediction and virality prediction, surpassing the existing methods on each. In addition to its utility on quantitative tasks, our method opens up the possibility of obtaining the first large-scale understanding of the evolution and propagation of memetic imagery.

  • An Attention Factor Graph Model for Tweet Entity Linking
    Authors: Chenwei Ran, Wei Shen and Jianyong Wang

    Keywords: Twitter, Knowledge graph, Entity linking, Factor model, Attention model

    The rapid expansion of Twitter has attracted worldwide attention. With more than 500 million tweets posted per day, Twitter becomes an invaluable information and knowledge source. Many Twitter related tasks have been studied, such as event extraction, hashtag recommendation, and topic detection. A critical step in understanding and mining information from Twitter is to disambiguate entities in tweets, i.e., tweet entity linking. It is a challenging task because tweets are short, noisy, and fresh. Many tweet-specific signals have been found to solve the tweet entity linking problem, such as user interest, temporal popularity, location information and so on. However, two common weaknesses exist in previous work. First, most proposed models are not flexible and extendable to fit new signals. Second, their scalability is not good enough to handle the large-scale social network like Twitter. In this work, we formalize the tweet entity linking problem into a factor graph model which has shown its effectiveness and efficiency in many other applications. We also propose selective attention over entities to increase the scalability of our model, which brings linear complexity. To adopt the attention mechanism in the factor graph, we propose a new type of nodes called pseudo-variable nodes to solve the asymmetry attention problem caused by the undirected characteristic of the factor graph. We evaluated our model on two different manually annotated tweet datasets. The experimental results show that our model achieves better performance in terms of both effectiveness and efficiency compared with the state-of-the-art approaches.