PHD Symposium

PHD Symposium Chairs:

  • Sylvie Calabretto (Université de Lyon, France)
  • Lalana Kagal (MIT CSAIL, US)
  • Maria Maleshkova (Karlsruhe Institute of Technology, Germany)

List of selected papers :

  • Web Performance Automation for The People
    Authors: Robin Marx and Peter Quax

    Keywords: Web Performance, WPO, Network optimization, User context

    Web performance is important for the user experience and can heavily influence web page revenues. While there are many established Web Performance Optimization (WPO) methods, our work so far has clearly shown that new network protocols, optimized browsers and cutting-edge web standards can have a significant impact on known best practices. Additionally, there is still low-hanging fruit to be exploited, in the form of personalizing performance based on user context (i.e., current device, network, browser) and user preferences (e.g., text reading vs multimedia experience). In our PhD project, we strive to integrate this user-specific metadata into dynamic configurations for both existing and new automated WPO techniques. An intermediate server can (pre)generate optimized versions of a web page, which are then selected based on user context and preferences. Additional metadata is also passed along to the browser, enabling improvements on that side, and used to steer new network protocols to speed up the incremental delivery of page resources. We use the Speeder platform to perform and evaluate full-factorial objective measurements and use subjective user studies across a range of groups to assess the applicability of our methods to end users. Our aim is to provide insights in how WPO can be tweaked for specific users, in the hopes of leading to new web standards that enable this behavior.

  • Modeling formation of online temporal communities
    Authors: Isa Inuwa-Dutse and Ioannis Korkontzelos

    Keywords: Social networks mining, community formation analysis and detection in Social networks, temporal communities, online social media, Twitter

    The advent of social media networks can be viewed as break to the early two-step flow model in which influential individuals act as intermediaries between the media and the general public for information diffusion. Social media platforms enable users to both generate and consume online contents. Users continuously engage and disengage in discussions with varying degrees of interaction leading to formation of distinct online communities. Such communities are formed at high-level either based on metadata, such as hashtags on Twitter, or popular content triggered by few influential users. Modeling online communities based on these popular contents do not often reflect true connectivity and lack cohesiveness required in a community. In this study, we propose to investigate real-time formation of temporal communities of users at microscopic level on Twitter. Our approach is inspired by mimicking a real-life event center scenario to effectively cluster users in real-time into distinct and cohesive communities. Membership to community relies on intrinsic tweet properties to define similarity functions categorised into ContSim, MetSim, and AggSim. The ability to operate in real-time makes the approach potentially capable of reflecting and dealing with the dynamism of communities on Twitter over time. We describe the model formulation including a proposed setting of parameters, and some preliminary results. We also present the evaluation process and a baseline model to utilise for comparison. Success of this approach will have application in enhancing local event monitoring and clique-based marketing among other benefits.

  • Detection of Strength and Causal Agents of Stress and Relaxation for tweets
    Authors: Reshmi Gopalakrishna Pillai

    Keywords: Stress, Relaxation, Sentiment Analysis, Word Sense Disambiguation, Word vectors

    The ability to automatically detect human stress and relaxation is central for timely diagnosing stress-related diseases, ensuring customer satisfaction in services and managing human-centric applications such as traffic management. Traditional methods employ stress measuring scales or physiological monitoring which may be intrusive and inconvenient. Instead, the ubiquitous nature of social media can be leveraged to identify stress and relaxation, since people habitually share their recent life experiences through social networking sites. In this PhD research, we introduce an improved method to detect expressions of stress and relaxation in social media content. It uses word sense vectors for word sense disambiguation to improve the performance of the first ever lexicon-based stress/relaxation detection algorithm TensiStrength. Experimental results show that incorporating word sense disambiguation substantially improves the performance of the original TensiStrength. It performs better than state-of-the-art machine learning methods too in terms of Pearson correlation and percentage of exact matches. We also suggest a novel, word-vector based approach for detecting causes of stress and relaxation in social media content. We demonstrate the feasibility of the proposed approach through a pilot experiment.

  • Privacy Preserving Distributed Analysis of Social Networks
    Authors: Varsha Bhat

    Keywords: social network analysis, privacy, multiparty computation

    Social networks have been a popular choice of study, given the surge of online data on friendship networks, communication networks, collaboration networks etc. This popularity, however, is not true for all types of social networks. In the current work, we draw the reader’s attention to a class of social networks which are investigated to a limited extent, classified as distributed sensitive social networks. It constitutes of networks where the presence or absence of edges in the network is distributedly known to a set of parties, who regard this information as their private data. Supply chain networks, informal networks such as trust network, advice network, enmity network, etc. are a few examples of the same. A major reason for the lack of any substantial study on these networks has been the unavailability of data. As a solution, we propose a privacy preserving approach to investigating these networks. We show the feasibility of using secure multiparty computation techniques to perform the required analysis, while preserving the privacy of individual data. The possible approaches that can be considered to ensure the design of efficient secure protocols are discussed such as efficient circuit design, ORAM based secure computation, use of oblivious data structures, etc. The results obtained in the direction of secure network analysis algorithms are also presented.

  • Concept Embedded Topic Modeling Technique
    Authors: Dakshi Kapugama Geeganage

    Keywords: Topic modeling, semantics, concepts

    Text contents are overloaded with the digitization of the data and new contents are transmitted through many sources by generating a large volume of information which spreads all over the world through different communication medias. Therefore, text data is available everywhere and reading, understanding and analysing the text data has become a main activity in daily routine. With the increment of the volume and the variety of information, organizing and searching the required information has become vital. Topic modelling is the state of the art for information organization, understanding and extracting the content. Most of the prevailing topic models use the probabilistic approaches and consider the frequency and the co-occurrence to discover the topics from collections of documents. The proposed research aims to address the existing problems of topic modeling by introducing a concept embedded topic model which generates the most relevant and meaningful topics by understanding the content. The research includes approaches to understand the semantic elements from the content, domain identification of concepts and provide most suitable topics without getting the number of topics from the user beforehand. Capturing the semantics of document collections and generating the most related set of topics according to the actual meaning will be the significance of this research.

  • A User Centred Perspective on Structured Data Discovery
    Authors: Laura Koesten

    Keywords: data search, human data interaction, data discovery, data portals

    Structured data is becoming critical in every domain and its availability on the web is increasing rapidly. Despite its abundance and variety of applications, we know very little about how people find data, understand it, and put it to use. This work aims to inform the design of data discovery tools and technologies from a user centred perspective. We aim to better understand what type of information supports people in finding and selecting data relevant for their respective tasks. We conducted a mixed-methods study looking at the workflow of data practitioners when searching for data and looked at search result presentation on data portals. From that we identified textual summaries as a main element that supports the decision making process in information seeking activities for data. Based on these results we performed a mixed-methods study to identify attributes that people consider important when summarising a dataset. We found such text summaries are laid out according to common structures, contain four main information types; and cover a set of dataset features. We describe potential follow-up studies that are planned to validate these findings and to evaluate their applicability in a dataset search scenario.

  • Automatic Translation of Competency Questions into SPARQL-OWL Queries
    Authors: Dawid Wiśniewski

    Keywords: ontology, competency question, SPARQL-OWL, word embedding, machine translation, ontology authoring

    The process of ontology authoring is inseparably connected with the quality assurance phase. One can verify maturity and correctness of a given ontology by evaluating how many competency questions give correct answers. Competency questions are defined as a set of questions expressed in natural language that the finished ontology should be able to answer to correctly. Although this method can easily indicate what is the development status of an ontology, one has to translate competency questions from natural language into an ontology query language. This task is very hard and time consuming. To overcome this problem, my PhD thesis focuses on methods to automatically check answerability of competency questions for a given ontology and propose SPARQL-OWL query for questions where it is possible to create the query. Because the task of automatic translation from competency questions to SPARQL-OWL queries is a novel one, besides a method, we have proposed a new benchmark to evaluate such translation.

  • Query for streaming information: dynamic processing and incremental maintenance of RDF stream
    Authors: Xuanxing Yang

    Keywords: RDF stream, Continuous query, Stream processing, Incremental maintenance

    Recently with dynamic information being ubiquitous on the Web, there have been efforts to extend RDF and SPARQL for representing streaming information and continuous querying languages, respectively. While existing works focusing on formalization and implementation of continuous querying process over RDF streams, little attention has deserved the problem of querying the complex temporal correlations among RDF stream tuples and the effective, scalable implementation of RDF stream processing system. To fill this gap, in this paper we propose CT-SPARQL and IMRS, a language for querying the compositional stream patterns and an architecture for incremental maintenance of RDF stream tuples defined in CT-SPARQL. We believe that this work will benefit a wide range of real-time analyzing and future predicting applications.

  • Compromised account detection based on clickstream data
    Authors: Tobias Weller

    Keywords: Clickstream Fraud Detection, Anomaly Detection, Clickstream Analysis, Machine Learning

    The number of users of the world wide web is constantly increasing. However, this also increases the risks. There is the possibility that other users illegally gain access to a users’ account of social networks, web shops or other web services. Previous work use graph-based methods to identify hijacked or compromised accounts. Most often posts are used in social networks to detect fraudulences. However, not every compromised account is used to spread propaganda information or phishing attacks. Therefore, we restrict ourselves to the clickstreams from the accounts. In order to identify compromised accounts by means of clickstreams, we will also consider a temporal aspect, since the preferences of a user change over time. We choose a hybrid approach consisting of methods from subsymbolic and symbolic AI to detect fraudulences in clickstreams. We will also take into account the experience of domain experts. Our approach can also be used to identify not only compromised accounts but also shared accounts on for instance streaming sites.

  • Truth or Lie ? Automatically fact checking news
    Authors: Lucas Azevedo

    Keywords: Fact Checking, Deception Detection, Linguistic Cues, Survey, Machine Learning

    In the actual scenario of ever-growing data consumption speed and quantity, factors like news sources decentralization, citizen journalism and democratization of the medias, make unpractical, and not always feasible, the task of manually checking and correcting disinformation across the internet. Here, there is an imperative need for a fast and reliable way to account for the veracity of what is produced and spread as information: Automatic fact-checking. In this work we present the problem of fact-checking in the era of big data and post-truth. Some existing approaches for this task are presented and their main features discussed and compared. Concluding, a new approach inspired on the best components of the existing ones is presented.

  • Monetization Strategies for the Web of Data
    Authors: Tobias Grubenmann

    Keywords: Web of Data, Monetization, Marketplace, Integer Programming, Auction

    Inspired by the World Wide Web, the Web of Data is a network of interlinked data fragments. One of the main advantages of the Web of Data is that all of its content is processable by machines. However, this also has its drawbacks when it comes to monetization of the content: advertisements and donations–two important financial motors in the World Wide Web–do not translate into the Web of Data as the rely on exposing the user to advertisement/call for donations. The remedy this situation, we propose two different monetization strategies for the Web of Data. The first strategy involves a marketplace where users can buy data in an integrated way. The second strategy allows third parties to promote certain data. In return, the sponsors pay money whenever a user follows a link contained in the sponsored data. We identified two different kind of data–commercial and sponsored data–which can benefit from the two respective monetization strategies. With our work, we propose first solutions to the problem of financing the creation and publishing of content in the Web of Data.

  • Mining the Web of Life Sciences Linked Open Data for Mechanism-Based Pharmacovigilance
    Authors: Maulik R. Kamdar

    Keywords: Semantic Web, Linked Data, Pharmacovigilance, Federated Querying, Association Mining

    The vision of the Semantic Web has stimulated development of Web-scale architectures for discovering implicit associations from multiple heterogeneous data and knowledge sources. In biomedicine, using W3C-established standards and Linked Data principles, data publishers have transformed and linked several datasets to create a huge web of Life Sciences Linked Open Data (LSLOD). However, mining the LSLOD cloud is still very difficult and often impossible for biomedical researchers due to several challenges: structural heterogeneity, lack of vocabulary reuse, inconsistencies, and incompleteness. To discover drug–adverse reaction associations and their mechanistic explanations, we have developed a novel architecture that incorporates a pattern-based query-federation module for information retrieval, and a graph-analytics module for association discovery. The query federation module relies on RDF graph patterns that are observed in LSLOD sources and mapped to a common data model. The proposed architecture demonstrates favorable AUROC statistics against baseline methods in pharmacovigilance, along with confidence values on underlying biological mechanisms. We quantify the several challenges associated with mining the LSLOD cloud for biomedical applications through an empirical analysis of more than 40 different sources. Ideally, our architecture can be extended in other biomedical domains to realize the goal of implicit association discovery.