2018Awards and Distinctions
Seoul Test of Time Award!
- YAGO: A Core of Semantic Knowledge Unifying WordNet and Wikipedia
Suchanek , Kasneci and Weikum, Max-Planck-Institut
This paper was presented at WWW2007.
Conference Awards Winners
- Best Paper Winner
- Best Poster Winner
- Best Demo Winner
The awards have been presented during the closing ceremony…
and the winners are:
Research Paper Honorable Mentions
A Short-term Intervention for Long-term Fairness in the Labor Market
Authors: Lily Hu and Yiling Chen
Keywords: fairness, game theory, labor market
The persistence of racial inequality in the U.S. labor market against a general backdrop of formal equality of opportunity is a troubling phenomenon that has significant ramifications on the design of hiring policies. In this paper, we show that current group disparate outcomes may be immovable even when hiring decisions are bound by an input-output notion of “”individual fairness.”” Instead, we construct a dynamic reputational model of the labor market that illustrates the reinforcing nature of asymmetric outcomes resulting from groups’ divergent access to resources and as a result, investment choices. To address these disparities, we adopt a dual labor market composed of a Temporary Labor Market (TLM), in which firms’ hiring strategies are constrained to ensure statistical parity of workers granted entry into the pipeline, and a Permanent Labor Market (PLM), in which firms hire top performers as desired. Individual worker reputations produce externalities for their group; the corresponding feedback loop raises the collective reputation of the initially disadvantaged group via a TLM fairness intervention that need not be permanent. We show that such a restriction on hiring practices induces an equilibrium that, under particular market conditions, Pareto-dominates those arising from strategies that employ statistical discrimination or a “”group-blind”” criterion. The enduring nature of equilibria that are both inequitable and Pareto suboptimal suggests that fairness interventions beyond procedural checks of hiring decisions will be of critical importance in a world where machines play a greater role in the employment process.
Aesthetic-based Clothing Recommendation
Authors: Wenhui Yu, Huidi Zhang, Xiangnan He, Xu Chen, Li Xiong and Zheng Qin
Keywords: Clothing recommendation, side information, aesthetic features, tensor factorization, dynamic collaborative filtering
Recently, product images gain increasing concern in clothing recommendation since the visual appearance of the items have a significant impact on consumers’ decision. Existing models usually extract conventional features, such as convolutional neural network (CNN) features, scale-invariant feature transform (SIFT) features, and color histograms, to represent item image characters and capture user visual preferences. However, one important feature, aesthetic feature, is typically ignored. It is vital in recommendation since users’ decision depends largely on if the clothing is in line with their aesthetic while the conventional image features cannot portray this directly. To bridge this gap, we propose to introduce aesthetic information which is more related with users’ preference into the field of clothing recommender system. To do so, we first present the aesthetic features extracted by an pre-trained neural network, which is a brain inspired deep structure trained for aesthetic assessment task. Considering the aesthetic preference shows diversity with different people and time, we propose a novel tensor factorization model as a basic model and then incorporate the aesthetic features into it. Finally, extensive experiments on real-world datasets demonstrate that our approach can capture the aesthetic preference of consumers and outperform several state-of-the-art models significantly.
An Automated Approach to Auditing Disclosure of Third-Party Data Collection in Website Privacy Policies
Authors: Timothy Libert
Keywords: Web Privacy, Web Security, Internet Policy, Internet Regulation
A dominant regulatory model for web privacy is “”notice and choice””. In this model, users are notified of data collection and provided with options to control it. To examine the efficacy of this approach, this study presents the first large-scale audit of disclosure of third-party data collection in website privacy policies. Data flows on one million websites are analyzed and over 200,000 website’s privacy policies are audited to determine if users are notified of the names of the companies which collect their data. Policies from 25 prominent third-party data collectors are also examined to provide deeper insights into the totality of the policy environment. Policies are additionally audited to determine if the choice expressed by the “”Do Not Track”” browser setting is respected. It is found that third-party data collection is wide-spread, but only 15% of attributed data flows are disclosed. The third-parties most likely to be disclosed are those with consumer services users may be aware of, those without consumer services are less likely to be mentioned. Policies are difficult to understand and the average time requirement to read both a given site’s policy and the associated third-party policies exceeds 85 minutes. Only 7% of first-party site policies mention the “”Do Not Track”” signal, and the majority of such mentions are to specify that the signal is ignored. Among third-party policies examined, none offer unqualified support for the “”Do Not Track”” signal. Findings indicate that current implementations of “”Notice and Choice”” fail to provide notice or respect choice.
Did You Really Just Have a Heart Attack?: Towards Robust Detection of Personal Health Mentions in Social Media
Authors: Payam Karisani and Eugene Agichtein
Keywords: Social Media Classification, Health Tracking in Social Media, Representation learning for text classification
Millions of users share their experiences on social media sites, such as Twitter, which in turn generates potentially valuable data for public health monitoring, digital epidemiology, and other analyses of population health at global scale. The first, basic, task for any of these applications is classifying whether a personal health event was mentioned. Thus, identifying actual personal health mentions (PHM) is critical. This task is challenging for many reasons, including typically short length of posts, inventive spelling and lexicons, and figurative language, including hyperbole using diseases like “heart attack” or “cancer” for emphasis and not as a health self-report. This problem is even more challenging for rarely reported, or frequent but ambiguously expressed conditions, such as “stroke”. To address this problem, we propose a general, robust method for detecting PHMs in social media, which we call WESPAD, that combines lexical, syntactic, word embedding-based, and context-based features. WESPAD is able to generalize from few examples by automatically distorting the word embeddings to most effectively detect the true health mentions. Unlike previously proposed state-of-the-art supervised and deep learning techniques, WESPAD requires relatively little training data–which makes it possible to adapt with minimal effort to each new disease and condition. We evaluate WESPAD on both an established publicly available Flu detection benchmark, and on a new dataset that we have constructed with mentions of multiple health conditions, which we plan to share with the research community. Our experiments show that WESPAD consistently and significantly improves upon baseline methods in a variety of settings, and consistently outperforms all state-of-the-art deep neural network methods. Most importantly, we show that WESPAD outperforms the lexical baseline by a large margin when the number and proportion of true health mentions in the training data is small. These properties make our method particularly valuable for extending online public health methods to an ever expanding set of diseases and conditions.
Facebook (A)Live?: Are live social broadcasts really broadcasts?
Authors: Aravindh Raman, Gareth Tyson and Nishanth Sastry
Keywords: user generated broadcast, broadcast demographics, live videos
The era of live-broadcast is back but with two major changes. First, unlike traditional TV broadcasts, content is now streamed over the Internet enabling it to reach a wider audience. And, second, due to various user content generated platforms it has become possible for anyone to become involved, streaming their own content to the world. This emerging trend of going live usually happens via social platforms, where users perform live social broadcasts predominantly from their mobile devices, allowing their friends (and the general public) to engage with the stream in real-time. With the growing popularity of such platforms, the burden on the current Internet infrastructure is therefore expected to multiply. With this in mind, we explore one such prominent platform – Facebook Live. With one month of global data, we explore the characteristics of live social broadcasts, from which we infer a smarter way to alleviate the network burden. We then dissect global and hyper-local properties of the video while on-air, by capturing the geography of the broadcasters, or the users who produce the video and the viewers, or the users who interact with it. Finally, we study the social engagement while the video is live and distinguish the key aspects when the same video goes on-demand. A common theme throughout the paper is that, despite its name, many attributes of Facebook Live deviate from both the concepts of live and broadcast.
Fully Dynamic k-Center Clustering
Authors: T.-H. Hubert Chan, Arnaud Guerquin and Mauro Sozio
Keywords: k-center clustering, dynamic algorithms, social networks
Static and dynamic clustering algorithms are a fundamental tool in any machine learning library. Most of the efforts in developing dynamic machine learning and data mining algorithms have been focusing on the sliding window model (where at any given point in time only the most recent data items are retained) or more simplistic models. However, in many real-world applications one might need to deal with arbitrary deletions and insertions. For example, one might need to remove data items that are not necessarily the oldest ones, because they have been flagged as containing inappropriate content or due to privacy concerns. Clustering trajectory data might also require to deal with more general update operations. We develop a (2+epsilon)-approximation algorithm for the k-center clustering problem, which requires polylogarithmic amortized cost under a fully dynamic adversarial model. In such a model, points can be added or removed arbitrarily, provided that the adversary does not have access to the random choices of our algorithm. Our theoretical results are complemented with an extensive experimental evaluation on dynamic data from Twitter, Flickr, as well as trajectory data, demonstrating the effectiveness of our approach.
HighLife: Higher-arity Fact Harvesting
Authors: Patrick Ernst, Amy Siu and Gerhard Weikum
Keywords: Higher-arity relation extraction, Knowledge graphs, Partial fact observation, Knowledge Base Construction, Text-based knowledge harvesting, Health, Tree pattern learning, Distant supervision
Text-based knowledge extraction methods, for populating knowledge bases, have focused on binary facts: relationships between two entities. However, in advanced domains such as health, it is often crucial to consider ternary and higher-arity relations. An example is to capture which drug is used for which disease at which dosage (e.g. 2.5 mg/day) for which kinds of patients (e.g., children vs. adults). In this work, we present an approach to harvest higher-arity facts from textual sources. Our method is distantly supervised by seed facts, and uses the fact-pattern duality principle to gather fact candidates with high recall. For high precision, we devise a constraint-based reasoning method to eliminate false candidates. A major novelty is in coping with the difficulty that higher-arity facts are often expressed only partially in texts. For example, one sentence may refer to a drug, a disease and a group of patients, whereas another sentence talks about the drug, its dosage and the target group without mentioning the disease. Our methods cope well with such partially observed facts, at both pattern-learning and constraint-reasoning stages. Experiments with news articles and with health-related documents demonstrate the viability of our method.
Large-Scale Analysis of Style Injection by Relative Path Overwrite
Authors: Sajjad Arshad, Seyed Ali Mirheidari, Tobias Lauinger, Bruno Crispo, Engin Kirda and William Robertson
Keywords: Relative Path Overwrite, Scriptless Attack, Style Injection
Relative Path Overwrite (RPO) is a recent technique to inject style directives into websites even when no style sink or markup injection vulnerability is present. It exploits differences in how browsers and web servers interpret relative paths (i.e., path confusion) to make a HTML page reference itself as a stylesheet; a simple text injection vulnerability along with browsers’ leniency in parsing CSS resources results in an attacker’s ability to inject style directives that will be interpreted by the browser. Even though style injection may appear less serious a threat than script injection, it has been shown that it enables a range of attacks, including secret exfiltration. In this paper, we present the first large-scale study of the Web to measure the prevalence and significance of style injection using RPO. Our work shows that around 9 % of the websites in the Alexa Top 10,000 contain at least one vulnerable page, out of which more than one third can be exploited. We analyze in detail various impediments to successful exploitation, and make recommendations for remediation. In contrast to script injection, relatively simple countermeasures exist to mitigate style injection. However, there appears to be little awareness of this attack vector as evidenced by a range of popular Content Management Systems (CMSes) that we found to be exploitable.
Me, My Echo Chamber, and I: Introspection on Social Media Polarization
Authors: Nabeel Gillani, Ann Yuan, Martin Saveski, Soroush Vosoughi and Deb Roy
Keywords: political polarization, randomized experiment, social networks
Homophily—our tendency to surround ourselves with others who share our perspectives and opinions about the world—is both a part of human nature and an organizing principle underpinning many of our digital social networks. However, when it comes to politics or culture, homophily can amplify tribal mindsets and produce “”echo chambers”” that degrade the quality, safety, and diversity of discourse online. While several studies have empirically proven this point, few have explored how making users aware of the extent and nature of their political echo chambers influences their subsequent beliefs and actions. In this paper, we introduce Social Mirror, a social network visualization tool that enables a sample of Twitter users to explore the politically-active parts of their social network. We use Social Mirror to recruit Twitter users with a prior history of political discourse to a randomized experiment where we evaluate the effects of different treatments on participants’ i) beliefs about their network connections, ii) the political diversity of who they choose to follow, and iii) the political alignment of the URLs they choose to share. While we see no effects on average political alignment of shared URLs, we find that recommending accounts of the opposite political ideology to follow reduces participants’ beliefs in the political homogeneity of their network connections but still enhances their connection diversity one week after treatment. Conversely, participants who enhance their belief in the political homogeneity of their Twitter connections have less diverse network connections 2-3 weeks after treatment. We explore the implications of these disconnects between beliefs and actions on future efforts to promote healthier exchanges in our digital public spheres.
Minimizing Latency in Online Ride and Delivery Services
Authors: Abhimanyu Das, Sreenivas Gollapudi, Anthony Kim, Debmalya Panigrahi and Chaitanya Swamy
Keywords: vehicle routing problems, minimum latency problems, online ride services, online delivery services
We study natural variants of the classical multi-vehicle minimum latency problems where the objective is to route a set of vehicles located at depots to serve requests located at different points in a metric space so as to minimize the total latency. In this paper, we consider point-to-point requests that come with source-destination pairs and release-time constraints that restrict when each request can be served. The point-to-point requests and release-time constraints model taxi rides and deliveries. For all the variants considered, we show constant-factor approximation algorithms based on a linear programming framework. To the best of our knowledge, these are the first set of results for the aforementioned variants of the minimum latency problems. Furthermore, we provide an empirical study of heuristics based on our theoretical algorithms on a real data set of taxi rides.
Poster Honorable Mentions
- Rediet Abebe
- Jeremy Blackburn
- Tanmoy Chakraborty
- Gianmarco De Francisci Morales
- Dina Demner-Fushman
- Djellel Eddine Difallah
- Emilio Ferrara
- Vanessa Frias-Martinez
- Ran Gilad-Bachrach
- David Laniado
- Afra Mashhadi
- Michael Mathioudakis
- Preston Mcafee
- Luke Mcdowell
- Davide Mottin
- Geppino Pucci
- Kirk Roberts
- Mahadev Satyanarayanan
- Markus Strohmaier