Our keynote speaker is Prof. Michael J. Franklin, Professor of Computer Science at the University of California, Berkeley. Prof. Franklin is a world-renowned researcher in large-scale data management infrastructure and applications. He is also the Director of the Algorithms, Machines and People Lab (AMPLab), a highly-successful industry and government-supported collaboration across data management, cloud computing, and statistical machine learning.
Big Data software has created quite a stir, largely driven by open source environments such as Hadoop and Spark. In this talk, I'll begin by giving an overview of one such environment, the Berkeley Data Analytics Stack (BDAS), which we have been building over the past 6 years at the UC Berkeley AMPLab. BDAS has served as the launching platform for Spark, Mesos, Tachyon, GraphX, MLlib and other popular systems. I will then survey some recent trends in such software including some or all of the following: real-time analytics, machine learning model serving, internet of things, cloud-hosted analytics and the potential convergence of high-performance computing and big data processing. We can then try to predict which of these might actually take off and discuss the open research questions in some of the more promising areas.
Michael Franklin is the Thomas M. Siebel Professor of Computer Science and former Chair of the Computer Science Division at the University of California, Berkeley. Prof. Franklin is also the Director of the Algorithms, Machines, and People Laboratory (AMPLab), an NSF CISE Expedition in Computing center at UC Berkeley. The AMPLab currently works with nearly 30 industrial sponsors including founding sponsors Amazon Web Services, Google, IBM, and SAP. AMPLab is well-known for creating a number of popular systems in the Open Source Big Data ecosystem. Prof. Franklin is a co-PI and Executive Committee member for the Berkeley Institute for Data Science, part of a multi-campus initiative to advance Data ScienceEnvironments and a PI of the NSF Western Region Big Data Innovation Hub. He is an ACM Fellow, a two-time winner of the ACM SIGMOD "Test of Time" award, has several recent "Best Paper" awards and two CACM Research Highlights selections, and is recipient of the outstanding Advisor Award from the Computer Science Graduate Student Association at Berkeley. In summer 2016 he will be joining the University of Chicago to initiate a major new effort in Data Science and as Chair of Computer Science.
Organizational studies have been focusing on understanding human factors that influence the ability of an individual to perform a task, or a set of tasks, alone, or in collaboration with others, for over 40 years. The reason crowdsourcing platforms have been so successful is that tasks are small and simple, and do not require a long engagement from workers. The crowd is typically volatile, its arrival and departure asynchronous, and its levels of attention and accuracy diverse. Today, crowdsourcing platforms have plateaued and, despite a high demand, they are not adequate for emerging applications such as citizen science and disaster management. I will argue that workers need to be brought back into the loop by enabling worker-centric crowdsourcing. My current research seeks to verify how human factors such as skills, expected wage and motivation, contribute to making crowdsourcing kick-off again. In particular, I will discuss team formation for collaborative tasks, adaptive task assignment, and task composition to help workers find useful tasks.
This is joint work with Senjuti Basu Roy from the University of Washington and Dongwon Lee, Penn State University.
Data series are one of the most common types of data, and are present in virtually every scientific and social domain, such as biology, astronomy, entomology, the web, and others. It is not unusual for applications to involve numbers of data series in the order of hundreds of millions to billions, which are often times not analyzed in their full detail due to their sheer size. In this work, we discuss the state-of-the-art data series indexing approaches that can cope with the data deluge. We briefly review the iSAX2+ index, which is the first specifically designed for very large collections of data series, and use novel algorithms for efficient bulk loading. We also describe the first adaptive indexing approach, ADS+, where the index is built incrementally and adaptively, resulting in a very fast initialization process. We experimentally validated the proposed algorithms, including the first published experiments to consider datasets of size up to one billion data series, showing that we can deliver orders of magnitude improvements in the time required to build the index, and to start answering queries. Furthermore, we observe that there is currently no system that can inherently accommodate, manage, and support complex analytics for data series. Therefore, we articulate the necessity for rigorous work on data series management systems, able to cope with the large volume of data series collections, their heterogeneity (in terms of properties and characteristics), and possible uncertainty in their values.
We introduce a model for collaborative data-driven workflows based on the concept of “abstraction”. From a data viewpoint, this is simple: the peer sees only some of the global data (in a local-as-view style). But the peer also sees only part of what is going on globally.
We study the problem of explaning to a particular peer what is going on in the general system, and in particular what the other peers are doing. We would like to explain the events (that typically result in data updates) to the peer. At a global level, such an event is a rule instantiation. To explain the event, one can recursively explain how the positive and negative atoms in its body were obtained. (One can think of that as the provenance of the event.) This provides an explanation of the event. A particular peer sees only an “abstraction of the explanation”.
We formalize these various notions, and study semantics and algorithmic issues that are raised. A specific goal is to be able to construct for a particular peer, a local workflow that would be the appropriate abstraction of the global one.
Modern information and knowledge centric applications produce, store, integrate, query, analyze and visualize rapidly growing data sets. Traditional data processing technologies (data management and warehousing systems) are inadequate for processing this data and respond to the new big data challenges. From the data processing point of view the first important challenge concerns physical scalability which is achieved through generic massively distributed parallel data processing architectures combined like Hadoop, Spark, Flink. A second challenge concerns the definition and implementation of flexible semi-structured / semantic data models and languages for facilitating complex semantic data representation, integration and access as a complement to traditional structured SQL database systems (noSQL). The Resource Description Framework (RDF) is one of the first data models for modeling and processing this kind of data. Some relevant application examples are the Semantic Web's Linked Open Data (LOD) cloud which contains over 60 billions RDF triples or the data originating from Schema.org annotations in Web documents.
The Big Data revolution is creating a new business environment with unprecedented opportunities and new risks and constraints for insurance industry. Being more specific, thanks to data Insurers could create new ways to interact with consumers and customers, new opportunities of growth and new approaches to drive efficiency. But in order to reap these benefits, several challenges need to be addressed, some specific to insurance other more generic, especially
In AXA, to bring some momentum and to meet with these challenges, the Data Innovation Lab was created in January 2014.
Ph. Marie-Jeanne, Group chief data officer, founder and head of the Data innovation Lab of AXA, will bring his first experience on these different topics and explains the role, missions of the Data Innovation Lab and real challenges both technical and strategic faced today by a company like AXA on its journey to transform business through data.
Before being Group Chief data Officer of AXA and Head of the Data Innovation Lab (Axa Group) that he created on January 2014, Philippe Marie-Jeanne was Group Head of P&C Retail since 2011. Before joining AXA Global P&C, Philippe Marie-Jeanne was since February 2007 P&C Technical Head at AXA France. From 1998 to 2007, Philippe hold different positions at the SMABTP, leading insurer in the building and construction sector. From 2000 to 2011, Philippe Marie-Jeanne has been a member of the CPABR (FFSA Plenary Commission). He has also been President of the FFSA Statistician Committee from September 2008 to March 2011. Philippe started his career in 1989 at UAP Incendie et Accidents and hold different positions until 1997, when he joined Tillinghast – Towers Perrin as Senior Consultant.
Philippe Marie-Jeanne is a graduate engineer of the Polytechnique School, but also graduated from the National school of Statistic and Economic Administration (ENSAE) and member of the French Actuary Institute (IAF). He also followed an INSEAD executive program in 2006 (Programme supérieur pour Dirigeants).
Post-marketing drug surveillance is largely based on signals found from spontaneous reports from patients and health care providers. The use of web-based data (such as query logs and social media) is emerging among regulators (FDA and EMA), industry and academia. Sanofi is involved in several projects which aim is to assess the ability of identifying signals from web-based data. Three examples will be provided in this presentation.
The travel industry generates huge volumes of data coming from different sources like search engines, airline systems, online travel agencies, traveler reviews, and other possibly related data such as weather. At Amadeus, we are interested to leverage all of these data to improve the overall experience for travelers. Combining data coming from diverse data sources produces new business opportunities that have never been even imagined in the past. This not only raises some very interesting and complex technological challenges, but also legal aspects have to be carefully considered in this context. During this presentation we will expose real-life challenges we have experienced during our daily jobs when building data-driven applications for the travel industry.
Dr. Acuna-Agost has more than 10 years of technical, scientific, and management experience solving complex decision-making problems by applying many kinds of science, from traditional operations research to more recent data science techniques. He is currently working as Head of the Analysis and Research department where he manages the scientific research activities of Amadeus. The main outcomes include innovation prototypes, development of core components of Amadeus’ products (e.g, optimizers), big-data analysis, and fundamental research related with the travel industry. Dr. Acuna-Agost’s work has been recognized by several awards along his career including the latest Best Industrial Application of Operations Research delivered by the French Community of Operations Research in 2015. Dr. Acuna-Agost has strong relationships with academia, including publishing papers with more than 40 presentations at international conferences, reviewing several OR journals, and work as Visiting Professor at Universidad de Concepción (Chile).
In this talk, I will present YAGO, one of the oldest and largest KBs on the Semantic Web, with more than 10 million entities and 120 million facts about them. The project is driven jointly by Telecom ParisTech and the Max Planck Institute for Informatics in Germany. YAGO is constructed automatically from Wikipedia, WordNet, GeoNames, and other resources. I will explain how the knowledge is extracted from multilingual Wikipedias, cleaned, and distilled into a KB of 95% accuracy. All in all, the original YAGO paper has been cited more than 1600 times.
Understanding customer buying patterns is of great interest in the retail industry. Applications include targeted advertising, optimized product placement, and cross-promotions. Association rules, expressed as A → B (if A then B) are a common and easily understandable ways to represent buying patterns. While the problem of mining such rules has received considerable attention over the past years, most of the approaches proposed have only be evaluated on relatively small datasets, and struggle at large scale. In the context of the Datalyse project, Intermarché, our industrial partner, has given us access to 2 years of sales data: 3.5B sales records, 300M tickets, 9M customers, 200k products. This constitutes an opportunity to re-visit the problem association rules mining in the context of “big data”. In the remainder of this paper, I will first give an overview of our work on designing mining algorithms adapted to long-tailed datasets. Then, I will describe our evaluation of quality measures for ranking association rules. Finally, I will present the systems architecture deployed to apply mining in production at Intermarché.
Big Data is a new term used to identify datasets that we cannot manage with current methodologies or data mining software tools due to their large size and complexity. Big Data mining is the capability of extracting useful information from these large datasets or streams of data. New mining techniques are necessary due to the volume, variability, and velocity, of such data. MOA is a software framework with classification, regression, and frequent pattern methods, and the new APACHE SAMOA is a distributed streaming software for mining data streams.
Big data is dominated by textual information. The Bag-of-words model has been the dominant approach for Text mining assuming the word independence and the frequencies as the main feature for feature selection and query to document similarity. Although the long and successful usage, bag-of-words ignores words’ order and distance within the document – weakening thus the expressive power of the distance metrics. We propose graph-of-word, an alternative approach that capitalizes on a graph representation of documents and challenges the word independence assumption by taking into account words’ order and distance. We applied graph-of-word in various tasks such as ad-hoc Information Retrieval, Keyword Extraction, Text Categorization and Sub-event Detection in Textual Streams. In all cases the the graph of word approach, assisted by degeneracy at times, outperforms the state of the art base lines in all cases.
We present a project, MEDUSA, that is part of the exploitation of satellite remote sensing data for monitoring urban environment. In this project, the new context of big data in remote sensing is both considered a challenge but also an opportunity. The investigated techniques are coregistration, machine learning and in particular deep learning, the super-resolution and change detection algorithms. The covered applications for urban monitoring are urban extension monitoring, traffic monitoring, ground and building deformation, and heat islands.
Typical analysis processes in the Life Sciences are complex, multi-staged, and large. One of the most important challenges is to properly represent, manage, and execute such in-silico experiments. As a response to these needs, scientific workflow management systems have been introduced. They provide an environment to guide a scientific analysis process from design to execution. This area is largely driven by the bioinformatics community and also attracts attention in fields like geophysics or climate research. In a scientific workflow, the analysis processes are represented at a high level of abstraction which enhances flexibility, reuse, and modularity while allowing for optimization, parallelization, provenance tracking, debugging etc. Differences with business and ETL workflows have been studied extensively: scientific workflows have building blocks which are complex user-defined functions rather than relational operators and they are focused on data transformations.
These developments, accompanied by the growing availability of analytical tools wrapped as (web) services, were driven by a series of expectancies: End users of scientific workflow systems, without any programming skills, are empowered to develop their own pipelines; reuse of services is enhanced by easier integration into custom workflows; time necessary for developing analysis pipelines decrease etc. However, despite all efforts, scientific workflows have not yet found widespread acceptance in their intended audience.
In the meantime, it becomes possible to share, search, and compare scientific workflows, opening the door to the exchange of mature and specialized data integration solutions. For example, myExperiment is a portal that hosts more than two thousands scientific workflows while BioCatalogue is a repository of more than one thousand web services to be combined in workflows.
We argue that a wider adoption of scientific workflow systems would be highly beneficial for users but can only be achieved if at least the following two points are considered.
First, provenance in scientific workflows is a key concept and should be considered as a first citizen in scientific workflow systems. The importance of replication and reproducibility has been critically exemplified through studies showing that scientific papers commonly leave out experimental details essential for reproduction, studies showing difficulties with replicating published experimental results, an increase in retracted papers, and through a high number of failing clinical trials. Provenance supports reproducibility and allows assessing the quality of results. Research questions for workflow provenance include comparing workflow runs based on their provenance data and querying provenance information which can be, in turn, used to asses the similarity of workflows.
Second, since the targeted users are mainly non programmers, they may not want to design workflows from scratch. The focus of research should thus be placed on searching, adapting, and reusing existing workflows. Only by this shift can scientific workflow systems outreach to the mass of domain scientists actually performing scientific analysis – and with little interest in developing them themselves. To this end, scientific workflow systems need to be combined with community-wide workflow repositories allowing users to find solutions for their scientific needs (coded as workflows). As a need, to be reused by others, workflows should remain simple to use: a complex workflow composed of dozens of intertwined tasks, in general, is not much easier to understand than a well structured program performing the same analysis.
Our talk will outline the contributions we made to these research questions and draw research opportunities for the database community.
Data science is a novel discipline, concerned with the design of automated methods to analyze massive and complex data in order to extract information. Data science projects require expertise from a vast spectrum of scientific fields ranging from research on methods (statistics, signal processing, machine learning, data mining, data visualization) through software building and maintenance to the mastery of the scientific domain where the data originate from. To tackle challenges arising from managing such a multidisciplinary landscape, a number of universities launched data science initiatives (DSIs) in the last couple of years. The goal of this talk is to raise and partially answer some of the questions these initiatives are facing, through the experience we accumulated at the Paris-Saclay Center for Data Science: What is the scope of a DSI? How is the data science ecosystem structured? Who are the players of the ecosystem? Where are the bottlenecks? What motivates the players, and how to manage the misaligned incentives? What existing tools do we have for managing deeply multidisciplinary projects, and what tools should we develop?
Directeur de recherche in Computer Science CNRS, data scientist with more than 100 scientific papers. Wide experience at the interfaces of data science and scientific data (physics, biology, Earth sciences, macroeconomy). Since 2014, head of the Paris-Saclay Center for Data Science.
Uplift Modeling is a branch of machine learning which aims at predicting the causal effect of an action on a given individual. It aims to predict not the class itself, but the difference between the class variable behaviors in two groups. By using uplift modeling for recommender system, we can differentiate between the effects of two treatment and specify the best treatment based on its impact on customer behavior. We applied uplift modeling algorithms on marketing campaign dataset, we measured the real impact of the each treatment and optimized the recommender system by sub-targeting and personalizing.
Artificial neural networks and the deeper versions of these networks have become a highly active field in the recent years due to the availability of huge amount of data and computational resources. Deep learning is the new nomenclature for this collection of computational approaches to solve large scale real world problems which have significant economic and social impact on the society with learning capabilities closer to humans. There has already been a significant research effort on improving the accuracy of recognition capabilities of deep learning systems in computer vision, speech processing and natural language processing. However, the major problem with deep learning algorithms is that they consume too much power. This prevents them from being deployed on a large scale. Currently, they need to work on the server farms and cannot be integrated into mobile and handheld devices. In order to overcome this issue, we work on deep learning algorithms to be implemented on a hardware for making sense of real-world data in different modalities such as image, audio and text. Recently, IBM announced its energy efficient TrueNorth chip designed for spiking neural network architectures which are a subset of deep neural networks. Our project is a collaboration with a hardware research group and supported by ST microelectronics. Our solution is not limited to spiking neural networks only and it can be produced as a support hardware card and can handle large amounts of data very efficiently. The benefits of this solution are twofold. First, we process the data faster than a purely software approach and secondly, we use power efficient hardware instead of power-hungry servers which allows us to deploy our algorithms ubiquitously. The most important research problem with a energy efficient hardware solution is that it requires reduced precision in computation. We research for efficient deep learning algorithms that can allow reduced precision without compromising the performance significantly. To this end, we work on deep learning algorithms with binary and ternary weights. When we have such a constraint on the weights, the problem becomes combinatorial and it is not possible to solve for optimality especially in larger and deeper networks. Therefore, we develop variational techniques and sampling algorithms for approximate solutions that are also highly accurate.
We introduce a censored mixture model for duration, and develop maximum likelihood techniques for the joint estimation of the incidence and latency regression parameters in this model. We derive the necessary steps of the Quasi-Newton Expectation Maximization (QNEM) algorithm used for inference. The properties and robustness of the method is then examined by means of an extensive Monte Carlo simulation study, where we compare performances with state of the art methods in this framework. The model is finally illustrated with different real datasets where we try to predict a marker of rehospitalisation or death, based on high dimentional clinical variables.
There is an increasingly pressing need, by several applications in diverse domains, for developing techniques able to index and mine very large collections of sequences, or data series. Examples of such applications come from biology, astronomy, entomology, the web, and other domains. It is not unusual for these applications to involve numbers of data series in the order of hundreds of millions to billions, which are often times not analyzed in their full detail due to their sheer size. All relevant techniques that have been proposed in the literature so far have not considered any data collections much larger than one-million time series. In this work, we describe the iSAX2 and ADS families of algorithms, designed for indexing and mining truly massive collections of data series. We show how our methods allow mining on datasets that would otherwise be completely untenable, including the first published experiments using one billion data series.
Query answering in presence of semantic constraints must cope with both explicit and implicit data. Two families of query answering techniques have been proposed in literature, namely Saturation-based, which adds all the implicit knowledge to the explicit graph, that is then queried in a classical way, and Reformulation-based, which does not alter the set of explicit data, but reformulates the original query under the given contraints in such a way that ensure completeness also w.r.t. implicit knowledge. Different reformulation algorithms have been proposed in literature. Union of conjunctive queries (UCQ) reformulation applies to various fragments of RDF, ranging from the Description Logics (DL) one up to the Database (DB) one. Another notable reformulation technique computes semi-conjunctive queries (SCQ) reformulations and can be applied to the DL fragment of RDF. Join of unions of conjunctive queries (JUCQ) reformulation, a generalization of both UCQ and SCQ, has been proposed for the DB fragment of RDF, and has shown to improve the execution performance of reformulated queries over both UCQs and SCQs. To improve the performance of reformulated query evaluation performance even further, in the context of query answering for RDF, we investigate SPJUM, which allows plans composed by semi-join, projection, join, union and materialization operators, and thus strictly subsumes JUCQ. Join and union operators can appear in the computed plans at any level, differently from all the previous proposals in literature. Semi-join operators are explored for improving performance, materialization allows to save extra work in case of worth reuse opportunities for intermediate results. In this context, we propose cost-based algorithms, which compute a plans for executing the query having a contained cost.
YAGO is a knowledge base with more than 10 million entities (such as people, companies, and cities) and more than 120 million facts about these entities. It knows, e.g., birth dates of people, creation dates of companies, and names and sizes of cities. YAGO is extracted from Wikipedia in different languages. Human evaluation showed an accuracy of over 95 %, which distinguishes it from its competitors.
Knowledge bases like YAGO enhance the possibilities of text analysis. Entities that are mentioned in a text can be linked to the corresponding entities of the knowledge base. This is called entity linking, and enables applications such as automatic link generation for websites, trend analysis, or fact checking. An analysis by Huet et al. on the French newspaper 'Le Monde' illustrates this by analyzing mentions of men and women and their age over time.
Traditional approaches restrain themselves to cases where the entity name is mentioned explicitly (such as “Barack Obama”). Our goal is to extend entity linking also to nominal anaphoras such as “the president”, where it is clear from the context which entity is meant. By covering also the nominal anaphoras, we hope to drastically increase the number of annotations, and thus ultimately their usefulness.
When important events unfold, social media users engage in intense social activity by sharing relevant content in the form of text, pictures, videos. By efficiently analysing the large amount of data produced in social media such as Twitter, one could find and study events, in an automatic fashion. We present and evaluate a technique for finding events in Twitter based on finding overlapping dense subgraphs in a carefully defined graph.
The rise in the use of social networks in the recent years has resulted in an abundance of information on different aspects of everyday social activities that is available online, with the most prominent and timely source of such information being Twitter. This has resulted in a proliferation of tools and applications that can help end-users and largescale event organizers to better plan and manage their activities. In this process of analysis of the information originating from social networks, an important aspect is that of the geographic coordinates, i.e., geolocalisation, of the relevant information, which is necessary for several applications (e.g., on trending venues, traffic jams, etc.). Unfortunately, only a very small percentage of the twitter posts are geotagged, which significantly restricts the applicability and utility of such applications. In this work, we address this problem by proposing a framework for geolocating tweets that are not geotagged. Our solution is general, and estimates the location from which a post was generated by exploiting the similarities in the content between this post and a set of geotagged tweets, as well as their time-evolution characteristics. Contrary to previous approaches, our framework aims at providing accurate geolocation estimates at fine grain (i.e., within a city), and at real time. The experimental evaluation with real data demonstrates the efficiency and effectiveness of our approach.
Over-indebtedness is a significant (growing) problem of modern societies, with impact at personal and societal levels. Using big data, credit risk can be estimated through ongoing, up-to-date borrower assessments. In this project, we are interested in the scalability of data mining and predictive algorithms. We focus on the accuracy of existing tools for over-indebtedness risk prediction, and we aim to propose new solutions that improve the scalability of the most accurate approaches. The objectives of this project are: (1) Propose a data centric technique that can be used for the measurement of the factors of over-indebtedness; (2) apply existing data mining techniques to predict over-indebtedness on real datasets at least 6 months before the actual over-indebtedness occurs, and evaluate their accuracy and efficiency; (3) develop new ideas to handle the scalability issues of the most accurate techniques. This project is developed as a collaboration between the Groupe BPCE and the Big Data and Market Insights Research Chair.
As one of the basic studies of this thesis, reinforcement learning is reviewed first. Reinforcement learning is learning from interaction with an environment to achieve a goal. The learner discovers which actions produce the greatest reward by experiencing the actions. The system then estimates how good a certain action is in a given state. Reinforcement learning aims to maximize the total reward in the long run. For this, the value of the state (or action) when making decisions is crucial because highest value brings about the greatest amout of reward over the long run. Rewards are given immediately by selecting an action but values must be estimated from the experience of an agent. Estimating values is the most important thing in reinforcement learning. The optimal estimated value of each action is obtained by interative evaluation and improvement. To maximize the total reward, the agent must select the action with highest value (exploitation), but to discover such action it has to try actions not selected before (exploration). This exploration enables to experience other actions not taken before and it may increase the greater total reward in the long run because we would discover better actions. The trade-off between exploitation and exploration is one of the challenges in reinforcement learning.
Providing personalized point-of-interest (POI) recommendation has become a major issue with the rapid emergence of location-based social networks (LBSNs). Unlike traditional recommendation approaches, the LBSNs application domain comes with significant geographical and temporal dimensions. Moreover most of traditional recommendation algorithms fail to cope with the specific challenges implied by these two dimensions. Fusing geographical and temporal influences for better recommendation accuracy in LBSNs remains unexplored, as far as we know. We depict how matrix factorization can serve POI recommendation, and propose a novel attempt to integrate both geographical and temporal influences into matrix factorization. Specifically we present GeoMF-TD, an extension of geographical matrix factorization with temporal dependencies. Our experiments on a real dataset shows up to 20% benefit on recommendation precision.
Our work focuses on the theory of query evaluation on probabilistic instances, and asks which structural restrictions on such instances guarantee the tractability of this problem. We show a dichotomy result that fully answer this question. On the one hand, we show that the probabilistic query evaluation problem is always tractable on input instance families if the treewidth of instances is bounded by a constant: in fact, this tractability extends beyond the simple formalism of tuple-independent databases and unions of conjunctive queries commonly studied in this context, all the way to guarded second-order queries on relational and XML models with expressive (but bounded-treewidth) correlations. On the other hand, on arity-two signatures, and under mild constructibility assumptions, we show that this problem is intractable on any family of input instances of unbounded treewidth, no matter which other conditions we choose to impose on the instances.
The Resource Description Framework (RDF) is the W3C’s graph data model for Semantic Web applications. We study the problem of RDF graph summarization: given an input RDF graph G, find an RDF graph S which summarizes G as accurately as possible, while being possibly orders of magnitude smaller than the original graph. Our summaries are aimed as a help for query formulation and optimization; in particular, querying a summary of a graph should reflect whether the query has some answers against this graph. We introduce two summaries: a baseline which is compact and simple and satisfies certain accuracy and representativeness properties, but may oversimplify the RDF graph, and a refined one which trades some of these properties for more accuracy in representing the structure.
The aim of biological data ranking is to help users face with huge amounts of data and choose between alternative pieces of information. This is particularly important in the context of querying biological data, where very simple queries to the huge repositories of biological data can return thousands of answers. Ranking biological data is a difficult task: data may be associated with various degrees of confidence; data are not independent of each other’s; various ranking criteria can be considered (the most well-known data ranked first, or the freshest, or the most surprising, etc.). Rank aggregation techniques, which consists in aggregating several rankings into one consensual ranking, are very promising in this context. However, such approaches are intrinsically complex. A plethora of algorithm approximations and heuristics have thus been designed, making the choice of the approach to follow very difficult to make for the user. In the RankaBio project we have carefully studied the problem of rank aggregation for biological data, both practically and fundamentally by (i) performing a comparative study of rank aggregation algorithms, (ii) providing new results on the complexity of the problem and (iii) designing concrete tools able to efficiently rank biological data obtained as answer to a query on major biological data repositories (http://conqur-bio.lri.fr/).
Web-extracted knowledge bases (KBs) such as YAGO, DBpedia or the Knowledge Graph store up to billions of machine-readable facts about real-world entities. Such a plethora of information offers the opportunity to discover interesting patterns or rules in the data. For example, we can find that if a married person has a child, then most likely the spouse is also a parent of the child. Finding such rules automatically is known as rule mining. It is challenging for two reasons. First, KBs operate under the Open World Assumption (OWA), and therefore traditional rule mining techniques cannot distinguish between unknown information and negative evidence. Second, the size of current KBs poses scalability challenges for any rule mining method. In this poster we present a system called AMIE (http://www.mpi-inf.mpg.de/departments/databases-and-information-systems/research/yago-naga/amie/), which can efficiently learn logical rules on large KBs under the OWA. In addition, we show some of the numerous applications of rule mining in KBs. For instance, rules can be used to predict facts with high precision in order to increase the recall of KBs. They can also be used in schema alignment, in the canonicalization of open KBs, and for the semantification of Wikipedia links.
The poster presents SoMap, a web-based platform that provides new scalable methods to aggregate, analyse and valorise large collections of heterogeneous social data in urban contexts. The platform relies on geotagged data extracted from social networks and microblogging applications such as Instagram, Flickr and Twitter and on Points Of Interest gathered from OpenStreetMap. It could be very insightful and interesting for data scientists and decision-makers.
SoMap enables dynamic clustering of filtered social data in order to display it on a map in a combined form. The key components of this platform are the clustering module, which relies on a scalable algorithm, and the ranking algorithm that combines the popularity of the posts, their location and their link to the points of interest found in the neighbourhood. The system further detects mobility patterns by identifying and aggregating trajectories for all the users.
Social content such as blogs, tweets, news etc. is a rich source of interconnected information. We present a new data model, called S3, which is the first to capture both social structured and semantic features and allow to search them in a top-k approach. We present briefly our data model, our algorithm and its results.
Answering queries over Semantic Web data , i.e., RDF graphs, must account for both explicit and implicit data, entailed by the explicit data and the semantic constraints holding on them. Two main query answering techniques have been devised, namely Saturation-based (SAT) which precomputes and adds to the graph all implicit information, and Reformulation-based (REF) which reformulates the query based on the graph constraints, so that evaluating the reformulated query directly against the explicit data (i.e., without considering the constraints) produces the query answer. While SAT is well known, REF has received less attention so far. In particular, reformulated queries often perform poorly if the query is complex. Our work includes a large set of Ref techniques, including but not limited to one we proposed recently. The audience will be able to analyze and understand the performance challenges they raise. In particular, we show how a cost-based REF approach allows avoiding reformulation performance pitfalls.
Temporal data is pervasive. Store receipts, tweets or temperature measures generated by weather sensors are just some examples. Temporal semantics bring the opportunity to connect data using time and leverage temporal connections between data items. In the first part of this PhD thesis, we investigate how to efficiently connect large amounts of temporal intervals using a new particular kind of join, coined Ranked Temporal Join (RTJ) [1]. RTJ features predicates that compare time intervals and a scoring function associated with each predicate to quantify how well it is satisfied. RTJ queries are prevalent in a variety of applications such as network traffic monitoring or tweet analysis and are often best interpreted as top-k queries where only the best matches are returned. We show how to exploit the nature of temporal predicates and the properties of their associated scoring semantics to design TKIJ, an efficient query evaluation approach on a distributed Map-Reduce architecture. TKIJ relies on an offline statistics computation used for workload assignment to reducers. This aims at reducing data replication, to limit I/O cost. Additionally, high-scoring results are distributed evenly to enable each reducer to prune unnecessary results. We conduct extensive experiments on both synthetic data and real network traffic logs. We show that TKIJ outperforms state-of-the-art competitors and provides very good performance on a range of realistic n-ary RTJ queries on large collections of temporal intervals. In the second part of this PhD thesis, we explore how to leverage temporal semantics of crowdsourced tasks to design an adaptive motivation-centric crowdsourcing framework. In our approach, tasks are assigned to workers in multiple iterations over time. At each iteration, we aim at capturing the motivational factors of each worker by leveraging the temporal connections holding in sequences of completed tasks. Our investigations include machine learning techniques on temporal data and on-the-fly task assignment approaches.
[1] J. Pilourdault, V. Leroy, S. Amer-Yahia. Distributed Evaluation of Top-k Temporal Joins. To Appear (2016).
The Semantic Web is the vision that data can be shared across the boundaries of applications and websites. With the Linked Open Data (LOD) project, this vision has become much more concrete: RDF data can be published, accessed, and linked to in a distributed manner. More than 1000 datasets carrying rich semantic information are available this way. The initiative still has a long way to go, but we believe that the time has come to think beyond it: What if all Web data, whatever its source, type, access mode, were available on the Semantic Web? How would we uniformly query the all these resources? In our work we mainly focus on the uniformly querying and integration of data coming from different datasets with heterogeneous structures e.g. RDF dataset or Web service APIs. The first contribution of our work is a system called DORIS that enables an uniform access to Web service sources with the purpose of enriching a target Knowledge Base. The key idea of our approach is to exploit the intersection of Web service call results with a knowledge base and with other call results. Secondly, we propose an on-line instance-based relation alignment approach between RDF datasets. The alignment may be performed during query execution and requires partial information from the datasets. We align relations to a target dataset using association rule mining approaches.