Probabilistic databases revolve around so-called possible worlds semantics. An uncertain database is actually a set of possible databases, also termed possible worlds. To run a query on the uncertain database, you run the query (in principle) in each of the possible worlds, and you get a set of possible answers. Define a probability distribution over the possible worlds, and you get a distribution over the possible answers; This pair of an uncertain database and a probability distribution is what is commonly referred to as a probabilistic database.
Uncertain data typically appears in one of three forms: Row-level uncertainty, where a tuple may or may not be in a given table, Attribute-level uncertainty, where the exact value of an attribute is not known, and Open-world uncertainty, where the set of tuples in a given result table is not known ahead of time. Note the distinction between row-level and open-world uncertainty: With the former, you can describe precisely, and in a finite way which tuples could be in the table, while with the latter you can not.
Morgan & Claypool: Probabilistic Databases A solid, theory-focused survey of techniques for probabilistic databases. A good starting point for anyone working in the space.
Theoretical Foundations of Uncertain Databases
This section outlines foundational work on probabilistic and uncertain databases in theory.
The C-Tables data representation is a way to compactly encode row- and attribute-level uncertainty in a classical deterministic database. The idea is to allow the use of labeled nulls, and to tag each row in an uncertain relation with a so-called 'local condition' The local condition is a boolean formula who's atoms are comparisons over the labeled nulls. Every possible assignment of values to the local nulls defines a possible world; Rows are only part of possible worlds who's valuation causes the row's local condition to evaluate to true.
- On representing incomplete information in a relational data base (no longer available on the internet?)
- Incomplete Information in Relational Databases (JACM article)
Somewhat related to uncertainty is a concept called three-valued logic, which extends classical boolean logic's True and False with a third "Unknown" value. This is the type of logic used in SQL when NULL values appear in queries, and it has the potential to mess all sorts of things up.
In general, query processing in a probabilistic database is NP-Hard. This means that for any meaningfully sized query, good luck getting precise results before the heat death of the universe. However, there are a number of cases where you can take shortcuts. For example, if you can prove that query results are independent (P(X | Y) = P(X), P(Y | X) = P(Y)) or disjoint (P(X|Y) = 0, P(Y|X) = 0), you can compute the probability of a disjunction significantly faster. There's been a bunch of theoretical work on trying to distinguish between queries for which such tricks exist, and queries where they don't. These are typically called 'dichotomy results'.
- The dichotomy of probabilistic inference for unions of conjunctive queries
- The complexity of causality and responsibility for query answers and non-answers
- Dichotomies for Queries with Negation in Probabilistic Databases
A few UPenn folks sat down to formally define a general model for provenance and came up with this beauty. There was already a well known link between database computations and semirings (Union & Aggregation are +, Joins are *). The authors here noted that many existing provenance models fit nicely slotted into the semiring model as well. The key idea is that you can express provenance as a polynomial expression (e.g.,
a + b*c) and then depending on what you slot in for the +, *, and base type, you get a different provenance model. This lets you build one system that supports all of those models. The crucial insight (with respect to Probabilistic Databases at least), is that the semiring model is equivalent to the condition columns of C-Tables. (NB: the paper incorrectly claims that they subsume all of C-Tables, although they don't capture the behavior of a C-Table's labeled nulls)
Aggregation in a probabilistic database is hard. The short reason is that you could potentially end up with a different aggregate value in each and every possible world. Thus, exact algorithms for aggregation must at least scale with the number of possible aggregate values, which could be exponential in the input size in the worst case. In practice though, that's not always the case. COUNT(*) for example is quite friendly to exact computation, since the number of distinct aggregate outputs is bounded by the number of input rows. SUM() can also be friendly on integers or low-precision floats (or more generally when your aggregate inputs are binned), as noted by the paper below out of INRIA.
Probabilistic Database Systems
This section outlines complete end-end systems for probabilistic query processing. A number of these are purely academic systems, and not all have been released.
MCDB, or the Monte-Carlo Data Base introduced to probabilistic databases the idea of describing a probability distribution by using a function that can compute a sample from the distribution. VG-Functions are table-generating functions that output a random sample from the table's possible worlds. MCDB processes queries by (1) generating a set of sampled possible worlds, (2) factorizing the possible worlds into a more compact representation, and (3) running the query over each of those possible worlds (conceptually) in parallel. Conveniently, the factorized representation also admits a more efficient query evaluation strategy. The main advantage of this approach is that it's simple and expressive: If you can generate samples of an uncertain table, you can use MCDB over it. In contrast to many of the other systems here, MCDB can support open-world uncertainty. Sampling upfront, however, can limit the accuracy of the query results, particularly if you have an extremely selective filtering predicate over the data.
The central idea behind MayBMS is a practical implementation of Probabilistic C-Tables called U-Relations (in fact, it's not uncommon to discuss U-Relations, calling them C-Tables). The idea is to avoid labeled nulls (which most databases do not support) and instead focus entirely on row-level uncertainty. As it turns out, if you're considering only finite, discrete (i.e., categorical) distributions, row-level uncertainty can encode attribute level uncertainty as well. By further limiting condition columns to conjunctions of boolean equalities (which is still sufficient to capture a significant class of queries), MayBMS can use a classical deterministic database engine to evaluate probabilistic queries.
- MayBMS: Managing incomplete information with probabilistic world-set decompositions
- 10^(10^6) worlds and beyond: efficient representation and processing of incomplete information
- MayBMS: a probabilistic database management system
- A Compositional Query Algebra for Second-Order Logic and Uncertain Databases
- On Query Algebras for Probabilistic Databases
- On APIs for Probabilistic Databases
- From Complete to Incomplete Information and Back
- DEMO: Query language support for incomplete information in the MayBMS system
- BOOK CHAPTER: MayBMS: A system for managing large uncertain and probabilistic databases
- MANUAL: MayBMS: A Probabilistic Database System.
Sprout grew out of the MayBMS project, and has functionally replaced it. It uses a similar data model, but work on sprout has focused largely on optimizing query evaluation. In particular, much of the work involves clever tricks for making it easier/faster to compute reliable approximations of boolean formulas -- the main bottleneck in most probabilistic database computations.
- Approximate Confidence Computation in Probabilistic Databases
- SPROUT: Lazy vs. Eager Query Plans for Tuple-Independent Probabilistic Databases
- A dichotomy for non-repeating queries with negation in probabilistic databases
- Anytime Approximation in Probabilistic Databases
- Aggregates in Probabilistic Databases via Knowledge Compilation
- Ranking in Probabilistic Databases: Complexity and Efficient Algorithms
- Using OBDDs for Efficient Query Evaluation on Probabilistic Databases
Some of the original dichotomy and complexity results in probabilistic databases came out of the MystiQ project.
- MYSTIQ: a system for finding more answers by using probabilities
- TECH REPORT: Efficient query evaluation on probabilistic databases
Orion 2.0 (UMD)
Orion 2.0 was the first attempt to support continuous probability distributions in a probabilistic database engine. Although it limits itself to a fixed set of probability distributions, by doing so it is able to compute exact, closed-form solutions to probability mass computations in some cases.
- PrDB: Managing and Exploiting Rich Correlations in Probabilistic Databases.
- Lineage Processing Over Correlated Probabilistic Databases
Though not technically a probabilistic database, Orchestra shares many of the same characteristics. Orchestra was designed as a data-sharing system for biologists, and as a result most of the uncertainty that they deal with comes from data exchange. Under the hood, they share a lot of ideas with C-Tables-based systems like Pip and MayBMS. They also leverage a lot of this infrastructure for provenance computations.
- ORCHESTRA: facilitating collaborative data sharing
- ORCHESTRA: Rapid, Collaborative Sharing of Dynamic Data
- The ORCHESTRA Collaborative Data Sharing System
- TECH REPORT: Provenance in ORCHESTRA
Trio was pretty much the first major system for provenance management, and was also one of the first who's authors recognized the connection with uncertainty (like provenance semi-rings above). Unfortunately, Trio restricted itself to lineage as a provenance model (Why provenance), and as a result was only able to handle comparatively limited classes of uncertainty and queries.
- Exploiting Lineage for Confidence Computation in Uncertain and Probabilistic Databases
- Working Models for Uncertain Data
- ULDBs: databases with uncertainty and lineage
- Databases with uncertainty and lineage
- Trio-One:Layering Uncertainty and Lineage on a Conventional DBMS
- TECH REPORT: Continuous Uncertainty in Trio
- TECH REPORT: Trio: A System for Integrated Management of Data, Accuracy, and Lineage
- TECH REPORT: An Introduction to ULDBs and the Trio System
Pip was an early attempt to fully implement a database engine supporting C-Tables using user-defined types and a front-end query rewriter. Pip builds on MCDB, allowing VG-Functions to be defined alongside secondary metadata functions, consequently models are defined using what the paper refers to as a 'Grey-Box' function. Although not explicitly called out in the paper, one of the key contributions is the idea of using expressions as variable identifiers – this makes it possible to perform extended projections over probabilistic data without needing to allocate fresh variable identifiers during query evaluation.
Mimir is the first effort to virtualize uncertainty-based computations. In comparison to other systems which define an on-disk representation of uncertainty, Mimir borrows the VG-Terms of Pip and defines uncertain relations by the query used to construct them. Then, depending on the specific requirements of the user, Mimir selects one of several intermediate representations best suited for the workload. In addition to exploring virtualized query processing, the Mimir group has also explored techniques for to presenting uncertain query results to users.
- On-Demand Query Result Cleaning
- Detecting the Temporal Context of Queries
- Lenses: an on-demand approach to ETL
- Communicating Data Quality in On-Demand Curation
- Mimir: Bringing CTables into Practice
Jigsaw is a variant of MCDB. The underlying system implementation basically follows the same pattern. The goal here is to address the fact that VG-Functions are black-boxes and identify scenarios where two different parameterizations are going to create similar results. The trick is to generate a bunch of samples from the function and use that as a fingerprint. When approximating query results in a first-pass Jigsaw avoids repeated invocation of a VG Function by calling only once per fingerprint.
- Jigsaw: Efficient optimization over uncertain enterprise data
- DEMO: Fuzzy prophet: parameter exploration in uncertain enterprise scenarios
Querying Machine Learning Models
Virtually all probabilistic database systems adopt a data model based on tuples. A number of efforts have come up looking at how to use similar techniques to directly query data defined by a graphical model and/or how to represent graphical models in a database.
Use databases to evaluate graphical models. Now part of MADLib
- BayesStore: managing large, uncertain data repositories with probabilistic graphical models
- THESIS: Extracting and Querying Probabilistic Information in BayesStore
Create views that represent data in a continuous domain.
Database for fitting and obtaining metrics over Markov Chain Simulations.
Machine Learning toolkit for information extraction
Spinoff of DeepDive and BigDansig
Ihab Ilyas' Projects
- Descriptive and Prescriptive Cleaning
- Sampling from Repairs
- Data Civilizer
System for fitting and querying markov chains.
- Challenges for Event Queries over Markovian Streams
- Event Queries on Correlated Probabilistic Streams
- Access Methods for Markovian Streams
- Lahar Demonstration: Warehousing Markovian Streams (demo)
- Lineage for Markovian Stream Event Queries
Database with Active Learning.
- The Missing Piece in Complex Analytics: Low Latency, Scalable Model Management and Serving with Velox
Signal Processing Database
Integration for Spark/SciKit that allows users to store, query, retrieve, and reuse training parameters for models.
MCMC using IVM
Probabilistic Programming Languages
This section outlines a few languages specially aimed at allowing programmers to write turing-complete programs (or at least imperative-looking programs) where data/variables are defined as probability distributions.
- ENFrame: A Framework for Processing Probabilistic Data
- Declarative Probabilistic Programming with Datalog
Formal semantics for a language for defining hierarchical linear regressions, based on similar language extensions found in many R packages. The authors integrate the language functionality into Tabular, a DSL for programming Excel spreadsheets.
A number of efforts have taken a step back from probabilistic query processing and come up with some clever things to do when all you have is a weaker description of uncertainty in your data. One particular class of efforts explored in this section is incomplete results --- cases where you can precisely characterize what fragment of your input data is missing (e.g., when one partition goes down in a distributed system, the other partition generally know exactly shards of data were on that partition)
Partial Results in Database Systems develops a classification system for completeness. As in most work on PQP, they adopt notions of value- and row-level uncertainty, which they model by indicating that a particular attribute/table is complete or incomplete (and/or providing statistical measures over the resulting attributes). They then show how completeness propagates through a query.
m-tables: Representing Missing Data is the theoretical answer to Partial Results in Database Systems. The idea is to model tuples that could be missing along with a hypothetical distribution over their multiplicities.
Identifying the Extent of Completeness of Query Answers over Partially Complete Databases does something almost identical. The underlying idea is to provide a set of query patterns and show how these patterns propagate through a query. This allows them to relate annotated fragments of query inputs to query outputs, which in turn makes it possible to detect what results are incomplete.
- Indexing Metric Uncertain Data for Range Queries, in which the authors propose an indexing scheme for spatial data with an uncertain position. One of several papers that adopts a similar key trick: Index one point in the probability distribution. Then, when you query, ask the user to provide you with a confidence bound/threshold and use the threshold to define a sufficiently large radius around the area you're querying to ensure that you hit any point that could be possibly in the result with at least that threshold.
- Supporting Data Uncertainty in Array Databases, in which the authors address positional uncertainty in an array database, a premise quite similar to spatial uncertainty but on a discrete domain. The key challenge the paper addresses is the tradeoff between replication (casting attribute-level uncertainty as tuple-level uncertainty as in MayBMS) and the overhead of relaxing query constraints on indexes to detect non-replicated records. Interestingly, more replication increases query cost due to redundant tuple copies getting returned.
Lineage & Why/Why Not Queries
- Sensitivity Analysis and Explanations for Robust Query Evaluation in Probabilistic Databases
- Meiliu/Gatterbaur/Suciu: "Why or why no"?
- Approximate Lineage for Probabilistic Databases
- Sensitivity Analysis and Explanations for Robust Query Evaluation in Probabilistic Databases
- Lenses: An On-Demand Approach to ETL
- Scorpion: Explaining away outliers in aggregate queries.
- A formal approach to finding explanations for database queries.
- Explaining Query Answers with Explanation-Ready Databases
Training Probabilistic Databases
Classically, PDBs assume that you come to them with data already annotated with probabilities. This is, however, not always the case. Papers in this section explore how probabilistic databases can 'learn' the underlying distribution over time.
A popular model for probabilistic databases is called the Tuple-Independent model (creating TI-PDBs for short). Tuple-independent probabilistic databases annotate each input tuple with a Bernoulli-distributed random variable. That is, we assume that each row of the input data is effectively present according to a random coin-flip. In a Beta-PDB, this is instead a Beta-Bernoulli distribution. It's still a coin flip, but the bias of the coin comes from training data given by two parameters (typically called a, b). Naively, these parameters represent samples: You flip a coin a+b times, and it comes up with a heads, that corresponds to a beta-distribution with parameters a, b. Propagating this training data through queries turns out to be surprisingly harder, which is the subject of this paper.
A Graph-ish database with missing values that prioritizes triples for cleaning based on the anticipated number of added results the triple could produce.