Probabilistic databases revolve around so-called possible worlds semantics. An uncertain database is actually a set of possible databases, also termed possible worlds. To run a query on the uncertain database, you run the query (in principle) in each of the possible worlds, and you get a set of possible answers. Define a probability distribution over the possible worlds, and you get a distribution over the possible answers; This pair of an uncertain database and a probability distribution is what is commonly referred to as a probabilistic database.
Uncertain data typically appears in one of three forms: Row-level uncertainty, where a tuple may or may not be in a given table, Attribute-level uncertainty, where the exact value of an attribute is not known, and Open-world uncertainty, where the set of tuples in a given result table is not known ahead of time. Note the distinction between row-level and open-world uncertainty: With the former, you can describe precisely, and in a finite way which tuples could be in the table, while with the latter you can not.
This section outlines foundational work on probabilistic and uncertain databases in theory.
The C-Tables data representation is a way to compactly encode row- and attribute-level uncertainty in a classical deterministic database. The idea is to allow the use of labeled nulls, and to tag each row in an uncertain relation with a so-called 'local condition' The local condition is a boolean formula who's atoms are comparisons over the labeled nulls. Every possible assignment of values to the local nulls defines a possible world; Rows are only part of possible worlds who's valuation causes the row's local condition to evaluate to true.
Somewhat related to uncertainty is a concept called three-valued logic, which extends classical boolean logic's True and False with a third "Unknown" value. This is the type of logic used in SQL when NULL values appear in queries, and it has the potential to mess all sorts of things up.
In general, query processing in a probabilistic database is NP-Hard. This means that for any meaningfully sized query, good luck getting precise results before the heat death of the universe. However, there are a number of cases where you can take shortcuts. For example, if you can prove that query results are independent (P(X | Y) = P(X), P(Y | X) = P(Y)) or disjoint (P(X|Y) = 0, P(Y|X) = 0), you can compute the probability of a disjunction significantly faster. There's been a bunch of theoretical work on trying to distinguish between queries for which such tricks exist, and queries where they don't. These are typically called 'dichotomy results'.
A few UPenn folks sat down to formally define a general model for provenance and came up with this beauty. There was already a well known link between database computations and semirings (Union & Aggregation are +, Joins are *). The authors here noted that many existing provenance models fit nicely slotted into the semiring model as well. The key idea is that you can express provenance as a polynomial expression (e.g., a + b*c) and then depending on what you slot in for the +, *, and base type, you get a different provenance model. This lets you build one system that supports all of those models. The crucial insight (with respect to Probabilistic Databases at least), is that the semiring model is equivalent to the condition columns of C-Tables. (NB: the paper incorrectly claims that they subsume all of C-Tables, although they don't capture the behavior of a C-Table's labeled nulls)
Aggregation in a probabilistic database is hard. The short reason is that you could potentially end up with a different aggregate value in each and every possible world. Thus, exact algorithms for aggregation must at least scale with the number of possible aggregate values, which could be exponential in the input size in the worst case. In practice though, that's not always the case. COUNT(*) for example is quite friendly to exact computation, since the number of distinct aggregate outputs is bounded by the number of input rows. SUM() can also be friendly on integers or low-precision floats (or more generally when your aggregate inputs are binned), as noted by the paper below out of INRIA.
This section outlines complete end-end systems for probabilistic query processing. A number of these are purely academic systems, and not all have been released.
MCDB, or the Monte-Carlo Data Base introduced to probabilistic databases the idea of describing a probability distribution by using a function that can compute a sample from the distribution. VG-Functions are table-generating functions that output a random sample from the table's possible worlds. MCDB processes queries by (1) generating a set of sampled possible worlds, (2) factorizing the possible worlds into a more compact representation, and (3) running the query over each of those possible worlds (conceptually) in parallel. Conveniently, the factorized representation also admits a more efficient query evaluation strategy. The main advantage of this approach is that it's simple and expressive: If you can generate samples of an uncertain table, you can use MCDB over it. In contrast to many of the other systems here, MCDB can support open-world uncertainty. Sampling upfront, however, can limit the accuracy of the query results, particularly if you have an extremely selective filtering predicate over the data.
The central idea behind MayBMS is a practical implementation of Probabilistic C-Tables called U-Relations (in fact, it's not uncommon to discuss U-Relations, calling them C-Tables). The idea is to avoid labeled nulls (which most databases do not support) and instead focus entirely on row-level uncertainty. As it turns out, if you're considering only finite, discrete (i.e., categorical) distributions, row-level uncertainty can encode attribute level uncertainty as well. By further limiting condition columns to conjunctions of boolean equalities (which is still sufficient to capture a significant class of queries), MayBMS can use a classical deterministic database engine to evaluate probabilistic queries.
Sprout grew out of the MayBMS project, and has functionally replaced it. It uses a similar data model, but work on sprout has focused largely on optimizing query evaluation. In particular, much of the work involves clever tricks for making it easier/faster to compute reliable approximations of boolean formulas -- the main bottleneck in most probabilistic database computations.
Orion 2.0 was the first attempt to support continuous probability distributions in a probabilistic database engine. Although it limits itself to a fixed set of probability distributions, by doing so it is able to compute exact, closed-form solutions to probability mass computations in some cases.
PrDB: Managing and Exploiting Rich Correlations in Probabilistic Databases.
Lineage Processing Over Correlated Probabilistic Databases
Though not technically a probabilistic database, Orchestra shares many of the same characteristics. Orchestra was designed as a data-sharing system for biologists, and as a result most of the uncertainty that they deal with comes from data exchange. Under the hood, they share a lot of ideas with C-Tables-based systems like Pip and MayBMS. They also leverage a lot of this infrastructure for provenance computations.
Trio was pretty much the first major system for provenance management, and was also one of the first who's authors recognized the connection with uncertainty (like provenance semi-rings above). Unfortunately, Trio restricted itself to lineage as a provenance model (Why provenance), and as a result was only able to handle comparatively limited classes of uncertainty and queries.
Pip was an early attempt to fully implement a database engine supporting C-Tables using user-defined types and a front-end query rewriter. Pip builds on MCDB, allowing VG-Functions to be defined alongside secondary metadata functions, consequently models are defined using what the paper refers to as a 'Grey-Box' function. Although not explicitly called out in the paper, one of the key contributions is the idea of using expressions as variable identifiers – this makes it possible to perform extended projections over probabilistic data without needing to allocate fresh variable identifiers during query evaluation.
Mimir is the first effort to virtualize uncertainty-based computations. In comparison to other systems which define an on-disk representation of uncertainty, Mimir borrows the VG-Terms of Pip and defines uncertain relations by the query used to construct them. Then, depending on the specific requirements of the user, Mimir selects one of several intermediate representations best suited for the workload. In addition to exploring virtualized query processing, the Mimir group has also explored techniques for to presenting uncertain query results to users.
Jigsaw is a variant of MCDB. The underlying system implementation basically follows the same pattern. The goal here is to address the fact that VG-Functions are black-boxes and identify scenarios where two different parameterizations are going to create similar results. The trick is to generate a bunch of samples from the function and use that as a fingerprint. When approximating query results in a first-pass Jigsaw avoids repeated invocation of a VG Function by calling only once per fingerprint.
Virtually all probabilistic database systems adopt a data model based on tuples. A number of efforts have come up looking at how to use similar techniques to directly query data defined by a graphical model and/or how to represent graphical models in a database.
Use databases to evaluate graphical models. Now part of MADLib
This section outlines a few languages specially aimed at allowing programmers to write turing-complete programs (or at least imperative-looking programs) where data/variables are defined as probability distributions.
Formal semantics for a language for defining hierarchical linear regressions, based on similar language extensions found in many R packages. The authors integrate the language functionality into Tabular, a DSL for programming Excel spreadsheets.
A number of efforts have taken a step back from probabilistic query processing and come up with some clever things to do when all you have is a weaker description of uncertainty in your data. One particular class of efforts explored in this section is incomplete results --- cases where you can precisely characterize what fragment of your input data is missing (e.g., when one partition goes down in a distributed system, the other partition generally know exactly shards of data were on that partition)
Partial Results in Database Systems develops a classification system for completeness. As in most work on PQP, they adopt notions of value- and row-level uncertainty, which they model by indicating that a particular attribute/table is complete or incomplete (and/or providing statistical measures over the resulting attributes). They then show how completeness propagates through a query.
m-tables: Representing Missing Data is the theoretical answer to Partial Results in Database Systems. The idea is to model tuples that could be missing along with a hypothetical distribution over their multiplicities.
Indexing Metric Uncertain Data for Range Queries,
in which the authors propose an indexing scheme for spatial data with an uncertain position. One of several papers that adopts a similar key trick: Index one point in the probability distribution. Then, when you query, ask the user to provide you with a confidence bound/threshold and use the threshold to define a sufficiently large radius around the area you're querying to ensure that you hit any point that could be possibly in the result with at least that threshold.
Supporting Data Uncertainty in Array Databases,
in which the authors address positional uncertainty in an array database, a premise quite similar to spatial uncertainty but on a discrete domain. The key challenge the paper addresses is the tradeoff between replication (casting attribute-level uncertainty as tuple-level uncertainty as in MayBMS) and the overhead of relaxing query constraints on indexes to detect non-replicated records. Interestingly, more replication increases query cost due to redundant tuple copies getting returned.
Lineage & Why/Why Not Queries
Sensitivity Analysis and Explanations for Robust Query Evaluation in Probabilistic Databases
Classically, PDBs assume that you come to them with data already annotated with probabilities. This is, however, not always the case. Papers in this section explore how probabilistic databases can 'learn' the underlying distribution over time.
A popular model for probabilistic databases is called the Tuple-Independent model (creating TI-PDBs for short). Tuple-independent probabilistic databases annotate each input tuple with a Bernoulli-distributed random variable. That is, we assume that each row of the input data is effectively present according to a random coin-flip. In a Beta-PDB, this is instead a Beta-Bernoulli distribution. It's still a coin flip, but the bias of the coin comes from training data given by two parameters (typically called a, b). Naively, these parameters represent samples: You flip a coin a+b times, and it comes up with a heads, that corresponds to a beta-distribution with parameters a, b. Propagating this training data through queries turns out to be surprisingly harder, which is the subject of this paper.