Probabilistic databases revolve around so-called possible worlds semantics. An uncertain database is actually a set of possible databases, also termed possible worlds. To run a query on the uncertain database, you run the query (in principle) in each of the possible worlds, and you get a set of possible answers. Define a probability distribution over the possible worlds, and you get a distribution over the possible answers; This pair of an uncertain database and a probability distribution is what is commonly referred to as a probabilistic database.
Uncertain data typically appears in one of three forms: Row-level uncertainty, where a tuple may or may not be in a given table, Attribute-level uncertainty, where the exact value of an attribute is not known, and Open-world uncertainty, where the set of tuples in a given result table is not known ahead of time. Note the distinction between row-level and open-world uncertainty: With the former, you can describe precisely, and in a finite way which tuples could be in the table, while with the latter you can not.
The C-Tables data representation is a way to compactly encode row- and attribute-level uncertainty in a classical deterministic database. The idea is to allow the use of labeled nulls, and to tag each row in an uncertain relation with a so-called 'local condition' The local condition is a boolean formula who's atoms are comparisons over the labeled nulls. Every possible assignment of values to the local nulls defines a possible world; Rows are only part of possible worlds who's valuation causes the row's local condition to evaluate to true.
Somewhat related to uncertainty is a concept called three-valued logic, which extends classical boolean logic's True and False with a third "Unknown" value. This is the type of logic used in SQL when NULL values appear in queries, and it has the potential to mess all sorts of things up.
MCDB, or the Monte-Carlo Data Base introduced to probabilistic databases the idea of describing a probability distribution by using a function that can compute a sample from the distribution. VG-Functions are table-generating functions that output a random sample from the table's possible worlds. MCDB processes queries by (1) generating a set of sampled possible worlds, (2) factorizing the possible worlds into a more compact representation, and (3) running the query over each of those possible worlds (conceptually) in parallel. Conveniently, the factorized representation also admits a more efficient query evaluation strategy. The main advantage of this approach is that it's simple and expressive: If you can generate samples of an uncertain table, you can use MCDB over it. In contrast to many of the other systems here, MCDB can support open-world uncertainty. Sampling upfront, however, can limit the accuracy of the query results, particularly if you have an extremely selective filtering predicate over the data.
The central idea behind MayBMS is a practical implementation of Probabilistic C-Tables called U-Relations (in fact, it's not uncommon to discuss U-Relations, calling them C-Tables). The idea is to avoid labeled nulls (which most databases do not support) and instead focus entirely on row-level uncertainty. As it turns out, if you're considering only finite, discrete (i.e., categorical) distributions, row-level uncertainty can encode attribute level uncertainty as well. By further limiting condition columns to conjunctions of boolean equalities (which is still sufficient to capture a significant class of queries), MayBMS can use a classical deterministic database engine to evaluate probabilistic queries.
Sprout grew out of the MayBMS project, and has functionally replaced it. It uses a similar data model, but work on sprout has focused largely on optimizing query evaluation. In particular, much of the work involves clever tricks for making it easier/faster to compute reliable approximations of boolean formulas -- the main bottleneck in most probabilistic database computations.
Orion 2.0 was the first attempt to support continuous probability distributions in a probabilistic database engine. Although it limits itself to a fixed set of probability distributions, by doing so it is able to compute exact, closed-form solutions to probability mass computations in some cases.
Pip was an early attempt to fully implement a database engine supporting C-Tables using user-defined types and a front-end query rewriter. Pip builds on MCDB, allowing VG-Functions to be defined alongside secondary metadata functions, consequently models are defined using what the paper refers to as a 'Grey-Box' function. Although not explicitly called out in the paper, one of the key contributions is the idea of using expressions as variable identifiers --- this makes it possible to perform extended projections over probabilistic data without needing to allocate fresh variable identifiers during query evaluation.
Formal semantics for a language for defining hierarchical linear regressions, based on similar language extensions found in many R packages. The authors integrate the language functionality into Tabular, a DSL for programming Excel spreadsheets.