## Summary

Probabilistic databases revolve around so-called possible worlds semantics. An uncertain database is actually a set of possible databases, also termed possible worlds. To run a query on the uncertain database, you run the query (in principle) in each of the possible worlds, and you get a set of possible answers. Define a probability distribution over the possible worlds, and you get a distribution over the possible answers; This pair of an uncertain database and a probability distribution is what is commonly referred to as a probabilistic database.

Uncertain data typically appears in one of three forms: Row-level uncertainty, where a tuple may or may not be in a given table, Attribute-level uncertainty, where the exact value of an attribute is not known, and Open-world uncertainty, where the set of tuples in a given result table is not known ahead of time. Note the distinction between row-level and open-world uncertainty: With the former, you can describe precisely, and in a finite way which tuples could be in the table, while with the latter you can not.

## Benchmarks

## Surveys

Morgan & Claypool: Probabilistic Databases A solid, theory-focused survey of techniques for probabilistic databases. A good starting point for anyone working in the space.

## Formal Systems for Incomplete Information

#### C-Tables

The C-Tables data representation is a way to compactly encode row- and attribute-level uncertainty in a classical deterministic database. The idea is to allow the use of labeled nulls, and to tag each row in an uncertain relation with a so-called 'local condition' The local condition is a boolean formula who's atoms are comparisons over the labeled nulls. Every possible assignment of values to the local nulls defines a possible world; Rows are only part of possible worlds who's valuation causes the row's local condition to evaluate to true.

- On representing incomplete information in a relational data base (no longer available on the internet?)
- Incomplete Information in Relational Databases (JACM article)

#### Three-Valued Logic

Somewhat related to uncertainty is a concept called three-valued logic, which extends classical boolean logic's True and False with a third "Unknown" value. This is the type of logic used in SQL when NULL values appear in queries, and it has the potential to mess all sorts of things up.

## Probabilistic Database Systems

#### MCDB (UFL/Rice/IBM)

MCDB, or the Monte-Carlo Data Base introduced to probabilistic databases the idea of describing a probability distribution by using a function that can compute a sample from the distribution. VG-Functions are table-generating functions that output a random sample from the table's possible worlds. MCDB processes queries by (1) generating a set of sampled possible worlds, (2) factorizing the possible worlds into a more compact representation, and (3) running the query over each of those possible worlds (conceptually) in parallel. Conveniently, the factorized representation also admits a more efficient query evaluation strategy. The main advantage of this approach is that it's simple and expressive: If you can generate samples of an uncertain table, you can use MCDB over it. In contrast to many of the other systems here, MCDB can support open-world uncertainty. Sampling upfront, however, can limit the accuracy of the query results, particularly if you have an extremely selective filtering predicate over the data.

#### MayBMS (Cornell)

The central idea behind MayBMS is a practical implementation of Probabilistic C-Tables called U-Relations (in fact, it's not uncommon to discuss U-Relations, calling them C-Tables). The idea is to avoid labeled nulls (which most databases do not support) and instead focus entirely on row-level uncertainty. As it turns out, if you're considering only finite, discrete (i.e., categorical) distributions, row-level uncertainty can encode attribute level uncertainty as well. By further limiting condition columns to conjunctions of boolean equalities (which is still sufficient to capture a significant class of queries), MayBMS can use a classical deterministic database engine to evaluate probabilistic queries.

- MayBMS: Managing incomplete information with probabilistic world-set decompositions
- 10
^{106}worlds and beyond: efficient representation and processing of incomplete information - MayBMS: a probabilistic database management system
- A Compositional Query Algebra for Second-Order Logic and Uncertain Databases
- On Query Algebras for Probabilistic Databases
- On APIs for Probabilistic Databases
- From Complete to Incomplete Information and Back
- DEMO: Query language support for incomplete information in the MayBMS system
- BOOK CHAPTER: MayBMS: A system for managing large uncertain and probabilistic databases
- MANUAL: MayBMS: A Probabilistic Database System.

#### Sprout (Oxford)

Sprout grew out of the MayBMS project, and has functionally replaced it. It uses a similar data model, but work on sprout has focused largely on optimizing query evaluation. In particular, much of the work involves clever tricks for making it easier/faster to compute reliable approximations of boolean formulas -- the main bottleneck in most probabilistic database computations.

- Approximate Confidence Computation in Probabilistic Databases
- SPROUT: Lazy vs. Eager Query Plans for Tuple-Independent Probabilistic Databases
- A dichotomy for non-repeating queries with negation in probabilistic databases
- Anytime Approximation in Probabilistic Databases
- Aggregates in Probabilistic Databases via Knowledge Compilation
- Ranking in Probabilistic Databases: Complexity and Efficient Algorithms
- Using OBDDs for Efficient Query Evaluation on Probabilistic Databases

#### MystiQ (UWash)

Some of the original dichotomy and complexity results in probabilistic databases came out of the MystiQ project.

- MYSTIQ: a system for finding more answers by using probabilities
- TECH REPORT: Efficient query evaluation on probabilistic databases

#### Orion 2.0 (UMD)

Orion 2.0 was the first attempt to support continuous probability distributions in a probabilistic database engine. Although it limits itself to a fixed set of probability distributions, by doing so it is able to compute exact, closed-form solutions to probability mass computations in some cases.

#### PrDB (UMD)

- PrDB: Managing and Exploiting Rich Correlations in Probabilistic Databases.

- Lineage Processing Over Correlated Probabilistic Databases

#### Orchestra (UPenn)

- ORCHESTRA: facilitating collaborative data sharing
- ORCHESTRA: Rapid, Collaborative Sharing of Dynamic Data
- The ORCHESTRA Collaborative Data Sharing System
- TECH REPORT: Provenance in ORCHESTRA

#### Trio (Stanford)

- Exploiting Lineage for Confidence Computation in Uncertain and Probabilistic Databases
- Working Models for Uncertain Data
- ULDBs: databases with uncertainty and lineage
- Databases with uncertainty and lineage
- Trio-One:Layering Uncertainty and Lineage on a Conventional DBMS
- TECH REPORT: Continuous Uncertainty in Trio
- TECH REPORT: Trio: A System for Integrated Management of Data, Accuracy, and Lineage
- TECH REPORT: An Introduction to ULDBs and the Trio System

#### Pip (Cornell)

Pip was an early attempt to fully implement a database engine supporting C-Tables using user-defined types and a front-end query rewriter. Pip builds on MCDB, allowing VG-Functions to be defined alongside secondary metadata functions, consequently models are defined using what the paper refers to as a 'Grey-Box' function. Although not explicitly called out in the paper, one of the key contributions is the idea of using expressions as variable identifiers --- this makes it possible to perform extended projections over probabilistic data without needing to allocate fresh variable identifiers during query evaluation.

#### Mimir (UB)

- On-Demand Query Result Cleaning
- Detecting the Temporal Context of Queries
- Lenses: an on-demand approach to ETL

#### Jigsaw (Cornell/Microsoft)

- Jigsaw: Efficient optimization over uncertain enterprise data
- DEMO: Fuzzy prophet: parameter exploration in uncertain enterprise scenarios

## Model Database Systems

#### BayesStore (Berkeley/UFL)

Use databases to evaluate graphical models. Now part of MADLib

- BayesStore: managing large, uncertain data repositories with probabilistic graphical models
- THESIS: Extracting and Querying Probabilistic Information in BayesStore

#### MauveDB

Create views that represent data in a continuous domain.

#### SimSQL

Database for fitting and obtaining metrics over Markov Chain Simulations.

#### DeepDive

Machine Learning toolkit for information extraction

#### Lahar

System for fitting and querying markov chains.

- Challenges for Event Queries over Markovian Streams
- Event Queries on Correlated Probabilistic Streams
- Access Methods for Markovian Streams
- Lahar Demonstration: Warehousing Markovian Streams (demo)
- Lineage for Markovian Stream Event Queries

#### Velox (Berkeley)

Database with Active Learning.

#### Plato (UCSD)

Signal Processing Database

#### ModelDB

Integration for Spark/SciKit that allows users to store, query, retrieve, and reuse training parameters for models.

## Probabilistic Programming Languages

#### ENFrame

- ENFrame: A Framework for Processing Probabilistic Data
- Declarative Probabilistic Programming with Datalog

#### Fabular

Formal semantics for a language for defining hierarchical linear regressions, based on similar language extensions found in many R packages. The authors integrate the language functionality into Tabular, a DSL for programming Excel spreadsheets.

## Markov Processes

#### Lahar

- Access Methods for Markovian Streams
- Approximation trade-offs in a Markovian stream warehouse: An empirical study
- THESIS: Lahar: Warehousing Markovian Streams
- DEMO: Lahar demonstration: warehousing Markovian streams

#### SimSQL

#### MCMC using IVM

## Lineage & Why/Why Not Queries

- Sensitivity Analysis and Explanations for Robust Query Evaluation in Probabilistic Databases
- Meiliu/Gatterbaur/Suciu: "Why or why no"?
- Approximate Lineage for Probabilistic Databases

## Sensitivity

- Sensitivity Analysis and Explanations for Robust Query Evaluation in Probabilistic Databases
- Lenses: An On-Demand Approach to ETL

## Explanations

- Scorpion: Explaining away outliers in aggregate queries.
- A formal approach to finding explanations for database queries.
- Explaining Query Answers with Explanation-Ready Databases