Introduces dynamic (visual) queries based on principles of direct manipulation: (1) visual presentation of the query’s components; (2) visual presentation of results; (3) rapid, incremental, and reversible control of the query; (4) selection by pointing, not typing; and (5) immediate and continuous feedback.
Users interact dramatically better with visual widgets than either natural language (too unspecific) or structured queries (too cumbersome).
Interactive data exploration requires minimal prep + online data processing. Key insights: Intermediate data must be represented probabilistically, special indexes are needed to create visualizations, data exploration often happens incrementally, so caching/updating past results is helpful.
Example Frontends for Approximate Data Exploration
Run queries at each level of the plan in parallel. As you're joining, allow operators at the same level to communicate. F'rex for a NLJ, estimate a 4-way NLJ by periodically sampling the 4-way join results from state available in-memory at the time. Followup work re-spins it using a pipelined version that mirrors currents somewhat. Incremental state is cached and used for downstream computations.
"The problem with traditional database engines in this context lies with the fact that relational operations are treated as black boxes. This abstraction renders accurate statistical estimation impossible because it hides interme- diate results as well as internal state from the remainder of the system. If intermediate results are not externally visible, it is impossible for the system to guess the final answer to the query because no entity has access to informa- tion about every input relation." (Systems paper)
System designed for approximate query answering based on join synopses; Challenge: Join sampling suffers from the birthday paradox; quadratic (not linear) reduction in number of join results at low sampling rates. Observation: Foreign key constraints imply a 1-to-1 relationship between source relation and outputs. Solution: sample from source relations, and store and use the full joins of those samples. Congressional joins extends this to use an index-striding-like approach to group-by vars. Each paper builds a statistical analysis of the results of aggregates over such joins.
BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data
Scale up data processing by sampling. Maintain (offline) not only purely random samples, but also samples biased (by grouping?) around combinations of columns. Goal is to identify rare/long-tail samples and keep them available for query processing that relies on those samples. Gimmicks: Stratified (grouped) sampling with hierarchies: Don't sample from the distribution: group on an attribute(s) and build one set of samples where each group is capped to X instances. Overflow samples go into a second set, which is capped at X' instances. Overflow from that goes into a third set, etc... When processing use the first N sets, where N is determined by accuracy and time constraint bounds.
Maintain a set of incrementally maintained "sample" views along the lines of DBO. Integrated directly into the underlying data management system. The main gimmick of the sampling strategy is that they try to capture variance in the attributes of interest. If a group (potentially the entire relation) has a high variance, we need more samples from it than if it has low variance. E.g., Compare two classes, one where all students get 4.0s and one where there's a uniform distribution from 2.5 to 4.0. The latter needs more samples to accurately capture the variance in grades.
Sampling-based evaluation tied directly into the standard query evaluation process. Parallel sampling-based processor. Extensions to the definition of aggregate operators that allow estimates of arbitrary (linear) aggregates.
A language for declaratively specifying visualizations over relational data. Emphasis on using transactional processing techniques (reordering, consistency, literal view serializability) to enable faster (interactive-latency) responses to user interactions.
A datastructure/operator roughly analogous to a priority queue, but buffered, out-of-core, supporting changing priorities, and designed to support prioritized sampling. "Here, you'll find this interesting"
Without storing extra information in a B+ Tree, pick randomly selected data values; Challenge: Sample rates for leaves below nodes with higher fan-outs are going to be lower. Use A/R sampling at each node to rebalance the sampling rate.
The bootstrap: Run the query N times over N different samples of the data. The sampled query results give you a range of results that you can use to get confidence bounds on the actual result. Precursor to MCDB.
Seminal paper. Initial paper assumes randomness from the input streams and that it's always possible to stream the query (i.e., all joins are one-sided). Original work also proposes index striding to support group-by more efficiently.