Motivation for AQP and Data Exploration

Dynamic queries for visual information seeking
 Introduces dynamic (visual) queries based on principles of direct manipulation: (1) visual presentation of the query’s components; (2) visual presentation of results; (3) rapid, incremental, and reversible control of the query; (4) selection by pointing, not typing; and (5) immediate and continuous feedback.
 Users interact dramatically better with visual widgets than either natural language (too unspecific) or structured queries (too cumbersome).

Interactive data analysis: the control project
 Survey of the area, proposes a system built out of the hellerstein online agg work.

From databases to dataspaces: A new abstraction for information management
 An approach to semistructured data. Structured queries over data that may not fit any specific schema, without trying to shoehorn it. Proposes many of the challenges we're looking at.

MAD skills: new analysis practices for big data
 Adapting statistical tools for use with big data systems.

The case for interactive data exploration accelerators (IDEAs)
 Interactive data exploration requires minimal prep + online data processing. Key insights: Intermediate data must be represented probabilistically, special indexes are needed to create visualizations, data exploration often happens incrementally, so caching/updating past results is helpful.
Example Frontends for Approximate Data Exploration
Systems
Fuzzy Prophet
 Fuzzy Prophet: Parameter Exploration in Uncertain Enterprise Scenarios
 Jigsaw: efficient optimization over uncertain enterprise data
Precursor to this system, based on parameter exploration. Precompute for different parameter values. Attempt to reexploit computations already performed for previous operators.
SqlShare
UIfocused data management system. Intuitive approach to specifying joins, views, queries, etc... Append only, etc...
DBO
Run queries at each level of the plan in parallel. As you're joining, allow operators at the same level to communicate. F'rex for a NLJ, estimate a 4way NLJ by periodically sampling the 4way join results from state available inmemory at the time. Followup work respins it using a pipelined version that mirrors currents somewhat. Incremental state is cached and used for downstream computations.
"The problem with traditional database engines in this context lies with the fact that relational operations are treated as black boxes. This abstraction renders accurate statistical estimation impossible because it hides interme diate results as well as internal state from the remainder of the system. If intermediate results are not externally visible, it is impossible for the system to guess the final answer to the query because no entity has access to informa tion about every input relation." (Systems paper)
Aqua
 Join synopses for approximate query answering
 The Aqua approximate query answering system
 Congressional samples for approximate answering of groupby queries
System designed for approximate query answering based on join synopses; Challenge: Join sampling suffers from the birthday paradox; quadratic (not linear) reduction in number of join results at low sampling rates. Observation: Foreign key constraints imply a 1to1 relationship between source relation and outputs. Solution: sample from source relations, and store and use the full joins of those samples. Congressional joins extends this to use an indexstridinglike approach to groupby vars. Each paper builds a statistical analysis of the results of aggregates over such joins.
Blink
 Blink and it's done: interactive queries on very large data
 BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data Scale up data processing by sampling. Maintain (offline) not only purely random samples, but also samples biased (by grouping?) around combinations of columns. Goal is to identify rare/longtail samples and keep them available for query processing that relies on those samples. Gimmicks: Stratified (grouped) sampling with hierarchies: Don't sample from the distribution: group on an attribute(s) and build one set of samples where each group is capped to X instances. Overflow samples go into a second set, which is capped at X' instances. Overflow from that goes into a third set, etc... When processing use the first N sets, where N is determined by accuracy and time constraint bounds.
Derby
 Derby/S: a DBMS for samplebased query answering
 Sample synopses for approximate answering of groupby queries
Maintain a set of incrementally maintained "sample" views along the lines of DBO. Integrated directly into the underlying data management system. The main gimmick of the sampling strategy is that they try to capture variance in the attributes of interest. If a group (potentially the entire relation) has a high variance, we need more samples from it than if it has low variance. E.g., Compare two classes, one where all students get 4.0s and one where there's a uniform distribution from 2.5 to 4.0. The latter needs more samples to accurately capture the variance in grades.
MapReduce Online (HOP)

 Extend map reduce semantics to support pipelined operation. Easy for Map, Reduce trickier but still doable. Supports continuous and online queries. Contrast with storm?

Online Aggregation for Large MapReduce Jobs
 Online aggregation paper discusses ways to counteract batching, etc... behaviors introduced by map reduce in the estimators.

Early Accurate Results for Advanced Analytics on MapReduce
 Extend HOP with a feedback loop for repeated sampling/estimation
PFOLA / GLADE
 PFOLA: A High Performance Framework for Parallel Online Aggregation
 GLADE: big data analytics made easy (demo)
 GLADE: a scalable framework for efficient analytics
Samplingbased evaluation tied directly into the standard query evaluation process. Parallel samplingbased processor. Extensions to the definition of aggregate operators that allow estimates of arbitrary (linear) aggregates.
Eddies
Incremental, stepwise query evaluation. Each operator processes one "chunk" at a time.
Use eddies for stream processing. Tag tuples with lineage to allow parallel multiquery outoforder evaluation (MCDB parallelism)
Octopus
Besteffort utilities for data extraction and query processing. Hybrid automation under human direction query processing system.
QWiK
Alternative interfaces for interactive data exploration.
dbTouch
Data migration / Data exploration
DeVIL (Declarative Visual Interaction Language)
A language for declaratively specifying visualizations over relational data. Emphasis on using transactional processing techniques (reordering, consistency, literal view serializability) to enable faster (interactivelatency) responses to user interactions.
Incremental/Partial Processing Strategies

Online Dynamic Reordering for Interactive Data Processing
 A datastructure/operator roughly analogous to a priority queue, but buffered, outofcore, supporting changing priorities, and designed to support prioritized sampling. "Here, you'll find this interesting"

Distributed Online Aggregation
 Distributed DHTbased implementation of the Hellerstein paper. Varies number of processing unit, and persists state for use in future computations.
 Sets
 Blakeley,J.A.,Larson,P.A ̊.,Tompa,F.W.:Efficientlyupdating materialized views. In: SIGMOD, pp. 61–71 (1986)
 Buneman,P.,Clemons,E.K.:Efficientlymonitoringrelationaldatabases. ACM TODS 4(3), 368–382 (1979)
 Bags
 Chaudhuri, S., Krishnamurthy, R., Potamianos, S., Shim, K.: Op timizing queries with materialized views. In: ICDE, pp. 190–200 (1995)
 Griffin, T., Libkin, L.: Incremental maintenance of views with duplicates. In: SIGMOD, pp. 328–339 (1995)
 Aggregates
 Palpanas, T., Sidle, R., Cochrane, R., Pirahesh, H.:Incremental maintenance for nondistributive aggregate functions. In: VLDB, pp. 802–813 (2002)
 Lazy Evaluation
 Colby,L.S.,Griffin,T.,Libkin,L.,Mumick,I.S.,Trickey,H.:Algorithms for deferred view maintenance. In: SIGMOD, pp. 469–480 (1996)
 Zhou,J.,Larson,P.A ̊.,Elmongui,H.G.:Lazymaintenanceofmaterialized views. In: VLDB, pp. 231–242 (2007)
 Asynchronous
 Salem,K.,Beyer,K.S.,Cochrane,R.,Lindsay,B.G.:Howtorollajoin: Asynchronous incremental view maintenance. In: SIGMOD,pp. 129–140 (2000)
 Map/Reduce
 Largescale incremental processing with MapReduce
Sampling Strategies
 Survey Paper: Random sampling from databases: A survey
 Good overview of work on sampling via datastructures, and commutators for relational operators with the sampling operator
 Summary of F. Olken's Thesis work. (LBL tech report LBL32883 which is itself seminal)

Simple Random Sampling from Relational Databases
 Summary of how sampling interacts with standard relational database operators, and how to push the sampling operator through.

Random Sampling from B+ Trees
 Without storing extra information in a B+ Tree, pick randomly selected data values; Challenge: Sample rates for leaves below nodes with higher fanouts are going to be lower. Use A/R sampling at each node to rebalance the sampling rate.

Random Sampling from PseudoRanked B+ Trees
 Attach upperbound metadata to each node of a tree to minimize rejection rates; This also limits the rate at which the inner nodes must be updated.

Random Sampling from hash files
 Similar solution as above; Handle unknown data sizes in a block by using A/R sampling.

Pig latin: a notsoforeign language for data processing
 Section 5 describes sampling for joins, selection predicates/etc...

Relational Confidence Bounds Are Easy With The Bootstrap
 The bootstrap: Run the query N times over N different samples of the data. The sampled query results give you a range of results that you can use to get confidence bounds on the actual result. Precursor to MCDB.

The analytical bootstrap: a new method for fast error estimation in approximate query processing
 Bootstrap + Provenance Semirings + A Semiring of Databases + PIP
Estimators and Accuracy Analyses

Online aggregation
 Seminal paper. Initial paper assumes randomness from the input streams and that it's always possible to stream the query (i.e., all joins are onesided). Original work also proposes index striding to support groupby more efficiently.

Largesample and deterministic confidence intervals for online aggregation
 Detailed analysis of confidence bounds for the Hellerstein Online Aggregation work.

Samplingbased estimation of the number of distinct values of an attribute
 As titled. Strategies for using result samples to estimate COUNT DISTINCT.

Sampling Estimators for Parallel Online Aggregation
 An overview of estimators for summationbased aggregate functions

Statistical estimators for aggregate relational algebra queries
 Estimation of count (and by extension, count distinct), in the face of setprojection, setunion, and setdifference.

Robust estimation with sampling and approximate preaggregation
 Store aggregate results, and to compute pergroup aggregate values, rescale the samplebased estimate so that it conforms to the overall aggregate. Account for variances as well, using MLE.

Online estimation for subsetbased SQL queries
 Dealing with Nested Subqueries (exists, not exists, etc...); Tiered estimation: Estimate the probability of the nested subquery condition first.

Uncertainty Aware Query Execution Time Prediction
 Costbased optimization taking into account uncertainty about predicted query execution times.
Online Joins

Ripple joins for online aggregation
 Followup work for online aggregation adds support for incremental join processing, but relies on it being inmemory (though suggest distribution as a solution).

PRJoin: A NonBlocking Join Achieving Higher Early Result Rate with Statistical Guarantees
 Advanced Ripple Join, hashpartitions input data on the fly to achieve better performance

On Random Sampling over Joins
 Strategies for join sampling without resorting to A/R  Different variations on the idea of sample from one of the input relations, join with the other.

On producing join results early
 Definition of sortmergejoin, an online implementation of merge join (looks like sorting is assumed)
Creating Data Synopses

Fast incremental maintenance of approximate histograms
 Maintaining variable boundary histograms (e.g., equidepth and compressed histograms)

Approximate computation of multidimensional aggregates of sparse data using wavelets
 A compact encoding for data cubes based on wavelets.
 Flajolet/Martin Sketches
 Synopsis of countdistinct
 Count Sketch
 Synopsis of topk count.

Sketching Sampled Data Streams
 Extending sketchbased guarantees to sketches built on sampled data.
Query evaluation Over Synopses

Histogrambased techniques for approximating setvalued query answers
 Distance Metric for sets (e.g., how good is an estimate for a query that returns a set), and a "histogram relational algebra" that allows query evaluation over histograms.

Approximate query processing using wavelets
 Query processing directly on wavelets
Precomputation

Wave Computing in the Cloud
 Capture recurring queries by examining a queries executed repeatedly on later instantiations of the same dataset. Use shared scans, etc...
Distributed and Continuous Workflow Systems
 Nova: continuous Pig/Hadoop workflows
 Dryad: distributed dataparallel programs from sequential building blocks
 Storm

The Design of the Borealis Stream Processing Engine
 Store intermediate results. Allow modification, reuse, dynamic restructuring of the query graph.

Partial Results in Database Systems
 Logic for describing and reasoning about partial results from a distributed computation.
Predictive, Anticipatory Query Suggestions

Matchup: Autocompletion for mashups
 Autocompletion of structures for mashups, analogous to what we're trying to do.

VisGets: Coordinated Visualizations for Webbased Information Exploration and Discovery
 Propose a set of appropriate visualizations for the available data.

Logos: a system for translating queries into narratives
 Translate a query into a humanreadable string.