Motivation for AQP and Data Exploration
Dynamic queries for visual information seeking
- Introduces dynamic (visual) queries based on principles of direct manipulation: (1) visual presentation of the query’s components; (2) visual presentation of results; (3) rapid, incremental, and reversible control of the query; (4) selection by pointing, not typing; and (5) immediate and continuous feedback.
- Users interact dramatically better with visual widgets than either natural language (too unspecific) or structured queries (too cumbersome).
Interactive data analysis: the control project
- Survey of the area, proposes a system built out of the hellerstein online agg work.
From databases to dataspaces: A new abstraction for information management
- An approach to semistructured data. Structured queries over data that may not fit any specific schema, without trying to shoehorn it. Proposes many of the challenges we're looking at.
MAD skills: new analysis practices for big data
- Adapting statistical tools for use with big data systems.
The case for interactive data exploration accelerators (IDEAs)
- Interactive data exploration requires minimal prep + online data processing. Key insights: Intermediate data must be represented probabilistically, special indexes are needed to create visualizations, data exploration often happens incrementally, so caching/updating past results is helpful.
Example Frontends for Approximate Data Exploration
- Fuzzy Prophet: Parameter Exploration in Uncertain Enterprise Scenarios
- Jigsaw: efficient optimization over uncertain enterprise data
Precursor to this system, based on parameter exploration. Precompute for different parameter values. Attempt to re-exploit computations already performed for previous operators.
UI-focused data management system. Intuitive approach to specifying joins, views, queries, etc... Append only, etc...
Run queries at each level of the plan in parallel. As you're joining, allow operators at the same level to communicate. F'rex for a NLJ, estimate a 4-way NLJ by periodically sampling the 4-way join results from state available in-memory at the time. Followup work re-spins it using a pipelined version that mirrors currents somewhat. Incremental state is cached and used for downstream computations.
"The problem with traditional database engines in this context lies with the fact that relational operations are treated as black boxes. This abstraction renders accurate statistical estimation impossible because it hides interme- diate results as well as internal state from the remainder of the system. If intermediate results are not externally visible, it is impossible for the system to guess the final answer to the query because no entity has access to informa- tion about every input relation." (Systems paper)
- Join synopses for approximate query answering
- The Aqua approximate query answering system
- Congressional samples for approximate answering of group-by queries
System designed for approximate query answering based on join synopses; Challenge: Join sampling suffers from the birthday paradox; quadratic (not linear) reduction in number of join results at low sampling rates. Observation: Foreign key constraints imply a 1-to-1 relationship between source relation and outputs. Solution: sample from source relations, and store and use the full joins of those samples. Congressional joins extends this to use an index-striding-like approach to group-by vars. Each paper builds a statistical analysis of the results of aggregates over such joins.
- Blink and it's done: interactive queries on very large data
- BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data Scale up data processing by sampling. Maintain (offline) not only purely random samples, but also samples biased (by grouping?) around combinations of columns. Goal is to identify rare/long-tail samples and keep them available for query processing that relies on those samples. Gimmicks: Stratified (grouped) sampling with hierarchies: Don't sample from the distribution: group on an attribute(s) and build one set of samples where each group is capped to X instances. Overflow samples go into a second set, which is capped at X' instances. Overflow from that goes into a third set, etc... When processing use the first N sets, where N is determined by accuracy and time constraint bounds.
- Derby/S: a DBMS for sample-based query answering
- Sample synopses for approximate answering of group-by queries
Maintain a set of incrementally maintained "sample" views along the lines of DBO. Integrated directly into the underlying data management system. The main gimmick of the sampling strategy is that they try to capture variance in the attributes of interest. If a group (potentially the entire relation) has a high variance, we need more samples from it than if it has low variance. E.g., Compare two classes, one where all students get 4.0s and one where there's a uniform distribution from 2.5 to 4.0. The latter needs more samples to accurately capture the variance in grades.
MapReduce Online (HOP)
- Extend map reduce semantics to support pipelined operation. Easy for Map, Reduce trickier but still doable. Supports continuous and online queries. Contrast with storm?
- Online aggregation paper discusses ways to counteract batching, etc... behaviors introduced by map reduce in the estimators.
- Extend HOP with a feedback loop for repeated sampling/estimation
PF-OLA / GLADE
- PF-OLA: A High Performance Framework for Parallel Online Aggregation
- GLADE: big data analytics made easy (demo)
- GLADE: a scalable framework for efficient analytics
Sampling-based evaluation tied directly into the standard query evaluation process. Parallel sampling-based processor. Extensions to the definition of aggregate operators that allow estimates of arbitrary (linear) aggregates.
Incremental, stepwise query evaluation. Each operator processes one "chunk" at a time.
Use eddies for stream processing. Tag tuples with lineage to allow parallel multi-query out-of-order evaluation (MCDB parallelism)
Best-effort utilities for data extraction and query processing. Hybrid automation under human direction query processing system.
Alternative interfaces for interactive data exploration.
Data migration / Data exploration
DeVIL (Declarative Visual Interaction Language)
A language for declaratively specifying visualizations over relational data. Emphasis on using transactional processing techniques (reordering, consistency, literal view serializability) to enable faster (interactive-latency) responses to user interactions.
Incremental/Partial Processing Strategies
Online Dynamic Reordering for Interactive Data Processing
- A datastructure/operator roughly analogous to a priority queue, but buffered, out-of-core, supporting changing priorities, and designed to support prioritized sampling. "Here, you'll find this interesting"
Distributed Online Aggregation
- Distributed DHT-based implementation of the Hellerstein paper. Varies number of processing unit, and persists state for use in future computations.
- Blakeley,J.A.,Larson,P.A ̊.,Tompa,F.W.:Efficientlyupdating materialized views. In: SIGMOD, pp. 61–71 (1986)
- Buneman,P.,Clemons,E.K.:Efficientlymonitoringrelationalda-tabases. ACM TODS 4(3), 368–382 (1979)
- Chaudhuri, S., Krishnamurthy, R., Potamianos, S., Shim, K.: Op- timizing queries with materialized views. In: ICDE, pp. 190–200 (1995)
- Griffin, T., Libkin, L.: Incremental maintenance of views with du-plicates. In: SIGMOD, pp. 328–339 (1995)
- Palpanas, T., Sidle, R., Cochrane, R., Pirahesh, H.:Incremental maintenance for non-distributive aggregate functions. In: VLDB, pp. 802–813 (2002)
- Lazy Evaluation
- Colby,L.S.,Griffin,T.,Libkin,L.,Mumick,I.S.,Trickey,H.:Al-gorithms for deferred view maintenance. In: SIGMOD, pp. 469–480 (1996)
- Zhou,J.,Larson,P.A ̊.,Elmongui,H.G.:Lazymaintenanceofma-terialized views. In: VLDB, pp. 231–242 (2007)
- Salem,K.,Beyer,K.S.,Cochrane,R.,Lindsay,B.G.:Howtorollajoin: Asynchronous incremental view maintenance. In: SIGMOD,pp. 129–140 (2000)
- Large-scale incremental processing with MapReduce
- Survey Paper: Random sampling from databases: A survey
- Good overview of work on sampling via datastructures, and commutators for relational operators with the sampling operator
- Summary of F. Olken's Thesis work. (LBL tech report LBL-32883 which is itself seminal)
Simple Random Sampling from Relational Databases
- Summary of how sampling interacts with standard relational database operators, and how to push the sampling operator through.
Random Sampling from B+ Trees
- Without storing extra information in a B+ Tree, pick randomly selected data values; Challenge: Sample rates for leaves below nodes with higher fan-outs are going to be lower. Use A/R sampling at each node to rebalance the sampling rate.
Random Sampling from Pseudo-Ranked B+ Trees
- Attach upper-bound metadata to each node of a tree to minimize rejection rates; This also limits the rate at which the inner nodes must be updated.
Random Sampling from hash files
- Similar solution as above; Handle unknown data sizes in a block by using A/R sampling.
Pig latin: a not-so-foreign language for data processing
- Section 5 describes sampling for joins, selection predicates/etc...
Relational Confidence Bounds Are Easy With The Bootstrap
- The bootstrap: Run the query N times over N different samples of the data. The sampled query results give you a range of results that you can use to get confidence bounds on the actual result. Precursor to MCDB.
The analytical bootstrap: a new method for fast error estimation in approximate query processing
- Bootstrap + Provenance Semirings + A Semiring of Databases + PIP
Estimators and Accuracy Analyses
- Seminal paper. Initial paper assumes randomness from the input streams and that it's always possible to stream the query (i.e., all joins are one-sided). Original work also proposes index striding to support group-by more efficiently.
Large-sample and deterministic confidence intervals for online aggregation
- Detailed analysis of confidence bounds for the Hellerstein Online Aggregation work.
Sampling-based estimation of the number of distinct values of an attribute
- As titled. Strategies for using result samples to estimate COUNT DISTINCT.
Sampling Estimators for Parallel Online Aggregation
- An overview of estimators for summation-based aggregate functions
Statistical estimators for aggregate relational algebra queries
- Estimation of count (and by extension, count distinct), in the face of set-projection, set-union, and set-difference.
Robust estimation with sampling and approximate pre-aggregation
- Store aggregate results, and to compute per-group aggregate values, rescale the sample-based estimate so that it conforms to the overall aggregate. Account for variances as well, using MLE.
Online estimation for subset-based SQL queries
- Dealing with Nested Subqueries (exists, not exists, etc...); Tiered estimation: Estimate the probability of the nested subquery condition first.
Uncertainty Aware Query Execution Time Prediction
- Cost-based optimization taking into account uncertainty about predicted query execution times.
Ripple joins for online aggregation
- Follow-up work for online aggregation adds support for incremental join processing, but relies on it being in-memory (though suggest distribution as a solution).
PR-Join: A Non-Blocking Join Achieving Higher Early Result Rate with Statistical Guarantees
- Advanced Ripple Join, hash-partitions input data on the fly to achieve better performance
On Random Sampling over Joins
- Strategies for join sampling without resorting to A/R -- Different variations on the idea of sample from one of the input relations, join with the other.
On producing join results early
- Definition of sort-merge-join, an online implementation of merge join (looks like sorting is assumed)
Creating Data Synopses
Fast incremental maintenance of approximate histograms
- Maintaining variable boundary histograms (e.g., equi-depth and compressed histograms)
Approximate computation of multidimensional aggregates of sparse data using wavelets
- A compact encoding for data cubes based on wavelets.
- Flajolet/Martin Sketches
- Synopsis of count-distinct
- Count Sketch
- Synopsis of top-k count.
Sketching Sampled Data Streams
- Extending sketch-based guarantees to sketches built on sampled data.
Query evaluation Over Synopses
Histogram-based techniques for approximating set-valued query answers
- Distance Metric for sets (e.g., how good is an estimate for a query that returns a set), and a "histogram relational algebra" that allows query evaluation over histograms.
Approximate query processing using wavelets
- Query processing directly on wavelets
Wave Computing in the Cloud
- Capture recurring queries by examining a queries executed repeatedly on later instantiations of the same dataset. Use shared scans, etc...
Distributed and Continuous Workflow Systems
- Nova: continuous Pig/Hadoop workflows
- Dryad: distributed data-parallel programs from sequential building blocks
The Design of the Borealis Stream Processing Engine
- Store intermediate results. Allow modification, re-use, dynamic restructuring of the query graph.
Partial Results in Database Systems
- Logic for describing and reasoning about partial results from a distributed computation.
Predictive, Anticipatory Query Suggestions
Matchup: Autocompletion for mashups
- Autocompletion of structures for mashups, analogous to what we're trying to do.
VisGets: Coordinated Visualizations for Web-based Information Exploration and Discovery
- Propose a set of appropriate visualizations for the available data.
Logos: a system for translating queries into narratives
- Translate a query into a human-readable string.