Berkeley + Columbia
A few nice insights from this paper. One obvious one: Data cleaning is iterative. You don't just fix everything in one go. The paper states an interesting concern that cleaning might bias results. How do you stop an analyst from cleaning until you get the results you want. One of the main goals of the paper is to drive researchers to develop principled, iterative ways to clean data, measure quality, and approach curation in a principled way. There's also an informal survey of (what seems like industrial partners), that identifies M/R and Python as the tools of choice for curation. Finally, the paper identifies a disconnect between data engineers (who implement the curation pipelines) and analysts (who actually use the curated data)
When is My Bus
- When (ish) is My Bus? User-centered Visualizations of Uncertainty in Everyday, Mobile Predictive Systems
- The VisTrails ecosystem
- Fuzzy Prophet
- Interactive information visualization of a million items
- HV Jagadish's Usability Reading List
- Semantics of Interactive Visualization
- Software: LightTable, Stenci.la
- ORCHESTRA: Facilitating collaborative data sharing
- ORCHESTRA: Rapid, Collaborative Sharing of Dynamic Data.
- The orchestra collaborative data sharing system
- Provenance in ORCHESTRA
Comprehensive system for data sharing, particularly in biological settings. Emphasis on issues related to data sharing, exchange, and integration. Schema mappings, Trust mappings, Conflicting data. Continuously changing data: pushing transactions simultaneously across multiple systems.
- Cooperative update exchange in the Youtopia system
- Coordination through querying in the Youtopia system
Techniques for estimating the impact of missing records on aggregate functions when your datasets don't have full coverage (aka species estimation). The key idea is based on treating input data as being sampled with replacement: The ratio of single-occurrence samples to multiple-occurrence samples gives you a good idea of the fraction of the data that you've seen so far. They also propose a few tricks to counteract dataset collection biases (e.g., correlations between a record's sampling probability and the record's contribution to the aggregate value).
- Magellan: toward building entity matching management systems
- Magellan: Toward Building Entity Matching Management Systems (Tech Report)
- Corleone: hands-off crowdsourcing for entity matching
Magellan is an end-to-end system for doing entity matching. System may not be the right word, as its more of a toolkit embedded into the Python Data Science stack (SciPy, NumPy, Pandas, etc...) and an associated "How-To" guide. A general theme throughout the toolkit is that there are multiple stages in a typical evaluation pipeline (sampling, blocking, matching, etc...), and for each stage there are a variety of different algorithms available. Magellan helps users identify the right algorithm/procedure for this stage through a few resources: (1) The How-To guide outlines the space, (2) Debugging tools help users rapidly validate and iterate over possibilities in the space, and (3) For several stages, they have developed automated training procedures that interactively gather labels from users to select the algorithm/tool best suited for the user's needs. The final challenge is metadata: Incorporating Magellan into the Python Data Science stack requires using Data Frames. Data Frames lack support for schema-level metadata (e.g., key attributes or foreign keys), so they developed an external metadata manager to track the association externally. They also rewrote many of the existing tools in the stack to propagate this information if available. Unfortunately, propagation isn't guaranteed, so they adopt a validate and warn approach if metadata-derived constraints are broken.
The challenge is parsing sequences of strings (Logs, Lists, etc...) into tabular data. The basic approach is (1) use common separators (e.g., ',' or ' ') to tokenize each string, and then (2) Align the tokens into columns.
Step 2 is made more complicated by the fact that you can't tell upfront whether two tokens (and their conjoining separator) are actually part of the same column. The paper proposes a coherence metric based on term co-occurrence elsewhere in the corpus, bounds checking, etc... that evaluates whether the elements of a column belong together, and solves a bin-packing-style optimization problem to maximize coherence in the extracted table.
Use knowledge-bases and crowdsourcing to clean relational data. Given a table of input data, look the column headers up in the knowledge base to figure out relationships between them. These relationships and corresponding entries already in the KB provide ground truth and a sanity check. Crowdsourcing fills in the blanks.
Sanity checking messy data happens using constraint queries. These queries are often super expensive, as they involve things like NOT EXISTS, negated implication, or UDFs. BigDansing is a distributed system for processing these types of queries efficiently on large input data.
QOCO assumes that the input database is messy (contains incorrect tuples and is missing correct tuples). As users query it, they can provide feedback of the form: This result should or should not be in the result set, and QOCO comes up with a resolution plan. The idea, loosely put, is to use crowdsourcing to clean the data. The number of crowd queries is minimized by computing the minimal number of edits required to insert/remove the desired result tuple.
A broad term roughly equivalent to Information Integration. Information Fusion work typically approaches the process of merging data sources from a domain-specific perspective. As a result, such approaches tend to have more accuracy, but less generality.
- A Methodology to Evaluate Important Dimensions of Information Quality in Systems
- Query Time Data Integration (Thesis)
- Human Performance and Data Fusion Based Decision Aids
- Exploiting Lineage for Confidence Computation in Uncertain and Probabilistic Databases
- Working Models for Uncertain Data
- ULDBs: databases with uncertainty and lineage
- Databases with uncertainty and lineage
- Trio-One:Layering Uncertainty and Lineage on a Conventional DBMS
- TECH REPORT: Continuous Uncertainty in Trio
- TECH REPORT: Trio: A System for Integrated Management of Data, Accuracy, and Lineage
- TECH REPORT: An Introduction to ULDBs and the Trio System
Generic model of provenance built over the ring structure of relational algebra. The result is a pluggable provenance model that can replicate most major types of provenance over relational data (why, what, etc...) depending on what specific operators are plugged in. For example, using set-union as both multiplication and join gives you a list of all tuples that participate in a result. Paper also demonstrates that the resulting model is equivalent to C-Tables without labeled nulls (i.e., C-Tables without attribute-level uncertainty).
- A Generic Provenance Middleware for Database Queries, Updates, and Transactions
- Formal Foundations of Reenactment and Transaction Provenance
- Heuristic and Cost-based Optimization for Provenance Computation
- Perm: Processing Provenance and Data on the same Data Model through Query Rewriting
- Provenance for Nested Subqueries
- TRAMP: Understanding the Behavior of Schema Mappings through Provenance
- Using SQL for Efficient Generation and Querying of Provenance Information
A multi-granularity model of provenance, and visualization tool for that model. Groups operations into a hierharchy of modules and allows provenance to be viewed at different levels of aggregation. Concrete implementation on top of Pig Latin.
- Meilieu/Gatterbauer/Suciu: Sensitivity
- Kanagal/Deshpande: Influence
An overview of a crawler developed and run internally at Google to help users track and find datasets. Interesting IMO, because of the breadth of metadata it collects about the data. In addition to trivialities like timestamps and formats, the system collects: (1) Provenance, relying on logs from M/R and other bulk data processing systems to establish links between data files, (2) Schema information --- a particularly nifty trick here is using protocol buffer code checked into google's repo to identify candidates, (3) Frequent tokens, keywords, etc..., (4) Metadata from the filename (e.g., date, version, etc...), (5) Semantic information, where it can be extracted from code comments, dataset content, or other details.
Versioned Data Management
GIT for Sci Data
Cleaning/Modeling/Extraction/Integration (to be categorized)
- Jermaine: MCDB/SimSQL
- Wang/Hellerstein: BayesDB
- Deshpande/Madden: MauveDB
- Crankshaw: Velox
- Papakonstantinou: Plato
- Duggan: Hephaestus
Semistructured Data Management
- Idreos: Cracking, ... / Adaptive data management
- Ailamaki/Idreos: NoDB/VirtualDB
- Widom: Dataguides
- Wolfram Language
- Arnab's work -- GestureDB etc...