A few nice insights from this paper. One obvious one: Data cleaning is iterative. You don't just fix everything in one go. The paper states an interesting concern that cleaning might bias results. How do you stop an analyst from cleaning until you get the results you want. One of the main goals of the paper is to drive researchers to develop principled, iterative ways to clean data, measure quality, and approach curation in a principled way. There's also an informal survey of (what seems like industrial partners), that identifies M/R and Python as the tools of choice for curation. Finally, the paper identifies a disconnect between data engineers (who implement the curation pipelines) and analysts (who actually use the curated data)
Comprehensive system for data sharing, particularly in biological settings. Emphasis on issues related to data sharing, exchange, and integration. Schema mappings, Trust mappings, Conflicting data. Continuously changing data: pushing transactions simultaneously across multiple systems.
Techniques for estimating the impact of missing records on aggregate functions when your datasets don't have full coverage (aka species estimation). The key idea is based on treating input data as being sampled with replacement: The ratio of single-occurrence samples to multiple-occurrence samples gives you a good idea of the fraction of the data that you've seen so far. They also propose a few tricks to counteract dataset collection biases (e.g., correlations between a record's sampling probability and the record's contribution to the aggregate value).
Magellan is an end-to-end system for doing entity matching. System may not be the right word, as its more of a toolkit embedded into the Python Data Science stack (SciPy, NumPy, Pandas, etc...) and an associated "How-To" guide. A general theme throughout the toolkit is that there are multiple stages in a typical evaluation pipeline (sampling, blocking, matching, etc...), and for each stage there are a variety of different algorithms available. Magellan helps users identify the right algorithm/procedure for this stage through a few resources: (1) The How-To guide outlines the space, (2) Debugging tools help users rapidly validate and iterate over possibilities in the space, and (3) For several stages, they have developed automated training procedures that interactively gather labels from users to select the algorithm/tool best suited for the user's needs. The final challenge is metadata: Incorporating Magellan into the Python Data Science stack requires using Data Frames. Data Frames lack support for schema-level metadata (e.g., key attributes or foreign keys), so they developed an external metadata manager to track the association externally. They also rewrote many of the existing tools in the stack to propagate this information if available. Unfortunately, propagation isn't guaranteed, so they adopt a validate and warn approach if metadata-derived constraints are broken.
A broad term roughly equivalent to Information Integration. Information Fusion work typically approaches the process of merging data sources from a domain-specific perspective. As a result, such approaches tend to have more accuracy, but less generality.
Generic model of provenance built over the ring structure of relational algebra. The result is a pluggable provenance model that can replicate most major types of provenance over relational data (why, what, etc...) depending on what specific operators are plugged in. For example, using set-union as both multiplication and join gives you a list of all tuples that participate in a result. Paper also demonstrates that the resulting model is equivalent to C-Tables without labeled nulls (i.e., C-Tables without attribute-level uncertainty).
A multi-granularity model of provenance, and visualization tool for that model. Groups operations into a hierharchy of modules and allows provenance to be viewed at different levels of aggregation. Concrete implementation on top of Pig Latin.
An overview of a crawler developed and run internally at Google to help users track and find datasets. Interesting IMO, because of the breadth of metadata it collects about the data. In addition to trivialities like timestamps and formats, the system collects: (1) Provenance, relying on logs from M/R and other bulk data processing systems to establish links between data files, (2) Schema information --- a particularly nifty trick here is using protocol buffer code checked into google's repo to identify candidates, (3) Frequent tokens, keywords, etc..., (4) Metadata from the filename (e.g., date, version, etc...), (5) Semantic information, where it can be extracted from code comments, dataset content, or other details.