A few nice insights from this paper. One obvious one: Data cleaning is iterative. You don't just fix everything in one go. The paper states an interesting concern that cleaning might bias results. How do you stop an analyst from cleaning until you get the results you want. One of the main goals of the paper is to drive researchers to develop principled, iterative ways to clean data, measure quality, and approach curation in a principled way. There's also an informal survey of (what seems like industrial partners), that identifies M/R and Python as the tools of choice for curation. Finally, the paper identifies a disconnect between data engineers (who implement the curation pipelines) and analysts (who actually use the curated data)
Comprehensive system for data sharing, particularly in biological settings. Emphasis on issues related to data sharing, exchange, and integration. Schema mappings, Trust mappings, Conflicting data. Continuously changing data: pushing transactions simultaneously across multiple systems.
Techniques for estimating the impact of missing records on aggregate functions when your datasets don't have full coverage (aka species estimation). The key idea is based on treating input data as being sampled with replacement: The ratio of single-occurrence samples to multiple-occurrence samples gives you a good idea of the fraction of the data that you've seen so far. They also propose a few tricks to counteract dataset collection biases (e.g., correlations between a record's sampling probability and the record's contribution to the aggregate value).
Magellan is an end-to-end system for doing entity matching. System may not be the right word, as its more of a toolkit embedded into the Python Data Science stack (SciPy, NumPy, Pandas, etc...) and an associated "How-To" guide. A general theme throughout the toolkit is that there are multiple stages in a typical evaluation pipeline (sampling, blocking, matching, etc...), and for each stage there are a variety of different algorithms available. Magellan helps users identify the right algorithm/procedure for this stage through a few resources: (1) The How-To guide outlines the space, (2) Debugging tools help users rapidly validate and iterate over possibilities in the space, and (3) For several stages, they have developed automated training procedures that interactively gather labels from users to select the algorithm/tool best suited for the user's needs. The final challenge is metadata: Incorporating Magellan into the Python Data Science stack requires using Data Frames. Data Frames lack support for schema-level metadata (e.g., key attributes or foreign keys), so they developed an external metadata manager to track the association externally. They also rewrote many of the existing tools in the stack to propagate this information if available. Unfortunately, propagation isn't guaranteed, so they adopt a validate and warn approach if metadata-derived constraints are broken.
The challenge is parsing sequences of strings (Logs, Lists, etc...) into tabular data. The basic approach is (1) use common separators (e.g., ',' or ' ') to tokenize each string, and then (2) Align the tokens into columns.
Step 2 is made more complicated by the fact that you can't tell upfront whether two tokens (and their conjoining separator) are actually part of the same column. The paper proposes a coherence metric based on term co-occurrence elsewhere in the corpus, bounds checking, etc... that evaluates whether the elements of a column belong together, and solves a bin-packing-style optimization problem to maximize coherence in the extracted table.
Use knowledge-bases and crowdsourcing to clean relational data. Given a table of input data, look the column headers up in the knowledge base to figure out relationships between them. These relationships and corresponding entries already in the KB provide ground truth and a sanity check. Crowdsourcing fills in the blanks.
Sanity checking messy data happens using constraint queries. These queries are often super expensive, as they involve things like NOT EXISTS, negated implication, or UDFs. BigDansing is a distributed system for processing these types of queries efficiently on large input data.
QOCO assumes that the input database is messy (contains incorrect tuples and is missing correct tuples). As users query it, they can provide feedback of the form: This result should or should not be in the result set, and QOCO comes up with a resolution plan. The idea, loosely put, is to use crowdsourcing to clean the data. The number of crowd queries is minimized by computing the minimal number of edits required to insert/remove the desired result tuple.
A broad term roughly equivalent to Information Integration. Information Fusion work typically approaches the process of merging data sources from a domain-specific perspective. As a result, such approaches tend to have more accuracy, but less generality.
Generic model of provenance built over the ring structure of relational algebra. The result is a pluggable provenance model that can replicate most major types of provenance over relational data (why, what, etc...) depending on what specific operators are plugged in. For example, using set-union as both multiplication and join gives you a list of all tuples that participate in a result. Paper also demonstrates that the resulting model is equivalent to C-Tables without labeled nulls (i.e., C-Tables without attribute-level uncertainty).
A multi-granularity model of provenance, and visualization tool for that model. Groups operations into a hierharchy of modules and allows provenance to be viewed at different levels of aggregation. Concrete implementation on top of Pig Latin.
An overview of a crawler developed and run internally at Google to help users track and find datasets. Interesting IMO, because of the breadth of metadata it collects about the data. In addition to trivialities like timestamps and formats, the system collects: (1) Provenance, relying on logs from M/R and other bulk data processing systems to establish links between data files, (2) Schema information --- a particularly nifty trick here is using protocol buffer code checked into google's repo to identify candidates, (3) Frequent tokens, keywords, etc..., (4) Metadata from the filename (e.g., date, version, etc...), (5) Semantic information, where it can be extracted from code comments, dataset content, or other details.