Workshop

The Data Behind AI: A Hands-On Workshop on Dataset Documentation

Join Data Nutrition Project co-founder Kasia Chmielinski for an interactive session on the role of dataset documentation in promoting more trustworthy and transparent AI systems. Explore common documentation formats—including the Dataset Nutrition Label—and collaborate with other workshop participants on creating your own.

Key Takeaways

Overview

The workshop emphasized that data documentation is fundamental to any use of data, especially for training AI systems. Participants, who represented a diverse range of fields including academia, law, tech, and data science, explored the critical importance of understanding and documenting data sources. The discussions highlighted that while data is essential, its poor or incorrect use, often stemming from inadequate documentation, can lead to problematic and biased systems. As one participant noted about the Enron dataset, there were "lots of dead links" and limited documentation, making it challenging to assess its quality or original context. This led to a discussion on the need for vigilance and critical inquiry when working with data.

1. Data Fuels Today's Technology

The workshop underscored that AI's effectiveness is directly tied to the data it's trained on. Participants discussed how the Enron dataset, originally compiled for a legal investigation, was repurposed for many other uses, including training the prototype of Gmail's "smart compose" feature. This illustrates how a single dataset can have widespread, unintended influence on technology. As one participant noted, the biggest thing that surprised them was "how broadly it was distributed and used for so many purposes when the original use was for a legal investigation." This shift in context is a key takeaway.

Discussion Question: Think about a technology you use daily. Where might the data it was trained on have come from, and what might be the original context of that data?

2. Beware of Poor Data Use

Incorrect or unrepresentative data can lead to serious issues, including the infringement of privacy rights and the creation of systems that perpetuate harmful biases. A participant shared an example of a leader who averaged student course ratings to rank faculty, an action that could lead to unfair evaluations. Another participant noted that the natural language patterns of "shady business" from the Enron emails might be "deep in our models and training sets now so what have we internalized without awareness of how these words were used?" This highlights how poor data use can embed negative patterns and biases.

Discussion Question: Can you recall an instance where poor data use led to a problematic outcome? What could have been done to prevent it?

3. Transparency Through Documentation

Documentation is a powerful tool for mitigating issues related to data quality and use. The workshop highlighted the stark contrast in documentation between the Enron and Common Crawl datasets. In the workshop, the latter was praised for its "much better metadata," including file size and clear organizational ownership. This transparency allows users to make more informed decisions about how they use the data. The documentation also revealed that the Common Crawl's crawler code is proprietary, which is a key piece of information for users to consider.

Discussion Question: How can clear documentation help you better understand a dataset's limitations and potential biases?

4. Always Look for Trustworthy Sources

Participants agreed that assessing the trustworthiness of a dataset is crucial. Factors like the reputation of the reporting agency or individual, the authority of the source (e.g., academic or peer-reviewed sites), and the history of past trusted projects (like those from Kaggle or CMU) were identified as key indicators. As one participant put it, "I wouldn’t trust any of the links shared. But we trusted you." Another participant pointed out that "a peer-reviewed source would be best" for assessing quality.

Discussion Question: What criteria do you use to determine if a data source is trustworthy?

5. Ask Critical Questions

Before using a dataset, it's essential to ask critical questions about its origins, contents, and purpose. The workshop introduced frameworks like Datasheets for Datasets and Dataset Nutrition Labels as guides for this process. The discussion around the Enron and Common Crawl datasets raised questions about consent, data lineage, and potential PII leakage. The size of a dataset was also cited as a potential issue, as researchers may not fully know what's in their own data. As a participant noted, "The size of the dataset is an issue: researchers do not know what is in their own data."

Discussion Question: What questions would you ask a data creator before using their dataset for a new project?

6. Be Vigilant! It's Still the Wild West of Data

The current data landscape is largely unregulated, making vigilance essential. Participants noted that for datasets like Common Crawl, there's no clear process for "appeals or rescinding content from older crawls," which can lead to ongoing privacy concerns. As one person pointed out about the Enron dataset, "The fact that it was pulled together by FERC...[meant] consent seems to be automatic," highlighting the power dynamics inherent in data collection by governmental bodies.

Discussion Question: What personal actions can you take to promote ethical data practices and demand better documentation?

7. Good Documentation Is CLEAR

7. Good Documentation Is CLEAR The workshop concluded by introducing the CLEAR framework—good documentation should be Comparable, Legible, Actionable, and Robust. These principles provide a framework for creating documentation that helps practitioners better understand and use data, mitigating risks and promoting responsible innovation. The framework was shared via a link from the Shorenstein Center on Media, Politics and Public Policy, further emphasizing its academic and practical relevance.

Discussion Question: How does the CLEAR framework apply to a dataset you've used recently? Which aspects of its documentation are strong, and which could be improved?

Discussion Questions

  • How can we overcome resistance to routinely assessing data quality, especially when it requires significant effort
  • How can we design systems to be more transparent about the data they're trained on?
  • What are the ethical implications of using large-scale, web-scraped datasets without explicit consent?
  • What is the responsibility of a data scientist or AI practitioner when they discover a dataset has a problematic history or contains biases?

Sign up for our newsletter to hear about our future events!

Sign Up