Search a title or topic

Over 20 million podcasts, powered by 

Player FM logo
Artwork

Content provided by O'Reilly Media. All podcast content including episodes, graphics, and podcast descriptions are uploaded and provided directly by O'Reilly Media or their podcast platform partner. If you believe someone is using your copyrighted work without your permission, you can follow the process outlined here https://podcastplayer.com/legal.
Player FM - Podcast App
Go offline with the Player FM app!

Labeling, transforming, and structuring training data sets for machine learning

40:51
 
Share
 

Manage episode 248276917 series 1652310
Content provided by O'Reilly Media. All podcast content including episodes, graphics, and podcast descriptions are uploaded and provided directly by O'Reilly Media or their podcast platform partner. If you believe someone is using your copyrighted work without your permission, you can follow the process outlined here https://podcastplayer.com/legal.

In this episode of the Data Show, I speak with Alex Ratner, project lead for Stanford’s Snorkel open source project; Ratner also recently garnered a faculty position at the University of Washington and is currently working on a company supporting and extending the Snorkel project. Snorkel is a framework for building and managing training data. Based on our survey from earlier this year, labeled data remains a key bottleneck for organizations building machine learning applications and services.

Ratner was a guest on the podcast a little over two years ago when Snorkel was a relatively new project. Since then, Snorkel has added more features, expanded into computer vision use cases, and now boasts many users, including Google, Intel, IBM, and other organizations. Along with his thesis advisor professor Chris Ré of Stanford, Ratner and his collaborators have long championed the importance of building tools aimed squarely at helping teams build and manage training data. With today’s release of Snorkel version 0.9, we are a step closer to having a framework that enables the programmatic creation of training data sets.

Snorkel pipeline for data labeling
Snorkel pipeline for data labeling. Source: Alex Ratner, used with permission.

We had a great conversation spanning many topics, including:

  • Why he and his collaborators decided to focus on “data programming” and tools for building and managing training data.
  • A tour through Snorkel, including its target users and key components.
  • What’s in the newly released version (v 0.9) of Snorkel.
  • The number of Snorkel’s users has grown quite a bit since we last spoke, so we went through some of the common use cases for the project.
  • Data lineage, AutoML, and end-to-end automation of machine learning pipelines.
  • Holoclean and other projects focused on data quality and data programming.
  • The need for tools that can ease the transition from raw data to derived data (e.g., entities), insights, and even knowledge.

Related resources:

  continue reading

133 episodes

Artwork
iconShare
 
Manage episode 248276917 series 1652310
Content provided by O'Reilly Media. All podcast content including episodes, graphics, and podcast descriptions are uploaded and provided directly by O'Reilly Media or their podcast platform partner. If you believe someone is using your copyrighted work without your permission, you can follow the process outlined here https://podcastplayer.com/legal.

In this episode of the Data Show, I speak with Alex Ratner, project lead for Stanford’s Snorkel open source project; Ratner also recently garnered a faculty position at the University of Washington and is currently working on a company supporting and extending the Snorkel project. Snorkel is a framework for building and managing training data. Based on our survey from earlier this year, labeled data remains a key bottleneck for organizations building machine learning applications and services.

Ratner was a guest on the podcast a little over two years ago when Snorkel was a relatively new project. Since then, Snorkel has added more features, expanded into computer vision use cases, and now boasts many users, including Google, Intel, IBM, and other organizations. Along with his thesis advisor professor Chris Ré of Stanford, Ratner and his collaborators have long championed the importance of building tools aimed squarely at helping teams build and manage training data. With today’s release of Snorkel version 0.9, we are a step closer to having a framework that enables the programmatic creation of training data sets.

Snorkel pipeline for data labeling
Snorkel pipeline for data labeling. Source: Alex Ratner, used with permission.

We had a great conversation spanning many topics, including:

  • Why he and his collaborators decided to focus on “data programming” and tools for building and managing training data.
  • A tour through Snorkel, including its target users and key components.
  • What’s in the newly released version (v 0.9) of Snorkel.
  • The number of Snorkel’s users has grown quite a bit since we last spoke, so we went through some of the common use cases for the project.
  • Data lineage, AutoML, and end-to-end automation of machine learning pipelines.
  • Holoclean and other projects focused on data quality and data programming.
  • The need for tools that can ease the transition from raw data to derived data (e.g., entities), insights, and even knowledge.

Related resources:

  continue reading

133 episodes

All episodes

×
 
Loading …

Welcome to Player FM!

Player FM is scanning the web for high-quality podcasts for you to enjoy right now. It's the best podcast app and works on Android, iPhone, and the web. Signup to sync subscriptions across devices.

 

Copyright 2025 | Privacy Policy | Terms of Service | | Copyright
Listen to this show while you explore
Play