Exploring the LAION-5B: a 5 billion image-text-pairs dataset

LAION-5B is an open, free dataset consisting of over 5 billion image-text-pairs. Today’s video is an interview with three of its creators. We dive into the mechanics and challenges of operating at such large scale, how to keep cost low, what new possibilities are enabled with open datasets like this, and how to best handle safety and legal concerns.

OUTLINE:

  • 0:00 – Intro
  • 1:30 – Start of Interview
  • 2:30 – What is LAION?
  • 11:10 – What are the effects of CLIP filtering?
  • 16:40 – How big is this dataset?
  • 19:05 – Does the text always come from the alt-property?
  • 22:45 – What does it take to work at scale?
  • 25:50 -When will we replicate DALL-E?
  • 31:30 – The surprisingly efficient pipeline
  • 35:20 – How do you cover the S3 costs?
  • 40:30 – Addressing safety & legal concerns
  • 55:15 – Where can people get started?

Frank

#DataScientist, #DataEngineer, Blogger, Vlogger, Podcaster at http://DataDriven.tv . Back @Microsoft to help customers leverage #AI Opinions mine. #武當派 fan. I blog to help you become a better data scientist/ML engineer Opinions are mine. All mine.