Did you ever wonder how much further AI can scale?

In this session, Nidhi Chappell (Head of Product, Specialized Azure Compute at Microsoft) and Christopher Berner (Head of Compute at OpenAI) share their perspectives and insight about how the Microsoft-OpenAI partnership is taking significant steps to eliminate the barriers of scale to AI processes.

Of specific interest is OpenAI’s new GPT-3 natural language processing model that required 175 billion parameters to train properly.

Machine Learning Street Talk  Tim Scarfe, Yannic Kilcher and Connor Shorten discuss their takeaways from OpenAI’s GPT-3 language model.

OpenAI trained a 175 BILLION parameter autoregressive language model. The paper demonstrates how self-supervised language modelling at this scale can perform many downstream tasks without fine-tuning. 

Paper Links:

Content index:

  • 00:00:00 Intro
  • 00:00:54 ZeRO1+2 (model + Data parallelism) [GPT-3 DOES *NOT* USE THIS] (Connor)
  • 00:03:17 Recent history of NLP (Tim)
  • 00:06:04 Yannic “Light-speed” Kilcher’s brief overview of GPT-3
  • 00:14:25 Reviewing Yannic’s YT comments on his GPT-3 video (Tim)
  • 00:20:26 Main show intro
  • 00:23:03 Is GPT-3 reasoning?
  • 00:28:15 Architecture discussion and autoregressive (GPT*) vs denoising autoencoder (BERT)
  • 00:36:18 Utility of GPT-3 in industry
  • 00:43:03 Can GPT-3 do math? (reasoning/system 1/system 2)
  • 00:51:03 Generalisation
  • 00:56:48 Esoterics of language models
  • 00:58:46 Architectural trade-offs
  • 01:07:37 Memorization machines and intepretability
  • 01:17:16 Nearest neighbour probes / watermarks
  • 01:20:03 YouTube comments on GPT-3 video
  • 01:21:50 GPT-3 news article generation issue
  • 01:27:36 Sampling data for language models / bias / fairness / politics
  • 01:51:12 Outro

How far can you go with ONLY language modeling?

Can a large enough language model perform NLP task out of the box?

OpenAI take on these and other questions by training a transformer that is an order of magnitude larger than anything that has ever been built before and the results are astounding.

Yannic Kilcher explores.

Paper

Time index:

  • 0:00 – Intro & Overview
  • 1:20 – Language Models
  • 2:45 – Language Modeling Datasets
  • 3:20 – Model Size
  • 5:35 – Transformer Models
  • 7:25 – Fine Tuning
  • 10:15 – In-Context Learning
  • 17:15 – Start of Experimental Results
  • 19:10 – Question Answering
  • 23:10 – What I think is happening
  • 28:50 – Translation
  • 31:30 – Winograd Schemes
  • 33:00 – Commonsense Reasoning
  • 37:00 – Reading Comprehension
  • 37:30 – SuperGLUE
  • 40:40 – NLI
  • 41:40 – Arithmetic Expressions
  • 48:30 – Word Unscrambling
  • 50:30 – SAT Analogies
  • 52:10 – News Article Generation
  • 58:10 – Made-up Words
  • 1:01:10 – Training Set Contamination
  • 1:03:10 – Task Exampleshttps://arxiv.org/abs/2005.14165
    https://github.com/openai/gpt-3