Search a title or topic

Over 20 million podcasts, powered by 

Player FM logo
Artwork

Content provided by Jeremiah Prophet. All podcast content including episodes, graphics, and podcast descriptions are uploaded and provided directly by Jeremiah Prophet or their podcast platform partner. If you believe someone is using your copyrighted work without your permission, you can follow the process outlined here https://player.fm/legal.
Player FM - Podcast App
Go offline with the Player FM app!

Now I Really Won That AI Bet

15:51
 
Share
 

Manage episode 493792859 series 1755998
Content provided by Jeremiah Prophet. All podcast content including episodes, graphics, and podcast descriptions are uploaded and provided directly by Jeremiah Prophet or their podcast platform partner. If you believe someone is using your copyrighted work without your permission, you can follow the process outlined here https://podcastplayer.com/legal.

In June 2022, I bet a commenter $100 that AI would master image compositionality by June 2025.

DALL-E2 had just come out, showcasing the potential of AI art. But it couldn’t follow complex instructions; its images only matched the “vibe” of the prompt. For example, here were some of its attempts at “a red sphere on a blue cube, with a yellow pyramid on the right, all on top of a green table”.

At the time, I wrote:

I’m not going to make the mistake of saying these problems are inherent to AI art. My guess is a slightly better language model would solve most of them…for all I know, some of the larger image models have already fixed these issues. These are the sorts of problems I expect to go away with a few months of future research.

Commenters objected that this was overly optimistic. AI was just a pattern-matching “stochastic parrot”. It would take a deep understanding of grammar to get a prompt exactly right, and that would require some entirely new paradigm beyond LLMs. For example, from Vitor:

Why are you so confident in this? The inability of systems like DALL-E to understand semantics in ways requiring an actual internal world model strikes me as the very heart of the issue. We can also see this exact failure mode in the language models themselves. They only produce good results when the human asks for something vague with lots of room for interpretation, like poetry or fanciful stories without much internal logic or continuity.

Not to toot my own horn, but two years ago you were naively saying we'd have GPT-like models scaled up several orders of magnitude (100T parameters) right about now (https://readscottalexander.com/posts/ssc-the-obligatory-gpt-3-post#comment-912798).

I'm registering my prediction that you're being equally naive now. Truly solving this issue seems AI-complete to me. I'm willing to bet on this (ideas on operationalization welcome).

So we made a bet!

All right. My proposed operationalization of this is that on June 1, 2025, if either if us can get access to the best image generating model at that time (I get to decide which), or convince someone else who has access to help us, we'll give it the following prompts:

1. A stained glass picture of a woman in a library with a raven on her shoulder with a key in its mouth

2. An oil painting of a man in a factory looking at a cat wearing a top hat

3. A digital art picture of a child riding a llama with a bell on its tail through a desert

4. A 3D render of an astronaut in space holding a fox wearing lipstick

5. Pixel art of a farmer in a cathedral holding a red basketball

We generate 10 images for each prompt, just like DALL-E2 does. If at least one of the ten images has the scene correct in every particular on 3/5 prompts, I win, otherwise you do. Loser pays winner $100, and whatever the result is I announce it on the blog (probably an open thread). If we disagree, Gwern is the judge.

Some image models of the time refused to draw humans, so we agreed that robots could stand in for humans in pictures that required them.

In September 2022, I got some good results from Google Imagen and announced I had won the three-year bet in three months. Commenters yelled at me, saying that Imagen still hadn’t gotten them quite right and my victory declaration was premature. The argument blew up enough that Edwin Chen of Surge, an “RLHF and human LLM evaluation platform”, stepped in and asked his professional AI data labelling team. Their verdict was clear: the AI was bad and I was wrong. Rather than embarrass myself further, I agreed to wait out the full length of the bet and re-evaluate in June 2025.

The bet is now over, and official judge Gwern agrees I’ve won. Before I gloat, let’s look at the images that got us here.

https://www.astralcodexten.com/p/now-i-really-won-that-ai-bet

  continue reading

1073 episodes

Artwork

Now I Really Won That AI Bet

Astral Codex Ten Podcast

150 subscribers

published

iconShare
 
Manage episode 493792859 series 1755998
Content provided by Jeremiah Prophet. All podcast content including episodes, graphics, and podcast descriptions are uploaded and provided directly by Jeremiah Prophet or their podcast platform partner. If you believe someone is using your copyrighted work without your permission, you can follow the process outlined here https://podcastplayer.com/legal.

In June 2022, I bet a commenter $100 that AI would master image compositionality by June 2025.

DALL-E2 had just come out, showcasing the potential of AI art. But it couldn’t follow complex instructions; its images only matched the “vibe” of the prompt. For example, here were some of its attempts at “a red sphere on a blue cube, with a yellow pyramid on the right, all on top of a green table”.

At the time, I wrote:

I’m not going to make the mistake of saying these problems are inherent to AI art. My guess is a slightly better language model would solve most of them…for all I know, some of the larger image models have already fixed these issues. These are the sorts of problems I expect to go away with a few months of future research.

Commenters objected that this was overly optimistic. AI was just a pattern-matching “stochastic parrot”. It would take a deep understanding of grammar to get a prompt exactly right, and that would require some entirely new paradigm beyond LLMs. For example, from Vitor:

Why are you so confident in this? The inability of systems like DALL-E to understand semantics in ways requiring an actual internal world model strikes me as the very heart of the issue. We can also see this exact failure mode in the language models themselves. They only produce good results when the human asks for something vague with lots of room for interpretation, like poetry or fanciful stories without much internal logic or continuity.

Not to toot my own horn, but two years ago you were naively saying we'd have GPT-like models scaled up several orders of magnitude (100T parameters) right about now (https://readscottalexander.com/posts/ssc-the-obligatory-gpt-3-post#comment-912798).

I'm registering my prediction that you're being equally naive now. Truly solving this issue seems AI-complete to me. I'm willing to bet on this (ideas on operationalization welcome).

So we made a bet!

All right. My proposed operationalization of this is that on June 1, 2025, if either if us can get access to the best image generating model at that time (I get to decide which), or convince someone else who has access to help us, we'll give it the following prompts:

1. A stained glass picture of a woman in a library with a raven on her shoulder with a key in its mouth

2. An oil painting of a man in a factory looking at a cat wearing a top hat

3. A digital art picture of a child riding a llama with a bell on its tail through a desert

4. A 3D render of an astronaut in space holding a fox wearing lipstick

5. Pixel art of a farmer in a cathedral holding a red basketball

We generate 10 images for each prompt, just like DALL-E2 does. If at least one of the ten images has the scene correct in every particular on 3/5 prompts, I win, otherwise you do. Loser pays winner $100, and whatever the result is I announce it on the blog (probably an open thread). If we disagree, Gwern is the judge.

Some image models of the time refused to draw humans, so we agreed that robots could stand in for humans in pictures that required them.

In September 2022, I got some good results from Google Imagen and announced I had won the three-year bet in three months. Commenters yelled at me, saying that Imagen still hadn’t gotten them quite right and my victory declaration was premature. The argument blew up enough that Edwin Chen of Surge, an “RLHF and human LLM evaluation platform”, stepped in and asked his professional AI data labelling team. Their verdict was clear: the AI was bad and I was wrong. Rather than embarrass myself further, I agreed to wait out the full length of the bet and re-evaluate in June 2025.

The bet is now over, and official judge Gwern agrees I’ve won. Before I gloat, let’s look at the images that got us here.

https://www.astralcodexten.com/p/now-i-really-won-that-ai-bet

  continue reading

1073 episodes

All episodes

×
 
Loading …

Welcome to Player FM!

Player FM is scanning the web for high-quality podcasts for you to enjoy right now. It's the best podcast app and works on Android, iPhone, and the web. Signup to sync subscriptions across devices.

 

Copyright 2025 | Privacy Policy | Terms of Service | | Copyright
Listen to this show while you explore
Play