Agents—The Next Step In AI With Shelby Heinecke Generative AI In The Real World podcast

Agents—The Next Step in AI with Shelby Heinecke

2M ago 27:24

Content provided by O'Reilly. All podcast content including episodes, graphics, and podcast descriptions are uploaded and provided directly by O'Reilly or their podcast platform partner. If you believe someone is using your copyrighted work without your permission, you can follow the process outlined here https://podcastplayer.com/legal.

Join Shelby Heinecke, senior research manager at Salesforce, and Ben Lorica as they talk about agents, AI models that can take action on behalf of their users. Are they the future—or at least the hot topic for the coming year? Where are we with smaller models? And what do we need to improve the agent stack? How do you evaluate the performance of models and agents?

About the Generative AI in the Real World podcast: In 2023, ChatGPT put AI on everyone’s agenda. In 2025, the challenge will be turning those agendas into reality. In Generative AI in the Real World, Ben Lorica interviews leaders who are building with AI. Learn from their experience to help put AI to work in your enterprise.

Points of Interest

0:29: Introduction—Our guest is Shelby Heinecke, senior research manager at Salesforce.
0:43: The hot topic of the year is agents. Agents are increasingly capable of GUI-based interactions. Is this my imagination?
1:20: The research community has made tremendous progress to make this happen. We’ve made progress on function calling. We’ve trained LLMs to call the correct functions to perform tasks like sending emails. My team has built large action models that, given a task, write a plan and the API calls to execute that. This is one piece. A second piece is when you don’t know the functions a priori, giving the agent the ability to reason about images and video.
3:07: We released multimodal action models. They take an image and text and produce API calls. That makes navigating GUIs a reality.
3:34: A lot of knowledge work relies on GUI interactions. Is this just robotic process automation rebranded?
4:05: We’ve been automating forever. What’s special is that automation is driven by LLMs, and that combination is particularly powerful.
4:32: The earlier generation of RPA was very tightly scripted. With multimodal models that can see the screen, they can really understand what’s happening. Now we’re beginning to see reasoning enhanced models. Inference scaling will be important.
5:52: Multimodality and reasoning-enhanced models will make agents even more powerful.
6:00: I’m very interested in how much reasoning we can pack into a smaller model. Just this week DeepSeek also released smaller distilled versions.
7:08: Every month the capability of smaller models has been pushed. Smaller models right now may not compare to large models. But this year, we can push the boundaries.
7:38: What’s missing from the agent stack? You have the model—some notion of memory. You have tools that the agent can call. There are agent frameworks. You need monitoring, observability. Everything depends on the model’s capabilities: There’s a lot of fragmentation, and the vocabulary is still unclear. Where do agents usually fall short?
9:00: There’s a lot of room for improvement with function calling and multistep function calling. Earlier in the year, it was just single step. Now there’s multistep. That expands our horizons.
9:59: We need to think about deploying agents that solve complex tasks that take multiple steps. We will need to think more about efficiency and latency. With increased reasoning abilities, latency increases.
10:45: This year, we’ll see small language models and agents come together.
10:58: At the end of the day, this is an empirical discipline and you need to come up with your own benchmarks and eval tools. What are you doing in terms of benchmarks and eval?
11:36: This is the most critical piece of applied research. You’re deploying models for a purpose. You still need an evaluation set for that use case. As we work with a variety of products, we cocreate evaluation sets with our partners.
12:38: We’ve released the CRM benchmark. It’s open. We’ve created CRM-style datasets with CRM-type tasks. You can see the open source models and small models on these leaderboards and how they perform.
13:16: How big do these datasets have to be?

33 episodes

Podcasts Worth a Listen

Generative AI in the Real World « »
Agents—The Next Step in AI with Shelby Heinecke