We're sunsetting PodQuest on 2025-07-28. Thank you for your support!

870: OpenAI’s “Deep Research”: Get Days of Human Work Done in Minutes

2025/3/14

Super Data Science: ML & AI Podcast with Jon Krohn

AI Deep Dive AI Chapters Transcript

People

Jon Krohn

Topics

Jon Krohn: 我认为 OpenAI 的 Deep Research 是目前世界上最强大的研究工具。它能够自动化深度文献综述，将数百个在线资源整合到一个连贯且有据可查的报告中。Deep Research 通过将复杂查询分解成更小的任务，搜索相关信息，并迭代地将结果整合到报告中来实现这一点。它就像一个全天候工作的专家研究员，能够以人类无法比拟的速度处理数据，将原本需要数天才能完成的任务在几分钟内完成。 Deep Research 使用端到端强化学习进行训练，能够处理各种领域的任务。它可以浏览用户上传的文件，使用 Python 绘制图表，将图像嵌入到响应中，并提供带具体引用的信息来源。在最近发布的 AI 评估“人类最后考试”中，Deep Research 取得了显著的成绩，准确率达到 27%，远高于其他 AI 模型。这表明 AI 在解决复杂问题方面取得了重大进展。我个人订阅了 OpenAI Pro，每天都在使用 Deep Research。它帮助我节省了大量时间，例如，我最近用它来快速创建了一个研讨会的教学大纲，原本需要花费数小时的工作，Deep Research 只用了几分钟就完成了。Deep Research 能够根据我的要求和提供的示例，生成高质量的教学大纲、标题和摘要。它甚至会提出一些有建设性的问题，例如目标受众、编程重点和预期基调等，帮助我完善研讨会结构。当然，Deep Research 也有一些局限性，例如可能出现幻觉或错误引用。但 OpenAI 正在努力改进这些问题。总的来说，Deep Research 正在改变研究流程，自动化信息收集、分析和综合过程，为数据科学和其他领域带来巨大的价值。它降低了高质量研究的门槛，使更多人能够受益。随着 AI 技术的不断发展，类似 Deep Research 的工具将进一步增强人类能力，自动化更多日常工作，推动创新和发展。

Deep Dive

Chapters

This chapter introduces OpenAI's Deep Research, a tool that automates literature reviews. It explains its functionality, the technology behind it (multi-step reasoning models and reinforcement learning), and its impressive capabilities, including web browsing, data synthesis, graph plotting, and citation generation.

Automates deep dive literature reviews
Synthesizes hundreds of online sources
Uses multi-step reasoning models
Trained using end-to-end reinforcement learning
Can browse user-uploaded files, use Python, embed images, and provide citations

Shownotes Transcript

Translations:

中文

This is episode number 870 on Deep Research. Welcome back to the Super Data Science Podcast. I'm your host, John Krohn. I'm on a ski holiday in Switzerland this week with my family, so I'm going to skip past the preamble this week and jump right to the meat of today's five-minute Friday-style episode. And that meat is all about deep research, specifically...

While other firms like Google and Perplexity have also released tools called deep research in recent weeks, we'll be primarily focused on OpenAI's deep research in this episode because it's the clear frontrunner in this space at this time. Note that OpenAI doesn't sponsor me in any way. This is my independent opinion. So...

First off, what does deep research do? It remarkably well automates deep dive literature reviews and synthesizes hundreds of online sources into a coherent, well-cited report. Using multi-step quote-unquote reasoning models like the OpenAI model I covered recently in episode number 864,

Deep Research breaks your complex query into smaller tasks, and then it searches the web for each piece of those smaller tasks that it identified, and then it iteratively synthesizes the results into a report, pivoting its research trajectory as it learns new information.

In practical terms, it's like having an expert researcher on call 24-7 crunching through data at speeds no human could ever match. Tasks that could take a human researcher hours or days are now completed for you tremendously well within minutes.

OpenAI trained deep research using end-to-end reinforcement learning on challenging web browsing and reasoning tasks across a range of domains. Through that training, it learned to plan and execute a multi-step trajectory to find the data it needs, backtracking and reacting to real-time information where necessary.

Model is also able to browse over user-uploaded files. It's able to use Python to plot graphs. It can embed images into its responses, and that includes images from the websites that it searched when it was doing its research, or it can generate graphs for you and put those into its response as well. And it provides citations. So it'll provide sources with specific sentences or passages identified

as the specific source of the information. So that's pretty cool. All of that's really cool. And as a result of this reinforcement learning training, all these capabilities together, OpenAI's deep research reaches new highs on a number of public evaluations focused on real-world problems. To wit...

Deep research set a dramatically high new benchmark on a recently released AI evaluation called Humanity's Last Exam. I've got a link in the show notes to the Humanity's Last Exam website so you can check it out in detail. But it's a comprehensive assessment consisting of 3,000 multiple choice and short answer questions across over 100 subjects ranging from rocket science to linguistics.

This is a widely respected new benchmark that was supposed to be really challenging for AI models to be able to tackle. And we were hoping to have finally created an evaluation that would take

years to be getting any kind of traction with AI models. But in this Humanities Last Exam benchmark, deep research all of a sudden makes it clear that AI is getting traction on this supposedly very, very challenging set of tasks. So for example, OpenAI 01 only had a 9.1% accuracy on Humanities Last Exam.

DeepSeq R1 was about the same with a 9.4% accuracy. And then now comes along OpenAI Deep Research, and it completely blows all of those other numbers out of the water, getting a 27% accuracy, which is still not getting close to 100%. But once AI models on any other kind of benchmark we've seen, benchmarking

benchmarks in the past, like software engineering benchmarks, math benchmarks, where as soon as AI models seem to be able to get some traction, you know, making a jump like this from 9% to 27% accuracy,

From that point, we end up seeing it take big chunks out in the coming months or at least the coming years. And so I wouldn't be surprised if it isn't long before humanity's last exam is conquered by AI. And I've got in the video version of today's episode, I've got a table showing the performance of all of the leading models on humanity's last exam at the time of recording.

And so models like GPT-4-0, which is kind of the leading open AI model for just streaming out answers immediately without step-by-step reasoning, and it only performs with 3% accuracy on Humanity's last exam. That's comparable to Brock 2 from XAI and Claude 3.5 Sonnet, which score around the 4% mark.

And then Google's effort here, Gemini Thinking, scores about 6%. So O1 and DeepSeq R1 were doing much better at 9%. O3 Mini came along, getting up to 13% accuracy with that reasoning model set to high, using a large amount of computation. But yeah, opening eyes, deep research, absolutely crushing results.

all of its competitors in terms of competitor models and competitor companies with this, yeah, 27% accuracy on humanities last exam. So definitely something to watch there.

The big jump in performance on Humanity's Last Exam, of course, does translate into real world value. So since I got an OpenAI Pro subscription, which is 200 bucks a month, it was dollars a month, 200, but easily worth it for me given how much time it saves me and the value of its insight. So I've been using deep research near daily as a part of that Pro subscription and have been continuously impressed.

For example, I used deep research recently to accelerate the development of a syllabus for an upcoming four-hour agentic AI workshop that I'll be providing at the Open Data Science Conference East, ODSC East, in May in Boston.

And so I had a fair bit of information on this syllabus, this agentic AI syllabus already. And so I provided that detail to the model. I told it, you know, I'm going to have four modules. The first module feels pretty complete, but feel free to add something to it. Module two, that's what requires the most work. You know, I had a reference external URL that I wanted to check out for that module, but that's all I had.

And then module three was empty. And I said, you know, you can leave it empty because I know exactly what I'm going to put there. Or to be more specific, my co-presenter, Ed Donner, at this workshop, I knew exactly what he was going to put there.

And then I said, I've started on module four, but it probably needs a bullet or two more. And then I provided the information that I already had. So module one, I thought was pretty done. So I provided that, the syllabus points for that. Again, for module two, I just provided a link.

And module four was kind of incomplete. So provided all that information to the model. And then I said, to help you nail my style for the title and the abstract that I'd like you to create for this as well. So I'm asking it for a syllabus, kind of a bullet by bullet breakdown of what I'll be doing at this agentic AI workshop.

And then I said, I'm also gonna want a title and an abstract. And so here's examples of titles and abstracts that I wrote in the past for Open Data Science Conference workshops that I've given in the past. So provided that context as well. And yeah, so there were the two examples and those instructions.

And the model came back to me and asked me questions. I'm also noticing something. So if you're watching the YouTube version of this, I'm actually showing this specific query that I provided. And in the top left corner, you'll see that chat GPT 4.0 is the selected model. That's just because I'm now looking at the history of this. And so looking at the history, I'm not right back into the deep research session.

I could continue to have the conversation, but instead of it being deep research automatically, it's set to GPT-4.0. But you can ignore that. This was an O1 pro mode conversation that I had with deep research mode on. And yeah, it's very, very easy to turn deep research on.

which is something I probably should have said right at the beginning of this kind of explanation. But basically, there's just a button that you toggle right below the query box in ChatGPT, and that turns blue, and you're in deep research mode. Anyway, so once I provided all the information to deep research, as I described earlier,

The modules that I already had, my instructions for which ones to fill in more, examples of titles and abstracts that I've delivered at ODSC in the past. It came back to me and asked me for more information, which was a new experience for me with an LLM, especially this level of detail that it asked for. It asked me for more information on the target audience. You know, what's the expected level of the attendees? Will it be beginner, intermediate, advanced?

Is there a programming focus? Will there be hands-on coding with Python or other specific frameworks? And what's the tone? Should the title and abstract lean more towards a practical hands-on feel or a conceptual thought leadership style? And then it says, once I have this, I can refine the workshop structure accordingly. And these were great questions. So I provided detailed answers on the target audience I'm looking for, on yes, it being a hands-on coding workshop in Python.

and to lean to a more practical hands-on feel. And from there,

Deep Research spent three minutes and looked across eight different sources to come up with my results. And you can actually click in the history to see the chain of thought that Deep Research went through. So the kind of the step-by-step process, summaries of those step-by-step processes that it went through in order to come up with a conclusion for me. It even was looking at some stuff on my website on johncron.com.

to try to come up with a great title, abstract, and syllabus for me. And it provides me with links to all of the sources that it used for information for this. So all really cool stuff, all very easy to see in a pretty clean ChatGPT interface for this. But what you're probably most interested in is the results, and it was outstanding. I mean, I certainly made some small changes prior to...

providing this to ODSC as my abstract title and syllabus, but it saved me hours and hours and hours of time by creating a great draft in my style because I provided examples of my style in the past. It provided lots of great ideas on my syllabus outline, which I was able to quickly summarize into bullets for the Open Data Science Conference.

And yeah, so really, really cool. I hope that gives you a sense and kind of a deep dive into deep research with a specific example.

In your case, imagine something like you're exploring the latest advances in transformer architectures. Rather than spending days scanning archive, conference proceedings, and technical blogs, you could simply ask deep research for a summary of recent breakthroughs. The tool would extract key points such as improvements in training algorithms, scaling techniques, and performance metrics, and present you with a clear, structured overview complete with citations.

This not only saves tremendous time, but also minimizes the risk of overlooking critical studies. Of course, as I mentioned at the outset of this episode, OpenAI isn't alone in this space. Google and Perplexity, for example, have also rolled out their own deep research capabilities. Google's approach, powered by its Gemini LLMs,

leverages its vast search infrastructure to pull in a broad array of documents. The tool typically presents a user-guided research plan, outlining sub-questions before diving in. This method results in a comprehensive report that's reliable, but sometimes it stops short of the nuanced analysis that deep research delivers.

Then with Perplexity, they're offering a fast and free deep research mode. So Perplexity turns out a high-level overview in just a few minutes, making it great for quick snapshots. However, that speed can come at the cost of depth and iterative reasoning. For quick queries or free ones, Perplexity works well. But for mission-critical analysis, again, opening eyes more methodical and transparent approach works.

clearly has the edge, even if it is relatively expensive. Regardless of what company is behind the innovation, and by the way, I have links to more information on both Google and Perplexity's

deep research technologies as well. But regardless of what company's behind the innovation, looking ahead, the implications are profound. Deep research redefines how we approach problem solving in data science and beyond. It democratizes access to high quality research by lowering the bar to entry, whether you're a seasoned expert in some field or just starting out.

As these systems continue to improve, we might soon see research assistants embedded directly in our development environments, ready to pull insights from the latest publications or internal data stores that we have in our company or personally. Paired with AI agents that can take real-world action with increasing reliability, tools like deep research will enable more and more human abilities to be augmented and more routine work to be automated.

The implications are profound. If you roll this forward a few years and you continue to assume and safely assume that the capabilities on things like deep research and AI agents will continue to improve dramatically, I encourage you to take advantage of this unique moment in human history

to consider how the increasingly capable autonomous systems of the coming years can improve your life and the lives of those around you, including on socially beneficial projects and just plain old commercially impactful ones.

Today, there are, of course, still limitations to be aware of. Like any LLM-based tool, deep research could hallucinate or make incorrect references, although I haven't caught any of these myself yet. And opening eyes internal evaluations apparently show markedly lower hallucination rates with deep research than any of their previous tools.

The biggest risk for you is that deep research could present rumor as authoritative fact. But opening eye is aware of this occasional issue, and you can anticipate that in the coming months and years, this overconfidence problem will become vanishingly rare. I haven't noticed it in my uses of deep research personally yet.

So yeah, what's the catch to all this? Well, deep research is expensive, especially from OpenAI. I'm paying again, 200 US dollars per month as a pro user to get just 100 queries per month. So a little over three queries per day, but you get comprehensive answers. So that's actually quite a bit of work.

And as OpenAI figure out engineering efficiencies and how to use small models like O3 Mini more effectively for deep research, you can anticipate that more and more deep research queries per month will be available to all paying users and eventually, I'm sure it will be available for free like Perplexity's deep research is as well. In summary, OpenAI's deep research is transforming the research process by automating the heavy lifting of information gathering, analysis, and synthesis processes.

With its impressive benchmark performance on Humanity's last exam, transparent chain of thought, and iterative reasoning process, Deep Reasoning provides a level of depth and reliability that stands out even against competitors like Google and Perplexity. As we continue integrating AI into our workflows, tools like these will be key in turning raw data into actionable insights that

and allowing agentic AI models to be completely autonomous downstream, empowering us to push the boundaries of innovation and data science and everything else in the world. All right, that's it for today's episode. If you enjoyed it or know someone who might consider sharing this episode with them, leave a review of the show on your favorite podcasting platform. Tag me in a LinkedIn or Twitter post with your thoughts. And if you aren't already, obviously subscribe to the show.

The most important thing, though, is that you just keep on tuning in. Until next time, keep on rocking it out there, and I'm looking forward to enjoying another round of the Super Data Science Podcast with you very soon.

870: OpenAI’s “Deep Research”: Get Days of Human Work Done in Minutes 17:22 Share

Super Data Science: ML & AI Podcast with Jon Krohn

Deep Dive

Shownotes Transcript

870: OpenAI’s “Deep Research”: Get Days of Human Work Done in Minutes