How Close Are We to Self-Improving AI?

2024/11/19

The AI Daily Brief (Formerly The AI Breakdown): Artificial Intelligence News and Analysis

AI Deep Dive AI Chapters Transcript

A

Andrew Parsons

N

Narrator

一位专注于电动车和能源领域的播客主持人和内容创作者。

随着 AI 技术的不断发展，模型之间的竞争不再仅仅局限于性能指标上的比较。产品和用户体验、针对特定任务的定制化能力、特定数据访问以及与企业现有工作流程的整合，都将成为竞争的关键因素。Anthropic 的 Claude 模型在 AI 研究测试中表现出色，表明 AI 自我改进的潜力巨大，但与顶尖人类研究人员相比仍有差距。Google 的 Gemini 模型在基准测试中取得了领先地位，但编码能力仍需提升。一项研究表明，AI 工具本身可以有效提高诊断准确率，但人类医生与 AI 工具的协同使用反而降低了准确率，这凸显了对医生进行 AI 工具使用培训的重要性。

Deep Dive

Today, on the A I daily brief, how an propitiate OpenAI both performance test of A I that performs A I research before that in the headlines can ChatGPT out diagnosed doctors. The daily brief is a daily podcast in video about the most important news and discussions in A I. To join the conversation, follow the distance link in our shown notes.

Looking back to the AI daily brief headlines edition, all the daily AI news you need in around five minutes, here's a really interesting study published in science daily today. Does A I improved doctors diagnoses study puts IT to the test. This study came out of U, B, A, health and took fifty physicians, of whom half or renomme assigned to use ChatGPT plus to diagnose complex cases with the other half are lying on more conventional methods, including using medical reference sites.

The researchers then compared the results to each other as well as to chat B T alone. So what actually happened? Well, doctors using ChatGPT a plus slightly outperformed the physicians using conventional methods.

Still, IT was very close diagnostic accuracy for the doctors using ChatGPT plus for seventy six point three percent, while the conventional approach physicians with seventy three point seven percent. The ChatGPT test group also apparently reached their diagnosis slightly more quickly, about forty five seconds faster. That said, when ChatGPT plus was task with making the same diagnosis alone, its accuracy was more than ninety two percent.

Does this mean that ChatGPT is unreserve ative Better and that we should just be turning over everything to robot doctors? Not necessarily. This is a controlled setting.

And in real life, the researchers caution that there are many other aspects of clinical reasoning that come to play, especially as they write in determining downstream m effects of diagnoses and treatment decisions. Still, the fact that ChatGPT alone outperforming the doctors using ChatGPT suggest to some the doctors need Better training on how to use these tools. Steady leaded, Andrew parsons said, our study shows that A I alone can be an effective and powerful tool for diagnosis.

We were surprised to find that adding a human physician to the mix actually reduce diagnostic accuracy, though improved efficiency. These results likely mean we need more formal training and how best to use A I. Moving over to A I giant NVIDIA.

Some concerning news for the company recently, according to the information, NVIDIA has asked suppliers to change the design of server racks multiple times to deal with an overheating issue. The blackwell gp use overheat when connected together, and server racks designed to hold up to seventy two chips in video refused to comment on whether an updated design has been finalized. Still, this is extremely late in the production process to be making such major changes.

Reportedly in vidia hasn't alerted customers to any delays related to the redesign, a comedy spokesperson told reuters. Invidia is working with leading cloud service providers as an integral part of our engineering team and process. The engineering iterations are Normal, unexpected, basically a non answer denial.

This is unfortunately for in video, not the only issue with black. Well, in August, company discovered a design falt that impacted manufacturing, yelled and delay the released by at least to quarter C. E, O jensen huang g has recently claimed the black all unit will begin shipping in q four, but with just remaining in video could be cutting up close to hit that target over in adoption land.

E S, P N is testing an A I generated broadcaster on the saturday college football S C nation named facts, the genre avatar intended to promote quote education and funder on sports analytics, right? The verge, we haven't seen the avatar a and action, but IT sounds like a bottle fied version of stats and cypher. I shop, who is S P S.

first. S P N had already gone deep on A I. Adding A I generated gary capps to their website back in september. The future was used to expand coverage of less followed sports like women. Socks on the cross commentary at the time focused on the gaps, including a failure to recognize the occasion of a players retirement game as well as blend commentary, but that sort of to be expected as these things were allowed.

Anticipating backlash around this, E S, P N made clear that the avatara is absolutely not made to replace journalists, rather talent writing faces designed to test innovations out in the market and the outlook for P N fans in an engaging and enjoyable segment. Lately, a day, a fun one. A U K.

Toko is novel approach to using A I to produce fraud. Mobile phone Carrier o two has introduced a voice enable chatbot. Their calling P A I grainy to waste gamers time train to mimic an elderly woman. The chat bott engages in ramli discussion, keeping scarmelli for as long as possible. Named daily, the agron ic can feed fake bank details to the scammers to keep them interested while going on long tangents about kitting, the weather or her cat.

The chatbot isn't for use by customers, is being deployed directly on the phone network and use to answer calls from a list of serial scam numbers introduced to mark international fund, where this week o two claims the chapter t has kept numerous frost's on calls for forty minutes at a time. The virtual warn write best use of A I yet, and he might not be wrong. That, however, is going to do for today's, a ideally brief headlines edition.

Next up, the main episode, today's episode is brought to you by vantage, whether you're starting or scaling your company security program, demonstrating top noch security practices and establishing trust is more important than ever. Penta automates compliance for I S O twenty seven, O O one soc two gdpr and leading A I frameworks like I S O forty two thousand one and N I S T A I risk management framework, saving you time and money while helping you build customer trust. Plus, you can streamline security reviews by automating questionnaire and demonstrating your security posture with a customer facing trust center.

All power by van to A I over eight thousand globe companies like lung chain lead A I in factory A I use vane to demonstrate A I trust, improve security in real time. Learn more adventure 到 com flash N L W that's ventadour com slash N W。 Today's episode is brought to you, as always, by super intelligent.

Have you ever wanted an A I daily brief, but totally focused on how A I relates to your company? Is your company struggling with A A I adoption either because you're getting stalled, figuring out what use cases will drive value, or because the A I transformation that is happening isolated individual teams, departments and employees and not able to change the company as a whole? Super intelligence has developed a new customer internal podcast product that inspires your teams by sharing the best A I use cases from inside and outside your company.

Think of IT is an A I daily brief. But just for your companies, A I use cases, if you'd like to learn more code to be super dad, I slash partner and fill out the information request form. I am really excited about this product, so I will personally get right back to you again.

That's be super di slash partner. Welcome back to the a daily brief. Today, we are discussing kind of the shape and texture of what state of the art looks like. We've got a story about google gami performing other models on the leader board, and we're kicking off with this story about an anthropic and open eye in this A I research comparison.

But really, I want to take a step back and contextualize this in terms of how individuals and enterprises are thinking about A I right now over the last couple of weeks, a huge part of the conversation has been dedicated to the idea of or question of whether A I models are, whether there is a slowdown in the radar performance is why we talk about some alternative scaling methods and what the laws are doing to try to deal with this. In many ways, what I think we're going to see is even as that plat toe happens, the competition from models supremacy is going to be about more than just sheer state of the art performance. It's going to be about product and user experience.

It's going to be about customization and specification for task, and it's going to be about access to particular data and knowledge of specific workflows within the enterprise that make certain tools work Better than others. Basically, I think that we're about to see an expansion of the way that we think about the competition for genii supremacy. And so that's just a little bit context in background before we get into this.

The information headline reads and tropic beat OpenAI e and testing A I E that performs A I research. Then this came from independent researchers at the model evaluation and threat research, which is a nonprofit group, which is publishing later this week, an evaluation of how elements from both OpenAI in anthropic perform when they were asked to solve a set of seven A I research problems. This is more than just than idle test as the information puts its.

Since the days of valentine, a developers have been captivated by the prospect of A I powerful enough to improve itself. OpenAI has already developed in internal AI research assistant and tool to help its researchers work faster. A possible first step in the development of A I that can conduct A I research on its own.

Now for A I safety advocates, self improving A I is an indicator of something else entirely. But the point is that people are very interested in this question of whether A I can be used to improve A I according to to the information. In five of the seven tests that were run as part of this experiment, claude saw at three point five outperformed a one preview.

They also note the cloud one by what they call a wide margin in two of those seven tests of the two that are on preview one. One of those was also what they call the decisive. One thing for those who are trying to gage how far along the path to A G.

I. We are, the information also reports that both models were no match for the top human researchers who took the same tests, who scored more than twice highs. The models, on average, laud was, quote, basically as good as the average man researcher in two of the seven problems.

And oh, one preview was about as good as an average researcher in another problem. So what are the types of problems? The example they give. One of the problems involved writing code for a language model from scratch without using division or exponents, which are usually essential for that task.

Another problem involves experimenting with traditional AI scaling laws, just like an employee at OpenAI might do, but using only a small amount of computing power. The test star, in part designed to give us a beacon in a benchmark for how far along ad development really is. Again, the information rights.

These tests are designed to put human participants at a disadvantage way. Even if AI models catch up to humans on these tests, that would still mean the models are less capable than top human researchers overall. And we give the AI firms time to make adjustments to improve their safety.

So again, something up for those keeping track at home. A I still not as good as the top human researchers at A I research, but starting to, in certain cases, match average human researchers. Now, one of the small things from anthropic over on the topic.

Anthropic has been pushing really hard to get away from the world of prom engineering and just build tools that help people improve their proms automatically. At the end of last week, they announced, quote, the ability to improve prompt manage examples directly in the anthropic console. These features, they say, make IT easier to leverage prompt engineering best practices and build more reliable A I applications.

The prompt improve or allows developers to take existing prompts and leverage cloud to automatically refine them using advances prom engineering techniques. This is ideal for adapting prompts that were originally written for A I models as well as for optimizing hand written prompts. So some are connected in the sense that increasingly, we're seeing people ask the A I to help them use the ai.

No one more story, which was from the end of last week as well. Google's deep mind's latest experiment of model has left to the front of the benchmarking charts known as gi x one one one four. The model has undergone testing on crowds or spending marking website chatbot arena.

Over the past week, a consistently scored Better than ChatGPT four o, jumping forty ranks from the previous german I model to the top of the leader board. IT is now ranked in both technical and creative drains, topping the charts for both math and creative writing. IT also overtook GPT for o for the best vision mode, the only category where IT wasn't the best model was coding, where rank number three behind GPT four o and the O N reasoning models.

Notably, this is the first time a gemini model has taken the lead by the benchmarking standard. The model is currently available as a preview on google ei studio website. Logan kilPatrick, the product to google s studio, posted gi super duper smart market research on new model names.

Referring to sam altman's habit of quickly snatching the limelight back, scientist cast hanson wrote, what a great way to find out. Open a ee will release a one within twenty four hours. Professor, is the moloch route.

Why are people confused about which models are the best choice for hard problems? I mean, don't the name GPT for latest twenty twenty four o nine o and then I E X P one one, one four and oh, one preview make IT obvious. Stop naming A I like files on my hard drive.

As for the model itself, though, he wrote, this was pretty impressive from the new gym in I model launch today. I gave IT one of my papers and ask you to review the tables into comment on the methods. IT did a Better job than previous gm I pro, though that wasn't bad.

Claude was close but didn't zoom out as well. The bigger picture, of course, is there are now multiple models that are remarkably good at understanding complex academic papers and underlying quantitative methods. Reading a paper like A P, H D seems like a pretty impressive fit for us to take just in stride, as, of course, say, I can do that.

Part of what matters about that analysis, by the way, is that eating, among others, have suggested that. Part of the reason that, that looks like air performance is slowing down is that our benchMarks are just basically soaked at this point once you get up in the nineties, are just not that much room to run. And part of the question is, do we need Better benchMarks still overall, it's hard not to feel like we are in a more incremental improvement sort of time in the AI field.

I would suggest that rather than be concerned about this, especially if you are trying to integrate AI into your business, use this freezer as a chance to actually figure out how to use what's already available, which is so transformative enough itself. I have a feeling that we will not be in this sort of moment for very long, and that punctuated equilibrium will be back in no time flat. For now, though, that's gonna IT for today's a daily brief until next time, please.

How Close Are We to Self-Improving AI? 13:02 Share

The AI Daily Brief (Formerly The AI Breakdown): Artificial Intelligence News and Analysis

Deep Dive

Shownotes Transcript

How Close Are We to Self-Improving AI?