A Promising Alternative Way to Improve LLM Performance

2024/11/16

The AI Daily Brief (Formerly The AI Breakdown): Artificial Intelligence News and Analysis

AI Deep Dive AI Chapters Transcript

N

NLW

知名播客主持人和分析师，专注于加密货币和宏观经济分析。

N

NotebookLM

NLW 指出，传统的 AI 模型缩放方法似乎遇到了瓶颈，模型的性能提升不如预期。NotebookLM 介绍了 MIT 的一项新研究，该研究探索了一种名为“测试时训练”（Test Time Training，简称 TTT）的新方法。TTT 的核心思想是在 AI 执行特定任务之前，对其进行额外的训练，类似于考前练习。研究人员将 TTT 应用于 ARC（抽象与推理语料库）—— 一系列旨在测试 AI 抽象推理能力的视觉谜题。结果表明，使用 TTT 的中等规模语言模型在 ARC 上实现了 25% 的性能提升，结合 TTT 和混合方法（神经网络和符号推理）甚至达到了人类平均水平。TTT 的有效性源于三个关键因素：初始训练与目标任务的结构相似性、增强型任务格式和数据，以及针对每个谜题训练单独的适配器。此外，TTT 在未经 AI 生成数据训练的模型上效果最佳。这表明 AI 生成的数据可能缺乏真实世界的复杂性。TTT 不仅比盲目扩大模型规模更有效地提升 AI 性能，还注重提高 AI 的智能和适应性。TTT 在科学研究、软件开发和教育等领域具有巨大的应用潜力。 NotebookLM 认为，TTT 预示着 AI 设计和使用方式的变革，未来可能在于更小、更专业的 AI 系统，它们能够学习和适应特定任务和环境，更像是合作伙伴而非工具。TTT 可以使 AI 更个性化、更贴近人类需求，并融入日常生活。NLW 补充道，AI 快速学习和适应能力引发了对其控制和可预测性的担忧，需要确保 AI 的安全性和负责任的使用。

Deep Dive

Today, on the a ideally brief, a new MIT paper about test time training, which has been a big part of the recent discussion of the limitations of the scaling. And before that, in the headlines, ChatGPT can now read from apps on mac. The AI daily brief is a daily podcast in video about the most important news and discussions in A I.

To join the conversation, follow the disport link in our shown notes. Welcome back to the AI daily brief headlines edition. All the daily AI news you need in around five minutes.

We kick off today with an update from OpenAI, where chat B T can now read directly from certain other apps. ChatGPT s desktop APP from macro s can now read from a handful of leading developer focus coding applications, including VS code, x code, texted terminal and item too. What this means practically is that developers no longer need to copy and pace their code into ChatGPT while using IT as a coding copilot.

Instead, when this new work with apps feature is enabled, the section of code are working on will be automatically. Send this context alongside your prompt chat. P, T won't be able to write code directly into a developer apple cursor, get how copilot can, but the feature is more about building a test case for more general applications of this capability.

Opa I said the ability to understand other apps is a key building blocks towards creating agented systems. The feature is drawing from apple's accessibility A P, I to read and translate the screen. This means the technique will only work with text bates apps.

However, IT also avoids using visual based inputs, which are prohibitively expensive for heavy use. Still, the future sends up to two other lines of code as context, so it's going to be using a lot of tokens. It's unclear at this stage how open eye plans to make the future compatible with apps that don't work with apple screen reader.

We've seen anthropic ago all the way in the other direction using an approach that takes constant screen shots for context rather than relying on OpenAI d cup product. Alexander embraces said. This isn't meant to be an agent.

It's a way to collaborate with coding tools to start, and there will be more tools coming soon. On the side of agents, I think this is a really key building block. This idea that chat P T understands or can work with all the content that you have, so I can help with that.

The future is already available to plus s in team users and rolling out to enterprise in education tears in the next few weeks. Next up, the latest from elon and X A I, we had heard previously they were raising up to six billion dollars, but we gone some new details. The latest reports suggest that that's happening at a fifty billion dollar valuation and could close as early as next week.

C, N, B, C suggest that it's going to be a combination of five billion from sovereign funds in the middle and one billion from other investors. Now of course, most of this is going to end up in jenson wong s pocket, because the money is going to be used to acquire one hundred thousand. And video chips, according to C, N, B, C sources, will keep an eye to see if this deal actually closes.

Speaking of data, consulting firm gartner has warned that the A, I, energy crisis could arrive as soon as in a new report. The firm said the power shortages would restrict forty percent of A I data centers within a few years. Bob Johnson, bp analyst gardener, said the explosive growth of new hyper scale data centers to implement gena I is creating an insatiable demand for power that will exceed the ability of utility providers to expand their fast enough.

In turn, this threats to disrupt energy availability and lead shortages, which will limit the growth of new data centers for gene I and other users from twenty six garner said. The new servers last year required one hundred and ninety five terrorite hours of electricity, which is as much as the eighteen million U. S.

Households by twenty twenty seven. They believe that just the new facilities will demand five hundred terrible t hours. And this, my friends, is why, of course, all of the big A I labs are so focused on energy and energy solutions.

Over in the world of staffing moves, prominent AI developer france wash lay is leaving google after close to a decade at the company he is credited for creating Carries a high level open source P, I for creating A I models for tackling machine learning tasks. The platform boasts over two million users that underpin several high profile products, including wae o self driving algorithm, as well as the recommendation engines for youtube networks and spotify. In a post on x, he said, i'm very grateful for my decade at google, and that times been deep.

Learning went from a niche academic topic to a massive industry employing millions carrs went from a small library used by a few thousand enthusiasts to a state of the art ramework used by two million developers. He says he plans to quote, go start a new company with a friend, but didn't give any further details. Aside from paris to publish the abstraction and reasoning corpus for A G I in two and nineteen, the arc A G I benchmark, which, by the way, features prominently in today's main episode, measures the ability of A I system to solve a novel reasoning problems.

And his viewers one of the most recognizable sign post that a model has achieved to true A G I this year. In collaboration with others, he launched the arc prize, awarding one billion dollars to the first team to achieve eighty five percent on this benchmark. The prize remains on one, with the closest score coming in IT, forty two percent.

Charle has also taken a firm view on the scaling issues that has recently returned to prominence. He is often argued that the current approach to feeding every more data and compute resources to train models is unlikely to achieve A I that is smart as humans. Instead, he believes that methods that involve reasoning in ways are more likely to yield results.

Now that I think is a perfect segway into what our main discussion is going to be today, which is a newspaper out of mt. That puts a little more juice around this idea of new strategies like test time compute that's going to do for the days idi brief headlands edition. Next up, the main episode, today's episode, brought you by van tub.

Whether you're starting or scaling your company security program, demonstrating top note security practices and establishing trust is more important than ever. Penta automates compliance for I S O twenty seven, O O one soc two gdpr and leading A I frameworks like I S O forty two thousand one and N I S T A I risk management framework, saving you time and money while helping you build customer trust, plus you can streamline security reviews by automating questionnaire s and demonstrates your security posture with a customer facing trust center. All power by vent to A I over eight thousand global companies like LangChain lea AI in factory A I use vantage to demonstrate A I trust, improve security in real time, learn more adventure doc com slash N L W that's vented 到 com slash N L W。

Today's s episode is brought to you, as always, by super and intelligent. Have you ever wanted an A I daily brief, but totally focused on how A I relates to your company? Is your company struggling with the AI adoption, either because you're getting installed, figuring out what use cases will drive value or because the AI transformation that is happening isolated individual teams, departments and employees and not able to change the company as a whole? Super intelligence has developed a new customer internal podcast product that inspires your teams by sharing the best AI use cases from inside and outside your company.

Think of IT as an A I daily brief, but just for your company's A I use cases, if you'd like to learn more, go to A B superdad islas partner and fill out the information request form. I am really excited about this product, so I will personally get right back to you again. That's be super da eyes slash partner.

Welcome back to the AI daily brief today. We're doing something a little bit different, and I think it's going to be pretty fun. One of the big points of conversation for the last week or two has been this question of whether labs are hitting some limits with their previous approach to scaling our alarms.

Basically, reports are coming out that the next version of german I, as well as the next version of ChatGPT, as well as open the eyes next GPT model, just don't reflect this bigger jump as previous state of the aren't leaps represented. If you want more background on that, you can go listen to a couple shows from earlier this week that are all about those limitations. However, the important thing to note here is that it's not that A I can't get any more performance from here.

It's about what techniques and strategies are needed to actually make the next leap. One of the strategies that seems really promising is something called test time compute. This is a part of what's been built into the one reasoning model that open a eye has released.

And the newspaper this week from M. I T was called the surprising effectiveness of test time training for abstract reasoning. Now the chAllenge when any of these papers are released is that they are extremely dense and extremely technical. And yet we have ourselves some assets to Better understand this. And so from here, i'm going to turn the a daily brief over two google snobbism, where i've worked with IT to create a conversational podcast about this paper.

I'm not looking to put myself out of a job here, but I think you agree that this is a really powerful use of this new google capability that makes taking really abstract and dense information like what's in this paper a lot more accessible. So with that, i'm going to turn IT over to nopa allam. Appreciate you guys. Listen, as always. And here, once again, is a discussion of the surprising effectiveness of test time training.

Feels like for a while there, A I was just advances so fast, especially with those large language models from companies like OpenAI google, that seems like every few months, bam, some new mind blowing capability would drop. But lately, uh, I don't know, there's this sense that maybe that initial rush of progress is kind of starting to slow down. Like maybe just making these models bigger isn't the whole story.

yeah. What's interesting is there are signs that this bigger is Better approach to A I might be hitting some limits. We saw that with opening eyes releases, G P, T four being the big recent run, and even with google german I right?

And you'd think that with each new version, these jumps and capability would just get bigger and bigger. But I read something really interesting and said that while GPT for IT was definitely a huge leap of a GPT three, the improvements from GPT four to its successor, ryan, are actually much smaller. It's like maybe they're bumping up against some kind of fundamental barrier.

Yeah, that's a question a lot of people are asked him, if just throwing more data and computing power at the problem isn't the answer, then what is? And that's why I thought this new research coming out of M. I, T. Was so interesting. They're expLoring a completely different approach to making A I smarter, something they call test time training.

Okay, test time training, that sounds kind of tube isn't the whole point of training and A I to get IT ready to perform to use. So what's the idea here?

So think of IT this way. Imagine you have a big exam coming up. You've studied the material, you've got a good foundation, but then you do a few practice problems ripe for the test just to really sharpen your skills and focus on the specific types of questions you're likely to see. That's kind of what test time training or T, T.

T does for AI. So instead, just relying on that initial training, they're giving the A, I, A little extra boost right before IT has to tackle the .

specific task exactly. And the M, I, T, researchers applied this idea to a particularly chAllenging set of problems. It's called the arc, the abstraction and reasoning corpus. It's basically a collection of visual puzzles that are meant to test, and a is ability to solve problems is never seen before. To really stretch its capacity for abstract.

really took a look at some of those R C puzo. And while they are not missing around talk about mental gym tics.

right, they involve things like pattern recognition and applying logical rules, even some space al reasoning. They're designed to be tough, even for us humans.

So how do the A I do with this test time training? Did those practice problems actually help?

Oh, you bet they did. The researchers found that by using T, T, T with a fairly modest ze language model, one with about eight billion parameters, they actually achieved a twenty five percent improvement over the previous best results on A R C, which is a significant jump. But here's was even more remarkable. By combining T T T with a hybrid approach that uses both neural networks and symbolic reasoning, they actually managed to match average human performance on these puzzles.

Hold on, they got an A I to perform as well as an average person on these really complex visual reasoning tasks just by giving IT like a little crime session beforehand. That's incredible.

IT is pretty mind blowing. Isn't IT and IT really chAllenges some of our assumptions about how A I learns and adapts? But the real question is, what's the secrets of what makes test time training so effective? Especially in this case, will the M, I T. Researchers actually identified three key ingredients that seem to make T, T, T particularly potent? First, they found that IT really helps if the A I model is initially trained on task that share some underlying structure with the target task, in this case the A R C puzzles.

Also, it's like giving the AI ahead start a foundation of knowledge that you can then build on during the test time training.

Exactly that initial training can a prime the model to learn quickly and efficiently during the test time face. The second ingredients is what they call augmented task format in data. Basically, they very cleverly create new training data from the test input itself.

So instead just throwing the A I into the deep end with a totally novel problem, you're giving you a few practice labs in the pool first.

Yeah, that's a great way to put IT. It's like providing the A I with a set of really tailed exercises es that help IT zero in on the specific patterns and dogies that are relevant to the task at hand. And then the third k ingredient is that instead of using one generic adapter for all the tasks, they actually train individual adapters for each specific puzo.

especially customers. A is a thinking for each chAllenge, making IT hyper focus on that specific problem at hand. No wonder before him so well.

right? And what's even more interesting is that they found that T, T, T was most effective when used with models that hadn't been trained on sthetic data that was generated by another ai, which, you know, makes you wonder if there's something about that, A I generated data that's missing some element of real world complexity that's crucial for truly robust learning.

I think those AI generated images, you see, they're technically very impressive, but they often have these little glitches or inconsistencies they can to give them away. Maybe there is a similar issue with A I generated training data like it's not quite capturing the full richness, ss and messiness of the real world.

That's a really a stute observation and IT definitely points to a potential area for future research. But lets see you out for a second and think about the bigger picture here. What these findings suggest is that T, T T could actually be a more efficient path to improving A I than just blindly scaling up models.

Because it's not just about making A, I bigger. It's about making IT smarter and more adaptable.

and the potential applications of this are huge. Imagine if we could apply T, T, T to fields like scientific research, where A, I could help scientists analyze really complex data sets and generate hypotheses in real time.

Or in software development, where A, I could help programmers debug code or design more efficient algorithms adapting to the specific chAllenges of each project.

Or even in education where personalized A I touters could tailor their teaching methods to each students, individual learning style and pace, adapting on the flight to their strength and weaknesses.

Now you're talking that's the kind of A I that could really transform lives. But I also have to wonder if A I can learn and adapt this quickly. Doesn't that raise e some concerns about control and predictability? What's to stop IT from learning things we don't wanted to learn or behaving in ways that we .

didn't anticipate. That's a totally valid concern, and it's one that researchers are very actively expLoring. One of the key chAllenges with any really powerful technology is ensuring that is used safely and responsibly with tt. It's about striking a baLance between adaptability and robustness, making sure that the A I remains a line with our goals and values even as IT learns and evolves.

It's a delicate dance for sure, but the potential benefits are so compelling that IT seems worth expLoring further. What excites you most about the possibilities of T, T, T.

You know, for me, it's the idea that we might be on the verge of a truly transformative shift in how we design and use A I. We've been so focused on building these massive or knowing models, but maybe that's not the only way, or even the best way. Perhaps the future of A I lies in smaller, more specialized systems that can learn and adapt to specific tasks in contexts. Systems that are more like partners, are collaborators than just tools.

I like that A I as a partner, working alongside us to solve these complex problems and unlock new possibilities. But i'm also curious, what are some the limitations of T, T, T? Is IT a solar bullet? Or are there certain types of problems or situations where it's not as effective? It's fascinated to think about all the ways this could change how we interact with A I.

For a while, I felt like A I was this distant force. Something is happening in research labs or powering big tech platforms. But T, T, T, that brings IT closer, makes IT more personal, more tailed to our individual needs.

Yeah, I totally agree. If you think about the evolution of technology, that often starts with these big centralized systems, and then IT gradually becomes more distributed, more accessible, more integrated into our daily lives. And I think T, T, T could be a catalyst for that kind of shift in the world of ai.

like moving from mainframe computers to personal computers in out of smart phones and wearable devices. What was once so exclusive and specialized becomes ubiquity and personalized exactly.

And S A, I becomes more integrated into our lives. The ability for IT to learn and adapt on the fly is going to be essential. We're onna need A I systems that can understand the nuances of our individual preferences. Our workers yle are learning patterns.

Can you give me concrete example of might look like.

sure. Imagine a world where smartphone isn't just a device. It's a true A I companion that learns from your interactions and anticipates your needs, helps you navigate your day with incredible efficiency. You're working on a complex project, and your AI assistant proactively gathers relevant information, suggest potential solutions and even helps you draft emails or presentations, all tailor to your specific style in the context of the project.

That's a pretty compelling vision is not just about automation. It's about augmentation, about A I, amplifying our capabilities in helping us reach our full potential.

T, T, T could be a really crucial part of making that vision a reality by allowing A I systems to specialize on the fly, to adapt to the specific chAllenges and opportunities of each moment, we can create a future where A I is not just powerful, but also truly useful and truly human centered.

So as we wrap up this deep dive and to test time training, I want to leave you, the listener, with this thought, what areas of your life could benefit from this kind of ai? What tasks or chAllenges could you delegate or collaborate on with an A I partner that can learn and adapt as quickly as you can?

The future of A I is being written right now, and technologies like T, T, T are giving us a glimpse of what's possible up to all of us to imagine and shape the future, to ensure that A I is used to empower and uplift humanity and not to replace or diminish.

Thanks for joining us on this exploration of test time training. We hope this deep dive has given you a new perspective on the evolving landscape of A I and Sparked your curiosity about the incredible possibilities that lie ahead until next time, keep learning, keep questioning and keep pushing the boundary of what's possible.

A Promising Alternative Way to Improve LLM Performance 18:42 Share

The AI Daily Brief (Formerly The AI Breakdown): Artificial Intelligence News and Analysis

Deep Dive

Shownotes Transcript

A Promising Alternative Way to Improve LLM Performance