We're sunsetting PodQuest on 2025-07-28. Thank you for your support!
Export Podcast Subscriptions
cover of episode Why Alibaba’s ZeroSearch Might Beat Google with Next-Gen Tech

Why Alibaba’s ZeroSearch Might Beat Google with Next-Gen Tech

2025/6/12
logo of podcast No Priors AI

No Priors AI

AI Deep Dive AI Chapters Transcript
People
J
Jaeden Schafer
Topics
Jaeden Schafer: 阿里巴巴推出了一种名为Zero Search的全新AI模型响应生成方法,它通过模拟Google搜索结果来训练AI模型,无需实际访问Google API。我认为这项技术具有革命性意义,因为它不仅能够显著降低AI模型的训练成本,还能提高搜索结果的质量。我发现Zero Search通过生成多个模拟网页,并让AI模型从中筛选最佳答案,从而实现了更高的准确性和效率。这就像让AI模型进行链式思考,从而获得更好的结果。我认为Zero Search最大的优势在于它能够替代昂贵的Google API,使用合成数据进行模型训练。由于AI模型已经拥有大量的互联网数据,因此可以使用旧模型的合成数据来训练新模型,从而大大降低了成本。我对此感到非常兴奋,并认为这项技术将对整个AI领域产生深远的影响。 Jaeden Schafer: 我认为Zero Search在多个问答数据集上的表现通常优于使用真实搜索引擎数据的模型。一个70亿参数的检索模型可以达到与Google搜索相同的性能水平,而一个140亿参数的模型表现甚至优于Google搜索。这表明Zero Search具有巨大的潜力,可以替代传统的搜索引擎。我相信随着LLM不断改进,Google可能会被取代,因为LLM可以提供准确的数据,而无需将用户引导到其他网站。Twitter的数据集非常有价值,Grok可能会在这个新世界中表现出色,并创建自己的搜索引擎。Twitter和新闻的结合可以取代Google及其API。总而言之,我认为信息的获取方式正在发生根本性的转变,而阿里巴巴的Zero Search技术正在引领这场变革。我对阿里巴巴的Zero Search工具和新的训练概念感到非常印象深刻,尤其是在成本节约和性能方面。

Deep Dive

Chapters
Alibaba's Zero Search is a new AI model training technique that significantly reduces costs and potentially surpasses Google's search capabilities. It simulates search results instead of using an expensive Google API, generating multiple "fake" websites to train its models and improve response quality.
  • Reduces training costs by 88%
  • Simulates Google search results to train AI models
  • Uses synthetic data instead of an expensive Google API
  • Higher quality results compared to traditional methods

Shownotes Transcript

Translations:
中文

In what I view as an absolutely wild turn of events for AI, Alibaba has come up with a brand new way of generating high quality AI model responses. And this isn't something that you've ever heard before. So it's something they just dropped a research paper on and it is called Zero Search. Essentially, what it's doing is allowing

an AI model to essentially Google itself, but it's not using any sort of AI model. And it's cutting training costs by about 88%. So that's the big headline is this is cutting training costs a ton. I expect to see a lot of AI models essentially copy this template. But this is absolutely fascinating. So

researchers out at Alibaba came up with this. We're gonna be diving into all of this before we do. I wanted to mention that my startup AI box is officially launched. We have our beta at AI box.ai for our playground, which essentially allows you to use all the top AI models, text, image, audio in the same chat for $20 a month. So you don't have to have subscriptions to everything for $20 a month. You can access all the top AI models, um, from anthropic open AI, meta deep seek, uh,

11 labs for audio, like all of these top ones, ideogram and stuff.

for image and you can chat with them all in the same chat one of the features i love about um playground is the ability to ask a question to a certain model and then rerun the chat with another model so a lot of times i'll you know get chat gpt to write a document for me or help me with an email or change some wording and i'm like i just don't like the tone of that i rerun it with claude i found a better result or sometimes i'm like you know what i want to be a little bit edgier i run it with grok um so you have all the different options there

And then you have a little tab where you can open up all of the responses side by side and compare them, see which one you like the best. So if you're interested, check it out, AIbox.ai. The link is in the description. All right, let's get back to what's going on over at Alibaba. So this new technique they've unveiled, like I mentioned, it's called Zero Search. And essentially, it is allowing them to develop what are they calling advanced search capabilities. But essentially, what they're doing is they're just simulating search.

search result data. So like you ask it a question and it's creating a simulated Google response page where it's literally generating like, so when you do a search on Google and you get 20, you know, links to websites that you could go look at or whatever, it's, it's like generating 20 fake websites or AI generated websites that it thinks would be

you know, commonly shown for that question. And at first I was like, and then essentially it, it has the AI model run through, it has an algorithm, it picks which ones are high quality and low quality, picks which ones are the best responses. And this is essentially helping it to give you a good, um,

answer. And this is so fascinating to me. At first I was like, why would like, why would they do this? This seems so weird. You know, why are you generating multiple results? Why do you have to generate like an AM model? It's essentially just the latest addition in a way to, um, they're accomplishing a couple of things. Number one, higher quality results, right? It's kind of like when we came up with chain of thought or we told it to walk through its thought process, all of a sudden it started getting higher quality results. This is really cool because it's like,

it's generating 20 pages and it's going through and scraping and looking at what the 20 different results and it's determining what the best answer is. So it's like, it's generating the same thing kind of 20 times. So you're getting better responses there. But the other interesting thing they're saying is they're like,

This replaces having an expensive API to Google search. So Google search gives you an API. And if you want to train an AI model off of, you know, all the data on the internet, you just grab the Google API, you run it through and you can train your model off of, you know, all the content on the internet. But that is really expensive and you're paying Google a ton of money for that. So they've essentially replaced that Google API with synthetic data. It sounds crazy. It sounds impossible, but it's not actually true.

that far off. And the interesting thing about this is that because, sorry, because these AI models already have all of the data in the, you know, in the, and the whole internet, pretty much they've already sucked, slurped up all the data from Wikipedia and all the data sets that they can grab. They really have all the responses already. So if they've already went and scraped everything from Google, they don't need to re scrape it again, just because they're doing a new model training. They can use synthetic data from an old model to essentially

essentially create new data to train on. So it sounds kind of crazy, but this is what they said specifically about it. They said reinforcement learning training requires frequent rollouts, potentially involving hundreds of thousands of search requests, which incur substantial API expense and severely constrained capability.

To address these challenges, we introduced Zero Search, a reinforcement learning framework that incentivizes the search capabilities of LLMs without interacting with real search engines. This is just so fascinating to me, such an interesting concept. And what they found while they were doing this is that this is actually outperforming Google. So

One thing that they also mentioned, they said, "Our key insight is that LLMs have acquired extensive world knowledge during large-scale pre-training and are capable of generating relevant documents to a given search query. The primary difference between a real search engine and a simulated LLM lies in the contextual style of

of the returned content. So like they mentioned, they already have all the data from their pre-training. And when they're actually going to train it, they don't want to go query again, Google and pay all that money all over again to the thing. So like how good is the quality of the output? This was kind of my big question and I was blown away.

So they did a bunch of experiments. They did seven different kind of question answer data sets and zero search, their new method, not only match, but often was actually better than the performance of a model that had real search engine data. So they have a 7 billion parameter retrieval model, which is not very huge.

Um, and it actually achieved the same performance compared to a Google search. So when you go do a search on Google, they're just saying like the quality of the response that you get or the responses that you get those first 20 links, the quality of the information combined on that was the same quality of what the 7 billion parameter model could do. So it's kind of smaller model.

And then they bumped it up a little bit and they had a 14 billion parameter model, which still isn't like the biggest model. I think meta has like a, a 500, uh, billion parameter or 400 billion parameter model, uh, it might be their best. So like there's way bigger models, right? But they're 14 billion parameter model. Um,

actually outperformed the Google search. So 7 billion parameters, they were on par with the Google search, with NLLM with Google search and 14 billion parameters was better. So the cost savings are absolutely huge with about 64,000 search queries using Google searches API.

That would cost them about $586. So when they're using their 14 billion parameter model and they're just simulating with an LLM on, you know, A100 GPUs, it costs about $70. So $580 to $70 on this training. That is an 88% reduction.

In their paper, they said, quote, this demonstrates the feasibility of using a well-trained LLM as a substitute for real search engines in reinforcement learning setups. And I would argue we'll get to the point where it replaces search engines altogether, like in a real literal way. We're seeing ChatGPT pretty much do this. People are just using ChatGPT instead of Google. But I think like,

the need for Google will be gone as all the data on Google is now sucked into these. And as they get better and better at spitting out the data and not hallucinating and giving it in a real way, like Google in the way we see it won't really need to exist and send people to places. Now,

I know what you're thinking. You're like, well, how could you possibly replace Google? There's all this new information coming out. This article, for example, is new information that came out that is not in their model, but it's in Google. And so I think there's always going to be a place for quote unquote news, new information. You probably are going to need like an API to

wherever that news or new information breaks, which is like social media, which of course Facebook's completely locked down. So that's off except for, I guess, meta has access. But then you have something like Twitter or Reddit. So I think Twitter and Reddit, and maybe even Twitter more because it's got a lot of firsthand like journalism video kind of stuff. So the Twitter slash X, whatever you want to call it, I think that data set is incredibly valuable. And so I think Grok is going to do very, very well in this new world. They could essentially create their own search engine, which just ties information on Grok

which will link out to news articles and other things. So like they really have everything you need. And then of course, news articles is kind of the other thing. You kind of want like news and you see OpenAI is obviously aware of this because they're making all of these different deals with Axel Springer and all these different, you know, all of these different news organizations to get their data essentially. So,

Journalists making all these new news articles and stuff is great, but oftentimes they're grabbing it from Twitter. So it's kind of like, I think, a Twitter and news combo tied to an LLM. You just essentially don't need Google anymore. You don't need that API. You can run without it. And for companies like Meta that have access to Facebook,

Facebook, they probably are just good to go on their own because users are sharing news. They can grab what's trending there and add it to their LLM. Boom, they're good to go. And then, of course, Twitter, where a lot of stuff is getting uploaded firsthand, they should be good. Reddit could maybe even make a play or they're licensing their stuff to Google to do stuff. So that's kind of I think the partnership is probably going to be between Reddit and Google. But this is fascinating. This is completely shifting the way we are looking at information. Yeah.

for better or for worse, because I'm sure tons of people with websites that have been scraped and are no longer, you know, their information is no longer needed because it's been scraped and now it's in there are unhappy about it. So it's going to be interesting to see where this goes, but very fascinating. I've been blown away by the cost savings. I've been blown, blown away by the way they're able to outperform Google on this. Um, so this is, uh,

Very, very interesting tool coming out of Alibaba, a fascinating new training concept. Thank you so much for tuning into the podcast today. If you enjoyed it, make sure to leave a rating and review. And if you are looking for a way to cut down on your 20 different subscription costs, different AI models, check out AIbox.ai. We have a ton of exciting new features coming up.

coming soon. And we have access to the top 30 AI models all on there that you can use for $20 a month. So a ton of fun. Thank you so much for tuning in and I will catch you next time.