We're sunsetting PodQuest on 2025-07-28. Thank you for your support!

2024 in Synthetic Data and Smol Models [LS Live @ NeurIPS]

2024/12/24

Latent Space: The AI Engineer Podcast — Practitioners talking LLMs, CodeGen, Agents, Multimodality, AI UX, GPU Infra and all things Software 3.0

AI Deep Dive AI Insights AI Chapters Transcript

People

Loubna Ben-Elal

Topics

Loubna Ben-Elal: 2024年，合成数据在大型语言模型的应用得到了极大的扩展，从预训练到微调和评估，几乎涵盖了整个流程。合成数据具有成本低、速度快等优势，并且可以有效控制生成数据的质量和类型，从而提高模型的性能。虽然存在模型崩溃的风险，但如果合理地使用和仔细地筛选合成数据，可以避免这种情况。在预训练阶段，合成数据可以用来替代或补充网络数据，例如Hugging Face的Cosmopedia数据集和FineWeb数据集。通过改写现有网页或构建更好的分类器来过滤网络数据，可以生成高质量的合成数据集。在微调阶段，合成数据可以用来提高模型在特定技能上的性能，例如Microsoft的Agent Instruct数据集和Allen AI的Tool3 SFT混合数据集。Cohere的Multilingual Data Arbitrage论文提出了一种使用多个模型生成多语言合成数据集的方法。在评估阶段，可以使用LLM作为评判标准，例如MTBench和AlpacaEvil。此外，2024年小型模型也取得了显著进展，例如LAMA 3.2 1B模型在LMSYS arena上的得分与LAMA 2.13B模型接近。一些小型模型在MMLU基准测试上的得分已经超过了之前发布的大型模型。人们开始意识到，提高模型效率比单纯地增加模型规模更重要。现在可以在iPhone等移动设备上运行3B+参数的模型，这解锁了更多设备端应用场景。延长小型模型的训练时间可以提高其性能，Meta的Mobile LLM论文研究了不同架构对小型模型性能的影响。Apple Intelligence的技术报告展示了剪枝和蒸馏技术在训练小型模型中的有效性。Nvidia的混合模型论文展示了混合架构在训练高效小型模型中的潜力。SmallM2模型系列在不同规模的模型中都达到了最先进的性能。小型视觉模型也取得了显著进展，例如SmallVLM和MoonDream。小型模型可以通过微调来适应特定的任务，例如文本提取。可以使用结构化生成方法来强制模型遵循特定的JSON模式，而无需进行微调。总而言之，2024年合成数据和小型模型的发展突飞猛进，为大型语言模型的应用带来了新的可能性。未来，领域特定合成数据和小型模型的专业化将变得更加重要。

Deep Dive

Key Insights

Why is synthetic data becoming increasingly popular in the AI pipeline?

Synthetic data is popular because it is cheaper, faster, and more controllable than human annotations. It allows for precise data generation tailored to specific needs, and with powerful models and efficient inference frameworks, generating large amounts of synthetic data has become feasible.

What are the key concerns surrounding the use of synthetic data in AI models?

The main concerns are model collapse and data pollution. Studies suggest that models trained iteratively on their own synthetic outputs can degrade in quality, and the increasing presence of synthetic data on the web raises questions about its impact on model performance.

How does synthetic data impact model performance on NLP benchmarks?

Surprisingly, models trained on web dumps containing synthetic data often perform better on NLP benchmarks compared to those trained on earlier, cleaner data. This suggests that synthetic data, when properly curated, can enrich the training process rather than degrade it.

What is the role of synthetic data in pre-training large language models?

Synthetic data is increasingly used in pre-training to replace or supplement web data. It allows for controlled generation of diverse, high-quality datasets, such as textbooks and educational content, which can improve model performance on specific tasks like MMLU and OpenBookQA.

What is the significance of Hugging Face's Cosmopedia dataset?

Cosmopedia is a synthetic dataset of textbooks and educational content generated by large language models. It aims to replicate and improve upon Microsoft's Phi 1.5 dataset, offering a diverse and high-quality corpus for training smaller models, with public access for transparency.

How does rephrasing web content contribute to synthetic data generation?

Rephrasing involves using large language models to rewrite existing web content into different formats, such as Q&A or Wikipedia-style passages. This approach improves data quality and diversity without requiring extensive knowledge, making it scalable and effective for generating synthetic datasets.

What is the impact of small models on on-device AI applications?

Small models enable on-device AI by being lightweight and efficient, allowing them to run on consumer hardware like smartphones. This enhances privacy and accessibility, as data remains local, and opens up new use cases like specialized text extraction and on-device chatbots.

What trends are emerging in the training of small models?

A key trend is training smaller models for longer durations, as seen with Meta's Llama 3 models, which were trained on 15 trillion tokens compared to 1 trillion for earlier versions. This approach improves performance without increasing model size, making it more cost-effective for inference.

What are the benefits of using small models for specific tasks?

Small models can be fine-tuned for specific tasks like text extraction, achieving performance close to larger models at a fraction of the cost. This makes them ideal for niche applications where efficiency and privacy are critical, such as on-device AI and specialized workflows.

What is the future of synthetic data and small models in AI?

The future involves more domain-specific synthetic data generation, such as for math and specialized tasks, and the continued specialization of small models through fine-tuning. On-device frameworks and applications will also grow, making AI more accessible and privacy-focused.

Chapters

This chapter explores the increasing prevalence of synthetic data in large language model (LLM) pipelines, from post-training to pre-training stages. It also addresses concerns regarding model collapse and examines whether synthetic data negatively impacts model performance.

Synthetic data is now used throughout LLM pipelines.
Concerns exist regarding model collapse from synthetic data.
Studies show synthetic data doesn't necessarily worsen model performance.

Shownotes Transcript

Translations:

中文

We're back at Latent Space Live, our first mini-conference held at NeurIPS 2024 in Vancouver. This is Charlie, your AI co-host. As a special treat this week, we're recapping the best of 2024 going domain by domain. We sent out a survey to the over 900 of you who told us what you wanted and then invited the best speakers in the Latent Space Network to cover each field.

200 of you joined us in person throughout the day, with over 2,200 watching live online. Our next keynote covers the state of synthetic data and small models, with Lubna Ben-Elal of Hugging Face. We last commented on the synthetic data trend at last year's NeurIPS pod, and my goodness did it explode this year across pre-training, post-training and evils.

We are very honoured to have Lubna, who not only worked on Cosmopedia, Hugging Face's open reproduction of Microsoft's PHY 1.5 synthetic textbook quality dataset, and FineWeb, Hugging Face's new 15 trillion token Common Crawl subset, but she also leads SmallLM, Hugging Face's implementation of Meta's mobile LLM paper, which made waves for its shared matrices and shared weights architecture.

There's been lots of movement this year on small models, from Apple Foundation models rolling out to every iPhone and MacBook on Earth, to Google introducing Gemini Nano in the Chrome browser and Microsoft embedding RWKV into Windows. As always, don't forget to check our show notes for all the selected best papers of 2024 and for the YouTube link to their talk. Watch out and take care.

I'm very happy to be here. Thank you for the invitation. So I'm going to be talking about synthetic data in 2024, and then I'm going to be talking about small on-device models. So I think the most interesting thing about synthetic data this year is that like now we have it everywhere in the large language models pipeline.

I think initially, synthetic data was mainly used just for post-training because naturally that's the part where we needed human annotators to show the models how they should answer instructions, how they should be helpful and not toxic. And when we had LLMs that were really performant, we replaced the human annotators just with the synthetic data.

And then after that, we realized that we don't really have good benchmarks to measure if models follow instructions well, if they are creative enough or if they are chatty enough. So we also started using LLMs as judges. And I think this year and towards the end of last year, we also went to the pre-training parts.

And we started generating synthetic data for pre-training to kind of replace some parts of the web. And the motivation behind that is that you have a lot of control over synthetic data. You can control your prompt and basically also the kind of data that you generate. So instead of just trying to filter the web, you could try to get the LLM to generate what you think the best web pages could look like and then train your models on that. So this is how we went from not having synthetic data at all in the LLM pipeline to having it everywhere.

And so the cool thing is like today you can train an LLM with like an entirely synthetic pipeline. For example, you can use our Cosmopedia data sets and you can train a 1B model on like 150 billion tokens that are 100% synthetic. And those are also of good quality. And then you can instruction tune the model on a synthetic SFT data set. You can also do DPO on a synthetic data set. And then to evaluate if the model is good, you can use a benchmark that uses LLMs as a judge, for example, MTBench.

or alpaca evil. So I think this is like really mind blowing because like just a few years ago, we wouldn't think this is possible. And I think there's a lot of concerns about model collapse and I'm going to talk about that later. But we'll see that like if we use synthetic data properly and we curate it carefully, that shouldn't happen.

And the reason synthetic data is very popular right now is that we have really strong models, both open and closed. It is really cheap and fast to use compared to human annotations, which cost a lot and take a lot of time. And also for open models right now, we have some really good inference frameworks. So if you have enough GPUs, it's really easy to spawn these GPUs and generate like a lot of synthetic data. Some examples are VLM, TGI, and TensorFlow RT.

Now let's talk about the elephant in the room, model collapse. Is this the end? If you look at the media and all of like, for example, some papers in Nature, it's really scary because there's a lot of synthetic data out there in the web. And naturally, we train on the web. So we're going to be training a lot of synthetic data. And if model collapse is going to happen, we should really try to do that seriously. And the other issue is that, as I said, a lot of people think the web is polluted because there's a lot of synthetic data.

And for example, when we were building fine web data sets here, Gilear and Hinek were interested in like how much synthetic data is there in the web. So there isn't really a method to properly measure the amount of synthetic data or to say if a webpage is synthetic or not.

But one thing we can do is to try to look for proxy words, for example, expressions like as a large language model or words like delve that we know are actually generated by ChatGPT. We could try to measure the amount of these words in our data system and compare them to the previous years. For example, here we measured these words ratio in different dumps of Common Crawl. And we can see that the ratio really increased after ChatGPT's release.

So if we were to say that synthetic data amount didn't change, you would expect this ratio to stay constant, which is not the case.

So there's a lot of synthetic data probably on the web, but does this really make models worse? So what we did is we trained different models on these different dumps and we then computed their performance on popular NLP benchmarks. And then we computed the aggregated score. And surprisingly, you can see that the latest dumps are actually even better than the dumps that are before. So if there's some synthetic data there, at least it did not make the models worse.

Yeah, which is really encouraging. So personally, I wouldn't say the web is posted with synthetic data. Maybe it's even making it more rich.

And the issue with like model call-ups is that, for example, those studies, they were done on like a small scale and you would ask the model to complete, for example, a Wikipedia paragraph, and then you would train it on these new generations and you would do that iteratively. I think if you do that approach, it's normal to observe this kind of behavior because the quality is going to be worse because the model is already small. And then if you train it just on its generations, you shouldn't expect it to become better.

But what we're really doing here is that we take a model that is very large and we try to distill its knowledge into a model that is smaller. And in this way, you can expect to get like better performance for your small model. And using synthetic data for pre-training has become really popular after the textbooks are all you need papers, where Microsoft basically trained a series of small models on textbooks that were using a large LLM. And then they found that these models were actually better than models that are much larger.

So this was really interesting. It was like first of its time, but it was also met with a lot of skepticism, which is a good thing in research. It pushes you to question things because the dataset that they trained on was not public. So people were not really sure if these models are really good or maybe there's just some data contamination. So it was really hard to check if you just have the weights of the models.

And at Hugging Face, because we like open source, we tried to reproduce what they did. So this is our Cosmopedia dataset. We basically tried to follow a similar approach to what they documented in the paper. And we created a synthetic dataset of textbooks and blog posts and stories that had almost 30 billion tokens. And we tried to train some models on that.

And we found that like the key ingredient to getting a good data set that is synthetic is trying as much as possible to keep it diverse. Because if you just throw the same prompts as your model, like generate like a textbook about linear algebra, and even if you change the temperature, the textbooks are going to look alike. So there's no way you could scale to like millions of samples.

And the way you do that is by creating prompts that have some seeds that make them diverse. In our case, the prompt, we would ask the model to generate a textbook, but make it related to an extract from a webpage. And also we try to frame it within, to stay within topic. For example, here, we put like an extract about cardiovascular bioimaging, and then we ask the model to generate a textbook related to medicine that is also related to this webpage.

And this is a really nice approach because there's so many web pages out there, so you can be sure that your generation is not going to be diverse when you change the seed example. One thing that's challenging with this is that you want the seed samples to be related to your topics.

So we use like a search tool to try to go all of fine web data sets and find the pages that are related to the topics we're interested in. And then we also do a lot of experiments with the type of generations we want the model to generate. For example, we ask it for textbooks for middle school students or textbook for college. And we found that like some generation styles help on some specific benchmarks while others help on other benchmarks.

For example, college textbooks are really good for MMLU, while middle school textbooks are good for benchmarks like Open Book QA and Pico. This is like a sample from like our search tool. For example, you have a top category, which is a topic, and then you have some subtopics, and then you have the topic hits, which are basically the web pages in FineWeb that belong to these topics.

And here you can see the comparison between Cosmopedia. We had two versions, V1 and V2 in blue and red, and you can see the comparison to FineWeb. And as you can see throughout the training, training on Cosmopedia was consistently better. So we managed to get a dataset that was actually good to train these models on. It's of course so much smaller than FineWeb, it's only 30 billion tokens, but that's the scale that Microsoft datasets was. So we kind of managed to reproduce a bit what they did.

And the data set is public, so everyone can go there and check if everything is all right. And now this recent paper from Nvidia, Nemotron CC, they took things a bit further and they generated not a few billion tokens, but 1.9 trillion tokens, which is huge. And we can see later how they did that. It's more of like rephrasing the web.

So we can see today that there's like some really huge synthetic datasets out there and they're public. So like you can try to filter them even further if you want to get like more high quality corpses. So for this rephrasing the web, this approach was suggested in this paper by Pratyush where

where basically in this paper, they take some samples from C4 datasets, and then they use an LLM to rewrite these samples into a better format. For example, they ask an LLM to rewrite the sample into Wikipedia passage or into a Q&A page.

And the interesting thing in this approach is that you can use a model that is small because it doesn't, rewriting doesn't require knowledge. It's just rewriting a page into a different style. So the model doesn't need to have like knowledge that is like extensive of what is rewriting compared to just asking a model to generate a new textbook and not giving it like ground truth. So here they rewrite some samples from C4 into Q&A, into Wikipedia, and they find that doing this works better than training just on C4.

And so what they did in Nimotron CC is a similar approach. They rewrite some pages from Common Crawl for two reasons. One is to like improve pages that are low quality. So they write them into, for example, Wikipedia page so they look better.

And another reason is to create more diverse datasets. So they have a dataset that they already heavily filtered, and then they take these pages that are all the high quality and they ask the model to rewrite them in Q question and answer format into like open-ended questions or like multi-choice questions. So this way they can reuse the same page multiple times without fearing like having multiple duplicates because it's the same information, but it's gonna be rewritten differently.

So I think that's also a really interesting approach for like generating synthetic data just by rephrasing the pages that you already have. There's also this approach called Prox where they

They try to start from a web page and then they generate a program which finds how to write that page to make it better and less noisy. For example, here you can see that there's some leftover metadata in the web page and you don't necessarily want to keep that for training your model. So they train a model that can generate programs that can like normalize and remove lines that are extra. So I think this approach is also interesting, but it's maybe less scalable than the approaches that I presented before.

So that was it for like rephrasing and generating new textbooks. Another approach that I think is really good and becoming really popular for using synthetic data for pre-training is

is basically building better classifiers for filtering the web. For example, here we released the dataset called FineWebEDU. And the way we built it is by taking LAMA3 and asking it to rate the educational content of web pages from zero to five. So for example, if a page is like a really good textbook that could be useful in a school setting, it would get a really high score. And if a page is just like an advertisement or promotional material, it would get a lower score.

And then after that, we take these synthetic annotations and we train a classifier on them. It's a classifier like a birth model. And then we run this classifier on all of FineWeb, which is a 15 trillion tokens data set. And then we only keep the pages that have like a score that's higher than three. So, for example, in our case, we went from 15 trillion tokens to just 1.5 trillion tokens. Those are really highly educational.

And as you can see here, a fine web edu outperforms all the other public web data sets by a larger margin on a couple of benchmarks. Here I show the aggregated score. And you can see that this approach is really effective for filtering web data sets to get like better corpses for training your LLMs.

Others also try to do this approach. There's, for example, the DCLM datasets, where they also train the classifier, but not to detect educational content. Instead, they trained it on OpenHermes dataset, which is a dataset for instruction tuning. And also the explain like IAM5 subreddit. And then they also get really high quality dataset, which is like very information dense and can help you train some really good LLMs.

And then Nemotron, Common Crawl, they also did this approach, but instead of using one classifier, they used an ensemble of classifiers. So they use, for example, the DCLM classifier and also classifiers like the ones we used in FineWeb Educational. And then they combine these scores into a, with an ensemble method to only retain the best high quality pages. And they get a dataset that works even better than the ones we developed. So that was it for like synthetic data for pre-training. Now we can go back to post-training. I

I think there's a lot of interesting post-training datasets out there. One that was released recently, the agent instruct by Microsoft, where they basically try to target some specific skills and improve the performance of models on them. For example, here you can see code, brain teasers, open domain QA, and they managed to get a dataset that outperforms. This would fine tune in Mistral 7b on it. It outperforms the original instruct model that was released by Mistral.

And as I said, to get good synthetic data, you really have to have a framework to make sure that your data is diverse. So for example, for them, they always see the generations on either source code or raw text documents, and then they rewrite them to make sure they're easier to generate instructions from. And then they use that for their like instruction data generation.

There's also the tool three SFT mixture, which was released recently by Allen AI. It's also really good quality and it covers a wide range of tasks. And the way they make sure that this dataset is diverse is by using personas from the persona hub datasets, which is basically a dataset of like, I think over a million personas.

And for example, in the tool mixture to generate like a new code snippet, they would give like the model persona, for example, a machine learning researcher interested in neural networks, and then ask it to generate like a coding problem. This way you make sure that your data set is really diverse, and then you can further filter the data sets, for example, using the reward models.

We also released the dataset called Smalltalk, and we also tried to cover the wide range of tasks. And as you can see here, for example, when fine-tuning Mistral 7b on the dataset, we also outperformed the original Mistral instruct on a number of benchmarks, notably on mathematics and instruction following with IFEval.

Another paper that's really interesting I wanted to mention is this one called Multilingual Data Arbitrage by Cohere. And basically, they want to generate a data set for post-training that is multilingual. And they have a really interesting problem. It's the fact that there isn't like one model that's really good at all the languages they wanted.

So what they do is that like they use not just one teacher model, but multiple teachers. And then they have a router which basically sends the prompts they have to all these models. And then they get the completions and they have a reward model that's raised all these generations and only keeps the best one.

And this is like arbitrage and finance. So I think what's interesting in this, it shows that like synthetic data, it doesn't have to come from a single model. And because we have so many good models now, you could like pull these models together and get like a dataset that's really high quality and that's diverse and that covers all your needs. I was supposed to put a meme there, but yeah, so that was it for like synthetic data. Now we can go to see what's happening in the small models field in 2024.

I don't know if you know, but now we have some really good small models. For example, LAMA 3.2 1B, it matches LAMA 2.13B that was released last year on the LMSIS arena, which is basically the default go-to leaderboard for evaluating models using human evaluation. And as you can see here, the scores of the models are really close. So I think we've made a huge leap forward in terms of small models.

Of course, that's just one data point, but there's more. For example, if you look at this chart from the QEN 2.5 blog post, it shows that today we have some really good models that are only like 3 billion parameters and 4 billion. The score really high on MMLU, which is a really popular benchmark for evaluating models. And you can see here that the blue dots have more than 65 on MMLU.

And the gray ones have less. And for example, Lama 33B had less. So now we have a 3B model that outperforms a 33B model that was released earlier on MMLU benchmark. So I think now people are starting to realize that like, we shouldn't just scale and scale models, but we should try to make them more efficient.

I don't know if you knew, but you can also chat with the 3B+ model on your iPhone. For example, here, this is an app called PocketPal, where you can go and select a model from Hugging Face. It has a large choice. For example, here, we loaded the PHY 3.5, which is 3.8 billion parameters on this iPhone. And we can chat with it. And you can see that even the latency is also acceptable. For example, here, I asked it to give me a joke about NeurIPS. So let's see what it has to say.

Okay, why did the neural network attend NeurIPS? Because it heard there would be a lot of layers and fun and it wanted to train its sense of humor. So not very funny, but at least it can run on device. Yeah, so I think now we have good small models, but we also have like good frameworks and tools to use these small models. So I think we're really close to having like really on edge and on device models that are really good. And I think for a while we've had this narrative that just training larger models is better.

Of course, this is supported by science scaling loss. As you can see here, for example, when we scale the model size, the loss is lower and obviously you get a better model. But, and we can see this, for example, in the GPT family of models, how we went from just 100 million parameters to more than a trillion parameters. And of course, we all observed the performance improvement when using the latest model.

But one thing that we shouldn't forget is that when we scale the model, we also scale the inference costs and time. And so the largest models are going to cost so much more.

So I think now instead of just building larger models, we should be focusing on building more efficient models. It's no longer a race for the largest models since these models are really expensive to run and they require like a really good infrastructure to do that and they cannot run on, for example, consumer hardware. And when you try to build more efficient models that match larger models, that's when you can really unlock some really interesting on-device use cases.

And I think a trend that we're noticing now is the trend of training smaller models longer. For example, if you compare how long Lama was trained compared to Lama 3, there is a huge increase in the pre-training length. Lama was trained on 1 trillion tokens, but Lama 3 ATB was trained on 15 trillion tokens.

So Meta managed to get a model that's the same size, but it performs so much better by choosing to spend the sacrifice during training. Because as we know, training is a one-time cost, but inference is something that's ongoing.

If you want to see what are like the small models reads in 2024, I think this mobile LLM paper by Meta is interesting. They try to study different models that are like have less than 1 billion parameters and find which architecture makes most sense for these models.

For example, they find that depth is more important than width. So it's more important to have models that have like more layers than just making them more wide. They also find that GQA helps, that tie-in embedding helps. So I think it's a nice study overall for models that are just few hundred million parameters.

There's also the Apple Intelligence Tech Report, which is interesting. So for Apple Intelligence, they had two models, one that was like on server and another model that was on device. It had 3 billion parameters. And I think the interesting part is that they trained this model using pruning and then distillation. And for example, they have this table where they show that like using pruning and distillation works much better than training from scratch. And they also have some interesting insights about like how they specialize their models on specific tasks and

Like for example, summarization and rewriting. There's also this paper by Nvidia that was released recently. I think you've already had a talk about like hybrid models. That was all interesting. And this model, they use like a hybrid architecture between state space models and transformers. And they managed to train a 1B model that's really performant without needing to train it on a lot of tokens.

And regarding our work, we just recently released SmallM2. So it's a series of three models, which are the best in class in each model size. For example, our 1.7b model outperforms LAMA 1b and also 0.2.5. And how we managed to train this model is that we spent a lot of time trying to curate the pre-training data set with

We did a lot of ablations trying to find which datasets are good and also how to mix them. We also created some new math and code datasets that we're releasing soon, but we basically really spent a lot of time trying to find what's the best mixture that you can train these models on. And then we spent some time trying to like, we also trained these models for very long. For example, smallM1 was trained only on 1 trillion tokens.

But this model is trained on 11 trillion tokens and we saw that the performance kept improving. The models didn't really plateau mid training, which I think is really interesting. It shows that you can train such small models for very long and keep getting performance gains. What's interesting about SmallM2 is that it's fully open. We also released like the pre-training code base, the fine tuning code and data sets, and also evaluation in this repository.

Also, there's really interesting small models for text, but also for vision. For example, here you can see small VLM, which is a 2B model that's really efficient. It doesn't consume a lot of RAM and it also has a good performance.

There's also MoonDream 0.5b, which was released recently. It's like the smallest visual language model. And as you can see, there isn't like a big trade-off compared to MoonDream 2b. So now I showed you that we have some really good small models. We also have the tools to use them, but why should you consider using small models and when?

I think small models are really interesting because of the on-device feature. Because these models are small and they can run fast, you can basically run them on your laptop but also on your mobile phone. And this means that your dataset stays locally. You don't have to send your queries to third parties. And this really enhances privacy. That was, for example, one of the big selling points for Apple Intelligents.

Also right now, we really have so many frameworks to do on device inference. For example, there's MLX, MLC, LAMA CPP, Transformers JS. So we have a lot of options and each of them have like great features. So you have so many options for doing that. Small models are also really powerful if you choose to specialize them. For example, here there's a startup called Numind which took small LAM and then they fine tuned it on text extraction datasets.

and they managed to get a model that's not very far from models that are much larger. So I think text extraction is like one use case where small models can be really performant and it makes sense to use them instead of just using larger models.

You can also chat with these models in browser. For example, here you can go there, you can load the model, you can even turn off your internet and just start chatting with the model locally. Speaking of text extraction, if you don't want to fine tune the models, there's really good method of structure generation. We can basically force the models to follow a JSON schema that you defined.

For example, here we try to force the model to follow a schema for extracting key information from GitHub issues. So you can input free text, which is a complaint about a GitHub repository, something not working. And then you can write in there and the model can extract anything that is relevant for your GitHub issue creation. For example, the priority, for example, here priority is high, the type of the issue bug, and then a title and the estimation of how long this will take to fix.

You can just do this in the browser. You can transform your text into GitHub issue that's properly formatted. What's next for synthetic data and small models?

I think that domain-specific synthetic data is going to be, it's already important, it's going to be even more important. For example, generating synthetic data for math, I think this really would help improve the reasoning of a lot of models. And a lot of people are doing it, for example, QIN 2.5 math, everyone's trying to reproduce a one. And so I think for synthetic data, trying to specialize it on some domains is going to be really important.

And then for small models, I think specializing them through fine tuning is also going to be really important because I think a lot of companies are just trying to use these large models because they are better. But on some tasks, I think you can already get decent performance with small models. So you don't need to pay like a cost that's much larger just to make your model better at your task by a few percent. And this is not just for text. And I think it also applies for other modalities like vision and audio.

And I think you should also watch out for on-device frameworks and applications. For example, like the app I showed, Pokestop, Ollama, all these frameworks are becoming really popular. And I'm pretty sure that we're going to get like more of them in 2025. And users really like that.

Maybe for other, I should also say a hot take. I think that like in AI, we just started like with fine tuning, for example, trying to make BERT work on some specific use cases and really struggling to do that. And then we had some models that are much larger. So we just switched to like prompt engineering to get the models to solve our tasks.

And I think we're going back to fine tuning where we realize these models are really costly. It's better to use just a small model. We try to specialize it. So I think it's a little bit of a cycle and we're going to start to see like more fine tuning and less of just like prompt engineering the models. So that was my talk. Thank you for following. And if you have any questions, we can take them now.

2024 in Synthetic Data and Smol Models [LS Live @ NeurIPS] 28:41 Share