We're sunsetting PodQuest on 2025-07-28. Thank you for your support!

863: TabPFN: Deep Learning for Tabular Data (That Actually Works!), with Prof. Frank Hutter

2025/2/18

Super Data Science: ML & AI Podcast with Jon Krohn

AI Deep Dive Transcript

People

Frank Hutter

Jon Krohn

Topics

Jon Krohn: 深度学习在图像、音频和自然语言处理方面取得了显著进展，但在处理表格数据方面却进展缓慢。TabPFN 的出现为解决这一问题提供了新的途径。 Frank Hutter: 表格数据与其他类型的数据不同，它通常数据量较小且多样化，特征通常已预先定义。深度学习擅长特征提取，但表格数据并不需要这种特征提取。TabPFN 使用类似 GPT 的 Transformer 架构，能够进行上下文学习，将整个训练集和测试集作为输入，直接预测测试集的输出，无需显式地学习特征。TabPFN 使用合成数据进行训练，通过生成一个关于数据集可能外观的先验分布来实现。Prior Data-Fitted Networks (PFNs) 利用贝叶斯推理，通过从先验分布中采样数据并进行监督学习，直接逼近后验预测分布，避免了复杂的贝叶斯推断计算。TabPFN v2 相比 v1，在处理数据类型、缺失值、异常值以及数据规模方面有了显著改进，使其适用范围更广。TabPFN v2 在无需针对时间序列数据进行专门训练的情况下，在时间序列预测任务中取得了最先进的性能。Prior Labs 公司旨在将 TabPFN 技术商业化，并开发更易于大众使用的产品。

Deep Dive

Shownotes Transcript

Translations:

中文

This is episode number 863 with Professor Frank Hutter, co-founder and CEO of Prior Labs. Today's episode is brought to you by ODSC, the Open Data Science Conference.

Welcome to the Super Data Science Podcast, the most listened to podcast in the data science industry. Each week, we bring you fun and inspiring people and ideas exploring the cutting edge of machine learning, AI, and related technologies that are transforming our world for the better. I'm your host, John Krohn. Thanks for joining me today. And now, let's make the complex simple.

Welcome back to the Super Data Science Podcast. Today's episode is an excellent one with the renowned machine learning professor, Dr. Frank Hutter. Frank is a tenured professor of machine learning and head of the machine learning lab at the University of Freiburg. Although he has been on leave since May,

to focus on his fellowship on AutoML and tabular foundation models at the Ellis Institute in Tübingen, Germany, as well as becoming co-founder and CEO of Prior Labs, a German startup that provides a commercial counterpart to his tabular deep learning model research and open source projects. And the company has just announced a huge 9 million euro, so about $9 million pre-seed funding round.

Wow. In addition to that, he holds a PhD in computer science from the University of British Columbia, and his research has been extremely impactful. It has been cited over 87,000 times. Today's episode is on the technical side and will largely appeal to hands-on practitioners like data scientists, AI or ML engineers, software developers or statisticians, especially Bayesian statisticians. So for a bit of context on the topic of today's episode, pretty much everyone

works with tabular data, either primarily or occasionally. Tabular data, I'm sure you're familiar with them once I describe them, are data stored in a table format, tabular. So they're structured into rows and columns, like in a spreadsheet, where the columns might be different data types. Say some columns are numeric, some are categorical, and some are text.

For a decade, deep learning has ushered in the AI era by making huge advancements across many kinds of data, pixels from cameras, sounds from microphones, and of course, natural language. But through all of this AI revolution, deep learning has struggled to be impactful on highly ubiquitous tabular data until now.

In today's episode, Professor Frank Hutter details how his revolutionary transformer architecture, TabPFN, has finally cracked the code on using deep learning for tabular data and is outperforming traditionally leading approaches like gradient-boosted trees on tabular datasets. In this episode, he talks about how version 2 of TabPFN, released last month to much fanfare thanks to its publication in the prestigious journal Nature, is a massive advancement, allowing it to handle orders of magnitude more training data.

He also talks about how embracing Bayesian principles allowed TabPFN version 2 to work out of the box on data it wasn't even trained on, like time series data, beating specialized models, and setting a new state of the art on the key time series analysis benchmark.

And he also talks about the breadth of verticals that tab PFN has already been applied to and how you can now get started with this conveniently open source project on your tabular data today. All right, you ready for this horizon expanding episode? Let's go. Frank, welcome to the super data science podcast. It's awesome to have you on the show. Where are you calling in from today?

Thanks so much for having me. I'm in Freiburg. It's a beautiful city in the south of Germany, close to France and Switzerland. It's a beautiful university town and now it's actually sort of turned the new German center of foundation models. There is Black Forest Labs here. We're here building TapiaFN. So yeah, I'm super excited.

to be on the show, but also about FriVor being really on the rise. - Nice, that is exciting. Do you end up having a lot of in-person interaction? Does your lab, does your company meet in person and all these kind of people from Black Forest Labs, you actually rub shoulders with them in person? - We do go for coffee every now and then, but we're both pretty busy. But there's meetups and so on. - Nice, it is great to have that kind of community.

So, yeah, let's talk about TapioFN. You mentioned it just there. And that is TapioFN is something that's been exciting to me for a couple of years. So when version one came out, I took notice of it as really the only tabular data deep learning framework that I've noticed. So it definitely made a splash.

So there's a few different things that I want to talk about. First of all, we're going to talk about what the name means. And we'll talk about... Everybody mispronounces it. FPN? And yeah, so it stands for Tabular Prior Data Fitted Network.

What does that mean? Break it down for us. Tell us what it means to have prior data fitted into the network. And I guess something that I can even explain very easily is, I mean, you can expand greater, but this tabular idea is that most deep learning models are optimized for dealing with data that have a lot of spatial patterns. So things like machine vision, natural language processing,

But I mean, I've been teaching deep learning for almost a decade now. And very frequently, I would have students that are, say, finance students who have some tabular data. And they think, I'd love to train a deep learning model on this. And they would always find disappointing results relative to things like boosted trees or sometimes often just plain old regression models. And so, yeah, tell us about what makes this tab PFN architecture different from

why it made such a big splash a few years ago when version one came out. And then now with this brand new release of version two, what the differences are. All right. Yeah, that's a lot of questions to unpack. It is a lot of questions. I can remember them all. Maybe let's start with Tabula. So what is Tabula data? And why is it so different than vision data or speech data or text data and so on? So...

Tabulate data is super common in the enterprise. It's like tables, like think Excel sheets, relational databases. There's so much information stored in these tables. And you have applications in all kinds of domains, like health care, finance, business analytics, insurance, retail, whatnot.

and there's your typical classification and regression problems, which you learn sort of in machine learning 101, the stuff you fit a random forest to and so on. And there's also time series data, there's recommender systems, and all of these really work with tabular data. And one of the properties of

of tabular data is that typically, actually, most data sets are relatively small. And there's a lot of these relatively small data sets. And each of these data sets is very different. So if you have a data set from healthcare, let's say you want to predict based on some omics, blood work, whether a patient has early stage Alzheimer's.

Then you collect some data, like maybe you have like 5,000 patients that you had over the last couple of years. And you know what did they have early stage Alzheimer's or not. And then you get a new one and well, you want to predict whether they have it or not. And you can wait a couple of years and then you know whether they had it, but then it's too late to treat it. So you want to predict it. And so there your features are these omics blood values.

And the prediction variable is whether they have early stage Alzheimer's or not. And then take another data set from, I don't know, banking, fraud detection. Or let's say fraud detection. Then you have all kinds of different transactions as features that the person had before. And then maybe how much money is in the transaction, who is it going to, et cetera.

And that has just nothing to do in terms of the features with omics blood values. So how are you going to learn a model that from all of these different tabular data sets, that is actually very tricky.

um in particular like if you compare it to for example vision like there you have these spatial patterns like regardless of what you're looking at in terms of an image there is some spatial regularity that makes it um actually an image and rather than just some noisy thing to look at and

And so from that, we can actually then, yeah, we had like convolutional neural networks, et cetera, picking up the spatial structure and learning features from the data. That's what deep learning has been enormously strong at, learning successively abstract representations of your data. And then you have this high level representation that you can just fit some sort of a very simple linear model in the final layer.

And tabular data, on the other hand, that is something where the features-- people actually typically have put some thought into these features. Like what is this blood marker? What is this like the amount of money spent? That is actually a feature that it doesn't get much more high level than that. And so you don't need to discover these features. You already have them. And so--

the power of deep learning hasn't really reached tabula because it wasn't needed there like you don't need to learn these features you actually just have these features to start with and then

Rather than sort of these more low level feature engineering methods or feature generation methods that you get from deep learning, you have higher level feature engineering, like what data scientists are great at. You look at a particular application and you're like, ah, we're in medicine. We have the

the height of the patient and the weight of the patient and we want to classify some disease, maybe it's useful to know whether they're obese. So let's compute BMI by using weight and height and

um so so you have like what is it uh weight divided by height squared um and that's a new feature that would be pretty hard to learn for a network um off the bat i of course it can learn it but um it's it doesn't know that this would actually be a particularly good

feature for this particular application because it doesn't know the context, et cetera, because typically in tabular data, all that is actually fed into models like random forest, et cetera, is actually the features and the target variables. So the X and the Y and

none of these typical machine learning methods like random forest, XGBoost, et cetera, even look at the column header. So that's something that, for example, language models would be great at looking at the context and understanding what's going on and then understanding, ah, that's this column. Therefore, I could actually generate something like BMI. But that's not what is sort of part of the problem description of standard tableau

tabular machine learning and

Therefore, it's really exciting if we can actually build a deep learning method that does do a good job at just the tabular core, because when we have that, then we can actually combine it with all the power of deep learning with language models and so on, and build something that's much greater. But the first step that we took is really to go for an apples to apples comparison on

the problem that the traditional methods use and not use any of the column headers, et cetera, and still beat XGBoost, random forest, et cetera, on their own turf. So that it's not just better because we use additional information, but it's already very strong to start with. And then we can, on top of that, include all of this other information. All right. There was a long-winded answer to Tabula.

Excited to announce, my friends, that the 10th annual ODSC East, the Open Data Science Conference East, the one conference you don't want to miss in 2025, is returning to Boston from May 13th to 15th. And I'll be there leading a hands-on workshop on agentic AI.

ODSC East is three days packed with hands-on sessions and deep dives into cutting-edge AI topics all taught by world-class AI experts. Plus, there will be many great networking opportunities. No matter your skill level, ODSC East will help you gain the AI expertise to take your career to the next level. Don't miss out. The early bird discount ends soon. You can learn more at odsc.com slash boston. That's odsc.com slash boston.

But it was such a great answer. That was an excellent-- you provide a great scope on this problem and how, with deep learning, we're typically concerned with extracting features from data. With tabular data, we don't typically need to be extracting those features from raw pixels or from raw sound files or from raw natural language. Instead, we typically already have some curated features. But there's a huge opportunity in those curated features

to be quote unquote thinking thoughtfully about how maybe those features could be recombined. And so it sounds like what you're saying is the, and maybe this is the answer you're about to get into the prior data part, but it sounds like the prior data part, the transformer architecture part of this model is,

it is able, unlike gradient boosted trees or linear regression, to take into account the column header to understand what that means and automatically cook up something like, oh, let's, you know, I know what the model then quote unquote knows what height is, knows what weight is, and it can automatically calculate BMI. That's really, really cool. I almost swore. I almost said it's really effing cool. Yeah.

Yeah, and what's super cool actually is that we haven't even done that. And once we do that, it's going to be so much better. But what we have done actually so far is really only use the same information as XGBoost, et cetera, and the X, Y, the raw numeric values, the raw category labels, et cetera. And...

we can put it together with all the power of language models, et cetera. And of course, we're working on that and have some initial results. But yeah, so I mentioned that deep learning isn't really needed for tabular data for generating these features because we already have the features, but well,

we did actually come up with a deep neural network here. So what is different? And what is different is that we actually use a transformer very similar in a sense to GPT, to a standard language model.

in the sense that we actually can do in context learning. So in context learning is a term that was introduced in the GPT-2 paper. And it's sort of this phenomenon where you can tell GPT something in the prompt, and you can tell it sort of in the prompt what it should be doing. So you can say, for example,

basically prompted to do a translation task without telling it that it should translate, but just say two languages, like dog is Hund, Katze is cat, mother is Mutter, and then it's the German Mutter. And then it basically, from just these two or three different examples, it figures out, ah, I'm supposed to translate, let's do that. And so it basically...

GPT has learned to encode an algorithm that first figures out what the problem is and then solves it. And just like that, actually, we have learned an algorithm that can do tabular learning. And so what we do in our architecture is we feed in the entire X train, Y train, and X test as part of the prompt. And the output is going to be the Y test.

And so one data set is basically a data point for training our model. So we take the x train, y train, x test, feed it in. The network outputs something. And whatever it outputs, how similar is that to the true y test? We take the gradient, like the loss between these, take the simple cross entropy loss,

and optimize for the outputs of this network to be as similar as possible to the true white test. Does it make sense? OK. For sure. And so this-- I mentioned a data set is a data point. So if we had trillions of data sets, just like GPT is trained on trillions of tokens from the internet,

Then we could just say, well, we have trillions of data sets from the real world. We just fit a foundation model that does precisely this machine learning task, like classification, for example, on all of these data sets. And

we're done. So once we have learned to do that on a trillion data sets, then we can do it on the next data set. That makes a lot of sense. It's very much like a standard language model. You can predict the next word. You've just learned to predict the next word. But we don't have a trillion data sets. In contrast to language models, there is really very few high quality data sets that are

on the internet. So there is a bunch of tables. For example, if you go to Wikipedia, there's tables of, well, this basketball player has this number on their back. That's not a machine learning task. It's maybe a retrieval task. But you can't learn anything from that. You can't learn through a classification or regression.

But what you need is really these properly formatted data sets in order to actually train your algorithm on. And when there is lots of noise and missing values and garbage, then yeah, it's pretty hard to actually learn on that. And what we did instead--

is to actually generate all of our training data synthetically. So this is something that it's a trend that also is happening in language models to partially train on synthetic data. But I believe our paper is the first one that really succeeded in only using synthetic data and coming up with a state of the art model. And so the key is really we needed to generate a prior

over what we believe data sets might look like and what the types of data sets are that we want to work well at. And so that's where we're getting to prior data-fitted networks. There we'll take a step back now and first explain what these prior data-fitted networks are, and then again come back to tab-PFN, which is a prior data-fitted network on tabular data.

So basically, first explain the theory of PFNs, prior data-fitted networks. And then the step to tapuFN is just actually creating a prior that generates tabular data. Yeah, so the PFNs, prior data-fitted networks, there was actually a paper already from 2022. It was called "Transformers Can Do Bayesian Inference."

There we basically showed that if you have a prior that you can sample from, then

you can draw many data sets from this prior and draw many data points from each of these data sets and fit them just like we just explained with subPFN. And the resulting model will actually have learned to encapsulate the prior so that when it's actually fed real data, it will give you a approximation of the prior

of the posterior predictive distribution under that prior for that data. Now, and then

When you train this sort of with an arbitrarily large transformer, with an arbitrarily good-- so the cross entropy loss really goes as far down as potentially possible, then you are actually exact. And your posterior prediction is exactly what the true posterior prediction should be. So if you, for example, take a Gaussian process prior or a linear regression, then

you get the true basal linear regression out or the true posterior Gaussian process. Maybe I should pause here for some... Yeah, so I'm just going to quickly pop in with a couple, hopefully short questions and clarifications. So something that I didn't realize from my initial research, maybe this isn't the case that it is the case with all PFNs, with all prior fitted networks, with all prior data fitted networks, but it sounds like

fundamental to prior data-fitted networks is Bayesian inference. Is that, that's always the case? Okay, okay. So, yes. Yes, so they do compute the Bayesian posterior predictive distribution. That's what the, basically, that's what the optimization objective is. Nice. So there's a Bayesian process, a Bayesian learning process happening on, in this case, a transformer architecture project.

That is very cool. And because we, you know, we come across sometimes hearing that Bayesian inference is going to be useful with deep learning architectures like transformer architectures. But this sounds like a very concrete use case of it. And yeah, it sounds like a powerful application of it. Yeah, it is super powerful because, you know,

I mean, something like Bayesian linear regression, you can do in closed form. So maybe I should still explain it to set the scene. So there, the prior would just be the data is just a line. Line has some axis, a line and some slope. And so you basically just want a posterior over these two parameters. And yeah, if you do that,

do the math and you can compute this Bayesian posterior predictive distribution in closed form. For lines, this is fine. For Gaussian processes, for example, it is also fine. But once you go a step further, for example, a Gaussian process where you don't know the hyperparameters,

then you have to do Markov chain Monte Carlo or variational inference. Or once you have a neural network and you want to be Bayesian over the weights of your neural network, then it becomes extremely complicated and you can do all kinds of approximations or SGLD or Hamiltonian Monte Carlo or whatnot. And the math is really hairy.

And Markov chain Monte Carlo typically takes a long time to convert. There's a saying, while my chain is smoothly sampling, it takes a week or something like that until you get going. And the other opportunity-- the other possibility is variation of inference, which is also often quite hairy in the math and also has approximation errors.

typically and in contrast to that uh private networks are just so incredibly simple all you do is you sample from your prior you get a bunch of lines

And then you feed these lines as data points from a line and another data point from the line as a test distribution, as a test example. And you want to output the true predictive distribution for that data point that is the optimization objective for that one line. And you sample millions of these lines and just learn over millions of these lines by just standard supervised learning to predict the--

missing values. And naturally, you learn to actually compute the predictive distribution because, well, sometimes typically there is noise in this data, et cetera. And since we optimize this cross entropy loss, if we're far off and certain, then that's bad. And that's penalized. So you actually automatically learn something that is

Yeah, regularize to be exactly the right thing to be predicted. If you sample arbitrarily many curves from this prior that look like your-- that in the data sample look like your sample integrated over all of that, you want to get the best predictive distribution. And that's precisely the right predictive distribution.

And just by sampling from the prior and then running supervised learning, you learn to actually approximate Bayesian inference to an arbitrarily strong degree. And that's really powerful. And so there's a lot of applications that are outside from tabular data where this is also really cool. You could do things like for a neural network, you could say,

typical Bayesian neural networks give you a posterior predictive distribution, or I'm sorry, not a posterior predictive distribution, posterior distribution over the weights of the neural network and then integrate over that in order to give you a posterior predictive distribution.

But what they don't do, for example, is consider, well, which is the right architecture? They just say, well, what are the right weights of this particular architecture? But you don't know which is the right architecture to explain this data. So with PFNs, you can just say, well, I have this distribution of architectures. And then for each of the weights of the architecture, I have this distribution. You sample from it. It's trivial.

And then you just get the posterior predictive distribution, which is the right architecture for this data. So you kind of do some Bayesian neural architecture search in a forward pass, which is just really cool.

There are some limitations. That is really cool. We've been talking a lot, obviously, about priors, posteriors, posterior predictive distributions. I want to quickly break down for our listeners who aren't Bayesian what these kinds of things broadly mean. I'm going to give a really simple toy example. With your expertise, you can tell me what I get wrong in my explanation. Basically, if I have some

So I could have a simple linear model where all that I have is the slope of the distribution and some y-intercept. So this is kind of like the simplest kind of regression model.

with a Bayesian approach, I could assign some kind of prior distribution to both of those variables, to the y-intercept and to the slope of the line. So I could say, you know, based on my experience with these kinds of data, with this kind of problem, I think that there's going to be a slight slope, and I think the y-intercept will be around zero. If I'm very confident, I can assign a really narrow variance

to my distribution. Or if I want to, you know, if I don't have much kind of prior understanding of what these, of what this process should be like, what this regression model should be like, I could have a very wide distribution which will allow

these kinds of learning approaches, like you said, they're Markov chain, Monte Carlo, some kind of Hamiltonian process. There's lots of different kinds of solvers for Bayesian approaches that allow me to search gradually, like you said, it was kind of like while my guitar gently weeps, while my Markov chain gently converges, what did you say? Yeah, well, it gently makes it. Probably the right thing to say.

And, um, and so the, uh, a big advantage of this kind of approach of this Bayesian approach is that it allows, it allows you to incorporate prior information. You don't have to have your model be learning from scratch necessarily, although you could have it for some parameters or maybe even all of the parameters basically learned from scratch. And then after this learning process happens for a while, like a Markov chain, Monte Carlo, like a Hamiltonian, we end up in the end

with posterior distributions that represent kind of what we've learned from the data. So we start with a prior distribution,

that could be a highly informed prior distribution or it could be relatively uninformed prior distribution. And then we use the training data that we have to converge upon some posterior distributions that give us, yeah, so they incorporate the prior information as well as the data that we have trained on. And you were just giving lots of cool examples there where we can use posterior distributions to be finding the weights of a deep learning network, for example.

How did I do? Yeah, no, actually, very good. The one thing I wanted to say, we do have a limitation. Actually, we don't get the posterior over the weights of the neural network. That is what you get with standard methods like MCMC and variational inference.

We bypass that step. We just directly go to the posterior predictive distribution. So the Y given the X. And we don't have a posterior over the W, the weights. With MCMC and with variational inference, you actually integrate out over all of the weights in order to get your predictive distribution. But

We bypass that. We can't tell you which is the right architecture, actually. We just tell you the Bayesian integral over all the possible architectures that might have explained the data. AI is transforming how we do business. However, we need AI solutions that are not only ambitious, but practical and adaptable too. That's where Domo's AI and data products platform comes in. With Domo, you and your team can channel AI and data into innovative uses that deliver measurable impact.

While many companies focus on narrow applications or single-model solutions, Domo's all-in-one platform is more robust with trustworthy AI results, secure AI agents that connect, prepare, and automate your workflows, helping you and your team gain insights, receive alerts, and act with ease through guided apps tailored to your role. And the platform provides flexibility to choose which AI models to use.

Domo goes beyond productivity. It transforms your processes, helps you make smarter, faster decisions, and drive real growth. The world's best companies rely on Domo to make smarter decisions. See how you can unlock your data's full potential with Domo. To learn more, head to ai.domo.com. That's ai.domo.com.

Nice. Okay, so I think we've now covered what prior fitted networks are, what PFNs are. So now I think we're probably at a point where we can move to tab PFNs. So a PFN specifically designed for tabular data. Yeah, exactly. That is what tab PFN is. And we talked about PFNs. You need a prior to...

yeah, explain what type of different assumptions you have on the data. So we would have a prior that creates tabular data sets in order to express our assumptions on what data sets we might be facing. So the first author of the paper, Noah Holman, he came up with this pretty ingenious method to

sample structural causal models

A structural causal model is basically a model that samples a graph, and then the features are nodes in that graph, and the target variable is also a node in that graph. And then you sample connections in this graph. And you don't quite know, does a target variable cause some of the features? Do the features cause the target? Do some of the features jointly cause a target? Does a feature cause a target, and that in turn causes another feature?

There's this huge set of possible structural causal models that could explain the data.

if you could identify the right structural causal model for your data at hand, then you would get much better predictions. But you don't, you just get the data, you don't actually get the structural causal model. So what we actually with TapiafN do in the end is to build the Bayesian posterior over all the possible structural causal models that could be explaining the data.

And so you could have one structural causal model that's completely wrong for the data that gets a very low probability. And so the predictions from that model would be really low weighted. And then you can have a model that matches really well with this data that gets a higher probability and is weighted more highly. So that's what the true Bayesian posterior would do.

But of course, the TAP-EFN that doesn't get to store all the 130 million structured causal models that we used to generate it, it just gets the raw data. And it has learned to actually interpolate over all these possible models and

has learned to actually approximate this base and posterior in a forward path a big strength here with the tab pfn approach is using generated data so it sounded like you said over 100 million generated data sets uh because we don't have unlike say natural language data when you're training something like a gpt kind of architecture you have

trillions of tokens that you can train your model over, but we don't have that kind of scale anywhere near that scale in terms of high quality, well-structured tabular datasets. And so you've gone ahead and simulated over 100 million of them. Yes, exactly. And so we can actually really control what's going in. So we have no data leakage because we actually didn't put any real data in.

So there's not any possibility that we've memorized the test data sets or something. Actually, a fun fact, I should mention this. The very first time we submitted TAP-KFN v1, it was rejected because the reviewer said, well, A, this is just that the performance is far too good. You must be doing something wrong.

And what they thought we were doing wrong is you must be tuning on your test set. Because we actually had some complicated stuff in there that was actually doing some gradient-based updates, looking at some real data sets. And we just dropped all that. It went an epsilon worse. And we just never touched any real data during training. And that made it just so much more easy to defend than the next time it went to the sling colors. Yeah.

And so now, so you mentioned version one there. So now, unless I'm, unless I'm jumping the gun too much. So it was a couple of years ago, the version one came out and that's what I was talking about at the onset of the episode. That is when I first noticed happy event. And it is still today, the only tabular deep learning approach that is on my radar. Um,

But in January, you guys had a paper. So you mentioned Noah Holman earlier. So he's the first author on this Nature paper that you published. It's called Accurate Predictions on Small Data with a Tabular Foundation Model. And I'm, of course, going to have a link to that in the show notes. We're going to spend a fair bit now talking about this version two release and the associated paper. Something to kind of answer maybe kind of quickly at the offset is when you're

coming up with a venue to publish something like tab pfn in how did you think of nature which is it's one of the world's most popular journal art journals and it's it's general it's designed to kind of give a broad overview across all disciplines and so it's interesting because while i'm

So I guess what makes, what made you think, I don't know, so you can let me know why you chose nature, but it's amazing, first of all, to get published in nature at all. And so it's amazing to think that to even have the audacity to submit to something like nature. And then I'm guessing that the reason why you would pick something like nature is because tabular data is,

are so ubiquitous across so many different scientific fields. I mean, that's literally your opening sentence in the abstract is from biomedicine to particle physics to economics and climate science, tabular data, which are spreadsheets organized in rows and columns are ubiquitous across scientific fields. So I guess I understand. Well, I've spoken way too much. You can tell me, tell me about this nature paper and what led to it. Yeah, absolutely. Uh,

So yes, tabular data is super ubiquitous. And so we did want to reach a really broad audience. That's, of course, one of the nice things we do get with nature. But it was actually the

Already when we submitted the first version, like TapiaFN v1, we submitted it to iClear, for which the papers are directly online. And also if you retract them, they stay online. And literally the day after we submitted, I was like,

this is a real breakthrough. This changes everything. The fact that we can now actually have a deep learning model that does in-context learning and learn across tabular data sets, there's so incredibly much potential in there. And I was like,

look if i was a deep mind we would have said sent this to nature um because deep mind actually well it does publish there a lot and do you mind is we read a lot of these papers and we just whenever they have had a new paper we're like wow this is so amazing and we just everybody talks about them and and um everybody reads the papers i read the papers and there's they're really great um i was like hey this is all the same caliber and

But we had submitted to iClear, so if you retract, it's still online. That would have been a problem. So we're like, OK, fine. We can't publish in Nature. But let's go for a next version. And let's make this really, really strong. Because the first version, all that that did is-- it was extremely limited. It only worked on numerical data. It didn't do missing values. It didn't handle outliers. It didn't hand imbalanced data.

didn't have even the categorical values for a problem and like of course like tabular data it's all categorical it also didn't do a regression it only did classification it it really what it did do for me is it was an eye-opener in that this is possible and we quote are not just need to

scale up and just need to make this more general and so on. Of course, we had a bunch of extensions and improvements on architecture and so on. But at the core, going from WFN v1 to v2, it's very much the same in context learning, just made to work really well. And so

actually it's it's a much better paper for nature because the type of nv1 was like yeah great you can do this on data sets up to a thousand data points with what did we have a hundred features only numerical data none of the stuff that you have in data science none of the issues so not a whole lot of people um used it because of that and we have like um

we have a repo with some applications like 15 different papers that used it and showed that it's awesome in different domains. But yeah, 15, not thousands. And so that was the rightful criticism of the community in TAPI-FNv1. I said, hey, this is so amazing. And then they're like, well, why is nobody using it on Kaggle? Why is not... This is not...

not really a breakthrough in terms of the impact yet. But that is really what changed with WF-NV2 because it's just so darn generic now. It can just do whatever. It can tackle any type of problem

tabular machine learning problem, just like XGBoost can, with still some limitations. I definitely need to be very clear here. It still has a size limitation. So small-- and that is in the title of the Nature paper-- small tabular data sets. So in particular, what we evaluated was up to 10,000 data points and up to 500 features.

And so we already scaled up a fair bit from the thousand from before. And yeah, I'm pretty confident that based on a combination of different approaches, we can also scale up to a hundred thousand or a million or something like that. Once you have billions of data points and you don't really need to be basing about your data, then you have enough data to just let the data speak. But when you have a hundred data points and you're fit

a neural network or you fit an XGBoost or something, it will typically overfit the data a lot. But if you have a strong prior that has an emphasized smoothness and so on, then you overfit a lot less. And so it's particularly, it's learned to using cross entropy loss on the test portions of the sample data set, it has learned not to overfit. And so it just doesn't overfit as much as a standard method.

And yeah, so it was a breakthrough, but not really in terms of methodological improvements. Yes, we have a new architecture that is nice. That could have been a paper by itself. We could have written individual papers on, hey, let's do this for missing variables or missing variables. We can do a paper for imbalance. We can do a paper on just--

regression, et cetera, et cetera. So we could have papers on all of these, but

We didn't go for that because we would have had to have ablations, comparing against all kinds of different approaches, particularly for that. We just wanted an all-encompassing framework that just works for all kinds of data. And Nature is a great venue for these types of papers where just the end result counts. It's not the individual contributions in terms of methodology to get there, but

what do you have now, like alpha fold, for example. Yeah, there were also some methodological contributions there, but they weren't mind boggling. It's just that this whole thing put together really worked. And so we're also of that category, and that's why we did have the audacity to try for nature, and it did work.

Did you know that the number one thing hiring managers look at are the projects you've completed? That's why building a strong portfolio in machine learning and AI is crucial to your success.

At Super Data Science, you'll learn how to start your portfolio on platforms like Hugging Face and GitHub, filling it with diverse projects. In expert-led live labs, you'll complete an exciting new project every week. Plus, through community-driven projects, you'll tackle real-world multi-week assignments while working in a team. Get hands-on experience with projects like retail demand forecasting, building an AI model from scratch, deploying your own LLM in the cloud, and many more. Start your 14-day free trial today and build your portfolio with superdatascience.com.

Yeah, it's very cool. So this version 2, relative to version 1, to kind of summarize some of the key attributes, you can now handle, well, it's well-tested on up to 10,000 data points, 500 features, which is quite a few features.

And it can handle different kinds of data, not just numeric data. It can handle text data even, correct? It now can in the API, but actually not in the paper. And yeah, it handles missing values. It handles outliers. This is very cool.

I think I already said very cool one more, but I don't mind repeating it because this is something that is going to be a game changer, particularly, as you say, in situations where we have tabular data, where we don't have huge amounts, where we don't have billions of rows, where we have

Hundreds, thousands, tens of thousands, maybe hundreds of thousands of data points. Having these kinds of Bayesian approaches allows the priors to be able to fit the data much better than other kinds of approaches out there. Before we get on to kind of specific real-world examples of tabPFN, I understand that in addition to working with tabular data, you also recently had a breakthrough with time series data. Yeah, so this is really...

It's really mind-boggling. It is the same model that we have in the Nature paper. We also tried it for time series data. A time series is, you can think of a univariate time series, just a signal over time, such as maybe a trend. Basically, you have a time signal and then the size of the signal.

All we do is taking the time index, basically saying, well, this is the time of day. This is the day of the week. This is the day of the month. Do some sine and cosine features of that and cast it as a tabular problem. So basically, each timestamp gets these six features, including the timestamps in the future.

And then you have for each of the known time steps, you have your x train. And for the future time steps, you have the x test. And so this works for the next time step or for like 17,000 time steps in the future. You can encode each of these just as one new data point. And so you can predict as far ahead as you want just in like not auto-regressively, but just directly in one forward pass.

The mind-boggling thing is that this model that we had in Nature that is trained only on synthetic data and has never seen a time series and has never seen a real data set in the first place actually

is the best on the public benchmarks on time series. It's better than all the foundation models that are trained specifically for time series that are trained on synthetically generated time series, real time series, et cetera. And with this model, we didn't even try. And it just works out of the box. So as of today, there's this benchmark, GIFT EVAL, which was in Europe's DBT paper.

just a couple of months ago. And yeah, so this is the standard benchmark for time series. And it's number one on there outperforming Kronos. And Kronos is from Amazon. It's a really cool, cool paper. And this just goes to show how much there is to gain here. There's

Once we fine tune for time series and we iterate on this or we have a time series prior, the sky's the limit. So I'm super excited about this and really looking forward to building more there. State of the art, out of the box. That is a nice outcome. Wow.

Great. Yeah, so very exciting. All of these big updates from version one to version two. With version one, as you mentioned, there was relatively limited applicability of tab PFN, but nevertheless, there were still some great use cases that came out of it.

One of them was a science paper. So in addition to nature, uh, the paper that you published in, there's one other big kind of general broad science paper out there and it's called science. And so there's this paper. I'm not even going to try to get into the biology of what this means, but we'll include the paper in the show notes. It's called large scale chemoproteomics expedites, ligand discovery and predicts ligand behavior in cells. Uh,

And so I can't really explain what this is all about. It's something to do with determining protein structure. But the key thing is that tabPFN was used as a part of the inferences that they made in that paper. And I'll have a link also in the show notes to a repo, a GitHub repo called AwesomeTabPFN that lists about a dozen existing applications

of TAP-PFN across health insurance, factory fault classification. There's financial applications. There's a wildfire propagation paper, a number of biological papers in there. So yeah, clearly lots of different applications out there, even for V1. I don't know if you want to talk about them in any specific detail, Frank, but I know

that you are, of course, looking for more people trying out TapioFN, especially now that version two can handle so many more kinds of data types, can handle missing data, can handle outliers, and can handle larger data sets. So

So listeners, if you've got tabular data out there, you can head to the tab pfn GitHub repo that we also have a link to in the show notes and you can get started right away. Yeah, awesome. Thank you so much for mentioning this awesome tab pfn repo. I literally actually created this today. So I hope by the time that the show actually goes out, there is a lot more than a dozen applications there. And yeah,

please, whenever you have an application or use case, just either send us a note or actually this is one of these reports where you can just do a pull request with your own application, put your own paper and we'll basically advertise it. Also, if there is cool applications, we'd love to have blog posts or just retweet your content and so on. I think we really want to build this

community of people who love TapioFN and build on top. And the open source community has already picked this up. And within a couple of days of the Nature paper, there is this repo on ShappIQ that's all about interpretability, directly put TapioFN in there. And so yeah, it's really amazing to see the speed at which the open source community

and I'm really looking forward to what else people will build with this. One cool thing, the science paper I wanted to mention is, yeah, I also know nothing about chemical proteomics, but that's kind of the neat thing. I can still work on this because, well, we have this really generic method and if there is data from chemical proteomics out there, then we can fine tune on that and get something that's even better for this use case. And so those are the types of,

things that I'm really excited about doing for all kinds of use cases. There's also already something out there on predicting... Algal blooms! Yeah, algae! Yeah, algae, I know, and algal blooms sort of take care of that. Things that are good for the environment and so on. I think I'm really excited about those types of applications. There's lots and lots of applications in medicine

There's not that many published papers on applications and finance and so on because, well, typically people don't publish these types of applications as much. But medical and so on, there's a lot. And yeah, really hoping for a lot of people to use it to do good things for the world. Yeah, fantastic. Very cool. So yeah, we've got the tab PFN repo available to you soon.

to, to access, uh, this Python library right away. Um, it's been downloaded almost a million times at the time of recording, which is pretty cool. And, uh, yeah. And then we'll of course also have a link to this awesome tab BFN repo that has all of the applications. Um, and so speaking of applications, you are spinning out a startup to help, um,

Spread the good word and presumably applications of tab PFN and associated technologies and appropriate given how much we've talked about Bayesian inference and priors and posteriors. Your new company that you're co-founder and CEO of is called Prior Labs. So tell us a bit about Prior Labs and what's

how it complements or how it's different from the research that you're doing at Tübingen and Freiburg. Yeah, so I'm super excited about the startup. I've been wanting to build something for many years now. But

Really, for the last 10, 12 years, I co-started and have been co-leading the AutoML community, so this community on automated machine learning. That's all about democratizing machine learning, making it easy for everyone to get state-of-the-art machine learning by not having to worry about picking the right hyperparameters, picking the right method, et cetera.

And we've had a lot of...

great research and many, many really nice papers. We've also had some tools, like in particular AutoSQL Learn was our most widely used and widely known tool that wraps around scikit-learn and allows you to figure out the right method in scikit-learn, the right pre-processing, the right algorithm, the right classifier, the right hyperparameters, etc. And made that much easier. But

It sort of always, coming from the university and being at the university, having just a few research engineers who happen to want to work in a university setting, we were never really in a position to build something for the masses. We've always built something that's sort of good for our research friends and good for ourselves to do our research with. And

Yeah, if you want to reach a broader set of people, of course, we need a commercial entity for that. And also with TapioFN really being this breakthrough that will allow so much cool new stuff, yeah, we just need more workforce. We need really strong engineers to build amazing products. And so that's what we will be doing in the startup. In the university, I will keep a...

an academic co-affiliation and in the university I will focus very much on tabular data as well but and research about the tabular data like things

things like interpretability, like what does this network do? It is the best learned algorithm, but how precisely does this algorithm work? How precisely does it change when you change the priors? What are the failure modes? Where is it particularly good? How can we improve it further? There are so many avenues to do research on. And of course, with the startup, we also want to push the boundaries of what's possible in terms of capabilities.

with a university hat on, we'll be able to focus more on maybe some moonshots, things that might turn out or might not work. It's good to have this open-endedness of research, and that you can really only have in an academic setting.

So I'm really excited about combining the two and also provide the PhD students an opportunity to have amazing engineers to actually build products out of what the PhDs publish. And so I'm really excited about these synergies and the future of FireLab. Fantastic. And I know that you are doing hiring at least of PhD students because you posted about this recently on LinkedIn. So I'll include a link to that link.

in the show notes. And I wonder also, I mean, it sounds like you're also hiring engineers at Prior Labs. Yeah, we're hiring a lot of people actually at Prior Labs. I haven't posted about that on LinkedIn yet because we'll have our funding announcement two days after we launch

tape the show, but by the time it goes out, it will have long happened. And yes, we are hiring sort of full blast AI scientists, ML engineers, backend engineers, community people at some point in the future also sales, but we were actually really not focusing on that now. We're focusing on building the community and building amazing tech.

Nice. So given, I usually have my last question be how to follow you, but I'm actually just going to jump to that right now because we were kind of just talking about how you're going to have this big funding announcement, which will be live by the time that this episode is published and you'll be announcing more hiring on LinkedIn and that kind of thing. So how should people follow you to get the latest on tab PFN, but also maybe opportunities to be involved in the open source community or even as a paid employee? Yeah, so I'm,

I'm on Twitter slash X and LinkedIn, ever more on LinkedIn. Also want to at some point start Blue Sky if I ever have time. But yeah, then of course, so we have this, the GitHub repo you mentioned. So there's a TapioFN repo. There's a TapioFN API repo. And there's a TapioFN extensions repo. And particularly these TapioFN extensions, that's a repo...

repo where we strongly encourage the community to push extensions, push cool things people have done with TapioFN, such as work on interpretability, work on doing better hyperparameter optimization, postdoc assembling, stuff like that. Auto TapioFN is in there already. So we strongly encourage

Yeah, interactions there if you're interested in applying TPFN to your particular domain, like fine tuning, et cetera.

do reach out to us actually also particularly on our discord channel so we have a discord channel that is particularly for particular for tapio then we already have over 200 people in there i'm starting to build this uh community so i'm super excited that that um is working i already did an ama there last week and um yeah great questions and um i'm

Yeah, it looks like it's going to be a really cool community. Nice. Yeah, no doubt. It's interesting. I hadn't noticed this before, but I can see on the GitHub repo for TapioFN how many people are online in the Discord channel right now. There's 55 online, which is an interesting little kind of widget included in there. Nice. Yeah, so fantastic. I'm sure you'll get a lot of interest from this podcast episode and just how amazing this project is in general. It really is transformative. It

It's been so exciting for me to have you on the show because of my longstanding interest in TapioFN. Before I let you go, I need a book recommendation from you. Book recommendation? Let's see. I really like Asimov. The robot series, the foundation series. If you haven't read them, I strongly recommend. That's a great recommendation, especially at this time.

Thank you so much, Frank, for taking the time, you know, busy between getting a startup going and your university responsibilities. It's amazing that you can take the time to be on a show like this. So we really appreciate it. And yeah, wish you all the best. Yeah, this is super exciting. I love your show. And yeah.

I'm really honored to be here, actually. So I'm super excited. Thank you. Yeah, it's mutual. Thank you as well. All right. Yeah, maybe we can check in again in a few years and see how the TAB PFN journey and the Prior Labs journey is coming along. Absolutely. Love it.

What a fascinating and practical episode with Professor Frank Hutter today. In it, he covered how TabPFN is a deep learning model specifically designed for tabular data that uses a transformer architecture combined with Bayesian principles to make accurate predictions even with limited data. He talked about how version two of TabPFN significantly expanded its capabilities, now handling up to 10,000 data points, up to 500 features, missing values and outliers, numerical and categorical data, and through their API only at this time, text data as well.

The model was trained entirely on synthetic data, over 100 million generated datasets, eliminating any potential data leakage while ensuring robust performance. He talked about how TAP-PFN version 2 unexpectedly achieved state-of-the-art performance on time series prediction without any specific time series training, outperforming Amazon's Kronos and other specialized time series models. And he talked about how Prior Labs, his new startup, has been created to commercialize TAP-PFN technology and build products

that make the tab PFN breakthrough accessible to a broader audience while academic research continues. As always, you can get all the show notes, including the transcript for this episode, the video recording, any materials mentioned on the show, the URLs for Frank's social media profiles, as well as my own at superdatascience.com slash 863.

And if you'd like to meet in person as opposed to online, I'll be giving the opening keynote at the RVA Tech Data and AI Summit in Richmond, Virginia on March 19th. Tickets are a bargain, frankly. So if you're in the Richmond area especially, come on down and see me on March 19th. It'd be awesome to meet you there.

Thanks, of course, to everyone on the Super Data Science podcast team, our podcast manager, Sonia Brayovic, our media editor, Mario Pombo, partnerships manager, Natalie Zheisky, our researcher, Serge Massis, writers, Dr. Zahra Karchei and Sylvia Ogwang, and our founder, Kirill Aramenko. Thanks to all of them for producing another Horizon Expanding episode for us today. For enabling that super team to create this free podcast for you, we are, of course, super grateful to our sponsors. You can support the show by checking out our sponsors' links, which are in the show notes.

And if you are interested in sponsoring an episode yourself, you can find out how to do that by going to johnkrone.com slash podcast. Otherwise,

share this episode with people that would love to be applying deep learning to tabular data review this episode on your favorite podcasting app or on youtube subscribe obviously if you're not a subscriber feel free to edit our videos into shorts to your heart's content but most importantly just keep on tuning in i'm so grateful to have you listening and i hope i can continue to make episodes you love for years and years to come until next time keep on rocking it out there and i'm looking forward to enjoying another round of the super data science podcast with you very soon

863: TabPFN: Deep Learning for Tabular Data (That Actually Works!), with Prof. Frank Hutter 01:06:06 Share

Super Data Science: ML & AI Podcast with Jon Krohn

Deep Dive

Shownotes Transcript

863: TabPFN: Deep Learning for Tabular Data (That Actually Works!), with Prof. Frank Hutter