We're sunsetting PodQuest on 2025-07-28. Thank you for your support!

868: In Case You Missed It in February 2025

2025/3/7

Super Data Science: ML & AI Podcast with Jon Krohn

AI Deep Dive AI Chapters Transcript

People

Kal Aldubaib

Professor Frank Huta

Vaibhav Gupta

一

一位数据分析师

Topics

我是一名数据分析师，我主要使用dbt进行数据构建。dbt是一个强大的工具，它允许我使用SQL编写代码，并能够进行一些更高级的操作，例如嵌入业务逻辑并进行文档化。dbt的文档功能（dbt docs）非常实用，它可以自动生成字段定义，这大大提高了效率。此外，我还使用Sigma和Tableau进行数据可视化。Sigma可以直接连接Snowflake数据库，这简化了数据导入过程。它对于熟悉Excel的用户来说非常友好，并且也具有强大的分析功能，可以满足不同团队的需求。Tableau则是一个功能强大的数据可视化工具，可以创建各种类型的图表和报表。总的来说，dbt、Sigma和Tableau这三个工具组合使用，可以有效地完成数据构建、分析和可视化工作，满足不同团队的需求。

Deep Dive

Chapters

This chapter explores dbt, a data building tool that streamlines data processing and enhances collaboration. It highlights dbt's features, such as SQL coding, business logic embedding, and automated documentation, showcasing its impact on productivity and stakeholder alignment.

dbt is a data building tool using SQL and Python.
It allows embedding business logic and automated documentation.
It improves efficiency and collaboration within data teams.

Shownotes Transcript

Translations:

中文

This is episode number 868, our In Case You Missed It in February episode. Welcome back to the Super Data Science Podcast. I am your host, John Krohn. This is an In Case You Missed It episode that highlights the best parts of conversations we had on the show over the past month.

So one thing that you mentioned in

in that transition from kind of data analyst focused more on visualization tools like Tableau and moving a little bit up the stack to analytics engineering, you were getting more into dbt, right? Yes. Yes. So tell us about dbt, why a company would use it, how you interact with it, what it does. Yes. So dbt is a data building tool and essentially dbt

Oh, you already know that. Yeah. So I actually think that it, I looked this up, actually, I would say a few months ago, because I wanted to double check, because I went to the dbt conference this year, which was a ton of fun. And I, because someone had asked me and I was like, well, I believe it's scenes for data building tool, but I wanted to double check. I guess it used to, but now they have transitioned to just dbt. So, but I mean, it's still, yeah, but it's still data building tool.

And so essentially we would work with, if we already had the data in our database, but it was in a raw format and we need to expose it or surface it in a way for our stakeholders to utilize it in a visualization tool, we would go in, we would take the raw source of it and essentially you're writing SQL more or less. So dbt also does have

Python type of coding within it that allows you to, I would say it's, it allows you to do kind of SQL on steroids a little bit. So you get to do some different things with it that essentially allow you to, especially if it's something redundant in your code, they have different functions that allow you to just consolidate to speed things up. Um,

which is really, really nice. And so essentially that's what we were doing as far as my day to day. So I was going in, I would say initially wasn't,

building out models from scratch, but I would be adding to models that were in there already. So adding different fields, adding a lot of business logic, because that was really important too. And what I really loved about dbt was that you can embed your business logic, your agreed upon business logic with your stakeholders and then have it documented. And that was a really, really big thing that I love. So dbt has a documentation part of their tool dbt docs,

that allows you to have definitions. And actually, they have an automated version of this now, which is fantastic. Because I'm not going to lie, when you're building out a model, especially from scratch, and let's say you have

30, 40 fields in your model, the last thing you want to go do is build out a document and then just write out all the definitions of all of these things, especially when you feel that, you know, for example, if it says created date, you're like, well, it's created date. Do I really need? But a lot of times you probably need that definition because created date might not mean what you think it means all the time.

And so... It sounds like that's a generative AI assist coming in there. Yes. And it sounds like, you know, actually a great use of it. There's all kinds of times where generative AI is showing up in my life today. For example, I don't know, have you ever used or do you use a whoop for tracking like your sleep? So I have an Oura Ring, but similar. Yeah, yeah. Well, the whoop in the Gen AI mania of 2024, they incorporate it in this daily...

Gen AI guidance thing that like I looked at it one time and I was like, oh, please never show me that again. It's just, you don't need that in your life. Like you're getting all the information you need. I don't need this kind of generalized, like just, you know, it's using general advice about fitness, which I guess maybe somebody out there is finding it useful, but it's like, um, you know, make sure you get lots of water in your day and like, just, yeah. Yeah.

Yeah, it's interesting. I mean, I'm sure we could go off tangent on that, but it is interesting how AI is so powerful and it's very cool, but it's interesting how

But it also doesn't need to be everywhere for certain things. But especially for... Oh, sorry, you were going to say... But yeah, but in DBT, automatically creating a documentation for all the fields that you have in a data file, that sounds like it can at least, you know, you should be reading over that and making sure that it's accurate. But it saves you the blank page problem. Yes. Well, and also now they have a feature where you can identify...

a field, I believe they're called doc blocks, where you can define a field, let's say in your initial staging layer model. And then you can reference that throughout all of the lineage throughout however that where that column goes or field goes in your downstream models, which is so nice, because then you just it's just efficiency, and you can spend more time

Working on the actual model itself, but I loved that component. And I also think it was really, really good because then it forces people

the data team to work with your stakeholders and really come to terms with, hey, this is our logic and this is our definition of this thing. Because you can't expect, especially if you're wanting to change how a team speaks about what, let's say, active customer means or something like that. If your data isn't representing that, that's going to be really hard to make that change with your conversations.

So having that not only for, you know, your stakeholders to reference back, but also when you have new engineers coming in, that's huge. When they're trying to learn what the data means and what granularity your models are and all of that stuff, that's just infinitely helpful. And so being able to be on a tech stack like that was amazing.

So, so nice. And it, again, it can get kind of tedious, especially, but I mean, now that we have this generative process, it's, you know, not taking long at all, but it still was so nice to have that as, no, this is just what we do. Like any model that's created, like the, everything has to have a definition. So, so essentially that's what we were doing. And then, you know, started progressing to building out

models from scratch when we were getting new data in and then exposing that into tools like Tableau or Sigma and then building out dashboards

for our teams to be able to use. What's Sigma? So Sigma is another data visualization tool. And so it was really nice that-- so Tableau, well, I guess you can run it a few ways. But essentially, with Sigma, it sits right on top of Snowflake and works really, really well with that. So that ingestion process was kind of removed. Anything we had in Snowflake, we could build off of in Sigma.

And it was really good for, I would say that initially when I opened up Sigma, it definitely can look a little bit like Excel, which gave me, I was like, oh, I don't know. And then...

And then, especially because I feel like I'm sure lots of people in data can relate to this, but a lot of times you're trying to get people out of SQL or not out of SQL, out of Excel. Not all the time. Sometimes it's necessary, but for the most part, you want it consolidated. You want everyone to be looking at the same numbers. So there's consistency when we're speaking about making business decisions consistently.

But it was great because I think it has the capacity to allow for people that are in data analytics that are trying to do really high level analytics and insights with it. But it was great as far as for PMs and other people to get in there and they're really familiar with Excel. So they're like, oh, great.

This is great. It's just like Excel Plus and it has a lot of other features. So that was really great as far as adoption goes for both teams, which was really cool. It's so important for data scientists to think how teams that might not be particularly tech literate will adopt the tools they put forward. In episode 859, I spoke to the Y Combinator-backed entrepreneur Vaibhav Gupta about the kinds of tools that help prevent mistakes like interns bringing down entire production applications.

Here, Vaibhav and I talk about BAML, the basic but machine learning programming language he created that helps companies protect themselves from well-meaning mistakes. So you're the CEO and co-founder of Boundary, which is the creator of BAML, B-A-M-L, a programming language. And it's an expressive language for text generation specifically. So our listeners, we probably have a ton of listeners out there who are callers.

calling LLMs, fine-tuning them for various purposes, and BAML is designed for them. So tell us about BAML, what the acronym means, why you decided to do this stupid thing. Yeah, so let's start with the acronym first. BAML stands for Basic Ask Machine Learning, but if you tell your boss, you can say basically a made-up language. So...

But the premise of BAML really came in from this theory around how web development started out. So when we all started coding, at least for me, when I started coding websites, it was all a bunch of PHP and HTML kind of hacked together to make website work. And then I remember interning at Meta and they were the ones that made React. I think part of the reason why they made React was because their code base was starting to get atrocious to maintain.

Imagine having a bunch of strings concatenating your HTML syntax, and now an intern comes in, like myself, forgets a closing div, and now your newsfeed is busted. It's not really the way we want to write code, where multi-billion dollar corporations rely on an intern closing strings correctly. And it's not really even the intern's fault, because how could they really read a giant blob? I barely read essays. How could the intern do that? But a compiler like React...

could actually correct for those mistakes. If you add HTML and JavaScript into the same syntax by creating a new syntax, those ideas become much more easily expressed. And now in two milliseconds, you get a red squiggly line saying unit closes div tag. And in that web development loop, it just reframed the way we all started thinking about web development. Instead of being like things are going to be broken, we could do state management because React handled it for us.

We could do things like hot reloading a single component and having the state around it persist because React did that for us. It was tastefully done, even though it required learning something new. And we asked, in this AI world that we're all headed towards, we think a few things are going to be true. One, every code base will have more prompts in every subsequent year than it did have in the previous year.

And if that is true, we probably don't want all these unclosing div type of mistakes existing around forever. And when you say prompt, you mean like an LLM prompt? Yeah, like an LLM, yeah, calling an LLM of some kind. And LLMs I think are one start, but I think all models in general are going to kind of be used long-term. Models are only going to become more easy to use for people that know nothing about machine learning in the future.

So, yeah, so we've done episodes recently, for example, people can listen to episode 853, where we talked about this generalization of LLMs to foundation models more broadly, maybe a vision model, for example, where you don't necessarily need to have a language input or output. But even with that kind of model, even kind of in a vision use case.

It could be helpful. It could make things easier for people calling on that vision model if instead of having to write code, they can use a natural language prompt. And so I 100% agree with you. More and more often, the models that we're calling, whether they're big foundation models or

including specifically LLMs or the smaller models, having natural language prompts in there to just very easily kind of get what you're looking for, maybe even just out of a plot. Yeah, exactly. And I think the thing that we have to think about as this stuff becomes more and more prevalent is actually the developer tooling that has to come with it. Just like how React had to exist for Next.js, Tide Trip, and all these other things to come out and make our lives a lot better.

in the web development world, we ask what has to exist in the world of LLMs and generally AI models as a developer, not as the people perhaps producing the models, because that's a different world, but just the people consuming the models. And no matter how good the models get, at some point you have to write bits on a machine that flip, and that's code. And it has to plug into your code base in a way that makes sense.

And just like JavaScript sucks, TypeScript is way better because type safety and static analysis that we get

We wanted to do a bunch of algorithmic work that reframes the problem for users when we made BAML. We stay on the subject of the latest tools in tech with my next clip. In episode 863, I talked to Professor Frank Huta about the huge steps that TabPFN is making in science and medicine. For the uninitiated, TabPFN is a foundation model for impressively modeling tabular data, a feat that deep learning models had struggled on until now.

The phenomenal results of some notable tab PFN applications have been reported in leading peer-reviewed journals like Nature and Science already. I asked Frank which tab PFN applications excite him most and how listeners can get started with tab PFN for their own use cases. Yeah, so very exciting. All of these big updates from version one to version two. With version one, as you mentioned, there was relatively limited applicability of tab PFN, but nevertheless, there were still some great use cases that came out of it.

One of them was a science paper. So in addition to nature, uh, the paper that you published in, there's one other big kind of general broad science paper out there and it's called science. And so there's this paper. I'm not even going to try to get into the biology of what this means, but we'll include the paper in the show notes. It's called large scale chemoproteomics expedites, ligand discovery and predicts ligand behavior in cells. Uh,

And so I can't really explain what this is all about. It's something to do with determining protein structure. But the key thing is that tabPFN was used as a part of the inferences that they made in that paper. And I'll have a link also in the show notes to a repo, a GitHub repo called AwesomeTabPFN that lists about a dozen existing applications

of TAP-PFN across health insurance, factory fault classification. There's financial applications. There's a wildfire propagation paper, a number of biological papers in there. So yeah, clearly lots of different applications out there, even for V1. I don't know if you want to talk about them in any specific detail, Frank, but I know

that you are, of course, looking for more people trying out TapioFN, especially now that version two can handle so many more kinds of data types, can handle missing data, can handle outliers, and can handle larger data sets. So

So listeners, if you've got tabular data out there, you can head to the tabpfn GitHub repo that we also have a link to in the show notes, and you can get started right away. Yeah, awesome. Thank you so much for mentioning this awesome tabpfn repo. I literally actually created this today. So I hope by the time that the show actually goes out, there is a lot more than a dozen applications there. And yeah,

please, whenever you have an application or use case, just either send us a note or actually this is one of these reports where you can just do a pull request with your own application, put your own paper, and we'll basically advertise it. Also, if there is cool applications, we'd love to have blog posts or just retweet your content and so on. I think we really want to build this

community of people who love TapioFN and build on top. And the open source community has already picked this up. And within a couple of days of the Nature paper, there is this repo on ShappIQ that's all about interpretability, directly put TapioFN in there. And so yeah, it's really amazing to see the speed at which the open source community

works and I'm really looking forward to what else people will build with this. One cool thing, the science paper I wanted to mention is, yeah, I also know nothing about chemical proteomics, but that's kind of the neat thing. I can still work on this because, well, we have this really generic method and if there is data from chemical proteomics out there, then we can fine tune on that and get something that's even better for this use case. And

So those are the types of things that I'm really excited about doing for all kinds of use cases. There's also already something out there on predicting... Algal blooms! Yeah, algae! Yeah, algae, I know, and algal blooms sort of... But yeah, sort of things that are good for the environment and so on. I think I'm really excited about those types of applications. There's lots and lots of applications in medicine

there's not that many published papers on applications and finance and so on because, well, typically people don't publish these types of applications as much. But medical and so on, there's a lot. And yeah, really hoping for a lot of people to use it to do good things for the world.

It's incredible to see how TabBFN has gone from strength to strength in a relatively short space of time. So how does anyone go about setting up a successful tech company like that? In episode 865, I talked to Kal Aldubaib about how to start and scale a data science consultancy using his wildly successful company, Pandata, as a case in point.

Let's talk about the kinds of things that made Pandata so successful. We have already this make it boring idea of making it boring for data scientists easy for your clients to be able to understand the data science that you're delivering. What are the other keys to scaling a successful data science consultancy? So something that I didn't quite nail in my first startup that really stuck with me is this notion of product market fit.

And anyone who's in the space of entrepreneurship will hear this term bandied about. And for those of you who haven't been in the field of entrepreneurship, what that means is you found a pain point that someone is willing to spend something on solving. And there's enough of those people at enough scale, you know how to reach them, and you can consistently deliver that thing that they're willing to pay for. And...

Clients vote with their money. And I found early on, because I bootstrapped, that meant I didn't raise any capital. The only source of growth I had was when a customer is willing to pay for it. And so it's one thing when somebody says, hey, that's a great idea. It's another thing when they're willing to sign a big check for you to solve that problem. And then they come back to you to solve that same problem or similar problem again and again and again.

So product market fit and listening to what people were willing to spend on was a really big part of Pandata. My first year, all I had to do was say, "Hey, we can do data science things." And I was able to land a few contracts here or there, but it was a rotating window. I'd work with one enterprise and then they'd go away. Another enterprise would come, and that's a very common story for consulting companies. There were maybe one or two clients that stuck around or kept coming back to us.

And I remember having a conversation with my stakeholder there. I finally worked up the guts and I said, not that I want you to question the situation at all, but why are you coming back? And I was like really trying to do some market research and understand. And it turns out that they really liked that we were approachable, right? That was one of our core values is hold back the jargon, always speak plainly.

And then there were a couple of formulaic things that we accidentally ended up doing. We have this process called discovery and design that now is a mandatory requirement. Anybody that hires us to do any work, I say you have to do this up front or I won't work with you. With those clients, we accidentally did it. And that's where we spent just

30 days, six weeks, diving into a problem trying to figure out where the skeletons is the solvable? How can we approach this? What are the unknown unknowns, which is a really big part of solving problems that have not been solved before with pattern matching algorithms just to simplify it. And

So I tried to recreate that magic. So there were these attributes that we had that became our core values. We had five core values that I can talk about later. And then there are these processes. And one of these processes was discovery and design. Now, the funny thing is I decided, all right, I'm now no longer going to work with any client that doesn't want to do this. And we're going to charge an arbitrary amount of money.

That engagement size is now 50,000 at that time. That was a measly 12,000. And I was really a first time entrepreneur nervous about throwing that about. But I'd say, hey, you know what? Unless you're willing to spend this, I don't even want to work with you.

And it helped me weed out two things. One, clients that weren't serious, if they weren't willing to pay that, they definitely weren't willing to pay for the rest of the engagement. And two, if they didn't philosophically agree with the importance of that step, then I knew that they were likely to be a client that was consistently disappointed by the results because they didn't quite get the data science process.

So I went from spending a lot of time talking to a lot of people that seemed interested at first in data science. And then I got no, no, no, no, my pipeline started to dry out. And this is one of three times that Pandavas bank account reached like, less than, you know, a month's worth of expenses. And I was like, this was the end, this was maybe the dumbest idea. And

Within that same period of time, I landed three of the biggest clients I had ever engaged, two of which remained clients until Pandata's exit. So over a period of about six years. And that process became a part of how we were able to scale so much larger than most small solopreneur consulting shops. Right. So the key was...

having this 30 day discovering design initial engagement at the beginning of trying to consult with somebody and you'd say, you know, there's going to be this $50,000 price point to do that initial 30 day engagement. And

And so that initially seemed to put you in peril where your pipeline dried up, everyone was saying no, but then it did ultimately lead to discovering solid long-term clients that were with you for six plus years. Cool. Well, and so I would use this tactic and now I use this tactic to scare off non-serious people. And it actually allows me to save them time. It allows me to save time. And then I find the companies and the groups that say,

heck yeah that sounds amazing i love how you think about this and there's a lot of fish in the sea and it's all about this matchmaking process and one of the counterintuitive lessons i learned was the art of saying no or ruling others out by saying no to them and it really allows you to spend more time on the bigger things the higher value things and this is a common tactic i see a lot of most of my friends who are wildly successful right right that is tricky it's very hard

Yeah. To say no to smaller or more challenging projects, because you remember those times where you got to only a month of expenses. Oh, my value left in your bank account. Well, I guess I better say yes to everything. But then that ultimately it slows you down. You have the death by a thousand cuts.

of just all of these low value touch points. Well, it's funny when we were going through due diligence on this acquisition, there were about three points on the balance sheet in the financials that they had virtually circled. And they're like, we want to talk about this.

this and this. We don't like that. I said, I didn't like those either. Those were really bad moments for me too. All right. That's it for today's In Case You Missed It episode. To be sure not to miss any of our exciting upcoming episodes, be sure to subscribe to this podcast if you haven't already. But most importantly, I just hope you'll keep on listening. Until next time, keep on rocking it out there. And I'm looking forward to enjoying another round of the Super Data Science Podcast with you very soon.

868: In Case You Missed It in February 2025 26:31 Share

Super Data Science: ML & AI Podcast with Jon Krohn

Deep Dive

Shownotes Transcript

868: In Case You Missed It in February 2025