We're sunsetting PodQuest on 2025-07-28. Thank you for your support!

885: Python Polars: The Definitive Guide, with Jeroen Janssens and Thijs Nieuwdorp

2025/5/6

Super Data Science: ML & AI Podcast with Jon Krohn

AI Deep Dive Transcript

People

Jeroen Janssens

Thijs Nieuwdorp

Topics

Jeroen Janssens: 我在工作中偶然发现了Polars，并被它强大的性能和简洁的语法所吸引。我意识到Polars是一个非常有潜力的库，值得出一本书来介绍它。在写作过程中，我发现Polars的优势在于其高效的计算速度和对内存的有效利用，这使得它在处理大型数据集时具有显著的优势。此外，Polars的声明式编程风格也使得代码更易于阅读和维护。与Thijs合作编写这本书，我们能够互相补充，共同完成这项工作。我们也从实际项目中学习到了很多，并将这些经验融入到了书中。在与Alliander的合作项目中，我们成功地将Polars应用于生产环境，并取得了显著的成果。通过将Pandas代码转换为Polars代码，我们成功地将内存使用量从500GB减少到40GB，并将处理速度提高了一倍。这充分证明了Polars在实际应用中的价值。此外，我们还探索了Polars的数据可视化功能，并发现Great Tables包可以有效地对数据表格进行样式化，而无需修改底层数据。这使得我们可以创建更美观、更易于理解的数据可视化图表。在与NVIDIA和Dell的合作中，我们对Polars的GPU加速性能进行了基准测试，结果表明Polars在GPU上的运行速度比在CPU上快得多。这进一步扩展了Polars的应用范围，使其能够处理更大规模的数据集。 Thijs Nieuwdorp: 我与Jeroen的写作风格互补，他擅长润色，我擅长起草。在写作过程中，我们不断学习和完善，并从实际项目中汲取经验。在Alliander项目中，我们面临着处理大型数据集的挑战。通过使用Polars，我们成功地解决了这个问题，并显著提高了数据处理效率。Polars的优化器和引擎使得我们可以高效地处理数据，而无需过多关注底层细节。在包管理方面，我们从Poetry转向了UV，因为它更快、更可靠、更容易使用。UV基于Rust，其性能优势显著，这使得我们可以快速地设置和拆卸环境，从而方便地进行基准测试。在数据可视化方面，我们使用了Great Tables包，它可以有效地对数据表格进行样式化，而无需修改底层数据。这使得我们可以创建更美观、更易于理解的数据可视化图表。与NVIDIA和Dell的合作，我们对Polars的GPU加速性能进行了基准测试，结果表明Polars在GPU上的运行速度比在CPU上快得多。这进一步扩展了Polars的应用范围，使其能够处理更大规模的数据集。

Deep Dive

Shownotes Transcript

Translations:

中文

This is episode number 885 with Jeroen Janssens and Thijs Newdorp, authors of Python Polar's The Definitive Guide. Today's episode is brought to you by Tranium2, the latest AI chip from AWS. By Adverity, the conversational analytics platform. And by the Dell AI Factory with NVIDIA.

Welcome to the Super Data Science Podcast, the most listened to podcast in the data science industry. Each week, we bring you fun and inspiring people and ideas exploring the cutting edge of machine learning, AI, and related technologies that are transforming our world for the better. I'm your host, John Krohn. Thanks for joining me today. And now, let's make the complex simple.

Welcome back to the Super Data Science Podcast, unusually for the second week in a row. It's got to be the first time that's ever happened two weeks in a row. We've got two guests in today's episode.

Jeroen Janssens is our first guest. He is Senior Developer Relations Engineer at Posit. Previously, he was Senior Machine Learning Engineer at Zomnia, the largest Dutch data and AI consulting company. He wrote the invaluable O'Reilly book, Data Science at the Command Line, and holds a PhD in Machine Learning from Tilburg University. Our second guest today is Thijs Noedorp, who leads Data Science at Zomnia.

And he holds a degree in AI from Radboud University. My apologies to any of our Dutch listeners. I am surely butchering every single Dutch name that I try to say. So the reason why Jeroen and Thijs are on the show today is because they are the authors of Python Pollers, the Definitive Guide, which was published by O'Reilly just a couple of weeks ago.

Regular listeners will know that I often hold book raffles when I have guests on the show that have written a popular book, particularly if they wrote it recently. And typically, I administer those book raffles personally. Well, this week, Irun and Ties have upped the ante. Not only can you receive a free physical copy of their book, Python Pullers, they are kindly taking the admin out of my hands so that they can both sign and ship your physical copy to you wherever you are in the world.

That's free and signed and shipped to you. So Jeroen and Thijs are giving away three free copies of these signed Python Polar's books. Head to polarsguide.com slash SDS before the end of the day this Sunday, May 11th. That's polarsguide.com slash SDS to get a, well, to be in the raffle, to get a free signed copy of Python Polar's.

We've got, of course, a link for you in the show notes so that you can do that easily.

Today's episode will be particularly appealing to hands-on data science, machine learning, and AI practitioners, but Jeroen and Thijs are tremendous storytellers and frankly, very funny. So this episode can probably be enjoyed by anyone interested in data and AI. In today's episode, Jeroen and Thijs detail why Pandas users are rapidly switching to pollers for data frame operations in Python, the inside story of how O'Reilly rejected four book proposals on pollers before accepting the fifth,

The moment when an innocuous GitHub pull request forced a complete rewrite of an entire book chapter and a previously secret collaboration with NVIDIA and Dell that revealed remarkable GPU acceleration benchmarks by pullers. All right, you ready for this laugh-filled episode? Let's go. ♪

Jeroen and Thijs, welcome to the Super Data Science Podcast. You guys are together. I don't think, have I ever had this situation before? I can't think off the top of my head of ever having had both of my, of having two guests, but that they are co-located. Where are you two co-located? Thanks, John. It's great to be here again. Good to see you again.

And we're calling in from Rotterdam, the Netherlands. Nice. And that voice for the listeners out there, for the people not watching the YouTube version, that was Jeroen. That's his voice.

And for people watching the YouTube version, he's the one in the pink shirt. Oh, also his mouth was just moving and sound was coming out of his face. But whatever is easier for you to track. And then in a charming and matching or complimentary forest green shirt, we have Tize Newdorp. Tize, what does your voice sound like? Thanks so much for having me, John. This is what my voice sounds like. Oh, nice. I'm...

It would be helpful if one of you didn't have a Dutch accent, but fine. We'll have to just work with this. We have accents? What do you mean we have accents? That's funny. So, Jeroen, you've been on the show before. You were on in episode 531 back in December 2021. That was the very end of my first year hosting this show. We had a great time.

on that podcast. It's a great episode that people can listen to. But Thijs, am I correct in understanding this is your first podcast ever? It is. Yes. I've never podcasted before. It's just since the book is taking off, we're finally getting into that marketing and we're kicking off with the best. So I have no clue what to do after this. I hate to let you know, it's going terribly so far. This is a bad episode. It's

It's too bad. We'll just start with my biggest fear and just break it down from there, right? Your biggest fear, yeah, exactly. We'll start with your biggest fear and then we're going to move on to your biggest beer. If you could grab one of those, I think it might smooth things along. Until February 2025, you were co-workers together at Zomia, which is the leading data and AI consulting company in the Netherlands.

and makers of open source data frame library pullers. And then Jeroen, you recently took a DevRel job at Posit. Seems like a lot of people are moving over to Posit. A lot of big names are there now. Makers of RStudio and lots of other great open source tools. We've had episodes on Posit before in the past, so we don't need to get into that too much. But

But what I'd like to speak about most in this episode is your new book, brand new, released actually the same day that we were recording this, which is April 1st. So now that this episode is live in May, hopefully it'll actually be available again because right now, in a lot of locations around the world at least, if you try to buy Python Pollers, the definitive guide by our guest today, Yaron and Tize, you wouldn't be able to get it

on April 1st because it is sold out. But O'Reilly are very good. They do on-demand printing, and so they should be able to resolve that pretty quickly. It's not like with a lot of other publishers, it could be months potentially before a huge, I don't know, a lot of publishing companies do orders in the 5,000s, but I'm pretty sure O'Reilly, they can do printing on demand, which is cool. So anyway, that should be resolved soon. Very popular book, very excited to have you on the show.

we've had Polar's on the podcast before. We've had Richie Fink, its creator. We've had another key contributor to the Polar's project, Marco Gorelli. But the Polar's library has grown a lot since then. It's about to pass Pandas in popularity if we measure that in kind of number of GitHub stars, if that's a measure of popularity. And yeah, and now it has this great O'Reilly book, thanks to the two of you. So what...

caught like what's bird you guys on to write the book? What was the experience like? Oh, and I've got to tell the listeners about this, that you're doing a book giveaway. So, um, I think we'll give them until Sunday. What do you think up until Sunday?

Sounds good to me. Yeah. Sweet. So there's a URL. You can say what the URL is and what the free book giveaway is. It's unlike, we do free book giveaways on the show, lots, physical books, but there's something special about your book giveaway that we've never done before. So I'll let you guys fill the audience in. Yeah. For your listeners, we wanted to give away hard copies that are signed by the both of us.

So in order to be eligible for a copy, you go to Polar's Guide dot com slash SDS and you fill in your name and email address and then you enter the raffle. And then by Sunday, we'll let you know. Even if you don't win, you'll still get the first chapter for free. Awesome. That's such a cool giveaway.

I I'll encourage more guests to, to do that. And so both of you will sign it. I guess if you're co-located, it makes it easier. Did I say three copies? I'm not sure if I said that. I don't think you did, but yeah, three copies. We'll give away three copies. Nice. And people can be anywhere in the world. Anywhere. Anywhere. We'll take care of the shipping. Yeah. Sweet. Yeah. Super generous. Thank you very much for doing that. So yeah, polar's guide.com slash SDS. You have until Sunday.

to submit yourself into the raffle and get a signed copy of Python Pollers, the definitive guide from both of its authors. Our guest today, Jeroen Entijs. Super, super cool. All right. So yeah. So now with that out of the way, tell us what caused you to write this book and what the process was like. Yeah. It started with you, right? Exactly. So let's start with the origin story here.

I joined Xomnia in January 2022. Is that right? Yeah. No? It sounds like he's asking somebody, but that's Jeroen asking himself. It sounds like he's having a conversation maybe with ties with me. It doesn't really matter when. So when I started, I was just getting to know everybody working in the office and there was this one guy

really focused working behind his laptop. Everybody was going to lunch, but he would just stay working. Turned out that was Richie Vink, the creator of Polars, I learned later. And I didn't know anything about Polars, but I didn't have an assignment yet. I had some time to work to explore a data set, and I decided, let's try out Polars. And I was immediately hooked.

Of course. And I immediately figured, okay, this is so cool. This deserves a book. This is going to be a big thing.

But I already knew, having written Data Science at the command line before, right? Twice. Twice, yeah. That I never wanted to write a book by myself anymore. So I needed... Is that the tragic story of how you wrote it once and then accidentally burned it and... The dog ate his homework. I wish. No, so I needed another victim, right? Someone to share the pain with.

And Thijs, so very shortly after that, I got assigned at a client with a large code base and Thijs was also working in that same team. So we were not only colleagues, we were also working for the same client in the same team. And so I felt like, hey, Thijs, he seems to be good at this. He likes to write. Why don't I ask him?

And to which his answer was, obviously, yes. Yeah. And so I had a meeting with O'Reilly anyway about whether I could do anything else for them, maybe a video course or something related to data science command line. But that's when I asked him, like, hey, have you heard of this thing called Polars?

and say yeah yeah we've had four proposals so far but we've all rejected them and i was like oh wow four proposals already and so uh that's when i knew that we had to write a serious proposal so we wrote one over 15 pages we we brought in all the stats that we could and uh

Of course, by then, O'Reilly hesitantly said yes. But after a few months, they realized like, oh, wait a minute. This is actually going to be a big thing. We want to have this book now. They started feeling the pressure. Started asking like, okay, about that deadline, like, is everything going all right? Yeah. That's really scary because writing a book is, it is torture. It's

When I wrote Deep Learning Illustrated, it was the worst experience of my life. The only thing that came close was writing a PhD dissertation. But with a PhD dissertation, there's not that much pressure because two people are going to read it. You're going to have your PhD examined, like the board. They're the only people that are going to read it. And then also there's this amusing thing of a number of girlfriends that I've had since doing the PhD, this early stage of dating when they're really excited about

having met me, that goes away quickly. But there's this very brief window. No, but they see it on the shelf and they're like, I'm going to read that. And I'm like, you're not.

You're really not. It's really not readable. It's designed for a really specific niche of individual in the world that you have to spend many years for this to make any sense. But writing a book like Deep Learning Illustrated, the idea was hopefully more than two people would read it.

And I don't know the whole time, I don't know if you feel differently about this, Jeroen, having now written several books before. So maybe you feel like, you know what, I can write a bestseller. I know the process, I know what to do. But at least for me, I've only released that one book so far. And for me, the whole time I was writing it, I was filled with this deep concern that it would come out and everyone would realize that I was a fraud. That I had no idea what I was talking about.

I don't know if you've had anything like that. - Yeah, I recognize that, especially with the first edition of Data Science at the Command Line, which I wrote right after I finished my PhD thesis. I was in this groove, but I really felt like an imposter during that entire time, especially since everybody and their dog seems to have an opinion about Linux and Unix and which tools to use. And so a lot of opinions

opinionated people there, which made it all the worse. But by the second edition, I realized like, hey,

You said bestseller? Well, I'm not sure that our book is going to be a bestseller. I'm pretty sure it's not, but... The Poller's book? I mean, it might not be a New York Times bestseller, but I bet in some Amazon categories it will be. Maybe. Yeah, it's funny how Amazon assigns these categories. Number one in database design. Exactly. Graph database execution for children. Exactly.

Oh, wow. But what I have learned is that

I and Thijs, I knew that Thijs, we can definitely write a book, right? You don't have to know everything. That's what a lot of people think is that you have to be an expert in the topic. No, that's not true. You think maybe you think that you're an expert, but as you start writing, you'll realize that you have a lot of gaps in your knowledge. And that's when you start learning more and more about the topic.

So, by then, when we started writing Python Polars, I was pretty confident that as long as we would stay one step ahead, and what definitely helped is that we were able to implement the things that we learned at our client, and we can talk more about this later, how we actually put this into production,

But you'll figure things out along the way. You put the book into production? Not the book in production, but we have... I think that our client is probably... It's one of the first companies that has actually Polar's code running in production. So it has that now for over a year before the 1.0 release of Polar's. So, yeah, so...

quite confident. And I guess the biggest takeaway here is that you don't have to know everything when you start writing a book. You'll figure things out along the way. Yeah, it turns out that the imposter syndrome is a natural part of the writing process.

Curious about Tranium 2, the latest AI chip purpose-built by AWS for large-scale training and inference? Each Tranium 2 instance packs a punch with 20.8 petaflops of compute power, but here's where things get really exciting. The new Tranium 2 Ultra servers combine 64 chips to deliver a massive 83 petaflops in a single node. These Tranium 2 instances deliver 30 to 40% better price performance relative to GPU alternatives.

Major players in AI like Anthropic and Databricks, along with innovative startups like Poolside, have teamed up with AWS to power their next-gen AI projects on Tranium 2. Want to see what Tranium 2 can do for your AI workloads? Check out the links in the show notes. All right, now back to the show. Nice. And Ties, what is it like working with a tyrant like Yaron? Yeah.

You want me to leave the room? Blink twice. Exactly. In my opinion, it went very naturally. I think quite early on we already noticed that we have a relatively complementary writing style that I just start putting words on paper and start restructuring and moving it around and refine it more.

something that stems from the time I was still writing a thesis where I couldn't get anything on paper because I was so judging everything you put down like, nah, that's not quite it. And you kind of get stuck in that, right? So I learned to just get stuff out on paper and it may not be proper in the right format and the right semantics, not exactly the nuance you want to catch, but ultimately it gets you to where you want to be. It's just like the first 80% needs to come first and that's not perfect yet.

And Jeroen is one of his qualities, he can use his perfectionism in such a way that he's very good at the refining phase. So when I put some meat in the chapters already, he comes and moves stuff around and, "Have you thought about this or shouldn't you word it like this?" And that really does the eyes. So in that sense, not necessarily a tyrant, just a very effective perfectionist.

Thank you. It's a fine line. Yeah, there is a fine line. And I am very well aware that there is such a thing as preparing too much, as overthinking things. And it really helps when there is already something on the page

So for example, that could be text written by Thijs or by myself. What I sometimes do as a trick whenever I feel I'm stuck, because this is a book that involves a lot of code, I'll first write all the code cells, all the code chunks, so that I can then fill in the gaps with text along the way. That's one of a couple of tricks that I could apply here.

Very nice. How long do you think it'll be before book writing will just be completely... My next question is kind of a joke, but it's also just such an annoying question and I'm even regretting that it's going to come out of my mouth. I'm going to do it because now I've started down this road, but I was going to have that classic thing. It's like when I'm having...

When I'm out for drinks with friends that are, say, not data scientists or AI engineers, software developers, but maybe they listen to the podcast. And so, for example, I have a friend who's like, oh, you're really lucky that you got your book, Deep Learning Illustrated, out before the chat GPT era so that people know you really wrote it. And then, so I was going to have this dumb, this question, which is kind of trite, and you don't have to spend much time answering this. We can just get into some, we'll get into some Polar's topics next. But,

Do you think there will be a time in the foreseeable future where O'Reilly just asks a machine for a proposal, it creates a 15-page proposal, and then it says, you know, this is it. And then it goes and writes. If you want to keep on regurgitating existing knowledge, then yeah, using a stochastic parrot is great. But if you want to produce actual new knowledge, then I believe that humans are very much indispensable here.

Nice. Good answer. That was nice. Really a much, much better answer than I was anticipating, especially given the quality of the guests. Uh, but anyway, April 1st. Um, yeah, we are recording on April fool's day. Um,

And I guess it's still the morning in Hawaii or something at the time that we're recording. So nice, let's talk about Polar's. I do think your book is going to be a bestseller because there is a Polar's moment right now, as I talked about kind of at the outset of this episode, with the popularity, at least in terms of GitHub stars, probably going to surpass the number of stars that Pandas has this year.

Richie Vink and Marco Gorelli's episodes of this podcast last year were very popular in terms of both listens as well as social media reactions. And by the way, if people are interested,

because we've been careful with this episode with the topics that we've curated. We won't be overlapping with Polar's topics from Richie or Marco's episodes. So if you want even more Polar's after this episode, you can check out 827 with Richie or 815 with Marco or both. And I mean, they're, they're outstanding episodes. Both are highly technical people, just like Jeroen and Tysar. And so we get with, you know, these are all complimentary episodes. Um,

covering different aspects of the library. But yeah, very popular episodes, very popular social media reaction. I would not be surprised at all if your book did. If it sold like hotcakes, everyone loves hotcakes. So let's talk about the grammar. So like R's Tidyverse, which they make at Posit where you now work your own,

With pollers, there's also a grammar and a naming convention that is encouraged to preserve semantic clarity, which means that not only can you understand your code better when you come back to it later, but other people that you're working with can understand it more easily as well. In the book...

you to compare expressions in pollers to recipes. So specifically, I'm going to read a little snippet of your book here. It's, if you think of an expression as a recipe, then the operations would be the steps and the functions and methods would be the cooks. So how does this metaphor shape your philosophy about best practices with data transformation design in order to deliver clean, readable pipelines, especially in large collaborative projects? Big chunk.

The short answer is no more brackets. When you read Pandas code, there are many brackets in there. And in a lot of cases, it's very difficult to reason about what the code is actually doing.

And so with Polars, you take a different approach, not only with those expressions, which are indeed the building blocks, those small recipes, but also the part where you use the expression, namely in the entire query. So it's almost like you're writing a paragraph, right? To come back to book writing, you're writing a paragraph of things that you want to do. It's a logical...

element in your entire pipeline. And that's much easier to reason about. Yeah, and I think one of the things I've always liked most about Polar is the very declarative approach of what you're writing down. So in Pandas, it can be the case that you're

very focused on specific operations in parts of your data frame, which can make it hard to follow what exactly is going on in the hood. But with Polar, you declare what you want as end result. You just leave the specific processing and optimization to the engine. And that makes it way easier to read. Maybe this is a good moment also to clarify that we are very appreciative of pandas.

We're not here to bash pandas at all. That's what you sometimes see online, is these comparisons that are not done in a very... Elegant manner. Not in an elegant manner. That's not us. Without pandas, there wouldn't be Polars. So we are...

very much appreciative of what Wes McKinney and his team have done. - Absolutely. - Yeah. - The standing on the shoulders of giants, right? - Wes is now also, so Wes, the creator of the Pandas Library, which for people who aren't already practicing data scientists, you may not be aware that Pandas has been for some time now, for a decade at least,

the de facto standard for working with data frames, which are a kind of data that's like how you could imagine data are in a spreadsheet, so in an Excel kind of tool where you can have columns that represent different kinds of data so you're not restricted to having just a matrix of numbers, of float values. For example, with a data frame, you could have, similar to this idea of column names, and you could have one column that's string information, like company names,

Another one that's number information, like how much revenue those companies had in a year. And so for a decade or more, Wes McKinney's Pandas library...

Wes McKinney's Pandas library had been the standard for working with data frames in Python. They're hugely important because you're constantly, as a data scientist or data analyst, you're constantly working with different types of data like that. Working with them in Pandas has been key, but Polar's has taken off

Again, like hotcakes. It's burst onto the scene recently. Richie has led development of this, Richie Fink. I realize that you guys don't... It's not nice to bash pandas, but why...

Why are so many people switching over to Polar today? What's the kind of nuanced argument that even maybe Wes McKinney himself would forward? So I think when we were talking with Richie over the many times we talked with him over the course of writing the book, one of the main experiences that shaped how he wanted Polar to work

was some frustrations that he had when running his pipeline and only 20 minutes in you run into some trouble and it crashes. And that's not something you could have seen up front. So this is one of the things that, one of the experiences that helped him shape what Polars ultimately became. And there's also a lot of good things that he saw in how Pandas works that he wanted to take and put in Polars.

But generally, Pandas became a big inspiration, both good and bad, for Polars and also other libraries, I think like Spark. Especially you can see that the syntax of the Spark is a lot like how Polars turned out.

And there's other elements, like for example, from the Rust language that Richie took to implement in Polars because it just made it work so nicely. Yeah, he talks a lot about Rust in his episode. Cool. All right. So that kind of gives us a bit of a foundation around your book and around Polars and why people are using it so much for DataFrames operations today and more and more and more so.

Earlier in the episode, you mentioned about a real-world implementation of Polars. Maybe, as you said, the first ever production instance of Polars. Am I right in understanding that's Alliander? I'm probably butchering the pronunciation of that.

Yeah, Alliander. It's a power grid provider in the Netherlands. Also, they provide the infrastructure for both electricity and gas in a third to a half of the Netherlands, I believe. So the largest utility company in the Netherlands, therefore. I can't even say Netherlands. That's how bad I am at Dutch pronunciation. Nederlands. That's actually easier, isn't it? Yeah.

For us it is. Oh, that's what you're talking about. I was wondering. Where are these Netherlands? That ain't no country I ever heard of. And yeah, so tell us about that project and what it was like. And actually, it'd be interesting to know if, like, when was there overlap in working on the book and working on that project? And did working on a Polar's book help with a real world implementation? Anyway, that's kind of an interesting side question. Yeah. Yeah. So the origin story here is that

Thijs and I, we were both very excited about Polars. We were writing a book about it. And then all of a sudden, it became clear that at Aliander, we needed to speed up the pipeline. We need to lower cost. We needed to process much more data.

in the current state that just wasn't possible. It was a combination of not only Python and pandas, but also R code. So it was very inefficient. To give you an idea, we were running this on a single AWS instance that had over 700 gigs of RAM.

700 gigs of RAM. And so, yeah, we can provide you a link with more backstory to this with some actual numbers. But we were very excited and we were like, hey, let's try this out. Let's do this. At first, the team was very hesitant, right? We're there, two people or three, actually. We had another colleague, three people.

promoting Polars that is being developed at Xomnia. So they were very skeptic, understandably. So what we did in order to convince them is to just take on a very small piece of code, some low-hanging fruit, and benchmark it, and reimplement the Pandas code into Polars, and then just show the numbers.

And by then they were immediately convinced, right, this is indeed way faster, uses way less memory. Let's try this out. Let's take on this huge code base piece by piece by translating, not one-to-one because you can't do that. You really have to reason about the inputs and the outputs and then do it in an idiomatic way, right? You cannot just translate pandas to Polars.

And, you know, I think it took us, well, what, six months? A year? I don't even remember. But eventually I left that client at that time, but there was a moment like, okay, we can

we can now get rid of R and pandas as a dependency of this project. And it's been running smooth ever since. Yeah, definitely. Yeah, I think ultimately the size of jobs at the beginning was about 500 gigabytes for just that task.

doing like one calculation and we shrunk it down like both uh being a consequence of implementing polaris but also uh on the as we were going rehashing some of the uh the the the code structure that we were using in the project we hashed it all the way down from 500 to 40 gigabytes which makes it a lot more a lot more doable calculations yeah

And so the second part of your question was like, okay, how did this influence each other? The book writing and putting it into production. And this was, yeah, it was a perfect match because when you're just writing

when you actually need to put it into production, when you have a real problem to solve, that's also when you start to notice the limits, right? Or maybe inconsistencies or missing functionality. For example,

there was this random sampling with weights, right? That's something that you can do in pandas. You just give it another column that indicates the weights for the sampling. That's something, maybe even up until this point, something that Polaris doesn't have. Luckily, that was for an ad hoc analysis that we had to do. But at that point, you know, it becomes clear what Polaris can and cannot do. Also,

when you write, you start to look at things from a little bit of a higher level. So sometimes we noticed inconsistencies in naming or missing methods. Like, hey, why is there no inline operator for the XOR operation? That's something that nobody ever thinks about. But when you need to put in a table in your book,

and you need to fill in all the pieces, that's when you start noticing these kind of things. So we were able to also submit some issues, maybe even a few pull requests to Polar's itself along the way.

this episode is sponsored by adverity an integrated data platform for connecting managing and using your data at scale imagine being able to ask your data a question just like you would a colleague and getting an answer instantly no more digging through dashboards waiting on reports or dealing with complex bi tools just the insights you need right when you need them with adverity's ai powered data conversations marketers will finally talk to their data in plain english get instant answers

make smarter decisions, collaborate more easily, and cut reporting time in half. What questions will you ask? To learn more, check out the show notes or visit www.adverity.com. That's A-D-V-E-R-I-T-Y dot com.

Very cool. So you're actually influencing the library itself as you're writing the Polar's book, as you're working on consulting projects, bringing Polar's into the real world, getting huge benefits in terms of memory footprint, that 10x figure that you gave there, 500 gigs of memory down to 40, that's massive. Definitely makes it a lot easier to be working with the data.

Do you happen to have, and no pressure if you don't, but it was nice to kind of get that 10x for memory, that 10x improvement. Do you happen to know what it was for compute time? Was it about 10x as well kind of thing? I think ultimately the compute time, because along the way, one of the things why we had to optimize the code was because

the requirements for the amount of samples that we were running for a certain simulation were supposed to hit 50 samples. That was like what the stakeholders asked us to strive for. And the 500 gigabyte instances was already 25 samples. So we couldn't push it higher because it just stacked higher and higher. And that's at the end, ultimately we were able to do those 50 samples in the same timeframe that it took to do the 25 samples at the beginning. Cool.

Very nice. I want to move on to a slightly different topic from your book, but it's related to this idea that you just mentioned around improving code bases, improving the Polar's library and the flexibility, the full breadth of capabilities that it has. In your book, you introduce a way to style tables with a package called great underscore tables. Great tables. But yeah, if you're typing it out, the package is great underscore tables.

And in a talk you were in, you actually mentioned the tables are underappreciated in visualization. So could you elaborate on why this great tables package was created, what it does? Yeah, maybe what its advantages to existing approaches out there. In hindsight, so now that this package exists, right, great tables, it's strange that there wasn't already a package.

Because tables are everywhere, especially when people are working with Excel. A lot of people really like to add styling to this in order to make it presentable to stakeholders. Add some color, use

currencies, what have you. Maybe some mini graphs in there, right? So now that it's there, it's so obvious that there should be a package for this. So Rich, I'm actually not sure how to pronounce his last name, the creators of Great Tables, Rich Ion, I'm butchering that.

But I do know the co-creator. Well, they're both my colleagues. I should know, but I just call them Rich and Michael Chow. Great folks. You should have them on the show as well. They created the Great Tables package. And so just only a few days ago, I saw a post by someone about Polars advocating or actually recommending that, okay, it's useful to add in the dollar sign when you're presenting currency.

But what he was doing, he was actually changing the underlying data. I'm like, wait a minute, that's not the way to do it. You want to change how it's represented, right? This layer on top of it. That's what you need to do. And that's what great tables can provide. So you're not changing those floats or integers to strings in order to format it. That's not the way to do it. No. So there should be another layer and

Python has a myriad of data visualization packages, but when it comes to producing tables, well, I only know of one and that's gray tables.

So with Polar's, you can indeed style data frames using the great tables package created by Rich and Michael Chow. So you use the DF style accessor and that will then use the great tables package under the hood. There you go. And so to maybe I'll try to explain back to you the example you just gave me there with the dollar signs. And you can tell me if I'm getting this right, that basically you're saying

If you have this huge, it doesn't matter if you have a very small table. If you think about a spreadsheet with 100 rows, it doesn't really matter if you write some kind of find-replace that goes and adds in dollar signs at the beginning of every number in a column. But if you have a gigantic piece of data, then trying to

edit each of the individual items in that gigantic column would be very computationally and memory expensive. And so with grade tables, you have this abstraction above where you don't need to individually change that information in all the rows. It's just, it's like an attribute of the column that changes.

Yeah, I wasn't really hinting at the performance issues right there. It doesn't feel right to me that you're changing the actual data. You want to keep the data the data. Because you never know what you want to do after that. Maybe you want to have a subsequent calculation going on. You have to script a dollar sign again. Yeah. Also, when you want to have round numbers. There are so many instances where you just...

want to change the representation and keep the underlying data intact. Nice. Okay, yeah, great explanation. Cool. So your book, listeners should definitely check that out, the Python Pollers Definitive Guide, or I should say it properly, Python Pollers, The Definitive Guide. Nice.

The Definitive Guide.

We have been talking in this episode, obviously, about Polar's a lot, which is a popular Python package. But another Python package that is really taking off recently is UV. And so it's a Python package and project manager that climbed from zero GitHub stars to, yeah, I actually don't have the number in front of me right now, but a very large number. Lots of people are talking about UV.

Um, and it has, it is exceeded poetry as another longtime favorite for package management. So, um,

Tice, in a blog post, you mentioned ditching poetry for UV. You talk about increased speed, reliability, and ease of use as the reasons for that. Do you want to tell us more about UV, poetry, and whether people should be calling it of? Gladly. Yeah, so this is also one of the things that we decided to do for the book.

We started out with Poetry and did all the version management of Python versions with PyEnv and other tools around it. But ultimately, when we were prepping the repo that can be used by the readers of the book that contains all the different notebooks that come with the chapters, so you can follow the chapters along and execute the code yourself, play around with it.

Obviously, you need to set up an environment easily that can work on many different systems that all your readers might have. So in the beginning, we were thinking maybe to go for Docker because that generally is the easiest way to make something run on different kinds of configs. But as we were writing the book, UV became bigger. And at one point, I just started experimenting a little bit with UV to see how easy it is to set up.

And it boiled down to installing UV and then running UV sync, and it sets up everything. It sets up the right Python version. It just finds the right dependencies for your system. Everything just clicks. So that's ultimately what we went for as the final solution for that repo to allow people to just install UV and just make it work.

And one of the reasons I started playing around with UV was mostly because it goes with the trend of the Rust-based tooling, which shows that very much the performance of tooling is a feature in itself. It's one of the things that Polar showcases, and it clicked very well. UV has the same kind of thing going for it, also the Rust-based tooling, which is leagues faster. It can be more than 10 times faster.

that combined with the single command setup, just made it a very quick and easy win. - Maybe you can say a few things about the regression that you found in Polars, where you didn't take it in handy. - Yeah, so at some point, UV is so fast that you can on the fly set up an environment, like an ephemeral environment that's just set up for just that command and then torn down again. And with that, I was playing around with that to benchmark the different versions of Polars to see what the speed is on different queries, different kinds of setups.

and iterating over the versions and just bumping it every time to see what happened. At one point I found, I think in version 1.2 point something, that there was suddenly a regression. The query started taking 10% longer to run the full benchmark, and it didn't really go down again. And drilling down, we were able to pinpoint the two specific queries of the benchmark that we were running just spiked up on a certain version.

And because UV just sets it up so quickly, at one point with a script for git bisect, which allows you to pinpoint the exact commit version in the depoter repo where it started occurring,

allowed us to find which specific commit caused this regression. And funny enough, when I communicated it to the Podar's guys in that week, they hit the same code and for some reason, like they couldn't quite figure out quickly what exactly caused it, but they hit the same code and refactored it and it resolved itself. So ultimately, all was good.

But it was interesting to finally have a package manager that was able to be used so quickly that you can start using it for complete new use cases that you couldn't have thought of before. Nicely said. Yeah, very cool. I haven't been using UV myself yet, but it sounds like I should be.

I can definitely recommend it.

So your whole previous episode on this show, 531, was all about data science at the command line. Obviously, you've written two editions of the book, as discussed. You've written R packages that make the command line more interactive and playful. I don't know if I can pronounce them properly. There's Rayliber. R-A-Y-L-I-B-E-R. Rayliber.

RayLibber. Well, RayLibber is a wrapper around RayLib, which has nothing to do with the command line, but that's a C library to create video games. To create video games. To create video games. Yeah, yeah, yeah. And actually, I have a talk. I've given a talk at MYR a couple of years ago where I advocate for some of the things that video game programming offers, like...

2D and 3D graphics and interactivity, how that can be used for doing data science. Right, so that's RayLibber. That was a fun project, right? That you can actually create 3D environments from R.

but it has nothing to do with the command line. So let's talk about the command line. Yeah, so let's talk about the command line. I mean, yeah, there's, I don't know if any of the other, the RECSPECT or TMUXER, if those have anything to do with the command line. Oh, yeah, yeah. So TMUXER, that's a wrapper around TMUX, you know, the terminal multiplexer. If you want to run multiple terminal sessions at once, have like these sessions, and you can interact with that programmatically from R using the TMUXER pack.

Now, that's actually a triad of packages that you just mentioned two of them. So, tmuxr, there's rexpect, which is from rxpect. I'm not sure if you're familiar with the xpect tool. I'm not, no. That allows you to automate things on the command line. So, log in to a server automatically and then do certain things based on certain outputs.

I wrote a wrapper for that, and then there's also knitter-active. Now, I'm probably the only one who has used these packages at all, but I needed those. That's of course why you should write software in the first place for yourself. But I needed those three packages in order to be able to write a book about the command line using Bookdown, using knitter, which is the system that I used at the time.

So a lot of, what's that called, yak shaving or bike shedding. Lots of work, not actual writing, but lots of work. And we had some of that for the Polar's book as well.

But yeah, I mean, as when you're an engineer, when you're a developer and you're writing a book about development, there's always some kind of developing that you need to do on the side for the book, whether that's just to get in the groove or whether it's actually helpful. Just to make life easier. Yeah. Yeah. This episode of Super Data Science is brought to you by the Dell AI Factory with NVIDIA, helping you fast track your AI adoption from the desktop to the data center.

The Dell AI Factory with NVIDIA provides a simple development launchpad that allows you to perform local prototyping in a safe and secure environment. Next, develop and prepare to scale by rapidly building AI and data workflows with container-based microservices, and then deploy and optimize in the enterprise with a scalable infrastructure framework. Visit www.dell.com slash superdatascience to learn more. That's dell.com slash superdatascience.

Thank you for that extra context and explaining what those packages do. All of that was basically just to bolster, in a few minutes, your expertise on doing data science at the command line as a real expert in that. Something that I want to highlight here is that in your course, Embrace the Command Line, which folks can check out, it's online. There's more information at yaroniansons.com slash embrace.

- Well, sorry to interrupt you. - No, please do. - This was a course that I've given a couple of times. This was a cohort-based course using Maven, and I've given it a couple of times and I no longer do it. So unfortunately, it's not available online. We can cut that out. - No, it's okay, you just leave it in. I don't mind my mistakes being on air. - Neither do I.

But nevertheless, in that course, or at least in the course information, you say that the command line is as powerful as it is intimidating. So for our listeners out there who maybe haven't crossed that emotional barrier, maybe they do program, they use Python, maybe they use R, or whatever programming languages they use,

but they haven't crossed that threshold, that emotional barrier to start using the command line. What do you recommend to students to kind of, to get past that emotional barrier and see the command line shell as a great creative space for data science and software development? - Yeah, it's unfortunate that, you know, when you first see this window, this terminal, this blinking cursor with a prompt waiting for your commands,

It's such a shame that this is indeed so intimidating. But that's how, of course, when Unix or Linux was first created in the 60s and 70s, at that time, they didn't even have screens. So things had to be, they were a little flashy.

So there is indeed a hurdle for you to take, for you to embrace the command line. And there are certain tricks that you can apply, certain changes that you can make in order to make the command line a more pleasant environment, a more forgiving environment. So things that I always like to do are, let me try to come up with a couple of them,

First of all, use colors that you like, use a font that you like, add in aliases so that these long commands, these long incantations, that you don't have to remember them by heart. So you make the experience more ergonomic. It also helps to work in an isolated environment so that you know that you won't be able to break anything. Docker can be used for this.

And I think if you do these kind of things, experiment with the command line every day for a little bit. Don't try to do everything all at once. I mean, I don't. I just use it here and there as a complementary set of tools in addition to, well, all the other data science tools that you want to use.

And then, yeah, you'll gradually build up more and more appreciation of the command line. You'll be able to embrace it more and more, make it your own. Very nice. I love that. Nice. And so I realize now I've strayed. I've already kind of switched gears and taken us away from your Fuller's book. But I remember now that there were a couple of stories that

that we discussed before coming on air that I really wanted to cover before this episode ended. So we're going to have this grating experience for the audience of going back to your book. But maybe that's a nice place to end anyway.

And so first what I wanted to talk about, and so you guys may or may not be aware of this, but in 2025, two of the biggest sponsors of this podcast, to whom we're very grateful because it allows us to keep the lights on and make this show for everyone, are Dell and NVIDIA.

And it sounds like for the appendix of your book, Dell and NVIDIA, you had some kind of partnership with them that allowed you to do more. Explain how they're involved with your book. Yeah. So at a certain point, I got a LinkedIn message from NVIDIA. It was something about being an influencer. And at first, I didn't think much of it.

After a week or two, I decided to reply, like, "Alright, I'm interested, let's chat." And it turned out that they actually wanted to collaborate with us. They were quite eager to send us some hardware so that we could benchmark Polars on the GPU. And we were like, "Great!"

Only thing is, we don't have anything to put that video card in. So that's when they brought in their partner, Dell, and Dell was able to supply the rest of the hardware. So that was a fantastic collaboration. And the way we did this is-- Thijs can say more about the software side of things, but in terms of hardware,

It was all in the States. So Dell had this laboratory where they had a beefy machine and they were able to swap out different NVIDIA video cards. So we did the RTX 6000. No, the Ada generation. The Ada generation. So these were all professional video cards, not the consumer. Yeah, the workstation variants. Yeah. So yeah, we were...

It was very important for us that we were able to benchmark things ourselves. That we wouldn't just copy numbers from some leaflet, some promotional material. We wanted to produce these numbers ourselves if we were going to put them in our book. And that was all fine. NVIDIA and Dell thought it was a great idea. And so eventually we were able to try out five different video cards

for a number of different settings and packages.

And that's all reported now in the appendix of the book. But Thijs, maybe you can say something about how you actually benchmark. Yeah. So to start off with a little more context is that NVIDIA has a team called Rapids, which is working on creating all kinds of general purpose computing packages that can run on the CUDA platform. And CUDA is like the calculation platform that NVIDIA opens up so you can run

any kind of calculation effectively on the GPU. And the difference between normal CPU and GPU is that GPU has many relatively dumb, simple processors, but just many of them. So if you are able to bend a problem, a calculation problem into something that the GPU can run, it oftentimes accelerates by a lot, by a factor up to 10.

So they also did this for packages like Pandas. They have QDF, is what their package is called. It's a DataFrame library, but runs on the GPU. And they wanted to collaborate with Polars as well. But since Polars has this layered architecture where it runs through an optimizer first and only then gets sent to an engine,

it would be a waste to just put the Polars API on QDF and just translate it to normal QDF functions because a lot of the performance enhancements from Polars comes from its optimization. So instead, Rapids worked together with Polars and designed a GPU engine that gets input from that optimization layer.

And because they recently opened up a beta for this new package, they got in contact with us to ask, "Hey, you guys are working on the Book of Folders. Do you want to collaborate?" And we said, well, with the terms that we could test stuff ourselves and benchmark ourselves, we definitely said yes, because it turned out to be a lovely collaboration as well. That's a cool story.

Yeah, definitely. What were the results? Is that what you're going to tell me now? Please tell me the results. It's a lot faster. Yeah, it is. Yeah, so we already noticed that in the beginning, the promotional material was a bit careful with what kind of size of data set it would be beneficial from. And it turned out from the test that we were doing that it's quite quickly already. Because data needs to be transferred to the GPU, you get a small overhead.

So you start seeing the difference when the data set size grows, but it's already from one gigabytes and up. So it's relatively quickly because most data that you would work with in a professional setting is usually starts to tends to grow a lot. And, uh,

We also noticed that even the relatively smaller GPU cards with less processors already benefit a lot from this, already have a big speed up from just using the GPU engine.

Very nice. Cool project, great results, unsurprising results given everything that we know about Polar's already, including the examples that you gave at Aleander earlier in this episode. But cool to have kind of that comprehensive benchmarking there on four different NVIDIA cards. And cool that Dell supplied the server for you to be doing all that benchmarking yourself.

All right. And then one final story that I want to get in here. So I mentioned already how Marco Gorelli, so Marco Gorelli was our first ever Polar's episode on this podcast. So that was episode 815. And then he introduced me to Richie, the creator of Polar's who came in not long after that, a couple of months later in episode 827. And now you guys are our, you're the, you're the final episode in the trilogy of,

on pollers. Well, probably not the final. We'll have more. But for this, it's like the original Star Wars. Four, five, and six. Your episode six of the original Star Wars. So I understand that there's an amusing story involving Marco somehow sabotaging your book and forcing you to rewrite an entire chapter. Yeah, it's amusing now. But it's not. So...

We have to go back a little bit further. I was at a Christmas party organized by Exomnia where Richie was also present. And Richie was like, "Yeah, Polaris is going to have data visualization capabilities." I'm like, "What? Python doesn't need another package to do data visualization. There are already two dozen, so many out there." So at first I was like, "Man, they keep expanding the library. We just want to finish this book.

So I was quite upset at first. After a while, I started to realize, okay, maybe it's not so bad. I mean, if the book has a chapter about data visualization, maybe it'll sell better if it has pretty pictures. So I started writing. I was quite happy to find out that

Polars itself doesn't do any data visualization. It has the df.plot namespace, but then every method in that namespace calls out to another package, hbplot. And I wasn't familiar with hbplot yet. It's this meta package which can target matplotlib and plotly and another one, Bokeh. Thank you.

And so, okay, I really had to get into HBplot, but I didn't just want to write about HBplot. I also wanted to include gray tables, right? I was a big fan of that. And you could argue that presenting a table is also a form of data visualization. I'm a big fan of plot line. So it was going to be a huge chapter.

So I've written this and then all of a sudden I see on GitHub this pull request by Marco Gorelli. He was like, "Okay, I'm going to change out HB plot for Altair."

I'm like, "What? Now I need to rewrite the entire chapter or at least a big portion of it." Like, "Marco, what are you doing?" Now I know that Altair is a very good choice for this, especially for when you are working in a browser and you want to create interactive data visualizations. That is something that Plot9, for example, doesn't support. Altair definitely has its use cases.

And I should have known better as well, HBplot at the time, or the whole plotting functionality in Polars was marked unstable. So I should have known better. I was just too happy to get it out there. And you know what? Marco and I, we get along really well. We collaborate now on getting Narwhal's project into Plot9 so that Plot9 better supports Polars as well as Pandas.

So, but yeah, that was the story that I had to rewrite nearly everything inside chapter 16, visualizing data.

Nice. Great story. And it is funny to imagine Marco kind of sabotaging your book because he's an extremely nice, because we actually, the episode with Marco, I don't know if you know this or not, but I recorded it with him in person in London and he took a train. Yeah. He took a train from Cardiff to London, which is like three hours or something to come and record the episode. And yeah,

And then so we went for dinner afterward as well. And you really get this impression of a man who is exceedingly kind and cautious. Yeah, I hate it. I hate it. I wish he wasn't like that. No, he's a very generous and kind person. Definitely a pleasure to work with.

And I definitely love his dry sense of British humor. It's perfect. Every time he speaks at a conference, he tries to incorporate an expression from that country.

So when he was presenting at Pi Data Amsterdam, he used the expression "Helaas binde kaas", which translates to "too bad peanut butter". Doesn't make any sense if you're not Dutch. But that's what I... And in Germany, he talks about the concept of "Lufden" or "Wufden", to open up all your windows and

I think in PyData in Paris, he had a dataset about the pronunciation of pain au chocolat. At one point, further south you go, it changes to something else. Something like that worked in. He's a funny guy. Definitely.

That's really funny. Yeah, so people want some more of that humor. And he was also extremely technical. I mean, the depth, you know, in this episode, we haven't gone, and that wasn't really the point of the episode, in that one, as well as Richie's episode. So in 815 with Marco Gorelli or 827 with Richie Vink,

In different ways, you get into the nitty-gritty of why Polar's is so fast under the hood. If people want to check those out, we'll have links to those in the show notes.

All right, and so that brings us to the end of this episode, basically. It's been awesome having both of you on the show. Jeroen, welcome back. It's been a pleasure. Yeah, hopefully we'll have you back again soon. Tice, I hope your first podcast experience wasn't too painful to kind of end things. Great.

Um, ties, let us know if you have a book recommendation for us to, to wrap things up here. I do. Yeah. So, uh, besides not normally I'm more into the fantasy side of things to escape, but, uh, one of the things I love is when someone is able to explain a complex, uh, something complex in a way that has a proper story.

And one of the books I recently read was "Immune" by Philip Detmer, the main writer of Kurzgesagt, which is also a YouTube channel which has many explanation videos on all kinds of intense topics. But this book, he dives into the immune system, which is exceedingly complex.

Yet he still is able to explain it very, very well in a very both informative and entertaining way. So that's definitely one of the books I recently finished that I love reading. I love that recommendation. It has been advertised to me at the end of a number of Kurzgesagt videos. Kurzgesagt is one of the few YouTube channels that I subscribe to, and it is excellent. I mean, if I could... Yeah, it's perfect.

If I could go back in time and somehow be responsible for one YouTube channel, it would probably be Kurzgesagt. I think it's amazing. Good choice. And for our listeners out there who don't speak German, Kurzgesagt means shortly said. And so it is just like you described his immune book being...

you know, well-spoken, easy to understand on a complex topic. That's what all the videos on the channel aim to do. And I highly recommend Kurzgesagt. So many fascinating topics, scientific ones, but also philosophical ones. It really gets into the big questions of life in the universe. It's interesting. Nice. And Jeroen, do you have a book recommendation for us? Sure, I do. I am currently enjoying Unix, A History and a Memoir by Brian Kernighan. It's a short book. I haven't finished it just yet,

But, you know, as we have talked about, Unix can be a very dry topic, but it is so interesting to learn about the history and the people behind it and the politics and the things that went on in developing such a

profound piece of software. Yeah, truly revolutionary. It doesn't understate it. If you're watching the video version, you can see what the book looks like. But yeah, truly revolutionary Unix is. I mean, it's staggering to think the shoulders of the giants that we now get to stand on

inventing our relative trivialities in computing, which although interestingly, even though they're relative trivialities compared to say Unix, because of what's to come in terms of this age of intelligence that we're emerging into with intelligent machines,

Maybe in decades they'll be looking back and thinking about great tables and the big difference in the history of great tables book. Guys, it's been so great having you on the show.

If folks want more of your humor and brilliant insights, how can they follow you after the program? Well, the easiest way is probably go to the same website that we already mentioned, PolarSky.com. And from there on, you can find links to our LinkedIn pages and other places where we are. That's probably easiest. Cool, yeah. And so basically you would say LinkedIn is the main social medium for you both? Yeah, yeah. I'm trying out Blue Sky.

I've been pretty active or I've been enjoying Twitter for a long time, but that has changed. LinkedIn, we seem to get a lot of response on LinkedIn. Whenever we post something about the book, there's a good vibe. I mean, there's a lot of...

other stuff going on on LinkedIn, but yeah, it works. It works. Yeah. LinkedIn is working these days, which we can't say for every social media platform out there. That's definitely where you can find me the most active as well.

Awesome. Thanks so much. This was another great episode. Really had an awesome time with you guys. Appreciate you co-locating yourselves in the well-appointed perfectionist tyrant, your own studio over there. And yeah, I look forward to welcoming you guys on the show again sometime. Thanks, John. Thanks so much for having us.

What a tremendous episode with Jeroen and Thijs. In it, they covered Python Pollers and how it's a high-performance data frame library that uses a declarative approach, allowing users to state what they want as an end result while the engine handles optimization. We talked about how the book Python Pollers: The Definitive Guide by Jeroen and Thijs provides comprehensive coverage of pollers and includes benchmarks showing pollers can reduce memory usage and compute time by up to 10x compared to pandas, the standard.

Well, or at least until recently, until Polar came along the standard data frame operations library. Polar uses a grammar where expressions function like recipes, operations are steps, and functions or methods are the cooks.

creating readable code without excessive brackets. We talked about how when implemented at Aliander, a Dutch power grid provider, Polar's reduced memory requirements from 500 gigabytes to 40 gigabytes, so a 10x reduction and doubled processing capacity.

We also talked about other things than folders. We talked, for example, about how the UV package manager is emerging as a faster rust-based alternative to poetry, allowing for quick environment setup and teardown for benchmarking. Or whatever. It doesn't need to be for benchmarking. It can be for whatever reason you would need Python packages. And finally, we talked about great tables and how it provides styling capabilities for data tables, allowing presentation-ready formatting without modifying the underlying data.

As always, you can get all the show notes, including the transcript for this episode, the video recording, any materials mentioned on the show, the URLs for Yurun and Tai's social media profiles, as well as my own at superdatascience.com slash 885. And if you'd like to engage with me in person as opposed to just through social media, I'd love to meet you in real life next week.

at the Open Data Science Conference, ODSC East, running from May 13th to 15th in Boston. Boston. I'll be hosting the keynote sessions and along with my longtime friend and colleague, the extraordinary Ed Donner, I'll be delivering a four-hour hands-on training in Python to demonstrate how you can design, train, and deploy cutting-edge multi-agent AI systems for real-life applications. That is going to be fun.

Thanks, of course, to everyone on the Super Data Science podcast team, our podcast manager, Sonia Breivich, media editor, Mario Pombo, Nathan Daly, and Natalie Zheisky on partnerships, our researcher, Serge Massis, our writer, Dr. Zahra Karche, and our founder, Kirill Aramenko. Thanks

Thanks to all of them for producing another laugh-filled episode for us today. For enabling that super team to create this free podcast for you, we are of course deeply grateful to our sponsors. You listener, you can support this show by checking out our sponsors' links which are in the show notes. And if you yourself are interested in sponsoring an episode, you can get the details on how to do that at johncrohn.com/podcast. Otherwise, share, review, subscribe, edit videos into shorts to your heart's content.

But most importantly, just keep on tuning in. I'm so grateful to have you listening and hope I can continue to make episodes you love for years and years to come. Until next time, keep on rocking it out there. And I'm looking forward to enjoying another round of the Super Data Science Podcast with you very soon.

885: Python Polars: The Definitive Guide, with Jeroen Janssens and Thijs Nieuwdorp 01:15:22 Share

Super Data Science: ML & AI Podcast with Jon Krohn

Deep Dive

Shownotes Transcript

885: Python Polars: The Definitive Guide, with Jeroen Janssens and Thijs Nieuwdorp