We're sunsetting PodQuest on 2025-07-28. Thank you for your support!

862: In Case You Missed It in January 2025

2025/2/14

Super Data Science: ML & AI Podcast with Jon Krohn

AI Deep Dive Transcript

People

Azeem Azhar

Brooke Hopkins

Florian Neukart

Hadelin de Ponteves

Jon Krohn

Kirill Eremenko

Topics

Jon Krohn: 我认为2025年人工智能的发展将继续快速推进，但这种能力是否可持续是一个值得思考的问题。我们需要关注技术变革的速度，并为未来做好准备。 Azeem Azhar: 我认为人类难以理解指数增长，常常以线性思维看待问题。企业和决策者应该理解指数过程，避免线性思维，回归第一性原理思考，以更好地为未来的技术变革做好准备。同时，在规划和建模时，应使用动态百分比而非线性增长，并考虑反馈循环。回顾自己如何适应指数级技术变革，可以帮助我们更好地进行规划和预测。我们需要认识到变化并适应它，鼓励反馈循环是确保模型和算法朝着正确方向发展的好方法。

Deep Dive

Shownotes Transcript

Translations:

中文

This is episode number 862, our In Case You Missed It in January episode. Happy Valentine's Day, listeners, and welcome back to the Super Data Science Podcast. I'm your amiable host, John Krohn. This is an In Case You Missed It episode that highlights the best parts of conversations we had on the show over the past month.

With 2025, a new year starting, in January, my conversations on the show were focused on glimpsing into what the years ahead may bring. Developments in AI continue to rocket forward, but is all that power sustainable? In episode 855, I asked Azim Azhar, the famed futurist, how exponential changes in tech could radically impact our future. So following on from this idea of exponential growth, I'm going to talk a little bit about

Humans seem to be, and maybe turkeys as well, seem to be poor at being able to

to imagine that they're on this exponential curve. And so Ray Kurzweil, for example, another famous futurist, said that our intuition about the future is linear, but the reality of IT, as we've already been discussing in this episode of Zim, is exponential. And you similarly, in your book,

You talked about in chapter three, how, for example, the COVID pandemic, when that was unfurling in 2020 around the world, it was experiencing exponential growth.

But, and I experienced that in real time, you know, looking at, you know, I was like many times every day, probably a hundred times a day, refreshing how much in New York state, how many more infections there were. And it was very difficult for me, even as somebody, you know, with a lot of statistical background, you know, been a data scientist for a decade,

even for me, it was difficult to kind of process how this exponential change was happening. So

Yeah. So given the difficulties that even experts face in predicting exponential growth or being able to have intuitions about exponential growth, how can businesses, policymakers, our listeners better prepare for future technological shifts? I agree. It's really difficult to normalize and rationalize in your head the speed of that change. I don't

I do think that it's quite commonplace. A very simple exponential process is compound interest. And virtually all of us start saving for our pensions or 401ks or whatever it happens to be too late. The right time to start is when you're 23 and you just put 10 bucks a month away, knowing it's going to compound. And I think many of us are guilty of that. I am as well. I think the

There are companies who have internalized this possibility. And I think the technology industry, as it comes out of the Bay Area, has very much done that. They have relied on understanding that Moore's law keeps driving prices down and that you aren't really going to

systemically run out of capacity or compute. You may have crunch periods where you can't onboard the machines or the hard drives or the storage fast enough, but in general, you won't do that. So I think that one of the ways that you have to understand this is understand the processes and understand that these processes absolutely exist. And I think it's really unhelpful for

when you're trying to make sense of this world for people to set, to think in linear terms. And I still see it. And I'm sure you may see it when you're helping clients or people at work and you see their business plan and it shows a sort of a fixed increment of growth and nothing grows that way. Everything follows a phase of a logistic S curve where you have an exponential phase that tails off. Nothing is linear except for our birthdays, one to two to three to four.

So I think a lot of the tools are to hand, but it is very difficult. And what you need to do at these moments is perhaps go back to first principles thinking and perhaps say, look, the heuristics we've used were just that. They were really helpful in a world that doesn't move as quickly. But in a world that moves this quickly, we have to go back to heuristics. Sorry, pardon me, first principle thinking. And the thing that's so funny, John, is that

Most people who are listening to this podcast will have, beyond their experience with COVID, they will have lived through exponential technologies because they will have lived through upgrading their iPhone or their Android phone every two years and getting twice as much compute for the dollar they spend. They will have lived through, if they're data scientists,

their data array or their data lake going from a gigabyte to 100 gigabytes to 10 terabytes to a petabyte and beyond, right? They've literally witnessed it. And yet it still becomes quite difficult. I think going back to first principles is a really helpful way of doing that. Yeah, yeah, yeah. And so in terms of something that people could be doing, this idea of first principles in this instance here, so that's literally thinking about sketching for yourself

those kinds of changes and thinking about how you adapted to those changes and making projections based on that? Yeah, I think that's a really good way of doing it. I mean, when I do my own planning and build models of where the business might go or where usage might go, and I've done this for more than 20 years, I've never...

put in linear increases, like it'll go up by 20, it'll go up by 20. I've always gone in and put in a dynamic percentage because a percentage compounds. And if one of the things that drives these sort of exponentials is feedback loops. So the reason something accelerates, I mean, let's think about silicon chips, right? Why did chips during the 80s and the 90s and the 2000s

get better and faster. It was because there was a feedback loop. When Intel came out with a new chip, it allowed Microsoft to deliver better tooling on Windows

which gave people an incentive to upgrade their computers, which put money in the system, which allowed Intel to develop a new chip, which allowed Microsoft to push out more features. And that feedback loop accelerates. And so sometimes when I do my planning, I will also try to put those types of feedback loops in because an outcome of a feedback loop will often be a

a curve that ultimately has that sort of that quality of taking off. And in a lot of places, you end up with these linear forecasts. And if you're sitting there and you're thinking, listen, I need to put in my budget request for next year for storage on S3. And I also need to give some indication of what's going to happen the year after and the year after that and the year after that.

If it's growing linearly, I think you're making incredibly extreme assumptions based on what evidence has shown us. So you have to go back and start to say, how do I put in more realistic assumptions, even if it's going to freak the financial, the CFO out, because that's what history has shown us. As Azeem says, it's super important to recognize change so that we can adapt to it. Encouraging feedback loops is one great way to make sure your models and algorithms are moving in the right direction. Now,

Quantum machine learning is opening up even more ways to solve computationally challenging problems and model the world. In episode 851, I talked to Dr. Florian Neukart about how quantum computing can keep pushing the envelope. Now we have a bit of a sense of the theory and the special things that you can do with quantum computing. So can you provide an example of a practical, maybe optimization problem? That seems like the kind of thing that you guys do at TerraQuantum a bit. So like some kind of

some kind of practical problem that is intractable for classical computers, but with some quantum computing as well. It sounds like typically a hybrid system. How we can have a real-world application that provides some value. Yes. So there are so many. So these three branches that we look into, everyone who does quantum computing does, is machine learning, and as you said, optimization, and then simulation.

One problem in optimization that sounds boring at first is scheduling. But that is impossible to tackle with no matter how powerful a classical computer you have. So the challenge is manifold. Scheduling appears in production. Scheduling appears in production.

hospitals when you have to do plans for the nurses and the doctors. Scheduling appears in computers in electric vehicles when you want to optimize the subproteins for power consumption.

One of the things that we did with an automotive company, with Volkswagen in that case, was a scheduling problem for production. So imagine you have vehicles coming out of the production line, then all of these vehicles must undergo a couple of tests.

Ideally, I can test every vehicle for everything. But the reality is you don't have enough time, you don't have enough people, and not all of these people doing vehicle tests have the same skills. Especially if it's emissions testing. I mean, then you've really got to skip a few cars.

Yeah, that one. So some of these tests, of course, you can plan because you get reports, field errors. The workshops will report, well, I have these couple customers complaining about water damage. So anytime it rains, it gets wet inside the vehicle. So then you do water tests. But then there are 250 something test classes and each of these test classes has subtests. So

So the question now is, given the staff, the personnel in production with the skills available today, how can I maximize the number of tests for all of these vehicles that come out? And that is a very complex scheduling problem. But the same algorithm can be applied, as I said before, for scheduling subroutines in vehicles.

in electric vehicles, so you want to minimize power consumption, so then maybe you have two subroutines that use the same data. So instead of loading into memory, deleting it and loading it again, maybe I can execute these subroutines in sequence and access the data in sequence before I delete it. So these are things where this can be applied.

which don't sound, uh, very exciting at the beginning. And you would wonder, is this really something where I would need quantum computing? But you do, uh, because in the end with classical non-quantum algorithms, the only thing you can do is heuristics and make approximations. So you can never be sure. Is this really, uh,

the best solution I can find. I must admit also with a quantum computer, you cannot be sure, but what you can do then is you just compare the classical and the quantum algorithm. And if the quantum algorithm gives me a better solution, then that's the one.

that i take other problems are in logistics we did many logistics optimization problems so imagine you have a fleet of vehicles that have to transport goods through a network of hubs so for example food which can decay you have to have that vehicle number one at a certain hub between one and three pm otherwise there is a problem with the food for example

So how do you optimize the number of vehicles that I have in my transportation fleet, minimize the number of vehicles that I need to transport all the goods efficiently through the network? Or in other ways, how do I reduce the empty miles? The empty miles meaning I have trucks that just go from A to B but don't have any load. So how do I avoid that? So this is also one of the things, one of the problems that we have solved with a customer.

And then it ranges from optimization of satellite constellations, which we did, financial optimization. So you want to predict market behavior, you want to do collateral optimization, you want to do exotic options pricing, you want to do machine learning, you want to learn better, do better image classification. So all of these things benefit from hybrid quantum computing.

From super cool but still relatively niche today quantum computing, we now turn to a challenge that lots of this show's listeners are facing immediately and on a daily basis because of all the new foundation models such as large language models that are being released every day. In episode 853, I sat down with my super data science colleagues Kirill Aromenko and Adeline de Pontev to run through a checklist they provided to help business owners choose the perfect AI models for their needs.

Earlier, I talked about how large language models are a subset of all the foundation models out there. So it sounds like for that kind of medical application, unless it also needs to have vision to be able to read cancer scans, well, let's just assume that it sounded like that initial application was just going to be natural language in and out of the foundation model. So in that case, we could be like, okay, I can use a large language model. How do you choose

So maybe it's kind of vaguely you're within the space of all the possible foundation models you could select. There might be some kind of things like that where you can say, okay, if I want text in and text out, I want an LLM. But more specifically, how do you choose from all of the available foundation models out there? So within the category of LLMs, there's thousands of possible options out there. How do you pick the right one for your application?

Absolutely right, John. So interesting how we're so spoiled for choice now, even though two and a half years ago, there was no such thing, right? Even two years ago, there was no such thing as...

or you're just starting foundation models llms and so on now there's thousands as you said well there's a lot of factors and we're going to highlight 12. you don't need to remember them off by heart but like see which ones you um relate to as a listener which ones you relate to the most which ones will be imposed most important for your business so first factor that you probably need to think about is cost because there is a cost associated with using these models um and

they have different pricing. So you want to look at that as a starting point. Then there's modality, which John, you alluded to, what kind of...

What kind of data are we talking about? We're talking about text data, video data, image data, and so on. So what outputs, what inputs do you want? What outputs do you want? Things like that. So different models are designed for different things. You need to check that one off right away as well. Customization options. So we'll talk about customization further down in this session.

You need to be, once you're aware of the customization options, once we've talked about them, you will know which ones you would need for your business. And then you would look at which one does the foundation model offer, support.

Inference options. Inference is basically once you've deployed the model, so there's training, which the first three steps, and then there's fine tuning, which is also considered training. But then there's inference. Once you've deployed the model, how is it used? Is it used right away, instantly? If you're developing a gaming application, you want a foundation model to be integrated in your real-time game where users are playing with each other for some user experience thing.

you want it to be producing outputs right away. There cannot be even like a second delay. So that's one option. Then there's maybe a synchronous inference where you give the model some data and then it gives you an answer back in five minutes. And maybe there's like a batch transformation where it's done in the background later on. So we'll talk more about that in this session as well. Basically, you need to be aware of inference options that are relevant to your use case.

In general, generally speaking, like it's kind of tied in with inference options, but basically like what's the delay that the users will get and that how the model responds

response, how quickly it responds. With latency, if you want to be speaking in real time to the foundation model, it would need to have very low latency so that it feels like a natural conversation, for example. Yeah, exactly. That's a great example. Architecture is a bit more advanced. In some cases, you might need knowledge about the underlying architecture because that will affect how you're customizing the model or what performance you can get out of it. Usually, that's a more technical consideration for more technical users.

Uh, performance benchmarks. So these models, there's lots of, um, um, score leaderboards, scoreboards. Uh, Ed Donner was on the episode like a few episodes ago and he was, uh, eight, four, seven. Yeah. He was talking about, about, uh, leaderboards. What did he say? He's a leaderboard. I laughed at that. Yeah. So there's lots of leaderboards and there's lots of benchmarks that these models are compared against even before you customize them. We're not talking about your evaluation of the, um,

fine-tuned or customized model. We're talking about the evaluation of that cake, that bottom layer of the cake. Even they have their own evaluations. How well do they perform on general language and general image tasks and things like that? So you might want to consider those. So you might want really high-performance models

but that's going to cost you a lot of money. You might be okay in your use case with average performance because it's not critical, business critical, or you don't need that super high level of accuracy. Then you might be able to get a cheaper model because you're not requiring this super high accuracy. You also need to consider language. If you're using a language model, what languages does it support, like human languages?

the size and complexity, also how many parameters, small language models are becoming more popular these days. Can you use a small language model? Do you need to use a large language model? There's another consideration, it's a bit more technical as well. The ability to scale a model, that's an important consideration that probably I would imagine business users that

are not like technically savvy might overlook and that basically means okay you will deploy a model now and you can use it for your 10 000 users but what if your business grows to a hundred thousand how are you going to scale it are you going to scale it by um

spending more money? Are you going to like on the size of the underlying server or is there a way to scale it by fine tuning it and changing the underlying architecture somehow? And that's a very technical consideration, but it can be like a bottleneck for growth for businesses.

And the final two are, last but not least, compliance and licensing agreements. Very important as well. Like in certain jurisdictions, there are certain compliance requirements for compliance

or how data is processed or even AI. There's more and more regulations coming out around AI and licensing. Of course, these models come with licenses. How are you going to use to make sure that Europe is aligned with the license that you're getting from the provider? And the final consideration is environmental considerations. Like it might sound strange, but if you think about it,

These models, to pre-train them, there's a lot of computers required, a lot of energy is used up training these models. So you might want to look into, okay, well, am I supporting an organization that is environmentally conscious? Are they using the right chips? We'll have some comments on chips later down in the course.

Are they, you know, even inference of this model? Is this model efficient during inference? Am I going to be using a lot of electricity or not as much electricity as I'm?

as I could be with another model. So there you go. Those are the 12 considerations that maybe not all of them are applicable in your business, your use case, but those are the main ones that businesses tend to look out for when selecting a foundation model. Thanks Kiril. At the end there, you let slip again later on in this course, because I think you've been recording so many courses lately. But yeah, later in this episode, in fact, we'll be talking about chips.

And yeah, so to recap those 12 criteria for foundation model selection, you had cost, modality, customization, inference options, latency, architecture, performance benchmarks, language, size and complexity, ability to scale, compliance and licensing agreements, and finally the environmental considerations at the end. There's a ton there.

I'd love to hear your thoughts on this. And particularly if there's, you know, some way across all of these dimensions, I mean, like, where do you start? How do you, how do you start to narrow down the world? I mean, I feel like now that I know these 12 dimensions,

criteria for making selections, I feel like I'm even more lost in the woods than before. - Yes, that's right. I was feeling the same at first when I was starting and building a new application of generative AI and I had to pick a foundation model.

In my experience, it had a lot to do with the dataset format, because different foundation models expect different dataset formats, especially when you fine-tune them. So for example, I'll tell you about my recent experience. I did another fine-tuning experiment.

I think it was on one of the Amazon Titan models. Yes, so it's one of the foundation models by Amazon, which, by the way, just released their brand new foundation models called Nova. So I can't wait to test them out. But yes, at the time, I chose the Amazon Titan foundation models because the data set that I used...

to augment, once again, the knowledge of the foundation model, was fitting perfectly to the Amazon Titan model.

So I chose this one. It could have been a different one if it was a different dataset format. But yes, it really depends on the experiment that you're working on. It depends on the goal. So that's kind of an extra criteria that you need to consider, take into account. And when I created this chatbot doctor, this time, yes, as I said before, it was a LAMA model. And I chose this one once again for a format concern.

So, yeah, in my experience, you know, on a practical experience, it will have a lot to do with the data set that you're using to implement the knowledge or to do fine tuning or even rack, which we'll talk about later in this episode. Yeah. And this will sound like I'm giving you guys a boost and I am giving you guys a boost, but I'm not doing it just because of this. But this kind of difficult decision, trying to figure out what kind of foundation model you should be using is.

Making that selection effectively could depend a lot on people like you, the two of you, who are staying abreast of all the latest in-foundation models. And so it's the perfect kind of opportunity to be working with your new company, with Bravo Tech, to be able to, you know, that three hours, for example, that you were offering up front at the top of the episode, a lot of that could be spent on just figuring out what kind of foundation model to be using for this particular use case. Definitely.

Fantastic. Yeah, thanks, John.

is custom metrics. So there could be complex scenarios where standard metrics, just plain old accuracy, aren't useful. I mean, actually, that would be something. How do you, in a scenario where

This isn't like a math test. Scoring a conversation isn't like a math test where there's a correct answer. You just get to some integer or some float and you're like, okay, that is the correct answer. Nice work algorithm. When you have an agent handling a complex task, there's an effectively infinite amount of variability involved.

where, you know, there's an infinite number of ways that it could be right. Not even, you know, not even including the infinite number of ways that it could also be wrong.

So what kinds of metrics do you use to evaluate whether an agent is performing correctly? And then maybe building on that, what kinds of custom metrics might your clients need? I think you're exactly right that it's really hard to find the line between this is objectively a good conversation and this is objectively a failing conversation, but rather it's a spectrum conversation.

And so what we find works really well is layering metrics. So being able to run a whole suite of metrics and then looking at trends within those metrics. And this allows you to make trade-offs as well. So maybe you're a little bit worse at instruction following, but you get the cases that you care about most 100% correct. Because the distribution of how well you do on all these cases isn't like machine learning where you just care about

you know, getting 99% of examples right. Because if you're getting the one most oftenly used case wrong, it doesn't matter if you get the other 99% right, because when someone tries to book an appointment, they fail. And so we see that these patterns of what matters is correct, is different than other traditional software applications or machine learning applications or even robotics. And the other piece of this is being able to show

By having a variety of metrics, you can create a whole picture of how the system is behaving. So for example, a short conversation isn't inherently bad, but a short conversation where the goal wasn't achieved and the steps that the agent was supposed to take were not executed, that's an objectively bad conversation. So you can filter down by whether potentially true failures or false positives are

or false failures, et cetera, you can basically figure out which ones are ones worth looking into through filtering by these metrics. So I think while we aim to provide all automated metrics for things like, did it follow the workflow? Was the conversation successfully completed? Were all the right function calls called with the right arguments?

There's also always going to be space, I think, for human review and really diving into those examples. And the question is, how can you use that time most effectively? So it's not that you never look at all these examples, but you're looking at the most interesting examples. Nice. Very cool. That's a great example of kind of what to prioritize. Are you able to give concrete examples of metrics? Like what are the most common metrics for evaluating performance?

Yeah, so we have a metric that allows you to determine if you're following a workflow. So for a given workflow described in JSON, which is pretty common in a lot of these different voice platforms,

Can you determine if you're following these steps outlined in that workflow and determine when you're not meeting those in the conversation? And this is super useful, I think, especially for objective-oriented agents where they're trying to complete a task. Often, if they miss a step in that workflow, it's a really good indicator that the task wasn't completed correctly.

So, for example, if you're booking an appointment, just to use a consistent example, if you're booking an appointment and it asks for the email and the day that they want to book the appointment for, but they forget to ask for the phone number, that task has been completed technically, but hasn't been completed correctly because it missed this key step in the workflow.

Another interesting metric that we do, and then we also dynamically create these workflows in monitoring so that you can see what workflows your agents are actually going through in production and see how often, if that matches with your expectations or where you're seeing new use cases or new patterns of user behavior. We also have metrics around function calling. So, yeah,

you know, where the right arguments called for these different tool calls and that's all custom configurable.

What's interesting here is I think we try to make all of our metrics reference-free. There's two types of metrics. There's reference-based and reference-free. Reference-based is metrics where you have an expected output and you must curate that expected output with a golden dataset and maintain that as your agent behavior changes. Reference-free, we infer what the correct answer should be based on the context of the conversation.

And I think for LLMs in general, reference-free evaluation is really helpful because of the non-deterministic nature, whereas traditional unit testing and software is all reference-based, right? It's easy to make some assertions about what an API call should look like.

But even more so with voice and chat agents, the conversations can go so many different ways. And this changes when you change your prompt, when you change the models, when you change your infrastructure. So having reference-free metrics or at least a strong subset and test sets that rely on those is really important for being able to iterate really quickly.

So we tried to do function calling, create reference-free evaluation for function calling. So we say, for example, if we're taking the order, can we confirm that the right function call was made based on what was described in the order from the user? Those two things should match based on a prompt and a set of heuristics. So this gives you, the users, a lot more flexibility.

So those are just two examples. We actually will build, we've been building out a lot of metrics for new use cases and kind of pulling them from all over the map of using off-the-shelf models, drawing from inspiration in self-driving of like, can we measure, for example, the agent performance against the human performance? Like if it took the agent longer to perform a task

or shorter to perform a task. That's interesting intel. It's not necessarily good or bad when it stands alone, but if the agent takes significantly longer to perform a task and then ultimately doesn't or is repeating itself a lot, it's a good indication that your agent is going in circles. All right, that's it for today's In Case You Missed It episode. Be sure not to miss any of our exciting upcoming episodes.

Subscribe to this podcast if you haven't already. But most importantly, just keep on listening. Until next time, keep on rocking it out there. And I'm looking forward to enjoying another round of the Super Data Science Podcast with you very soon.

862: In Case You Missed It in January 2025 31:51 Share

Super Data Science: ML & AI Podcast with Jon Krohn

Deep Dive

Shownotes Transcript

862: In Case You Missed It in January 2025