We're sunsetting PodQuest on 2025-07-28. Thank you for your support!

Getting AI Apps Past the Demo // Vaibhav Gupta // #319

2025/5/30

MLOps.community

Vaibhav Gupta: 我认为使用 Langchain 编写提示在道德上是错误的，因为我看到的每一段代码都很丑陋。AI 管道代码感觉像演示代码，用完就扔。即使加上拉取请求和流程，AI 代码仍然看起来像随性编写的。每个公司都有 CI CD 系统，其中包括一个 linter，重要的是公司有一个编写代码的流程和有主见的系统，这有助于构建任何人都可以参与并迭代的管道。无论是 AI 代理还是人类编写代码，都需要这种严谨性，而 AI 管道似乎缺乏这种严谨性，感觉像是拼凑起来的，这促使了 BAML 的诞生。BAML 是一种编程语言，它源于我和我的联合创始人两年前的一个疯狂想法，我们认为不能使用 Langchain 来编写提示。BAML 的设计理念与 JSON 相同，我们为你提供你想要的所有 LLM 工具，然后我们构建一个与每种语言的兼容层，这样你就不必一直重新发现所有这些工具。构建另一种语言可能是创业公司有史以来最糟糕的想法之一，我们在角落里写了八个月的代码，才有了最初的几个用户。

Deep Dive

Chapters

Vaibhav Gupta, CEO of BoundaryML, discusses the prevalence of AI demos over production features. He points to a lack of rigorous engineering practices in AI pipeline code, describing it as 'vibe code' that lacks the structure and process of traditional software development. This leads to pipelines that are difficult to maintain and scale.

AI applications are disproportionately found in demos rather than production.
AI pipeline code often lacks the rigor and structure of traditional software development.
The concept of 'vibe coding' contributes to this issue, leading to hard-to-maintain pipelines.

Shownotes Transcript

Translations:

中文

- Like full expressive systems, like if statements, for loops, while loops, where you can conditionally call an agent, abort to a human worker, human annotation system, come back to it, that kind of system. - Wow. - Yeah.

Hey everyone, I'm Vaibhav. I'm one of the co-creators of BAML. I work for a company called Boundary as the founder and we do not really drink coffee. Actually, no one on the team drinks coffee out here. I mostly drink water and every now and then I will have some orange juice in the morning. What the fuck is BAML and why do I keep hearing all about it?

What is VAML? I mean, VAML is, I mean, what it tangibly is, is programming language. What it really is, is a crazy idea me and my co-founder started out with about two years ago, about a year and a half ago now. That just started with the idea that we can't possibly write prompts using Langchain. Like, that just feels wrong, ethically. Because you were doing it a lot? No, it's just because every single piece of code that I saw was ugly.

And like, I think so many people I talk to are just like, they talk about, we talk to everyone. And like, at least everyone I've talked to has always said that AI pipeline code feels like a demo code. It's like you throw it away and like with this vibe coding thing that's going on, everyone's just like, vibe code it and then ship it. And like, that works if you're starting from zero. And for front-end engineering where everything is strongly componentized, that works really well. But if you're building a backend that has like 15 intricacies with software you wrote five years ago,

It's not that it's bad. It's just like, I wouldn't let a junior engineer or new hire submit code without reading it carefully or like talking about it with them. And people just don't have that natural tendency when they vibe code. It's not like, so I think if you add pull requests, add process, you get to that. But anyway, the point is every single piece of prompt engineering code I've seen, even before vibe coding, felt like it was vibe coded. It looked like it was vibe coded. It was a phenomenon before it was a name.

Yeah. And like, I see, I think it's ugly code is one of those things that I think not a lot of people put a lot of thought into. But there's a reason that like, every single company in the world has a CI CD system, which includes a linter.

Because it doesn't even matter if your linter is right or bad. All that matters is you as a company have a process and an opinionated system of how you write code. Because that helps you build pipelines that anyone can hop into from a different company that just joined you and iterate on. And whether it's an AI agent writing that code or whether it's a human, you need that. And AI pipelines just don't seem to have that level of rigor. They just feel like glob put together. And that felt wrong. And that's kind of what inspired BAML. Okay, so...

You saw a bunch of lane chain examples. You thought, man, we can't lane chain prompt our way to a better future. What besides the linting was it? The linting and ugly code, the vibe coding-ness about it. What was it that made you think that? Take like web development, for example. When I first started writing websites, this was like 2010.

2012 kind of time, React didn't really exist. TypeScript didn't really exist. The way you wrote a website is your backend returned a string. And like you put it in there and that's what you shipped. And that was not bad, but there's a reason we all went away from that to writing React.

Like React does a lot of implicit stuff for us automatically that I thought was really helpful. Like one, if you forget a closing tag, like instead of button, you put slash button at the end and you forgot to put slash button, it yells at you. You can't even push code. But if you're returning it from your backend as a string, you can easily forget to put slash button. And now your whole website breaks. Building. I remember when I was first applying for internships back then.

an infinite scrolling newsfeed was a hard interview problem, right? And like, it was because like interactivity wasn't a thing that was built in. You had to do all this code and somehow merge jQuery with like a bunch of dollar signs in your code with your actual HTML. And like, it just looked bad. And then eventually React came around and said, what if we just pulled

the syntax of HTML pulls syntax of CSS into your JavaScript and made it native. So now not only did it get prettier because it looked prettier because its syntax highlighted button as blue or in whatever color you use based on your theme, differently from the actual content of the button, but it gave you command click, it gave you compile time checking, it gave you DOM re-rendering. Like the syntax that React exposed allowed it to do a lot of novel things.

So we kind of have the same approach. If you're going to write a prompt, you probably don't want to write a bunch of like English all through your code base that is not verifiable in any way. So what if we could verify the English? What if we could turn the English into something that looked a little bit more like code, but preserve the flexibility of English?

So we do a lot of error corrections for you under the hood. We do retries and other things minimally as possible. They effectively enter your control. And most importantly, we can build tooling for you that allows you to do things that are probably best seen and not heard. But if I were to describe it to someone that's just listening, imagine if you could see a prompt before you send it out to the model, like a markdown preview.

No one writes markdown files without actually looking at what it looks like. No one ships a website without opening the browser and looking at what it looks like. Why is it that today in every single framework that exists, the only way to see the prompt is to run your code and somehow monitor the web activity that's actually being sent out because there's like 50 layers of distractions under your prompt. So like we just ripped that out and said, what if you could just see your prompt live in real time in VS Code or Cursor while you coded?

Small tooling changes like that have dramatically helped iteration loops. Now, you said something there that I would love to understand better, which is verifiable English. What the hell does that mean? Verifiable English. Yeah, what does that mean? That's a great question. Today, everyone treats LLMs like this magical box that somehow gets the answer right. And if it doesn't get the answer right, it's your fault.

We kind of like to take a step back and just think of the LLM as just a really, really damn good calculator. So what is a calculator? A calculator takes two numbers and an operator, and then based on the operator, it transforms those two numbers into a new number, right? You add the plus sign, it adds them. You add the multiply sign, it multiplies them. Pretty simple.

Now, you might use different calculators for different jobs. Like you might use an abacus if you're in the 1600s or like 1200s. Or Montessori. Or Montessori. Or a few other things, exactly. But you might use a scientific calculator today. Or you might use MatLab for something really numerically complex. But they're all just calculators under the hood. It's just a different calculator has different trade-offs.

So what if we treat an LLM like a calculator? You give it some parameters of whatever your data ends up being, and you just say, I want the LLM to guarantee it spits out something specific. So if I'm looking for, let's say I'm building like a doctor-patient transcript, you want to have some audit control before the doctor gave medicine or anything that they have. And all I have is the audio transcript. But really what that looks like is I want a calculator that can take in an audio file

or a text of the transcript and spit back out a list of questions the doctor asked about medicine, which medicine it was, and whether or not the patient confirmed it. That's a calculator. Now I can implement that calculator using a really powerful device, which is an LLM. And based on the LLM I use, I'll get different levels of precision. If I use a small model, I'll get good precision. If I use a big model, I'll get much better precision.

but it's just a calculator and now the prompt I use kind of is the operator that goes into the calculator so it's like a plus sign and if we view it in that way then the English in that prompt becomes completely verifiable because all I have to verify is that the English I put in will somehow produce the data model that I want out okay so um what this I'm gonna show something because I think it's gonna be easier and then we can talk through it um to the people that aren't

perhaps able to see it. So this is like a 15 page notion doc that we did for like a four hour workshop we ran with a bunch of YC founders. So I think when we talk about code, it's sometimes easier to show code rather than talk about it orally. So I'll show a couple snippets of what I mean by leading to verifiable English. Nice. So in this case, what I have is I have a function called class by message.

And the classify message takes in a message which is a string, so it could come in from a review on Amazon, a tweet, or really anything that I really have. A text from my girlfriend, it doesn't matter. And all I want to know is a sentiment of it as one of these three things. So I want to build a calculator that can, no matter what the message is, will return one of either positive, neutral, or negative. Now the calculator of my choice here is going to be OpenAI GT 4.0, but I could use a Lama model, I could use an anthropic model, Gemini, it doesn't really matter.

And then like that operator that I was talking about becomes this prompt. So now I'm able to verify the English in this prompt very easily by doing a couple of things. Because this isn't plain English, all I have to do is I can say, is this English going to lead this message to being the right thing for a couple of different test cases? So if my girlfriend sends me a text that says, I am incredibly upset with you, it should return negative. This calculator should calculate to negative.

And when we view things like this, it becomes much more easy, much simpler to compose them into complex systems and know exactly what the contract I've built into the system is, which leads to code that is now usable by static analyzers, code that is detectable by AI as understanding when things are going wrong, and most importantly, a strong iteration loop. Because if for whatever reason that message returns back as neutral,

I know that there's two things that could be wrong. It's either I need a different calculator and to either maybe upgrade the model to like Gemini's 2.5 turbo, or I need to go and update my prompt and change the operator that this calculator is using. Does that make sense? Yeah. Yeah, I see that. And now why can't you do that today with the tools that we have in languages that are fairly common like Python?

Yeah, well, I think there's two parts to this. One, not every piece of software is Python. So when you go do this in Python, that means my SQL database, which is written in C, can't support AI pipelines now, except by spinning up a Python microservice. That just seems wrong. All the code in the world that exists in Java, they will eventually want to use LLM as well, because they're, again, really powerful calculators. And every computer, every system,

Software application will want to take advantage of that calculator So that's like reason number one like we should be able to support all these languages But then you get into really simple things like just like what what would this code even look like in Python? So one thing you'll get you'll notice and what I'm showing you over here is that this code has a string which is completely indented And maybe it's easier to see with like how it actually looks in static analysis so you're actually able to highlight this in a very easy way just like in react and

So I'm looking at VS Code right now, and I'm showing the same code that I had over there, just written out with the syntax highlighting that we offer. And the first thing that you'll notice is if I were to do this in Python, then my code will end up actually looking something more like this. So, uh-huh.

And this is like a really, really small nit that ends up happening. Yeah. But it goes back to what you were saying earlier on the ugliness of the code and almost this readability factor. Exactly. So like if I have a prompt, now every single prompt, which is a long string, is going to somehow be dedented all the way back to line zero. So if it's four layers deep in an if statement, suddenly your if statement will be

pushed all the way to the left of your code. And when you scroll, you won't know when the if statement exists or doesn't exist simply because you have a long string. And most code bases aren't designed to have globally available long strings across them. And I think that's something that people don't really think about is like when I have a lot of constant strings, how have we typically used a lot of string variables in our code base? We either load them from a database or

Or we have them like as very short strings that we put in some constants.py file that we then load in. But now literally my code, my business logic is embedded into these strings. And that needs to be co-located in the area that they exist in. And even worse, I build these strings in really dynamic ways. Sometimes I add if statements to conditionally add statements into my string. Sometimes I add for loops to dynamically build them.

Sometimes I want to, like, I have to change what string I have because the model is Gemini 50% of the time and OpenAI the other 50% of the time. That is not what strings are designed for. And like, that's kind of like, and strings aren't designed for that in any programming language. So how do you have a string that can be both extremely dynamic and extremely flexible? The only language that gets close to this is React. Okay. Well, it does feel like

strings in that regard are, they're for a different paradigm. You never would really use strings to control the computer or to control the outcome. But now with the advent of LLMs, you are using strings as a control mechanism. And so it is part of the code. And so I think if I'm understanding your philosophy, you're saying more that

The strings should be treated like a first class citizen in the way that we are treating other pieces of code like that. And they should be treated very cautiously almost. You need caution around your strings. If I'm like Delta Airlines and I have a bunch of AI chatbots that I'm building on, not just one.

How do I guarantee that an intern isn't going to come in and accidentally forget to add that you work for Delta Airlines as a first part of any system message or in chatbot? Yeah. That's not a thing we want to leave up to people. That's a thing we want to process to go do. And there's no mechanism in any language today to statically analyze a string in that way. And so you're saying, hey, this kind of...

process is added in natively with the programming language. It's something that only a programming language can do, because I think that's the other question. It goes back to this whole idea is why can't we do this with the tools we have today and the languages that are already out there, right? So we could do it in Python. Like what you could do is you could make a special class called a

prompt string and now everywhere you use strings you have to actually use a prompt string not a string but even if you do that you can't dedenter strings now

And you rely on developers everywhere in your company to use prompt string instead of string. And let's remember, developers are really lazy. Well, I'm lazy. So other people are, maybe they don't know, but I'm just lazy. Even if I knew, I wouldn't do it because I don't want to have to type in extra work. I don't have to. And then you're exposing yourself to risk. Yeah, I can see that. Exactly. And I think the whole caveat of this whole system is we have to decide where we want the burden to be.

Do we want the burden to be on developers or do we want it to be on a process? And it's fine to put the burden on developers sometimes, but I think tangibly what I think the best thing to do, if an action becomes frequent, then you put the burden on the process.

And because these are such powerful calculators, it's really powerful to go put it into the process rather than a developer. And then the other big caveat that we really say is like, like I mentioned earlier, we want every single language to go support these calculators, these LLMs. So I haven't seen a good framework in Java yet. I haven't seen a good framework in Go yet. And OpenAI themselves barely support those languages.

And that says a couple of things. Either they strongly believe that Python and TypeScript are the only feature that exists here. And that, you know what, screw any of the other ones. Or they believe that eventually someone will deal with it and we don't have to think about it. But I can't have the belief that everyone should be able to take advantage of these systems. There are some, like Kubernetes is built not in Python. Some of the most foundational systems that we have are not built in Python.

And it feels like a pity to me if those systems will be the last ones to be able to take advantage of AI pipelines because they're the most useful. How does BAML then sync up with all these? How does it interplay with all the different languages? So BAML's philosophy as a language is we do some things as a language, but then what we do is we do code generation into every other language of your choice.

So you write your code in BAML and then we do code gen to give you a Python function. So if you wrote a function called classify message in BAML, we would make an identical function called def classify message in Python that would actually call the BAML code under the hood. And the BAML code is actually written in Rust. So it's extremely fast and it runs on every system. So for those people that work in Python, it's like NumPy. NumPy is not written in Python. It's written in C.

And that's because to do really good numerical processing, you can't let Python do it. You write in C, make it way faster and expose it to Python via Python interface. That's what we do. We take the BAML code and we expose it to you via a type safe auto-completable interface in every language of your choice. So it feels like you wrote in Python. It just happens that you happen to do it in BAML.

So did I understand that correctly? Because it's basically you're writing in your language of choice. Yes. But you're writing BAML code and it will just translate it to BAML on the other side. Exactly. That's literally how that works. So you write, you open up .baml file. Yeah. And then you press Command S in VS Code or Cursor. And we run a CLI command called BAML CLI generate.

which will turn the .baml files into .py files or .ts files or .go files or .java files or whatever the heck you want. And then you just use it natively. There's no internet dependency. It runs locally on your machine. And then we're able to pack in a bunch of algorithms for you automatically. And because every language is taking advantage of the same exact core under the hood, we don't actually have the same type of bugs that a lot of other frameworks have. So most other frameworks, they implement ones in Python.

Then they're like, oh my God, I guess people use TypeScript. Then they implement, re-implement everything in TypeScript again. And then they kind of forget about it and they keep adding features to Python and people are like, when are we adding this TypeScript? But because BAML is the same everywhere, we actually support every language with every feature always by default. Wow. Okay. Yeah, that idea reminds me a lot of

when you have apps that are specific for either Android or iPhone. Yeah. It's like, when is this app going to be? Normally they go after iPhone first and then it's like, when can I get this app on Android? Right. And it's that same kind of idea, like implemented in Python. And then, oh yeah, we got to go do this in TypeScript too. Yeah. All right. Let's see.

Exactly. And then that's kind of why, like, I think React Native got a huge push. And that's why Dart and Flutter and all these other things had a huge push because you could implement it once and get it for web, get it for Android, get it for iOS all at once because they did the groundwork so that your team only has to focus on it once.

Yeah, it's a common pattern. And it is, again, going back to the persona that you're serving, it's going to be much more comfortable if you are just building once and it can work wherever it needs to work. Yeah. And if you think about it from like a larger organization standpoint, most large companies use more than one language. Yeah. So now what's nice is we can actually use many, right? So it's just, it naturally happens.

So if you live in that world and you're a large company, now it's great because all my LLM code can be shared. All the techniques that one team discovers can be used by another team, even if they use different languages and they get them for free. So that cross-pollinate, like the only thing I, it's like JSON. Why is JSON so powerful? Because JSON, every single language has a JSON.loads, JSON.parse, JSON.serialize. It has those methods.

built into it because they adapted it. JSON is compatible with every language. BAML was designed with the same philosophy. We give you all the tooling that you want for LLMs, but then we build a compat layer with every language so you don't have to actually rediscover all that tooling all the time. So you must have tried at least to not

build another language. You must have thought really hard on how you could do this without building a language because as you said earlier, I'm lazy. And what you are trying to do by building a language is probably like playing the game not on a hard mode, but on extremely hard mode. Yeah, I think this is probably the worst idea. One of the worst ideas for startup ever.

is what I would say. Not a lot of companies have been able to do this. There's a few, but very few. And almost no company has done it for business logic ever. But, and I think a lot of startups had this thing where like you could vibe code it and ship something in the hands of your customers in like a day.

We just sat in a corner and coded for eight months before we got our first few users even. Wow. Because like how would... Imagine, I want you to put your shoes in our first few devs who use Dammel. Yeah. And back then it was called Glue, so it was a different thing. That's funny. We got sued for our name, so we had to rebrand it. And I think we have a much better name now. Yeah. But...

The first few devs that used VAML used a compiler that was really flaky, had a bunch of bugs, so we wrote in C++, not Rust. They had no syntax highlighting. They had no autocomplete. We barely supported Python. We definitely didn't support any other language. Our type system wasn't very complete yet.

And then we grew from like, from four, I don't mean companies, four developers to five developers took us three months. From five developers to eight took another three months. And then we got 10, like two months later after that. And then we started booming. Now we have like, now we probably have like 700 or so companies deploying into production in BAML. Nice. Including some like Fortune 500s, which has been really, really surprising. So cool. But,

I think this is one of the, I think, projects and the technologies that, like, you know those rides at a roller coaster where, like, you have to be this tall to ride? Yeah. A language, as it turns out, you need to be, like, 6'2 to get in the door, maybe 6'8 to even have, like, a developer consider you. And, like, it was, I don't think we understood the dauntingness of the task we took on at the very beginning. Like, we knew it was hard. Like, programming languages, there's not a lot of them.

But one thing we underappreciated back then was how easy it was to build something stable. Like once you got to a point of stability and generality, the number of use cases that would be unlocked was really, really surprising to us. Okay. And the way that the tooling could evolve. So one of the things that we do is this concept that I was talking about, about like being able to see the prompt, for example.

So typically when people want to run a prompt, like they usually have to run some CLI command to go run tests. So right now what I'm showing is I'm showing VS Code and I'm showing our VS Code extension. Because we own the entire stack, I can legitimately show you the prompt in real time and you can watch it edit as I'm typing. So if I type something in here, it pops in right over here. If I change my test case,

it pops in and I can make, I can highlight my test case differently than my base prompt in the rendering of the prompt. And that matters a lot because if I want to quickly glance at this thing, I can easily see where the bug is. I can easily read the thing that's being sent out to the model. I can see the tokens that are actually underlying being sent out to the model. Depending on the model, the tokenizer changes.

And you just did that by you were and because with that tokenizer, what you're seeing there is each word being highlighted with a different color as a token, right? Yeah, because this is what the model is actually seeing, right? The model isn't seeing your English. There's it's seeing the token. So like the strawberry question is a very common question people use to talk about.

Like why does the model not, why is the model not able to answer the word strawberry? It's because strawberry is like two tokens. But if you do S space T space R and like spell out strawberry with spaces, it has a really easy time answering how many R's there are in strawberry because a tokenizer can see the R's, right? But that is really hard to explain to someone, especially a junior dev, if you don't have the right tooling, but here you can just show it as you're debugging and understand what the model is doing better.

Wow. With the raw curl mode that we have, under the hood, we all know every framework is going to at some point put a web message together. So if I swap from OpenAI to Anthropic, the raw web request itself changes. So we make that like a zero click that you can see ahead of time, not after the fact. And I think the biggest unlock is kind of what I...

what I call like the hot reload loop of prompts in React. Hot reload. I like that name. Right. So like in React, what is your hot reload loop? You go to your React file, like your TS file or TSX file. You press command S. You look at the browser. If it doesn't match what you want, you go back to the file. You edit it. You go back. That's really fast hot reload loop. And you can't do that without React in reality because...

A whole bunch of things about how web state component works and all that stuff. I won't go into that. And prompting, it's kind of similar. You want to change your prompt. You want to run your test case. And what I did right now is I just pressed the run test case button. And you can legitimately just see exactly what the model did. And if it doesn't match your expectations, you just go and edit the prompt, edit your data model, edit your input, and rerun it again. You get into a really fast hot reload loop. So you quickly converge on a good prompt for your test cases.

And because you never leave your editor, you don't have to go to a web browser. You're using tooling that you're already familiar with. You can use cursor, you can use cloud code or anything else you want without having to really log into a SAS of any kind. So this is really, it's helping you speed up that workflow by giving you like the rendering on one side of your screen of what you're working on on the other side of the screen. Exactly. It's how I do web development.

And because web development is so experimental, we needed to do that. Why is Jupyter Notebook so successful for data science? Because data science work is very experimental. I need the visual feedback to go do this. I don't want to have to run the whole program every single time from scratch. Agents are also very experimental and you need that. And like the question we have to ask is, is every language going to build that experimental tooling individually?

Or can we just use something like BAML and then plug into every language with autocomplete and all the benefits with like a cogen kind of layer? Well, because I was envisioning it as a tool that you would use with it. Like a, I know if we're talking about prompting specifically, there's a bunch of these different prompting tools. I think prompt layer is one of them or OPIC or ML flow even does it these days. Right. So, um,

How do you see those two things playing together? Is it still that I would have MLflow to like version my prompts and have that? Why? I think it's just like, why would you reinvent version control? We have Git. We have like, it's a battle tested version control system that works for decades for like the most complex software out there. Why reinvent it? It's beautiful. It's really, really damn good.

There's only two companies that reinvented Git, and that's Google, Microsoft, and Facebook. And that's because their code base is massive, that Git was too slow, so they made improvements to Git to make it better. And they have used Mercurial as well. Why would you use anything different? But isn't it because you would have so many different prompts?

for certain, like what you were showing me there was one test case, but I assume that you're going to have a test suite when you're using stuff. Yeah, we have that for regular software too. I think if you go back in time to a time when software was a lot smaller just in volume of lines of code, people would have said, oh, this works for like 10 functions, but is Git really going to work when you have 100,000 plus functions? It turns out it works when you have 100,000 plus functions.

And I think it's the same thing with your test cases and everything else too. It's just code. And the best place to have code is your code base. Now you might want to load tests from a database. That's fine. We know how to write tests that load from a database. We've done that before. You create a database call, you call it, and then you run the test case. We know how to write PyTest that does that. So it's true you might store some instances in a database and some instances locally. That makes sense. But it's just code.

Now, as you're working with different folks that are using PMO, what are things that they're telling you that's surprising you? I think the scariest thing that I heard was... They're using it in production? Well, no, that was fine. That stopped being scary like maybe like seven, eight months ago. That was scary seven, eight months ago because every time someone ships, I was like... Because we ship for so many different languages, I was like, oh my God, did we break something? Did we break something on like...

the what's it called like debian what's a slim version of debian that people have like the alpine there's like an ubuntu container called alpine that people use for deployments because it's very very small and we broke it once or like there's compat layers so like there's small things like that they used to give me anxiety but most almost all those bugs have been addressed now nice but I think the scariest thing is when I heard someone having like 25,000 lines of BAML code in their repo

And I was like, that's a real code base. Like that's scary to me. That's legit, dude. Yeah. And I was like, and they're like, can I please have namespaces? Cause we don't have namespaces right now in Babel. And I was like, why do you need namespaces? And then they showed me that I code. And I was like, okay, I guess you need namespaces. So we haven't had namespaces yet. It takes a while for us to build out some features. Yeah.

Just because they're so core and primitive. Yeah. So features do take a little bit, but some features are fast. Some are more work. How do you work it? That was probably... Go ahead. Oh, no, no, sorry. Keep going. Keep going. No, I was going to say like $25 a code scared me. I think first time I met someone that wrote $1,000 a code, that was scary. We hired one of the first people that wrote $3,000 a code of BAML. Nice. Into our company. And they ended up joining us. But...

Yeah, volume of code is probably the scariest thing. It's like, oh shit, people depend on this for real. And how do you look at where to go next? Is it by talking to folks and seeing there's namespaces we need to do and I'm sure there's a laundry list of other things that people are asking you for? I think one of the best things about building a programming language is in the end we're building for ourselves. Like there's very, very few tools that developers could truly build for themselves.

One of them is like editors, like cursor VS code. And you can go, you can just go do that. It turns out another one is a programming language because like you as a developer kind of know what it feels like to write code and you have an idea for what feels right, what is necessary. So we do listen to our customers to add different features, but we kind of just know what we have to do. Like I think if anyone had asked us to make BAML, they would have said no.

They would have said, why would I want this? And some of our earliest users were like, why the heck are you making me write this code? And we were just like, trust us. And we hope that we were right. But most features are usually just like in our heads. And we're like, we go talk internally on the team. We write a lot of pseudocode. And then we go shop test it once we have that concept with our community. So we have a community of over like a thousand people now. Nice. And they've been really helpful in helping us guide the direction of where we should be going.

But we usually don't leverage them to have the first inception of an idea because they're busy thinking about how to build the best applications. And what we see is sometimes, like I'll give you an example, streaming. A lot of SDKs and frameworks around LLMs don't give you a great streaming visualization.

And what I mean by that... In what way? Streaming is a really nuanced concept. So I'm going to show you something and maybe we can try describing it to everyone else. So let's take this recipe generator, for example. So everyone knows we can go to ChatGPT and ask it to spit out a recipe. It'll dump out something. But when I use streaming, I can do something really incredible. What if I could make it interactive while I was loading?

And if you take a look at my screen, you can actually see like the loader icon is telling me exactly what it's working on at any given moment versus what's done. And this cue of how you build this kind of application is it's possible today with today's SDKs. It's just really, really damn hard. And so what we're seeing here is that you've got a little slider button at the top. As the recipe is rendering, you can move the slider and get...

an interactive experience of that recipe being updated depending on the size of or the amount of people that you're trying to feed with this recipe. But you also see one of those little spinning wheels so you know exactly which part the LLM is working on. Yeah. Yeah. And that's just not something that I think a lot of people spend effort on. Like, can you do this today?

Yes, you can do this today with today's LLMs. Remember, we don't modify the LLM at all. We use every model as is without any modification. But the problem that I often see is not that you can't do this today. It's just it takes a lot of code to go do this. What if you can do this with one line of code in BAML? That is the nuance of what we offer. And I think that making something that needs to be common very easy is an undervalued thing.

But I think a perfect example that a lot of people probably know that did this is Tailwind. So I think Tailwind fundamentally changed the name. It changed the game of how styles should be done and things became more standardized once you use Tailwind. Because CSS is one of those things where like we all want it to be perfect. And we all thought like the right way to do CSS was to style out CSS files with like a bunch of classes that we link. But it turned out CSS is very hyperlocal.

I want to modify just the div I'm right at, and I don't want to have to hover over and see what class something is. But the problem is I don't want to write raw CSS there because raw CSS is hard to read. And actually like there's some attributes in there that are just like, what the heck is that? I cannot human read that. So Tailwind did something very simple, which is they added a string that was easy to read and programmatically possible to generate. And they added it in a spatially local place. So they added a new syntax,

for defining CSS that they then just run through React to render the actual CSS under the hood. And that allows them to do optimizations as well for free, like only include the styles that you actually use in your code into the rendered formats. Your style sheet isn't super long like Bootstrap or other systems used to be.

But more importantly, it became more ergonomic. And that allowed people to build perfect UIs for each part of their website because it became easy to do it frequently in my code. Baml tries to do the same thing with streaming. Instead of having to think really hard about, do I want to add like 100 extra lines here to make this thing stream perfectly? You had one line of code, and now it just streams the way you want it to stream. Now, what are you seeing...

users of BAML create? What are some cool projects? There are a couple ones. There's kind of things all over the domain. So there's some in like the government industry that I thought were really interesting. We have a lot of like government RFP generators that are built off BAML. We have some in the medical space, like analyzing doctor-patient conversations for all sorts of EMR stats, admin work,

We have agents that operate in the RFP automation space. We've all seen those Chrome extensions that kind of extract data from website webpages into a spreadsheet-like view. I've seen those very generic dynamic systems you built in BAML and SQL chatbots.

kind of all over the place, RAG systems, a little bit of everything, which has been really surprising. And is there stuff that folks are asking for besides like the streaming, besides the namespaces that you had said? What else is next on your list that you're thinking about tackling? More language support is probably like a huge one for us.

So we technically support every language today using OpenAPI. So we support Python, Ruby, Python, Ruby, TypeScript natively, and then all the rest are available via OpenAPI, GrooveMML. But we're going to add Go support soon. And then following that will be Java support natively as well. So now that we've figured out static languages, we should be able to unlock every other language without having to have like a sidecar kind of system.

And then the next big one is actually an orchestration system that we're going to announce pretty soon. Yeah, so not just props, but like full workflows and orchestrations with a debugging experience unlike anything people have ever seen before for agents. And when you say orchestration, you're talking about what specifically?

Like full expressive systems, like if statements, for loops, while loops, where you can conditionally call an agent, abort to a human worker, human annotation system, come back to it, that kind of system. Wow. Yeah. And all that will still be exposed to every language of your choice. Oh, man. So now, as you're thinking through a mindshare perspective, you like to...

garner more attention or to garner more developers using BAML, how are you looking at that? Like, how do you think this is ultimately going to be something that's not just a flash in the pan? Yeah, we spent a lot of time thinking about this because I think since we started, which is like November or October of 2023, there have been a ton of frameworks that have come out since then.

And a lot of them have died very fast. They've all been flashed to the fan. I think one thing that is good is people have sustainably continued, our users of BAML have increased over time without much churn. One thing I was worried about is are people going to grow out of BAML? But as I talked about, the person with the company with 25,000 lines of BAML code, clearly they seem to be adding more BAML, not less BAML over time.

So that sort of stuff, I think, has assuaged a lot of our worries around, like, will people grow out? Like, maybe they will, but we'll add namespaces. We'll add all the things that people expect out of a language. But I have a view on DevTools that I think is very different than a lot of people, but I was inspired to do it by the TypeScript team. I think as a company, you have a finite set of resources. You choose where you want to deploy your money. You can deploy or end your time. You deploy into marketing or engineering.

We as a company are just saying, screw it. We're just going to write a shit ton of code. Like we're just going to keep writing code and keep shipping. Our code base is almost like half a million lots of code or something. And we'll just keep, because that's what we as a team love doing. And we can do some incredible features. And what we want to do is we just want to make it so that you as a developer aren't thinking about BAML being a bottleneck. We're just shipping. And you're just like, cool, I have that feature. Oh, before you even imagine you need it, we've already thought of it and added it by the time you need it.

And what we hope is if people keep using BAML and they keep loving it, I hope that they tell their friends. And I think developers are these really, really... Developers are horrible to sell your product to. They're just the worst buyers ever. But they are the best referral system ever. If a dev loves your tool and they abide by it, they will tell every single one of their friends how much they love it. Yes. And I want to earn the trust of developers everywhere.

That, hey, we will look out for them and that we will do our best to make sure that their problems go away. So rather than spending time on like marketing or all these other things, like let's just ship good code. Like shipping is the best marketing. And that's what we want to keep doing. How do you think developers talk about BAMO when they talk to each other? Yeah.

I can read some tweets out that are like some messages, but like I get like messages almost every week now from someone somewhere around the world that discovered BAML for the first time. And Lily was like, thank you. Or like, this is amazing. I don't know how you did it, but it was really good. And I had someone that messaged me. I think the message that stuck out was I, what they sent me was I really, I explored it three months ago and I didn't want to go learn it because I didn't have time and I had to go ship.

But I tried it again this weekend because I was tired of seeing your posts. And I regret not having switched earlier. It would have saved me so much time. It ended up taking about like two and a half hours to learn. Nice. But I think it's that sentiment of like the regret that they felt was like a really, I felt bad that I could say things earlier to save them the time earlier.

But I'm really glad that they found value and it wasn't a waste of time for them. I want to highlight that you said he was tired of seeing your posts. And this is you not in marketing mode. I cannot imagine what the world will look like when you are in marketing. Yeah, I do. I do post about Bama on LinkedIn. I think I'm really proud of what the team has done. And the things that I mostly post like things that we ship.

Yeah. So I also talk about features and that's us for marketing. No. So more of our users know about the new features we release. Like I should probably send an email chain out. I feel like people do that thing, but I don't really know how to set that up. So we, we collect emails, but we have never really, we set up four emails in the lifetime of the company to people. Um, and I try my best not to do that because I, as a developer hate emails. Yeah. Um,

So we do different things. Google groups going on. That feels like, Oh my God, I would die. Uh, I hope not. I mean, like I, let's talk about like discord versus slack. Like why do we use discord? I hate discord. Like I, like I play games and stuff and it's just sort of for social, but slack is like more commercial and work-based. Yeah. But one of the things we thought was,

I don't want to give up my name and identity just to ask a question. That feels kind of like, it's like increasing a barrier. So we just said, what if we just make a Discord, no login needed, just go and ask a question. And it's kind of the philosophy that we always take, like reduce the barrier of entry as much as possible because we have this huge barrier, which is you got to spend two hours on a Saturday to play around with it. So it's like every other barrier we try and remove out of the way. And what are some things that you feel like the...

folks that are using BAML highlight when they talk about BAML? The biggest one is probably our parsing. So we have a way to do structured outputs that is better than OpenAI. It's better than Anthropic. It's better than every other model out there on almost every benchmark. It's a new algorithm we created called Schema Line Parsing. People love that thing. And they love it because it allows you to chain of thought in one shot with structured outputs in the model.

And you don't have to think about it. We just do error correction on the model. A lot of other frameworks, what they do is they do retries whenever the model doesn't give you the exact thing that you asked for. We solve that same problem with milliseconds of work with an algorithm. And now it's been battle tested in millions of calls, easily over tens of millions of calls and all different data types. That's one thing people love. And the other thing that they really love after they get past that

is the iteration speed. The iteration speed of BAML is unlike anything else. If it takes you, I think the way I put it is if it takes you five minutes to test one prompt at a time and you need to test 50 prompts to find the right answer, it'll take you 250 minutes. If it takes you five seconds to test a prompt, it will take you 250 seconds. And there's just like no, there's just no comparison when it's that much faster.

That run test button I showed earlier from our VS Code Playground is probably the single best feature that we've shipped. I am blown away by this now.

Getting AI Apps Past the Demo // Vaibhav Gupta // #319 50:29 Share

MLOps.community

Deep Dive

Shownotes Transcript

Getting AI Apps Past the Demo // Vaibhav Gupta // #319