OCaml is suited for Jane Street's trading systems due to its balance of performance, correctness, and dynamism. It is memory-safe, has a strong type system for catching bugs, and is performant enough to handle large data feeds and real-time processing. Its conciseness and readability, combined with type inference, make it ideal for rapid iteration and maintaining correctness in a high-stakes environment.
Jane Street's trading operations demand high performance to handle millions of transactions per second, speed of implementation to adapt to dynamic markets, and correctness to ensure software accurately executes trading strategies. These requirements stem from the need to process real-time data, iterate quickly on new ideas, and avoid costly errors in automated trading systems.
OCaml's type system, particularly its type inference, eliminates the need for explicit type declarations, reducing boilerplate code. This allows developers to write concise, readable code while maintaining type safety. The language's abstraction tools also enable efficient reuse of code, further minimizing repetitive patterns.
Concurrency in OCaml refers to managing multiple tasks that may not execute simultaneously but need to handle uncertainty in task timing, such as network communication. Parallelism, on the other hand, involves executing tasks simultaneously across multiple hardware cores. OCaml has strong concurrency support through libraries like Async but lacks shared-memory parallelism, requiring message-passing for distributed systems.
Correctness is critical in automated trading systems because these systems directly interact with financial markets and can commit to transactions without human oversight. Errors in such systems can lead to significant financial losses, especially if they occur in fast loops. Ensuring correctness involves rigorous testing, type safety, and careful modeling of the trading environment.
Jane Street emphasizes code readability and rigorous code review. They use a custom code review system called Iron, which helps catch bugs, share institutional knowledge, and improve the overall quality of the codebase. This focus on readability ensures that code is easier to modify, understand, and evolve, fostering a culture of high-quality engineering.
Jane Street mitigates software failure risks through institutional humility, redundant engineering, and the ability to quickly shut down systems. They have physical and logical mechanisms, like a 'big red button,' to halt trading when anomalies are detected. This approach prioritizes safety over availability, ensuring that systems can be turned off during critical failures to prevent catastrophic losses.
Type inference in OCaml allows the compiler to automatically determine the types of expressions without requiring explicit type annotations from the programmer. This reduces boilerplate code, improves readability, and maintains type safety. It enables developers to write concise, expressive code while still benefiting from the robustness of a static type system.
Jane Street avoids shared-memory parallelism in OCaml because it is difficult to reason about and can introduce subtle bugs, especially in systems with mutable state. Instead, they use message-passing between separate processes, which simplifies concurrency management and allows for distributed systems that span multiple machines.
Yaron Minsky's perspective on probability has shifted to emphasize the importance of expectancy and avoiding absolute certainty. He now views assigning zero probability to an event as irrational, as it disregards potential evidence to the contrary. This mindset, influenced by financial decision-making, encourages more careful and rational assessments of risks and outcomes.
Yaron Minsky is the head of quantitative research and technology at Jane Street Capital. Jane Street is unique both because of its incredible success, but also for its almost exclusive use of OCaml, a functional programming language. Yaron, welcome to Software Engineering Daily.
Thanks. It's good to be here. It's worth maybe correcting. So I'm the head of technology, no longer the head of quantitative research. That part of the world has kind of moved on and there are a lot more people at Jane Street than there used to be when I had both of those titles. Oh, okay. Interesting. Well, so Jane Street is a trading company. How does the trading requirements, how does that drive the technological requirements of the company? Yeah.
So it drives it in a number of ways. I think it obviously puts a lot of pressure on performance in that trading applications are dominated by large amounts of data that you need to consume in real time in order to have any kind of responsive system. Mark data feeds generating millions of transactions a second is not out of the realm of what you can get, and you're pulling data from multiple different sources.
And that costs you both in building real-time systems that can respond to data, but also in collecting data, getting accurate timestamps so you understand when things happened in the past, and being able to do kind of bulk research where you go over and analyze large amounts of historical data to understand things that have happened in the past. All of that requires good performance.
I think another thing that matters a lot is speed of implementation, the ability to get new things done. The markets are extremely dynamic. And so you constantly have to be coming up with new ideas and iterating quickly. A good invariant for the world of trading is that everything you're doing is going away.
If you keep on doing the same job you're doing now at the same level of quality and execute these strategies going forward, you're going to see them get worse and worse. And the only way to kind of build a good, sustainable business is to constantly be looking for new opportunities and looking to get better at the things you're already doing. And that puts a lot of pressure on dynamism, on the ability to make new things quickly.
So how do these – oh, go ahead. Oh, and finally, the third one I think is extremely important is correctness. Trading is in some sense for a programmer a terrifying thing to contemplate, automated trading, because you are writing a program which has access to your wallet. And if you don't find that terrifying, you're not thinking about it carefully enough. We're all terrible programmers. We're forgetful. We make lots of mistakes and those mistakes end up in our code and –
You have to put a lot of effort and engineering into making sure that the software you build correctly carries out your intent. And that both involves kind of ordinary notions of software correctness that you see everywhere, but also careful thinking about the way in which the software that you're – the model of the world that you're building your software against, whether that's really right and the ways in which the deviations between the model you expect and the world that's really there can cause you trouble. So like three things.
performance, correctness, and dynamism. Those are kind of the three big pressures from trading. Yes. And so before we get into how OCaml encapsulates those three traits well, let's talk about some languages that do not encapsulate those. What are some languages that exhibit the opposite of this correctness and agility and performance?
Sure. So I think one of the things that you'll find is that it's not that
that people use are strictly bad at everything, right? What happens is they make different sets of trade-offs. I mean, there are some languages that are just really badly designed and kind of became popular because they happen to have good libraries in some niche. But usually widely used languages are pretty good. They're reasonable, but they're not good at all exactly the same set of things that other languages are good at. So for example, C, which is in many ways a beautiful language, is...
gives you a programming model that's very close to the machine. It gives you lots of control over memory representation, and it lets you write performant code in a very natural way. But it's also very hard to reason about complicated C code because the compiler and the language does very little to protect you from the kind of ordinary mistakes that we fallible humans make. So that's a language that kind of fails on one side of the set of trade-offs, and that
The safety story is not so great, and that makes the correctness story that we care about for our programs harder to provide. Not impossible, but I think significantly harder. Another lovely language that fails in a different way is Python. So Python is a very user-friendly language, a language that was designed with, I think, real attention towards how ordinary people think.
And one of the reasons why I think Python is great is you look at a piece of Python code and it's almost pseudocode in many cases. So there's something lovely about Python, but it is crushingly slow, which makes it unsuitable for most of these applications. And I think the lack of type safety in language like Python means that for larger code bases, it's very hard to use. And I think it's hard in part because...
refactoring becomes very difficult in a dynamically typed language because the type system of a, especially of a language like OCaml, for reasons I guess we'll get into later, makes refactoring a lot easier. The complete lack of a type system means that if you have some library that has many, many clients and you want to change the semantics of that library,
It's very hard to find and correctly fix all the places where that is used. And it makes certain kinds of transformations very difficult. Actually, a good example of this in the Python community is the move from Python 2.7 to Python 3, which Python 3 has been around forever. And there are a bunch of reasons, sociological and other, that people haven't fully made the flip to Python 3. But I think one of them is that you get very little help from the type system in doing that move.
So there are some tools to help migrations, but the basic fact is the dynamically typed nature of the language just makes this significantly harder. Sure. So let's get into talking about OCaml. Why is OCaml such an appealing language given the technological requirements of Jane Street? So OCaml has a number of nice properties that give it a really comfortable spot in a kind of set of trade-offs between all the different things we've been talking about. So OCaml
For example, it is simultaneously safe, meaning it's got memory safety and lots of good support in the type system for helping you catch bugs. And at the same time, it's still quite performant.
I think lots of high-level languages that give you ease of expressiveness and safety have very large compromises in the performance side. And for OCaml, it's much less so. Your OCaml code typically won't be as fast as C code, but you can get it within spitting distance. Certainly, if you're willing to write the core of your program in a careful way, you can get OCaml code that's quite close to the speed of C. It's also...
a language which is simultaneously concise and once you get used to reading a functional language, very readable, but gets the kind of concision of a dynamic language with the full type safety that you expect from a language with a static type system. So it's in a few different ways kind of sits in a sweet spot in the space of programming language design. I think another aspect of the language which is very nice is
It lets you write things in a pretty declarative way, meaning declarative is a funny word. It's not clear exactly what it means, but roughly I mean you can write things that are pretty simple and pretty easy to read and understand that at the same time have quite understandable performance behavior. So to give an example of a language which is very different in this regard is SQL, where you write your SQL statement and it's in this nice higher form of it's pretty easy to understand what you're trying to achieve.
And then the query optimizer performs a miracle or doesn't perform a miracle. And so, you know, your performance of a given SQL query can vary by a couple of orders of magnitude depending on what choices the query optimizer made. And it's in many ways out of your control.
So there's this trade-off in SQL where you give up precise control over performance in exchange for having a very high-level language. And OCaml gives you a quite high-level language with quite precise control over performance, which is an unusual trade-off.
So you've said that two useful characteristics of OCaml are brevity and types. How does OCaml exemplify brevity and types so importantly in a way that is so useful to trading systems like Jane Street? So first, it's maybe worth saying that I think these –
These properties are valuable whether or not you're building trading systems. I think type systems and concise code is kind of universally a good thing to have in your programming language. But the key thing that lets you have those two in the same language is type inference. So you get types. Every expression you write down in OCaml, the type of that expression is precisely known by the compiler. But
It's known by inference mostly rather than by you having to expressly say what the types are. So if you go back to like the history of programming languages back to like the 1950s where the early – the first two languages that were kind of had any real success were Lisp and Fortran. And Lisp was untyped and Fortran was typed. And the key reason Fortran was typed was for performance. And Lisp was untyped and it gave you sort of better expressiveness and brevity.
And one of the reasons why types were so expensive in a language like Fortran is the way of expressing them was very simple, very concrete. Like every time you had a variable, you had to say what the type of that variable is. And OCaml and more broadly the kind of ML family of languages have type inference built in in a very universal way so that
Almost all of the types that you care about can be inferred automatically just by looking at the concrete things that you want to do and inferring. You've added these things together. They must be numbers. And that kind of simple inference woven together over many different parts of a program lets the complete type of your code be inferred. Is there a performance penalty to that code inference? No, because it's all compile time.
So there's some cost at compile time for doing it, but actually OCaml is a quite fast compiler. Way, way faster, for example, than the C++ compiler. There's a variety of reasons for that, but there's basically not a material performance penalty there.
Which is interesting because the type inference algorithm that's used by ML happens to be doubly exponential in its kind of asymptotic. So you can write a 17-line program that will not type check until the heat death of the universe. But it turns out that programs that are hard to type check for the algorithm are also hard to think about for people. So nobody ever writes them.
Weirdly, so even though in practice the asymptotics are terrible, in reality the compilers are very fast. Yeah, very interesting. So let's talk more about the type system. I went through this YouTube talk that you gave, and you mentioned all these interesting properties of the type system. There's parametric polymorphism, algebraic data types, phantom types, and...
You said that it's really the simple features that are actually the things that are more appealing rather than the more complex features. So maybe you could talk about the simple features of the type system that are really important.
Great. So I think you've mentioned two of the most important ones. So the first being parametric polymorphism, which even though it has a really fancy sounding name, is actually a simple feature. And the other is the presence of variance, which is a key part of what we call algebraic data types. So let me first talk about parametric polymorphism. It is more or less the same feature that you have in a language like Java called generics in Java.
And this is the ability to have some value or object or whatever entity you have in the language that is parameterized over some other type. So, for example, almost every language you can think of has an array type. And you typically can instantiate the array type at multiple different times.
types for the container. So you could have an integer array or a float array or an array that has pairs of integers and strings or whatever. There's a free type parameter. And parametric polymorphism is the default kind of polymorphism in OCaml as opposed to a language like Java where the default kind of polymorphism is what we call subtyping. The kind of is a relationship, this is a that.
And I think it turns out that most of the time the kind of polymorphism you want is this thing we call parametric polymorphism and not subtyping. They're both useful, but if you have to pick a default, parametric polymorphism is the more useful one.
And it comes up a ton in things like containers. So, for example, if you think about the... if you want a function that is going to iterate over a container and, say, find the sum of all the values or maybe combine all the values given some function that you pass in to do so, the natural type of that is kind of a type parameter.
which is to say you want to be able to do this given, say, any function that can combine things of the type in the container class, the type of the things in the array, say, and in array with that same type of element, you want to be able to kind of fold over the array and combine all the elements into a single value. Maybe the fold function is summation or maybe it's max or whatever happens to be for your particular application. But this kind of generic iterator
is naturally expressed parametrically, meaning the type we're talking about when we talk about the container type is always the same. Whereas in an object-oriented language where the normal thing is subtyping, the different things that come up, the different instances of the thing that are generic, you say, well, it's some subtype, but maybe not the same subtype.
So that's kind of the fundamental difference between parametric polymorphism and subtyping, where subtyping is all about, I want something that matches this interface, but anything that matches the interface will do. And parametric polymorphism is all about, there's something in here, I don't know what it is, but once I decide on what it is, it's always that same thing. And
In some sense, this whole description sounds sort of fancy and I'm getting my type theory hat on and it all sounds more complicated than it really is in practice. Well, so to clarify for people who are wondering maybe what the root is of all this type... What is so beneficial about these type systems? Is it that these type systems just allow you to catch more errors at compilation time? So I think there are two things that having...
Well, let's back up. There's having a type system at all, and there's having a really rich, expressive type system. So I think that one of the things that having even a very basic type system gives you, which people often don't think about, is tooling. Just the basic IDE-like features of being able to look up the values of things and look up the types of things just works much better in a language with a type system.
So that's kind of one advantage that you get even before you get to a language like ML. Just having good types is very valuable there. Another kind of tooling aspect of types, which is very nice, is that when you have informative types that tell you what your functions might do or how your objects might be constructed, the types themselves serve as a form of documentation. And unlike every other form of documentation you find in a big code base, it's documentation that doesn't lie to you. The thing the type system says is actually true because the compiler checks it.
So then the question is, what do you get by moving from a fairly simple and not terribly expressive type system like Java's to a somewhat richer one like the one you find in a language like OCaml? And I think the key advantage there, well, there are really two of them. One is it makes lots of things simpler to do.
There are lots of cases where you have to work around Java's type system to get the behavior you want and have to do lots of kind of complicated declarations of things to make it clear what you're trying to do. Like a really stupid but simple example is if you think about old Java, Java before generics, which the original Java in '95 didn't have generics. It took some Haskell programmers actually to bring generics into Java.
And Java before generics, if you wanted to return two things, like what did you do? Well, you could like create a new class for expressing that there are these two separate things that you want to return. In somewhat richer languages that have parametric polymorphism, you could say a general thing. I have a notion of a pair.
I don't have to declare it every time. It's reusable. I can use any different kind of type inside of it. So sometimes I return a pair of an int and a float, and sometimes I return a pair of characters. I can return a pair of whatever I want to. And I don't have to sort of say extra things to the type system to allow it to let me do that. So having more expressive types lets you do what you want to do without having to fill out quite so many forms. You can do things with less boilerplate.
Sure. Go ahead. And then the other thing is they do, in fact, catch bugs. They catch lots more bugs. I think that there's a lie that ML programmers are always saying, which is that when you write your program in OCaml or a similar language, once it compiles, it basically works.
And this is on some sense an absurd lie. There are lots and lots of kinds of bugs that aren't caught by the type system. But it's also shockingly close to true. A wide swath of trivial bugs that you're used to having to sweat over just vanish in a system like OCaml. I think the one that's maybe most readily understandable is null pointer errors.
I think it's hard for people to fully take on the fact that when you program in a language like OCaml, you basically don't get null pointer errors. There's enough support in the language to more or less with the type system prove that those can't happen. And that's a surprisingly effective tool. And null pointer errors are just one kind of error. There's a whole other rich set of errors that...
disciplined and careful use of the type system can make vanish and make vanish with relatively little effort.
Well, this property of avoiding null pointer errors within functional languages seems to be a theme. We did a show about Elm, which is kind of like this Haskell, it's like Haskell meets JavaScript, basically. And, you know, people were talking about rebuilding their JavaScript applications within Elm and never getting undefined errors. And that's kind of...
you know, unique because in JavaScript, generally you're writing code and you get all undefined all the time. Undefined is not a variable or whatever. Undefined is not a function. That's what it says. So yeah, that's, it definitely resonates as a motif there. So you mentioned this, this boilerplate avoidance that you get through using OCaml. And an interesting quote is,
I heard from you is you cannot pay people enough to carefully review dull code. And you were kind of saying this, like boilerplate code is boring to read. Like people don't want to read it. And so they won't read it. And then so you can end up with like errors in the boilerplate. And it's just like, you know, and then it screws you somehow. So...
Could you clarify this reduction in boilerplate code? How does that work? Are you saying that OCaml makes the code just more interesting to read? What do you mean by that? So I guess there's two kinds of boilerplate that are worth distinguishing. One is kind of type system boilerplate, which is where you have to make a whole bunch of declarations to explain to the compiler what it is that you're doing.
And that definitely makes the code harder to read because it kind of inserts all this noise around the logic. One of the nice things about OCaml is if you write, you get code that is about as concise as Python, but with the performance and correctness properties of a type safe language. And that concision is valuable in part because it makes things easier to read.
So that's one way in which one kind of boilerplate that you avoid. But more generally, one thing that OCaml is really good at is it gives you good abstraction tools, good lightweight ways in the language to take something that you need to do multiple times in a row in the same way and parameterize it, say, you know, rather than re-categorize
rewriting the same code multiple times, I'll just take a little function that I define locally that takes a few different arguments and then does the appropriate thing and then call that a function a few times. And now the different invocations of the function clarify what is different between the different times you had to do it and make it explicit what's the same. What the function did, it does the body of the function is the part that's preserved. And then the arguments is the part that changed from invocation to invocation.
So this is almost a form of compression, right, where you can take the repeated parts and thoughtfully factor them out so you don't have to say the same thing over and over. Just the well-known, you know, don't repeat yourself principle. Got it. But different languages support that to different degrees, and you want ideally to have a language which lets you do that effectively. A language which doesn't give you much in the way of abstraction tools like this is C, right?
When you talk to really good C programmers about how they deal with avoiding repetition, they sometimes resort to prolific use of the macro system, using pound-defines and things like that, to get around the fact that C's abstractions are relatively weak in this regard. But that has a whole set of problems on its own. Macros are much harder to reason about than functions. And so it just makes your code more... It solves some problems, but it creates others in its wake. What is the difference between a macro and a function?
That's a good question. You can think of a macro as a function that transforms your program, right? It generates code rather than just doing something. You can think of a function as something that has an abstract definition, and then the users of the function don't have to think about what's going on on the inside. Whereas when you're doing syntactic transformations, there's a lot of complicated things that can happen that are hard to think about. One of the basic ones to be concerned about is variable capture, right?
Where if you have a macro that takes in some names of variables and then expands some code, generates some code around them, one of the concerns you have is that one of the variables that you might reference in your generated code might happen to be also referenced by the variables that are passed into the macro. And because you're just writing something that writes codes, it can easily write bugs in that and falsely identify variables that are not supposed to be identified.
Whereas if you think in ordinary programming you call a function you're never exposed to this kind of confusion the name of the function on the other side is a kind of irrelevant the say the name of the argument you pass to the function is a kind of irrelevant detail of the syntax which never interferes with the meaning of the code that you're writing.
So we've talked about a lot of the benefits of OCaml, and I want to touch on some of the weaknesses and how those weaknesses change the programming model at Jane Street. What is the concurrency support for OCaml? So it's maybe worth separating between two confusing and very similar words, concurrency and parallelism. Parallelism is about
taking advantage of multiple pieces of hardware at the same time, doing things physically in parallel. And concurrency is in some sense about programming in the presence of uncertainty, specifically uncertainty about when things might happen. So a concurrent program is one where you have lots of things going on kind of logically that are being tracked by the program and
in some sense at the same time, but not necessarily using multiple pieces of hardware to do it. So a good example of a program that can be concurrent but not parallel is you might write a program that listens on the network and has connections to many different servers elsewhere and is sending and receiving messages from all of them.
And you want to logically describe different parts of the program that are sending messages to and waiting for responses from different things all at the same time. And it's uncertain about which one's going to happen next. And for that, you need some good model for writing these complicated concurrent programs. And then at the same time, parallelism is...
is just the story where you want more performance out of this. And you can use concurrency to implement parallelism. You can have lots of things that are kind of sending messages to each other and waiting for input one from the other. And that's a way of building a distributed or parallel system. But concurrency and parallelism are different concepts. OCaml has a very straightforward and good concurrency story, which is there are all sorts of useful libraries which give you beautiful abstractions for building concurrent programs.
There's a library that we have called Async.
which gives you, it's a model based on what are sometimes called futures, where a future is a value that may not yet be determined. And by using these little kind of combining functions, often called combinators, that let you string together computations based on these futures, where you can say, I'm going to call this function, which gives me back a value as a future, and then I'm going to attach to that another computation, which when that future is filled in,
The next function will be run. You can get these sensible sequences of operations that can then, you can have multiple leads combined in the same program. And because you have control over the interleaving, when these things kind of switch control from one to the other, it's a lot easier to reason about the interactions between the different pieces. So the normal problem you run into where you have
Lots of concurrent threads walking over the same memory space, and you have to use mutexes and condition variables and semaphores. That kind of code is incredibly hard to reason about, and abstractions like async give you really good ways of writing concurrent programs that are really easy to think about. But where OCaml's story is less good is parallelism, which is to say OCaml doesn't have any...
Any way of running multiple threads that share the same memory space that can use different hardware at the same time. So you can't saturate, you know, the eight cores on your box with eight OCaml threads running inside the same OCaml program.
What you need to do there is you have to run multiple OCaml programs that communicate via message passing. And this is the way in which we've built our systems here. And it's actually been really good. We really like this approach. And there's actually work going on right now actively pushing to have a multi-core garbage collector so that you could have multiple threads inside the same OCaml process.
And oddly, we're in the spot of thinking this is a good thing for the community because people care about it. But it's not clear to us how we're going to use it. Because we can and do build highly parallel systems that can use massive amounts of computing infrastructure. But we just do it in different ways. And one of the nice things about building things with message passing is you're not restricted to a physical box. You can write systems that span multiple boxes using the same set of abstractions. So...
So we don't have shared memory parallelism. That's just not a feature of the OCaml ecosystem right now. My guess is a year from now, we'll probably have that as part of the ecosystem. But it's not clear to me that it's a thing you should want. So in particular, within finance, is there some risk to the shared memory that you get to avoid by not having that level of parallelism?
Or rather to achieving the parallelism in a different way. And I think, yeah, there is a lot of risk that you avoid because writing code that uses the same shared memory...
and kind of interacts with it in an efficient and correct way is extremely difficult. So it's one thing, it's not that we don't have any programs that share memory. In fact, we often have programs which will set up explicit shared memory segments, which different programs will read to and write from to communicate with each other. But we don't have it as part of the kind of warp and weft of a program where kind of everything might be shared.
When we have sharing, we kind of have explicit regions cordoned off in which we do the sharing. And thinking about those is actually really hard. Like the last time we went to build one of these abstractions, like we literally went and like dug out some papers about how the x86 memory model works to try and understand that better because it's actually really subtle. And if you go and read the Intel manuals, they will tell you a bunch of hard to understand lies that actually don't correspond to how the hardware works.
And that's because the memory models you can implement efficiently are not simple in their – like the Intel one is pretty good. It's relatively simple as these go. But the naive memory model that everyone has of like, oh, there's a bunch of memory and then different things are reading and writing from it and it all happens in some order.
Like that just doesn't work. That's not how things happen when there are caches. You know, I might write to something and you might, you know, and you might write to something and the order, there isn't like one well-defined order by which you can imagine the whole thing happening. They would explain the point of view of the two different actors on the system. So shared memory concurrency is complicated and hard to think about it in, if you're, if you're working in an imperative context where you're like, you're actually doing mutation, which is,
OCaml, even though it's a functional language, we write a lot of imperative code in OCaml. And OCaml, unlike, say, Haskell, is a language where you can write imperative code anywhere. There's no constraints in the type system stopping you from doing it. In that kind of environment, having lots of threads that share the same memory space, it's very hard to reason about. And I think...
Anyone who's really concerned about correctness should think twice about programming in a kind of free thread model where you just have to like, everyone has to be very disciplined and put all the locks in the right place to make things work well. That's very hard to get right.
Hmm. So now that we're kind of on the topic of distributed systems, I'd love to get a better idea of the larger architecture at Jane Street. And maybe if you have any interesting distributed systems stories or some problems that you've been working to solve, it would be great to hear an anecdote or two. Sure. So I guess one interesting... So the whole question of how distributed systems should be designed is one that is...
particular interest to me because my academic background is actually in distributed systems. So I went to grad school and had lots of plans to go off and be a professor and I
I learned a lot about all sorts of clever algorithms out there for building strictly consistent replicated systems, which is one of the core things that the distributed systems literature talks about. And if you look at companies like Google and Facebook and so on and so forth, they actually use these primitives in their own systems. So, for example, Google has a system called Chubby, which is their lock server.
And it's a distributed system. You have multiple different chubby instances, but they try and present an illusion of one consistent system. And they use a bunch of fancy algorithms. In particular, they have a variation on an algorithm called Paxos, which is a well-known algorithm from the distributed systems literature, which is legendarily complicated and hard to get right. So one amusing anecdote about this is, so years back, one of our interns, a very smart guy named Ed Yang,
After having done some other kind of more directly useful stuff, we thought kind of as a flyer, why don't we go and use our nice concurrent programming library to go and build a Paxos implementation? Let's see how well it works. And he went off to build one and he in the end came up with a nice implementation that we could experiment with. But along the way, there were some questions about exactly how to put the algorithm together. And so we looked at one of the papers that Google had published on the topic and
and they had some nice trick they described for making basically some technical aspect having to do with group membership of the set of hosts that are involved in this distributed system, making that easier. And there was this part in that algorithm they proposed that didn't quite make sense to Ed, and he said, I don't think this is right. And I looked at the code, and I thought,
yeah, I don't think that's right. I don't see why that would be correct. And I emailed my, one of the guys I had worked with back at Cornell was a grad student who's written a number of papers in the area. And he said, no, I don't think that's right. And then I emailed the author of the paper to Google who I want to emphasize is no fool. He was a guy whose papers I read when I was a grad student, uh, on the same topic. And I pointed out to him, he was like, yeah, you're right. That's a bug. And like, kudos to them for having like correctly described in their paper well enough, uh, what was going on that, that, um,
that you could really understand the implementation well enough to find a bug in it. That was pretty good. As opposed to the original Paxos paper. Right. So I think the original Paxos paper, I don't know that there are any – I don't know of any bugs in it. It's a beautiful, charming paper by Leslie Lamport. I recommend it. It is completely incomprehensible. It's called The Part-Time Parliament and it's sort of – it's an extended allegory of a distributed system as if it were a rowdy parliament on the Greek island of Paxos. Right.
It's hilarious, but very hard to understand. And then he wrote another paper called Paxos Made Simple. And then a guy I know from Cornell, Robert Van Rennessey, wrote a paper called Paxos Made Moderately Complex. And there's like a long series of papers about trying to understand Paxos. Anyway – If I recall, like the first Paxos paper, he submitted it to ACM and they just didn't accept it because they were like, this guy is clearly crazy. Yeah.
I hadn't heard that story. And he's like, actually, I figured out how to do distributed systems. So I think an interesting question here is whether or not the whole idea of Paxos is as broadly applicable as people think it is. It's very hard to get Paxos correct. And so the basic thing you're trying to implement with something like Paxos is what's called state machine replication, where the idea is you have some well-defined notion of what are the things, the sequence of things that happened, a linear order to all the events.
And you have a well-defined way of decoding those events and building up a state. That's the part that's the state machine. And if you can have a way of distributing these events in the same order to everyone reliably, then all the replicas can just follow along decoding the messages as they go. And you have a very easy to think about way of replicating the system.
The fundamental problem that Paxos is trying to solve is the problem of what happens when the guy who's calling the tune, the part of the system that's deciding the order of events, what happens when that thing dies? And one of the experiences we've had here is we built a system along these same lines that is built on state machine replication where
You have a part of the system called the sequencer, which decides on the order of events for everything.
But when the sequencer dies, what happens is a human has to actually step in and manually kind of do some checking and make sure that like the thing that died isn't half alive and still generating messages, make sure it's actually good and dead before flipping over to a backup. And so it means that you have a few minutes of downtime as someone thinks through this somewhat complicated situation. And that's the downside. The downside is that that time when the master dies, you have some extra work to do that you have to be careful about to recover. Right.
And there's a question of at what point is the extra complexity of implementing these fancy algorithms worth it, given that if you buy a nice high-spec machine, its median time to failure is like once every two to three years. So if I could take a downtime of three minutes every two or three years, it's not clear to me that all of the complexity of Paxos is worth doing. And even beyond that,
The number of bugs you expect from the Paxos implementation being wrong is I think probably larger than the number of bugs that you expect from your nice, high-quality, dual-power, dual-networked machine crashing. So this is safety versus liveness that you're talking about. That's right. And I think part of what's going on is Paxos is – the Paxos, which I absolutely think is important in some application areas –
But it's a kind of trade where you make the system more complicated and thereby implicitly trade away safety.
In exchange for a system that's more alive and really more alive kind of in an automated way. If you look at a Paxos system, when the leader dies, because in the end they do have a leader that is making the decisions as to what the next transaction is. When the leader dies, there is a little bit of latency as the leader election algorithms fire and try and pick a new leader. It does take, I don't know, 10 seconds or something in a lot of these systems to fail over from one leader to another. So it's not like you get no downtime. It's just the downtime is less.
And importantly, it takes less operational kind of work and less operational know-how to do the fail. You don't have to have a smart person who's thinking about it. It just kind of happens in the background. And yes, you were out for a few seconds, but that's not that big of a deal.
So is the movement towards Raft – Raft is kind of like a win-win there, right? Because as I've heard about Raft, it's like you kind of just get a little more simplicity, kind of get a Paxos-like system but with more simplicity. So I think Raft is – it seems like an honorable direction to try in that I think Raft is – I haven't read the papers myself, but talking to people about it, it is I think somewhat simpler than Paxos, but it's still not I think radically simpler.
I still think it's the case that this is a complicated thing that's hard to get right. And even with systems like Raft, there's an enormous performance trade-off. So systems that are built with a single sequencer can do millions of transactions a second. If you can write extremely fast network processors in OCaml without going totally crazy, you don't have to...
decide to rewrite everything using FPGAs or something. With ordinary OCaml code on basically stock ordinary hardware, you can write things that can process millions of messages a second and get messages in and out of your box wire to wire in a handful of microseconds. It's just not that big of a deal. And you cannot do this in a Paxos-like system. There's just no way. So you're making...
Again, I don't want to say this kind of stuff is totally useless. Far from that. I think there's a lot of sensible applications for it. And it may be that things like Chubby are totally sensible uses and also Zookeeper and things like that. And in fact, we're starting to use Zookeeper a little bit internally. But my guess is that people over-apply things like Paxos and think it's relevant to more domains than it really is. And I think they're really...
What I'm kind of surprised by is it seems like you're kind of advocating more liveness, less safety. And we started off this conversation emphasizing the need for safety, particularly in your domain. So, no, no. I think the thing I'm advocating is actually more safety and less liveness. Oh, okay. Because what happens is the bad thing that happens when the leader fails is that the system stops. It stops cold. Right. So that's a liveness problem, not a safety problem.
And I think you get more safety out of this in a subtle way. It's not because the algorithms, you know, the platonic ideal of the algorithms, at least, are totally safe, right? There's nothing wrong with a correct Paxos implementation. But as a kind of practical software engineering matter, generating a correct Paxos implementation is not easy. I think it's a little bit like the cryptography story. It's not one of these things you should consider rolling on your own.
It takes a shocking amount of engineering to get a Paxos implementation right. And it's not clear to me that I would feel good about a Paxos implementation that didn't have a formal proof of correctness behind it. It's that subtle. Well, I guess it makes sense that you're using Zookeeper then since so many eyes have covered Zookeeper. You're probably a lot less likely to get the type of less safety due to lack of eyes on it. That's true. And we're not using Zookeeper for...
Things that are absolutely correctness critical, right? We build lots of kinds of infrastructure. We build stuff that sends orders to the markets and we also build build systems and we build monitoring tools and we build things for deploying servers and not everything. The infrastructure is generally important and should be of high quality, but not everything is equally terrifying to get wrong. And we apply different engineering techniques to different parts of the system.
Okay, interesting. Well, this seems like a good segue to talking about interacting with the broader market. You've said that it's terrifying to write software that interacts with the broader market, that interfaces with the broader market. Why is that? So there are a couple things. One is just the fact that these...
These systems that are trading on your behalf are committing you to transactions and you could lose a lot of money on those transactions. And if you have a system that makes the same stupid decision over and over in a fast loop, you're going to lose a lot of money.
you can lose a shocking amount of money before you've noticed. So this is in some sense less about the messy details of interacting without the outside world and just more about the core semantics of what we're doing, right? We're trading. Trading is scary. Having this thing happen without human oversight, but just being written by some program that, you know, poor programmers like us wrote and tried hard to get right. Like that's a scary thing on its own. It's all made much worse by the fact that the world is really complicated and hard to model.
There's a big gap between this naive model you might have of how the markets work and how they really work. If you could always rely on the fact that you were trading the symbol you thought – you were trading the instrument you thought you were trading and the market data you were getting was good and you had your correct understanding of the current meaning and all the different corporate actions that could affect the meaning of the security today versus yesterday were clear and when you have complex –
that are related by arbitrage relationships. That's a bit of jargon I should maybe explain, but the kind of obvious examples are things like ETFs, where an ETF is a security that represents a basket of other securities. You may be trading that based on some understanding of what that ETF means, and it may change in a way that you don't understand and somebody else does. So there's lots of ways in which the gap between your understanding and what's really going on can mess you up.
And that adds a lot to the complexity of it. And it's made worse by the fact that financial software, a lot of it's really terrible. Like we trade in markets all over the world and some of those markets are not super well engineered. And so, you know, you try, especially when you try and break into new and more obscure things.
You discover all sorts of weird stuff in the technology where you want hard guarantees. I want to know that if I send an order for 5,000 shares and I want to get filled at a price no worse than $33.30, I want to know that I don't get filled for more than that number of shares or at any worse price. It seems like a simple basic guarantee to want. And it's not one that every exchange everywhere successfully honors. They have bugs. They have problems. They make mistakes too.
So how do you hedge against this? Do you just have to have an institutional humility? Because it sounds like there's just so many Black Swan type of things you have to be prepared for, whether it's an ETF doesn't trade the way you want it to, or there's a glitch in a system. How do you institutionalize being prepared for this type of stuff? So I think institutional humility is a great way of putting it.
I think that risk management extends all the way from the technology stack to the trading decisions. And you need to be afraid of the various things that can go wrong. And we try and be very careful and try and think hard about the kind of things that we don't expect might happen, about different ways in which
The different risks we're worried about turn out to be correlated with each other. That's, I think, one of the scariest things about the financial world is when things go wrong, they sometimes all go wrong at the same time. Like exactly the time when the exchanges' bugs are likely to be exposed is when things are very busy, right? And when things are very busy, that's often because all sorts of other assumptions that you have about the world may also have been violated.
So you get all sorts of things. And just another terrifying thing to imagine. When the markets get really busy, you know what else might happen? Your monitoring might fall. So to be simultaneously exposed to weird behavior in the financial markets and large amounts of trading and failures in your monitoring and then weird failures in the semantics of the underlying systems you're dealing with, you just have to be afraid of all of those things. And think carefully when things are going wrong.
about what might be the cause about it. And when things look like they're going right, you should be a little worried about maybe they're going wrong and you just don't realize it. Do you have like a big red button that you can press to shut down the system? We have lots of ways of turning things off, including a literal physical big red button. Oh. Yeah, I like to say that, yes, it's red. It cuts the power to a thing that heartbeats other things and keeps them alive. Like we are...
I like to think that one of the most carefully engineered abilities of our trading systems is the ability to not trade. And we try and redundantly engineer that into the systems. Oh, okay. So maybe you can have like a big red button next to every fire alarm like people could pull. Do you have like an inverse of that, like a big green button that you can press in moments of severe volatility? So, no. No.
So this is back up. I think that one of the things that is a pervasive part of our, uh, kind of risk thinking around software is that the idea that not trading is safe. And so there's a default that when things look bad, you should turn off. Now you have to be careful with this because from a point of view of, uh,
of doing a good job at the actual business, you want to be up as often as possible. There's a service that you are providing to the markets. You are being the other side of people's trades. And when things are complicated and crazy, that's when people need your services the most and are willing to pay the most for your services. So you are very well compensated for being up and trading in crazy times.
But even so, we feel strongly that it's important to engineer the systems to turn off when things go bad enough. So if you have some crazy situation in your code, you're like, well, what if this happens and then the other happens? I get this weird corner that should never really happen. Well, we generally say, well, the right thing to do is to turn off.
And this is very different from, I don't know, Facebook where for them, I think availability is king. Like if they show you, you know, the wrong set of likes, it doesn't matter that much. But if they're not up, everybody's angry. So I should clarify what I meant by volatility. I meant like, let's say, uh, there's a war, you know, a war breaks out somewhere totally unexpected or like a plane goes down with a president in it. And it's the type of situation where it's probably not going to lead to like a
market glitch or something, but it's the type of thing where maybe you've got a bunch of positions that benefit from market volatility and then you would want to press the big green button and take advantage of that. Maybe that's something that just doesn't make sense. I don't know what that green button would do, but one thing I think is worth saying from a... I want a big button that buys everything. It's like, I don't think that helps. One thing I'll say is that I think
The day that there's some terrible news that makes the whole markets go crazy, that is a day where you expect bad things to happen on the software side. Because you'll have undue load. You'll have weird situations that you don't normally have. So these things are really correlated. Another good example of correlation – this is not a bad event, but Facebook's IPO –
Right.
And I think this is the way of things. The extreme circumstances from a financial sense are also the extreme circumstances from a technological sense. Interesting. So you're talking about this institutional humility. I'm curious about the other ways that Jane Street works as a technology organization, potentially for ways that listeners might be able to apply to their own companies. What are some unique principles at Jane Street?
So I don't know if I quite elevated to a principle, but we have a very strong focus on readability of code and on code review. We put a lot of effort into tools for code review. So we have our own code review system called Iron that is used for basically all the development that we do here. And one of the reasons we care a lot about code review is that one of them is just this issue of
It's a way of catching bugs. But I think it's actually much more important than that. It's not just about making sure that things are correct. Code review is a way of doing a lot of good things to your code base. It's a way of sharing institutional knowledge, improving your bus factor. If multiple people are reading the code, then the one person getting run over by a bus does not completely destroy all of your institutional knowledge of the systems in question. It's a great way of training people. And I think more than anything else,
it improves the engineering of your systems because it obligates you to write systems that are possible to be read. And readable code, I think, is very different and in functional ways often much better than code that's unreadable. Code that is simple enough to be read is also likely to be easier to modify, more likely to do what the people who wrote it intended, easier to evolve, to do new things.
So I think code review is a big part of our institutional kind of approach to software and something that I think even though lots of places have code review, I think we spend more time on it and think about it more carefully than any organization I know about. Got it. So to begin to close off, I'd like to ask kind of a philosophical question. How has your definition of money changed since you began working at Jane Street on these financial technologies?
So I don't know that my definition of money has changed exactly, but I do think very differently about the idea of what I'll call expectancy, right? Thinking about what is the kind of value of a certain choice that you make in monetary or in other terms. And I guess, let me tell you a short anecdote about this. So when I started investing,
At Jane Street. In fact, before I started, I was doing some consulting for the firm for about a year, half time while I was doing a postdoc. And one of the early people at the firm was talking to me about what was going on. He said, you know, what's the chance that you want to leave early before the year is out? And I said, ah, for various details of the academic job market, it's not very likely that I'd want to leave early. And he said, what's the chance you want to stay on permanently? And I said, zero. And I
His immediate response was to kind of laugh and say, I'll buy that, which is to say he'd be happy to make a bet with me that there's some probability that I would stay. And it's thinking about how I think about these kind of things now versus then. Like I would now think of it as almost a violation of my professional ethics to describe the probability of something as being zero.
In the sense that like saying something has zero probability to be true is if you think of it in a kind of Bayesian sense is like saying no matter how much evidence you can show me on the other side, I'm going to still believe that it's not true. Right. God himself could descend from heaven and say you are going to continue working at Jane Street after the following year. And I would continue not to believe that.
And I think it's this kind of shift in focus, I think, comes from thinking about the monetary implications of your decisions. One thing that is a kind of very clear effect we've seen over and over is that when you ask someone to try and estimate how confident they are about something, how likely something is true, they'll do a terrible job of it, very often being massively overconfident.
On the other hand, if you say, how much money would you bet on it? You know, and at what odds? You suddenly like step way back and think much more carefully about it and say things that make a lot more sense. So I think there is a real way in which thinking about the monetary consequences of your actions helps you think about probability in a more rational way.
Yeah, I definitely remember feeling that way. I used to play a lot of poker. And to your point, I remember reading this quote from Dan Harrington, who's a poker player. And he was like, you should always... The minimum probability that your opponent is bluffing is 5% in any given situation. And kind of this floor on the fact that your opponent is just full of it...
that is not zero, uh, was always kind of interesting to me. And I definitely found it to be a useful heuristic, but it's not necessarily an intuitive one. You would assume that grandma is, is never bluffing or, you know, something like that. But in fact, it's, it's actually, I don't know. It's, it's, it's a useful heuristic. So I really liked that, that probabilistic model of thinking. And grandma's a sharp cookie, right? You never, who, who exactly knows her table image.
Anyway, well, Jaron Minsky, thanks for coming up to Software Engineering Daily. It's been really awesome and interesting talking to you. The time flew by. Great. Thanks so much.