In Athens, 2013, Maria sits in a cafe. She's 32 and she's been jobless for three years. Like many Greeks, she feels stuck. Each month tougher than the one before. Newspapers drone on about austerity. There's cuts, there's layoffs, there's pensions slashed. Life is closing in on her. That morning though, she spots an unusual headline. It's a strange story from across the ocean about two Harvard economists,
Carmen Reithart and Kenneth Rogoff. Their 2010 paper claimed that when a country's debt tops 90% of its GDP, economic growth takes a hit. This idea became a key reason for austerity policies worldwide, including those imposed on Greece. But the news article Maria reads is an update. A graduate student,
with his professors found a critical error in Reinhart and Rogoff's analysis, a simple spreadsheet mistake. A miscalculated formula left out significant data, leading to just factual inaccuracies. Instead of economies shrinking when debt topped 90% of GDP, as the original paper had claimed, the corrective figures show average growth rates of around 2%.
This wasn't just an academic blunder. It had real-world fallout. Governments, misled by the flawed study where they had just not selected all the rows, rolled out austerity measures. And because of that, there was prolonged recessions. There was soaring unemployment. There was social unrest. In Greece, it was a national crisis. Unemployment over 27%. Public services falling apart. The lives of people like Maria in chaos.
It's unsettling, this idea that a simple spreadsheet error, a coding mistake, could steer global economic policy, could change the lives for millions of people. Maria's story is in fact fictional, just a composition, but the people affected by this error were real. And it makes you wonder, what other unseen mistakes or unintentional deviations in code are quietly shaping our world? And what happens when those lines of code are thrust into the spotlight?
Welcome to Code Rehearsive. I'm Adam Gordon-Bell. Today we're exploring the invisible code that quietly shapes the world around us. Code most of us never think about at all. Here's an example. My mom doesn't own a computer, unless you count her flip phone. She's nowhere near the Silicon Valley bro stereotype. But she did write code once in university. She wrote a lot of code.
She studied psychometry, measuring intelligence and cognitive skills. And back then, writing her research up meant running statistical calculations, which meant writing programs on punch cards and submitting them to batch processors to calculate correlations. A lot of the world's most important code is like that, or like that GDP spreadsheet.
It's just some simple calculations tucked away somewhere in academia, sitting on a co-author's machine that's only pulled up when a diagram needs to be regenerated or a constant needs to tweak or when somebody requests it. It's invisible code, but it's powerful. It affects policies. Often, this hidden code stays unnoticed until something goes wrong or until a single line out of context gets thrust into the spotlight. That's what today's story is about.
A story about more than just scientific data. A story about the human side of data analysis and the pressures on those who do it. I'm talking about ClimateGate. Does anyone remember ClimateGate? It was all over the news back in 2009, 2010, about 15 years ago. And I vaguely recall it being a really big deal.
I remembered something about leaked emails and climate change. It was one of those scandals that just happened in my past. And if I thought hard about it, I recall hearing about scientists getting caught red-handed fudging data to make global warming look worse than it was. They were supposed to be truth seekers, but they were twisting the numbers to fit their agenda. I think there was a hack or a leak. At least that's what I remember.
But then I looked into it and I realized it all boiled down to a single file, a single piece of code. That brifa_sep98_e.pro. Rolls off the tongue, doesn't it? ClimateGate was like that spreadsheet error, but on a massive scale because it shifted how people saw scientists. And it sparked distrust in science for some, maybe for many. And that trend continues today. But here's the thing, we can find the truth ourselves.
Today, I'm going to go download the actual leaked ClimateGate files, open up the controversial code, and dig through it step by step, all to answer one big question. Was ClimateGate evidence of scientific fraud? Or was it something else entirely? To answer this, we're going to take some detours. We'll explore strange files with cryptic names, decipher obscure programming languages like IDL,
venture into unrelated scandals like the Alzheimer's research scandal, and at times it might feel like I'm getting lost in the details, but trust me, we're always chasing the same goal: uncovering exactly what happened in ClimateGate. Because I think it matters. Because I think we live in a world where science itself is increasingly under attack, where misinformation spreads faster than actual explanations, and where trust in experts is super, super fragile.
So no matter what we uncover, the act of careful investigation itself is an essential skill. It will help us figure out what and who to trust in a moment when it feels like the stakes on the truth have never been higher. It all started on November 17th, 2009. Something was wrong at the Climate Research Unit at the University of East Anglia in Norwich, England, a city of about 150,000 people.
A backup server holding years of emails and research data had been breached. The university called it a sophisticated and carefully orchestrated attack. 160 megabytes of data were copied, emails, documents, code, everything. And then there were some whispers online and a curious upload to Real Climate. And then anonymous posts hinting at secrets, suggesting that climate science was too important to be kept under wraps.
By November 19th, whispers became a roar. An archive file with everything was copied across the internet and spreading fast. Suddenly, thousands of private emails and documents were out there. Climate change denial blogs jumped on it, claiming the truth was finally being revealed. In just a few days, the media picked up the story, and headlines ran about leaked emails about a brewing scandal, and this was all just weeks before the Copenhagen Climate Summit.
The University of East Anglia confirmed the breach, and the police got involved, and the world watched as Climategate erupted. No one knew the full impact yet, but it was clear something big had just happened, and the world of climate science was about to be shaken. When these files first leaked, James Dellingpole at The Telegraph reported that global warming was based on one massive lie. Now I just want to say,
I believe in global warming, and I believe that it's caused by humans. My intent here is not to give a platform to the science deniers, but I do want to explore how we can move beyond just trusting the experts, how we can look at things ourselves, how we can investigate what is the truth using our own minds and, you know, some effort. That's why I found these leaked files.
And I downloaded them. It's a zip file. F-O-I-A dot zip.
And it's packed with documents. It's split into two folders, documents and mail. The mail folder is like 11,060 text files with names like 125423285.txt. And if you open it, it's just a plain text email between two researchers, usually talking about a paper they're working on. The key file of our story, brifa-sep98e.com.
is in the documents folder in a directory called Harris Tree. And this file is considered the smoking gun that triggered the controversy and led to an entire university lab being investigated by the UK House of Commons, led to eight official inquiries, articles in the New York Times, articles in the Washington Post that claimed the climate scientists were lying, that claimed that they were hiding things, that the world was actually getting cooler.
And it's just one file. It's just a small file. It's 150 lines of what turns out to be IDL, a programming language that's kind of like MATLAB or like NumPy, but Fortran based. I guess IDL is mainly used in science for number crunching and graphing.
It's imperative code. It's like set this variable, then load this one, loop over these. And it's pretty heavily commented. Although in IDL, the comments start aligned with a semicolon, which I find a bit confusing, but I got used to it. Anyways, in this file, right at the top, in all caps with asterisks before and after to set it off as a heading, it says applies a very artificial correction for decline.
artificial correction. It's right there in the code, just two lines down from the top of the file. And then a list of values, and the values are labeled fudge factor, fudge factor, artificial correction. This wasn't sophisticated climate modeling jargon, right? This sounds like they were just making stuff up. But to get to why this artificial correction stirred things up,
you kind of need to know what was going on at the time. What was happening in the 90s, the late 90s, and the early 2000s, and about the hockey stick graph. In the late 90s,
Climate scientist Michael Mann, along with some others, Raymond Bradley, Malcolm Hughes, introduced the hockey stick graph. It showed global temperatures holding steady for a thousand years and then shooting up sharply in the late 19th and early 20th centuries. Picture a hockey stick lying flat on the ground and then suddenly at the end, the blade curving upward. That's the shape. That was temperature, worldwide temperature.
for the globe and for the Western Hemisphere. The graph wasn't just scientific trivia, right? It exploded into the public view. It became this shorthand for the urgency of climate change. Al Gore held it up. It was a big moment in an inconvenient truth. And suddenly this image, this shape was everywhere, a symbol of the crisis that was going on.
But that power made it a target. Almost immediately, it faced fierce scrutiny. Skeptics didn't just question it, they attacked, claiming the data was manipulated to exaggerate warming. To them, this wasn't science revealed, it was a political weapon that had been forged to institute drastic policies. So when Climategate erupted, what was really happening is these phrases like artificial correction and fudge factor popped up in the leaked code and skeptics thought they'd hit the jackpot.
They had proof of fraud. They had proof that they didn't have to worry. Here's the critical question. Was the hockey stick graph genuinely compromised? Was somebody mispresenting things? Or was this controversy more about misunderstanding? Were people misunderstanding the scientific process? Thankfully, we have the code. Now, I just need to figure out how IDL works. And I need to find the data and understand what's going on here.
The fudge factor is actually pretty straightforward. It's a series of numbers from 1400 to 1992. It starts at zero. So we have a zero value from 1400 to 1904. Then it dips negative into the 30s. And then it shoots up in the 50s all the way through the 70s. And then finally leveling off. I couldn't actually figure out how to run IDL. So I did what any developer would do. I just converted it to Python.
If you graph those values, you see a long flat line and then the shaft like a hockey stick starting up in 1950 onwards. It's a blade that tilts sharply upward. The code does more than just graph that fudge factor though. It reads in climate data and it applies a low pass filter to it, basically smoothing it out and then it applies that fudge factor over top.
So I did the same thing. I made up random climate data from 1400 to now, and then I applied the very artificial correction. And then I can graph both with the correction and without. And without, it's a very straight line, but with, it turns into a hockey stick. The fudge factor completely overshadows the real data. I can see why the skeptics were concerned. When this surfaced, Eric S. Raymond, a well-known open source advocate,
the guy who wrote The Cathedral and the Bazaar, and also a well-known social conservative, he saw it too. He did this same process and found some of the same issues. This is blatant data cooking, plain and simple. It flattens the warm temperatures of the 1930s and 40s, see those negative coefficients, then it adds a positive multiplier to create a dramatic hockey stick. This isn't just a smoking gun, this is a siege cannon with the barrel still hot.
Eric Ravens, a vivid blogger, right? Siege cannon, barrel still hot. That's powerful imagery. And it was coming from an expert in software. So it was hard to dismiss. He wasn't just some random internet crank. Eric at the time, he had a big book out. He was a respected figure in the tech world and he definitely understood code. His vivid take on the situation helped shape how people first saw the code.
He posited that this was an error cascade. The people at CRU had manipulated climate data with this hockey stick fudge factor, and that led many people to believe this false narrative about ClimateGate. And soon, the world was buying into this big lie. Until this leak happened and the deception came to light. Some claimed that climate change was fake, and this fudge factor in this code was proof. Climate change, of course, wasn't fake, but that didn't mean that scientists weren't nudging the numbers.
Both could be true. So what was really happening? With any good investigation, you can't stop at the first piece of evidence that fits the narrative. You have to keep digging, especially when the accusations are this big. And the deeper I dug, the more I kept seeing another infamous phrase, one that seems like a direct confession that kept coming up. Hide the decline. And in reference to the original hockey stick graph published in Nature, there was an email in this leak that talked about Mike's nature trick.
To bloggers and to the mainstream media, this felt like a confession. Some thought this hack was an inside job. Maybe someone at CRU was fed up with all the lies, and so they leaked this data out. But before jumping to conclusions, we need to understand what this code is really doing. You see, climate science is actually pretty complicated. You can't just read the file. You need to understand the context. So heads up, we're about to do a deep dive. But stick with it. I think it's worth it.
All right, imagine this, your uncle, it's 2am and pager goes off. Main transaction system is throwing errors. Latency is spiking. You dive in but something's off. The detailed performance logs, the granular stuff you need, they only go back six hours. Before that you just have daily averages, nothing useful for debugging this spike. You can see the system is acting weird now.
But the crucial question is, is this spike completely unprecedented or is this just Tuesday? And that's when the batch jobs run. And then it throws some alerts like this and you should ignore them. Without that historical context, without those older logs, you're flying blind, trying to figure out the root cause. Climate science is exactly like this, but the system is planet Earth and the stakes are considerably higher.
We have solid detailed data on the Earth's climate. Thermometer readings from weather stations, from ships, from lots of places going roughly back 150 years. This is the instrumental record and it tells a clear story of the planet's average surface temperature has risen by about one and a half degrees Celsius over the last century. Just like that production system with only six hours of logs. 150 years is a blink of an eye in climate terms.
So is that degree and a half of warming normal for the Earth, or is it abnormal? Is it outside of the natural variability? We've got this huge blind spot before the late 1800s. We just don't know. So to answer this question, scientists need to become data detectives. They need to find ways to reconstruct climate history from before the time of widespread measurement.
But this isn't like restoring logs from archives. Nature doesn't keep clean, standardized, you know, JSON files. The data log scientists have to work with were things like the width of tree rings or the chemistry of ancient ice layers drilled up from Greenland or the skeletons of corals.
or even the temperature profiles found deep underground in boreholes. These are called climate proxies, and they're imperfect, they're noisy, they measure climate indirectly. They're sparsely located around the globe, and they sometimes record things other than temperature. And also they have gaps, and they come in completely different formats. Piecing together the Earth's climate history from fragmented and messy data is a huge challenge. Climate science is actually a lot like data archaeology.
You're using complex statistical modeling and a painstaking process to try to figure out if the picture you're assembling is an accurate representation using all this proxy data. So let's look at some of the main types of data used. It's really the only way to get an understanding of what's happening in that file. The most famous temperature proxies are the tree rings. This is central to the story because this is actually what CRU focuses on.
Trees grow a new layer each year and how thick or dense that layer is often depends on the conditions during the growing season. Maybe how warm the summer was or how much rain fell. So you find some really old trees and you drill out a core and you count the rings back in time measuring their properties. It sounds simple but it's actually messy. Trees only grow in the mid latitudes of the globe and you won't find any trees in the ocean or in Antarctica.
And even where they do grow, it's not just climate affecting them. Younger trees grow faster. Trees get diseases. Maybe a nearby tree falls, giving the tree you're measuring more sunlight. It's like a performance metric that's being affected by random GC pauses or network hiccups that you weren't tracking or the amount of work available to do and a million other factors. But there's actually so many trees that
So you get lots of data, and hopefully with that much data, the individual noise can cancel out, and you can find the signal, the aggregate growth rate, year upon year for an area where the trees are in going back as far as those trees do. And actually even further, we'll get to that. Next up are ice cores. You drill deep into an ice sheet on Greenland or Antarctica or in a high mountain glacier.
And you can get a lot of data out because as snow falls and compresses into ice year after year, it traps tiny bubbles of atmosphere from that specific year. And scientists can measure the CO2 concentration from hundreds or thousands of years ago. Ice cores are how we know that today's CO2 levels are unprecedented.
The ice itself, the frozen water molecules, also hold clues. The ratio of heavy oxygen isotopes to light ones change depending on the temperature when the snow originally formed. So that's another proxy. But it's not perfect. The isotope ratio can be thrown off by where the snow came from.
not just the local temperature. And the deeper you go, the more the ice layers get compressed together. So the yearly resolution gets fuzzier and fuzzier. It's like a log file where older entries are being aggressively compressed.
For oceans, and especially in the tropics, scientists look at corals. Corals build skeleton out of calcium, and they add layers year by year, sort of like tree rings. So corals give us this precious data that we were missing from the vast ocean areas where trees don't grow.
And then there are other types of proxies. There's layers of sediment washed into lakes each year that can tell you about the levels of snow melt. And you can use that to infer temperature. Fossils and deep ocean mud give clues about temperature over millennium, though often really fuzzy in terms of what year it is. And you can even measure temperature down in boreholes that are drilled deep into the earth's crust. I don't totally get how that one works.
But the point is, you've got all these different types of proxies. Tree rings measure summer temperatures in North America. Coral skeletons record sea surface temperatures in the tropical Pacific. Ice cores log polar temperatures. Lake mud will tell you about the spring snowmelt.
So they're all recording something about climate, but they're all indirect and they're all noisy. And they all have different time resolutions, some annual, some spanning decades, some spanning centuries. And the dating isn't always perfect. Someone is piecing this data together by hand. Also, they all cover different parts of the globe and different seasons. Some stop abruptly. Some end up with weird glitches in them. So how do you take this mix of messy, scattered, imperfect data
data and turn it into a clear picture of the climate over time? How do you pull together data from systems that are so different and that are barely documented and that are sometimes reliable and get out of that a reliable view of the system's past, of the earth's past? The first problem you have to overcome is uneven data distribution.
You may have hundreds of tree ring records from North America, but only a few crucial ice core records from the Arctic, and also a few coral records from the tropics. So you pick a year, you have hundreds of values from different proxies and locations.
but most of it's tree rings. If you just toss all this raw data into a model, the tree rings would dominate, skewing the results to reflect only the mid-latitude forests and ignoring all this vital polar and ocean data. That's not ideal for a global temperature view. So before we put together a model, we need to pre-process the data. We have to transform that chaotic mix of raw proxy measurements into smaller and more structured and representative sets of features.
We do this with principal component analysis. It works like this. Imagine you're monitoring again, a massive microservice deployment. You've got hundreds, maybe thousands of metrics streaming in CPU load, memory usage, request latency, error counts, database connections. For every single service instance,
So at one moment, you capture a snapshot. You got 500 CPU metrics from your web tier. You have 10 latency metrics from your database cluster, five error rate metrics from your authentication service. So you have 515 numbers describing your system state at one particular moment in time. But looking at all 500 of these raw numbers is overwhelming and not helpful. And many of those 500 CPU metrics are probably telling you the exact same thing.
If the cluster is under heavy load, most of these CPUs will be high. In other words, they're all highly correlated variables. And you don't necessarily care about tiny variations between CPU 101 and CPU 102. You care about the overall pattern of the load on that web tier.
So principal component analysis, PCA, is the algorithm that spots these patterns or themes in your sea of metrics. It would scan all 500 CPU metrics and say, the biggest variation is here. The main signal is whether the whole group is generally high or low. And we'll call that PC1, principal component one for the webs here.
It might capture another pattern like front-end servers are busy, but back-end servers are idle as PC component 2, PC2. PCA creates these new synthetic variables, principal components, which are each made up of mixes of the underlying components.
The cool thing about principal component analysis is it figures out patterns without needing to know what's what. It's an unsupervised learning method that extracts correlated information from the data. And crucially, these principal components are ordered by how much of the total information in the original data they explain. And each principal component is uncorrelated with the last.
So back to the climate data. For a given year, you have these 500 tree ring measurements and a few ice cores and coral values. Instead of tossing all 500 noisy correlated tree ring values in the main model, you first extract the principal components. PCA finds the main shared patterns of tree growth across that network. The first few principal components might capture 80 or 90% of the meaningful variation.
The first component could literally represent the overall good growing conditions of the season, while the 100th might just reflect something like rainfall in one very small area of North America. PCA allows you to zero in on the big consistent patterns in tree growth, cutting through the noise of the individual trees.
So PCA doesn't give us a final temperature map from our tree rings. Instead, it gives us a neat, simplified data set. Gives us just a couple of data points to look at. And the cool thing is, it's all here in the data leak. While many climate models mix various metrics together for the most accuracy, our BRIFA file is just based on tree ring data.
And if you look around, it's not too hard to find the PCA file. It's in documents Osborne tree six in a file that starts with PCA. It's another IDL file. But getting that data ready for principal component analysis is no small feat either, because there's another file documents Osborne tree six RD all MDX one dot pro that does a lot of the heavy lifting to process this raw data.
It's nice though that it's all here. Now that I kind of am starting to understand IDL and how these climate models work, I can look through the files and kind of see what they're doing. So now that we've got our refined proxy features for each year, we can focus on calibration. Calibration depends upon the overlap in time between when we have actual temperature readings and when we have tree core measurements.
In our data, this overlap period is from 1856 to 1990. That's when our tree rings overlap with temperature data. Although that's not quite true, and you'll see why as we go. But yeah, that is the period where we both have process proxy features and reliable thermometer temperatures. That overlap is our ground truth for our climate model.
We're building a statistical model to link patterns in our proxy features with those in the known temperature records from this overlap, period. Think of it like training a machine learning model.
I mean, in this case, it's actually not a machine learning model. It's more simple statistics. The idea is the same. You give it the process proxy features as inputs and the instrumental temperatures as known outputs. The algorithm figures out the complex correlations and the weights, the best way to basically map from those inputs to the output temperatures during that time period. In our data leak, this process is done alongside the principal component analysis.
Ian Harris, known as Harry throughout this leak, checks the principal components that are extracted against rainfall records. Rainfall being the strongest non-temperature signal that we have records for. This lets him extract the temperature component, which is the non-rainfall component, which is then used in the graph in the question briffa98 file. Now here's where it gets interesting. I feel like this is the part that the skeptics missed.
Harry calibrated his statistical model using the overlapping data, and the PCA helped him pull out the signal. So when you feed your trained model only the proxy data from before the thermometers existed, from like 1000 AD to the start of our measurement area, the model, using the relationships it learned during calibration, gives its best estimate of the temperature for those years.
And just like that, you have a curve stretching back centuries showing the estimated ups and downs of the past temperature. You might ask, as I did, and I had to look into this, how can you have tree rings that go back to 1000 AD? Well, this tree ring data set is the MXD data set
And it actually uses live, very old trees, but also dead preserved trees that can be exactly dated. And they can be exactly dated via their correlation to live trees. It's more detective work, but basically high altitude, very old dead wood can be found and can be precisely dated. But yeah, building and running the algorithm is just the start. The next question is, does this work?
Is this reconstruction solid or did we just create a complicated statistical illusion? That's where the verification step comes in. The verification step uses holdout validation. Remember that overlap period where we had both proxy data and thermometer readings? Instead of using all of that to train the statistical model, you deliberately hold back a chunk of the thermometer data and then you can test against that to see if your model is working.
If the reconstruction can successfully predict the temperatures in the period that you held out, it boosts your confidence that the relationships it learned are real. It's like using a separate validation data set in machine learning. Model validation is the key. And we have a lot of files in this data leak, calpr, bantemp.pro, calibratebantemp.pro, and so on and so forth. Many files in this leak all aimed at validating the data.
And it's actually in this validation step that we find the answer to hide the decline. The controversial phrase that led to the reporting that climate scientists were hiding the truth. But before we dive into those emails and what hiding the decline is, there's another layer to consider. Because the past climate data isn't just about pinning down a single global temperature.
It's a complicated web. The Earth's climate isn't a simple thermostat that slowly goes up or down. It's a chaotic system that's fluctuating on multiple timescales that are all layered on top of each other.
You have events like El Nino and La Nina that pop up every few years and they warm or cool big parts of the Pacific and shake up weather patterns around the world. You have big volcanic eruptions that send aerosols into the atmosphere and these particles reflect sunlight and they cause global cooling for a year or two. And that's just two of the timescales at play. There's many more. And the big challenge for climate scientists is pulling apart all these overlapping signals.
It's much more complicated than just a global yearly average temperature. But okay, all right, we've circled back. Hopefully you made it through all my background. With all that proxy data, with all those proxies, and with all that data complexity in mind, let's tackle these infamous phrases. Let's break them down.
First, let's break down Mike's nature trick. This sparked huge controversy, right? Was Mike Mann publishing something incorrect? Was he hiding things? Then we'll cover hide the decline, the so-called smoking gun that caused ABC and CBC and the New York Times and the Washington Post all to accuse climate scientists of misleading the public.
But yeah, first, Mike's nature trick. Mike Mann is the man behind the iconic climate change graph. He's the one behind the original hockey stick graph, the one from Al Gore's Inconvenient Truth. And while Mike's nature trick sounds like something from a spy novel, it's not about secret manipulation. It's about taking all this complex data and turning it into a simple graph.
Mike had these projections from climate models, right? The proxy data and what they implied. And he also had real temperature data, thermometer readings, the straightforward stuff where no crazy stats are needed. You just check the thermometer. His trick was to put both types of data on one graph.
Mike used two separate lines, one for real measured temperatures from 1860 to today, and another for proxy temperatures reaching far back in time, which he also added error bars to. Mike's trick was putting both sets of data on one graph. The proxy data is complex, but without the real temperature data, which shoots up as a hockey stick blade, that's what makes it have a punch.
The thing is that Blade was never in doubt. It's just the yearly average temperature. Any weather station could tell you that. Now the folks at the CRU made a somewhat intentional misleading choice. Instead of using two separate lines, they combined them into one line. The instrumental and the projections. Now climatologists would understand that when the line hits modern times and the error bars go to zero, that it's showing real data and not a projection.
But not everybody would get that. So that is a little bit misleading, but there's no lies involved. But the real kicker, the real thing that upset people was emails that said, hide the decline. You know, you would have a cold winter or a snowstorm and politicians would show up trying to cast a suspicious light at global warming with snowballs. Where's global warming now? So when somebody said, hide the decline, they're like, yes, yes.
I get it. They were hiding the fact that it's actually getting cold. But as I said, it's easy to verify that the world wasn't getting colder. The world was in fact warming. The year 1999 this data came from was the hottest year on record. So here's the deal. Hiding the decline wasn't about covering up a drop in global temperatures. It was about a decision to leave out unreliable post-1960s data.
You see, for centuries, tree ring data matched up well with temperature. Warmer conditions meant denser wood formed late in the growing season. But around 1960, this relationship broke down.
This is known as the divergence problem, but it does seem like a real issue. We have this temperature data, this tree ring data is being used as a proxy to project backwards and tell the temperature a thousand years back, but yet it doesn't even work in known periods, like from 1960 to present. How solid is our past reconstruction if these proxies seem flawed? And a
And here's the thing. I actually found an answer for that. Me, just somebody who downloaded this data leak and started poking through and read a book or two to fill in some information. I figured it out. It was pretty exciting for me. And it involved reading lots of this IDL code. But first, before I share what I find, I want to say, you know, that questioning this data, looking carefully at this code, even if I assume that climate change is a given,
it's still a good thing. It's not anti-science to check their work. Critical examination, that impulse that I feel to look closer, is a vital thing, even when it's uncomfortable. No field is immune to bad intentions. Sometimes even foundational work warrants a second look. Somebody needs to check it. And a big reminder of this is a major ongoing investigation in a completely different field, Alzheimer's research.
So before I tell you what I found in the data, let me tell you about Alzheimer's research. In it, the dominant theory for decades was this amyloid hypothesis. It's the idea that this sticky beta plaque in the brain were what caused the disease.
In 2006, Sylvain Lesny and his team published a paper in Nature that seemed to back the amyloid hypothesis. They identified this protein, A-beta-star-56, and suggested that it caused memory issues in rats. And this paper became a cornerstone. It was cited thousands of times. And it ended up directing billions of dollars in research funding and drug development towards amyloid.
targeting these amyloid plaques. But over the years, things didn't quite add up. Top Alzheimer's labs tried to replicate his findings, but often they couldn't do it consistently. And that was a big warning sign. But yet, some labs managed to replicate the results, and then they led to more research. And then there was drug development based on those findings. Then enters Matthew Skrag. He wasn't digging through emails or private messages. He wasn't trying to read IDL files like me.
He was focused on the science. He was scrutinizing published papers in Alzheimer's research, and he spotted some anomalies, especially in the images, including the papers. It started with some offshoot papers, but the more he dug, the more it led back to Lesney's 2006 Nature paper. Basically, he was able to tell that the images were photoshopped.
Somebody had used a cloning tool and you could see mismatched backgrounds or lines that appeared too clean. And this wasn't just online talk that he posted on his blog. No, he was a major investigator and he led to a major investigation that was released in Science Magazine in 2022. It wasn't just misunderstood jargon or internal debates. In this case, it was actually the integrity of visual evidence in peer-reviewed studies.
It had a huge fallout. The fallout is actually still ongoing. Lesney's University launched an investigation. Nature issued a cautionary editor's note to the original paper. All these things feel pretty mild, but what's now known is that these results don't hold up. This was fraud. The process of retraction is messy and slow because no one wants to admit they've been chasing a lie. It's huge damage done to the field.
But there's also a chance for science to self-correct. Scientists are human, right? And some will cheat.
And Skrag's investigation shows the danger of, yeah, of a real error cascade. That 2006 paper wasn't just a study, right? It was a foundation. Thousands of studies built on it. Billions of funding followed. Patients took drugs based on faulty research, drugs that were costly, drugs that had side effects and that even led to deaths.
Drugs that ultimately failed to cure or help with Alzheimer's. An entire field pouring resources down a path that led nowhere, all because of some fraud in a key study. I just mention this because this investigation reminds us that the skepticism is vital. Questioning these findings, even influential ones, is crucial. This impulse to dig deeper is sound.
That's why I think I need to apply this skeptical spirit to ClimateGate and to this Brifa CEP98_E profile. But yeah, I think we can now understand what's happening in that file. The startling comment that caused such a stir applies a very artificial correction for the decline, followed by the fudge factor array. We can now explain what those are. At first glance,
You know, skeptics like Eric Grahame said that this was a siege cannon. It seemed super damaging. It looked like clear evidence of data manipulation to force that hockey stick shape. But now we know the decline is not about global temperatures dropping. It's about certain proxies, like the tree ring data, no longer being reliable indicators. Here's how I know. Here's what I found. Remember those calibration files I mentioned, like Calibrate BandTempPro.com?
They're really crucial. When you run the whole process, PCA, correlate, and then validate on this tree ring data, the predictions that come out are pretty noisy. There's something in the data, especially from the overlap period, that's causing noise and making the predictions inaccurate. So Harry or the team or whoever, after digging into the data, the issue became clear.
the post-1960 tree data. For centuries, these rings matched up with the temperature readings. Warmer summers meant denser rings, but after 1960, that link broke. The thermometer showed warming, but the rings suggested cooling. Something changed.
Something changed with how trees were growing in Earth. Maybe the extra CO2 from global warming. Maybe the trees just don't grow the same forever. Maybe pollution, maybe chemicals. We don't know. But the trees weren't matching predictions. But they found a way to overcome this. They would skip the post-1960s data for principal component analysis by focusing in on the data before 1960s
they could better extract the signal. If they remove that 1960s data, they could better estimate the temperatures going backwards. So that gave them a better ability to project backwards. But it led to a problem, right? When they feed that data forward to the post-1960s, the model predicted lower temperatures. So if the global temperature was 14 degrees in 1972, the model would say 12.
They found a way to build a model that predicts past temperatures well, but shows a decline just as the world heats up. That is the divergence, right? That's the failure of the specific proxy data post-1960. That's the decline that they are hiding. The reason it diverges is because of the way they built the model to ignore whatever changed post-1960.
It's actually all in the leak. If you look through the calibration attempts, you can find them performing these. They used the data from 1911 to 1960 to build the model and then calibrated it backwards using data from 1856 to 1910. And that worked better than if they used 1911 to present day. This wasn't a secret. They in fact published a paper on the divergence problem in nature back in 1998. It was a known issue.
But it's fascinating to me that you can dive into the code and you can see how they derived this. It doesn't clear up everything, right? As I said, when a key proxy method goes wonky, just as we have better tools to check it, it does raise real questions about how reliable the method is. But the puzzle here is just about the limits of this specific proxy. It's not about a lie.
And then going further, if we look at our file, our briffa-sep98-e, the file name is telling. The underscore e is actually some old school version control. There's actually a through d as well. And these are all found in a personal folder named Harris Tree for Ian Harry Harris, the programmer. And that fudge factor, those hard-coded numbers that look like a hockey stick graph,
in the context of the divergence problem that's pretty clear. This is actually Harry manually mixing in the instrumental data, the real world temperature data. As I said, ideally you'd show these as two separate lines. Harry was just trying to manually hack in the instrumental data into his graph. But here's the real kicker. This wasn't the code that was used for the paper.
In the linked files, there's a whole different directory where the actual published data is. There's brifa-sep98-decline1 and decline2. These files are quite similar, but they tackle the divergence problem differently. They don't have a hard-coded fudge factor. They don't mention an artificial decline. Instead, they read the actual instrumental data from files. There's no fudge factor. There's just reading in the temperature.
and adding it to the graph. The actual methods used later just use temperature data from a real public source. So the core accusation that scientists were literally inventing numbers to fake warming doesn't hold up when you actually look at what the files were. It's also crucial to just zoom out and remember what this data set is. This is the CRU high latitude tree ring density data.
This is the stuff with the divergence problem. It's just one single thread in a vast tapestry of climate science. The overall conclusion that the earth is warming and that humans are the primary cause doesn't rest on this file, or in fact, on this leak. It comes from the convergence of many independent lines of evidence gathered and analyzed by all kinds of scientists worldwide.
In fact, the graph that Al Gore used was based on ice core samples, not this data at all. So there's no error cascade here. The CRU data matters, especially for reconstructing detailed temperature maps of the northern hemisphere land temperatures over the last millennium. But that's just a part of the story. The attackers who leaked these files and the bloggers who spread the story weren't actually doing a thorough review of the CRU's work. Perhaps that's not surprising.
Likely, they just ran keyword searches for terms like trick or hide or artificial. And in this massive dump of emails and files, they found some juicy snippets in one file that was never used for a published paper. And they took them out of context and claimed that they found a lie and that they found a conspiracy.
Here's where the Alzheimer's story stands out as being quite different, right? Matthew Schrag wasn't sifting through stolen emails for dirt. He was carefully examining published scientific evidence one by one. He was questioning its integrity through complicated visual analysis. This was skepticism aimed at the science itself, leading to potential corrections for the field. In fact, he did it because he wanted to get the field back on track.
Climategate was driven by a specific code file. It used out-of-context chatter. It used experimental code to target scientists and to sow doubt, rather than engaging with the full body of the published work. And in fact, it was timed for this all to happen right before the Copenhagen Climate Conference. So there's some pretty strong hints that there was a political agenda here. Find a lie, and then you can say that they're lying about everything.
But here's the cool part. Maybe the real story in the ClimateGate files isn't about conspiracy or fraud at all. Maybe it's about something far more mundane, yet I think profoundly important. The unglamorous, often frustrating reality of being a programmer trying to make sense of messy scientific data.
Because Ian Harry Harris, the CRU programmer whose name is on this folder, Harris Tree, in the leak there's another file, a massive text document, 15,000 lines long, called harryreadme.txt. And it's basically Harry's personal log stretching over the years, documenting his day-to-day struggles to maintain and update and debug these climate datasets and to work on the code that's used to process them.
And reading it is like, well, if you've ever worked on a legacy code base, or if you've ever tried to integrate data from dozens of different inconsistent sources, I think you can feel a deep sense of empathy for Harry. Harry wasn't writing about grand conspiracies. He was writing about the grind of data wrangling and the challenges of software archaeology. He writes about an Australian data set being a complete mess that so many stations have been introduced and he can't keep track of it.
He complains a lot about Tim and Tim's code. And I assume that Tim is somebody who came before him and didn't sufficiently document what he did. Sometimes he just writes, oh, fuck this, all in caps. As in, oh, fuck this, it's Sunday night and I've worked all weekend. And just when I thought it was done, I'm hitting yet another problem. And it's the hopeless state of our databases. There's no data integrity. It's just a catalog of issues that keep growing on and on. Reading Harry's log,
You don't see this cunning manipulator working to hide inconvenient truths. You see just an overworked programmer, likely under-resourced, grappling with commonalities
with complex, messy, real-world data and imperfect legacy code. And he's just, he's doing his best to make sense of it all. He's dealing with inconsistent formats. He's dealing with missing values, undocumented changes. It's just the kind of stuff that data scientists and that legacy software engineers everywhere deal with daily.
And he leaves all these exasperated comments, and they don't sound like admissions of fraud, just like the slightly cynical remarks of someone deep in the trenches of doing the difficult work of climate change.
Maybe this is the real story of ClimateGate. It's not a scientific scandal, but a human one. A story about the immense and often invisible technical labor required to turn noisy observations into scientific understanding. And the pressures faced by those tasked with doing it, often without recognition or even the resources they need. And then after all that, they get attacked and their private work files become the hot topic on ABC News.
So where does all this leave us? After all the sound and the fury and the investigations and the accusations, I mean, what did Climategate really reveal? At first, the media jumped on this idea that this was a smoking gun. Nobody wanted to deal with global warming. I mean, nobody still wants to deal with it. Al Gore called it an inconvenient truth. So there was hope.
There was hope that it was all a mistake or a fraud, and people ran with that. Newspapers churned out stories of deception for weeks after the leak. And the investigations came much slower. There was eight official inquiries. Yes, eight. And all came to the same conclusion. No fraud. No scientific misconduct. Climate science's core findings stood firm. The hockey stick graph showed
could be debated for some of its statistical details. You can debate the limits of some of these proxies, but it's backed by many other studies that use different methods and use different data. The trick wasn't a deception, it was just a graphical choice. And hiding the decline wasn't hiding a global cooling trend, it was about dealing with a known issue. Climategate
wasn't proof that climate change was a hoax. It was more like a case study in how internal scientific discussions and informal language and experimental messy code can be twisted when leaked into a charged climate where people are looking to create doubt. If I were to take a lesson from the Climategate saga, it would be about the necessity of transparency in science, especially things like climate science.
What if from the start all the raw data and code and statistical methods were out there? What if they were publicly accessible to begin with? I imagine them on GitHub, ready for anyone to run and critique. And actually, as a result of all this, CRU now has the instrumental data available under an open government license. And while Eric Raymond's initial take on the code file is what caused a big stir, he was right about one thing.
Because he demanded that they open source the data. And I feel like that's a principle I can agree with him on. Climate science, with its global stakes and complexities, should embrace open source, should embrace open access as much as possible. Science isn't always neat. It's a human process full of debate and messy data and involving methods. But...
Like software development, it gets stronger and more robust and more trustworthy when the process is open, when the data is shared, when the code is available for review. That's my takeaway from the whole affair. It's not about a conspiracy revealed, but a powerful argument for doing science in the open. We live in a world in which science is more than ever under attack and underfunded and being questioned and being politicized.
I think the best defense against that is to be open. That was the show. How many people made it this far? I don't know. Honestly, I started by diving into this Climagate code and it got more interesting as I went along, but I'm still pretty unsure about how interesting it is for others. There's like a lot of interesting tangents I went on that I had to cut as well, but I came away with one big idea. Climate science is kind of interesting. It's a little bit like data science.
Except in climate science you're dealing with messier data and you often have to gather it and label it yourself. But you get to work with a community that's striving for shared knowledge. ClimateGate makes it sound like it's all about global warming models and politics, but really it's more about diving deep into specific issues like how the layers of sediment in this certain data set can affect the feedback cycle in the Atlantic Ocean temperatures.
Harry's exasperated, cynical grievances notwithstanding, it actually sounds pretty interesting. But yeah, let me know what you think of this episode. And until next time, thank you so much for listening.