We hear all the time that ChatGPT and all of these different AI models are trained on an enormous corpus of text data. But what exactly is this data? right? A lot of people have to come up with different things like, oh, it's like a third of the internet or it's any percent of the internet or eighty percent today on the podcast we're going to talk about what exactly is inside the black box known as ChatGPT. Google, facebook, all of these different A I companies that are training uh their A I models on data were going we talking about what exactly this data is, what's inside to m like what are the actual websites this is an incredibly interesting podcast in my opinion and you're going to want to listen close because i'm other than just talking about the specific website, it's gna give you a really good idea of why ChatGPT and all these different AI models are really good at some things are not really good at the others because you're going to be able to we're talking about what goes into them and it's gna kind of give some ideas about why IT says some of the things that says so first off, a lot of this um that weren't be talking about is coming from some research out of the Allen institute for ai and they actually went and they decided to dig in a little bit into um what's called google c four data set.
So it's essentially just a massive snapshot of content from fifteen million different websites that were used to construct some of the most high profile english language a um and so that would be google t five and facebooks lama OpenAI doesn't necessarily disclose what data said they are using to train a the models backing ChatGPT but we can assume it's some pretty similar things um and you'll see why in a little bit but you know suffice IT to say this is gonna the day and that's in google barred and that is coming out of a lot of the facebook products as well um and and everything this is probably what's in ChatGPT as well. Now this bean said um I would say it's important to know that of all of the content were gonna talk about today, this is a still only a small amount of data. This is what what was essentially scraped um in April twenty nineteen by the company.
It's a nonprofit called common craw. As for google, uh four sea data set, which is using a lot of different A, A lot of different A, I just use this giant data set um and so I think that while is really huge um you know fifteen million websites when all of the content on them and some people call IT like the agent data set. Um it's still you know if it's probably about forty times smaller than what GPT three was trained on.
So this while that might think big is actually still lots more than GPT three and the assumption is that GPT poor is even much bigger than this, although they haven't actually released ah how many parameters are in GPT for? So um what what's important to know on all of this is one of the biggest contents sections in this entire data set is just in business in general. So business and industrial websites um they made up about sixteen percent of the entire data set and the number one website in that was full out com, which if you don't know, it's just like a financial investment, which kind of website.
So um that is kind of interesting as a lot of people are experimenting with different use cases for investments and other areas like that out of ChatGPT. So not very far behind, fool dotcom was kick started. Dotcom, that is the crowd source, seeing you know raising money for different businesses and that kind of stuff below that a little bit.
They had paid chron, which was a prety pretty big chunk of that which if you don't know, IT just helps create us, collect monthly fees from subscribers for exclusive content. So that is actually pretty interesting that they were able to get the patron on a list what people are essentially selling and if you like, between um full kick starter and patcher on that kind of like three different industries online to make you know process a lot of money. And so i'd be really curious to see what kind of data gathered from that there will be useful in generating business ideas in general.
Kicker in patron might you know give the A I I think a lot of access to different um ideas for marketing and just like technology and just a lot of really different interesting ideas. Um the next biggest area behind finance that are these these models were trained on would appear to be the news. So news and media accounted for, I think, about half of the top ten overall in the entire data thing where news outlets so new york times was number four L A times was number six, the guardian number seven forms having imposed washington post um and so I think like artists and creators, a lot of these news organizations have criticised tech companies for using their content without authorization or compensation, right? Like essentially they are just scraped in all of this kind of news data and a lot of these news companies are complaining about that.
So um it's going to be interesting that all of that is getting sucked into this as so I think they they found that several different media outlets that link row that link that rank low on news guards, independent scale for trust bordin ss showed up in there and a the washington post did a news article about this whole thing and they they specifically ally kind of commented on this. Um you know this is also interesting as a lot of these different like uh trustworthiness or fact checking organza have come under scrutiny themselves. Had a lot of flag.
I think overall online um a lot of people don't really love the whole fact track can going on booth posts. I know a song twitter that's have been replaced with community notes. So just allowing anyone to go um and um you know if a community creator is range high enough, then they'll be able to post like a high quality link um disputing a specific claim which is kind of interesting because that in a sense democratized away from I think in the past a lot of fact checking websites um were selected essentially by riders or other organizations and people complained about who got to pick them yet ah so I think uh uh democratizing a little bit was good in any case.
Um washington post wasn't happy that russian state back news site art dock, tom, and was included in the list of media outlets and also they were complaining that bright, bright dot com, which is a right wing news uh opinion website was on there. And it's going to interesting, regardless of anyone's political opinions, right or laugh or whenever I think is really important to have news from different perspectives. And all this kind of content is kind of a big debate in A I you know people um I A lot of opinion pieces on washington post specifically talk about you know like why would we have untrustworthy training data put into this that's gonna propagate bias and propaganda is information that kind of that sort of stuff?
I actually think it's pretty important to have a wide variety of opinions, false in real um that this is trained off of because these represent the opinions um of a wide range of people in the world and anyone that says their opinions are their biases because government has biases that there is are the exclusive right ones uh lacks a lot of perspective obviously because uh, people have a lot of different opinions and a lot of different perspectives. S I think is important to kind of incapable all that and you can pick what you believe or what you don't believe. But um you know i'd saw you know the wash to post and that I say article was complaining that you there's like a list essentially of words that get black listed.
So one of these words an article IT doesn't get added to the it's not supposed to get added to the training data. One of those words was sweet ka obviously swiss ka, a reference not seas and hit layer and all that kind of bad stuff um but they were complaining that the word waster c still showed up in this giant a training date set over seventy five thousand times even though he was a blacklisted word and that kind of got me thinking um that um it's I feel like it's not a very good idea to have completely blacklisted words even though obviously swat represent uh political party that did a lot of horrible things in the world. Why would we want to remove that word right like why wouldn't we just want to say, you know sweeter in germany are bad, but obviously the if something that happened in history, right we can just uh a race that and hope that I never talks about IT and I think IT actually could cause more harm than good because I think it's important to um have these A I models um in jest, words or whatever that might be deemed bad because it's important and then you know you could train and tell them obviously swastikas and uh not theism and killing people as horrible.
But I think it's important that that's in there because um you know IT needs to IT needs to IT needs to know all of those different concepts and I think it's important that we're not just you know I get really nervous looking at how A I models are trained um when people are trying to put any sort of bias on the model or remove different segments or you remove different blacklisted words. It's IT just doesn't feel very good. IT feels like like censorship.
And I think it's fine to have all the content on there and then people can choose what they believe. And you know you could put safeguards in there and say X, Y, Z, E topics are about. And I think that's probably what google and open ei are doing here um against the to the complaints of the journalists there, including words like swans ticket in there that obviously um could be a red flag.
But i'm assuming they are used their referencing, uh, I would be I would hope or they have worked this in referencing that you know obviously swastika is symbol that represents something not good, right? But I I do not think you wanted just like pull out completely because that doesn't know anything about a very important um you know a trophy that happened in the world. So I don't see why you don't want to remove that completely.
In any case, my swett there. I think one of the other really big areas that is in this is religious sites. I think about five percent of of all the content or religious sites.
Obviously, this makes sense that is over a billion muslims in the world. There's over a billion Christians in the world. These are you know very large percentage of the pop relation. So I don't think that's shocking um despite some commentators don't know how many different opinions now.
In any case, ah yeah and I also think that that's a really useful for anyone that would like to learn about different cultures or religions or people to have all of that different content on there. So which just would seem like IT the big part of the world IT would be a good thing to have incorporate these A I models. Another area that seems to have made a pretty big a chunk up of this fifteen percent is personal blogs.
So it's actually the second largest category. Sorry if I said I was, what did I say IT was before, said news news section number three. Second biggest is personal blogs um and hope the one other really interesting thing about this that I forgot to mention at the beginning is that thirty percent of all the content that is in this giant data set that used for so many different AI models is currently not available online anymore.
Meaning that was websites that expire. They got taken down, people remove them, change the world, whatever is. And the reason why I bring this up and why think this is so important is because if you think about IT google that collected this giant dataset, thirty percent of all the content they collected is now how essentially exclusive to them, right? They have the the thirty percent of IT is gone now.
So they are the only ones with the and it's going to be interesting because people are um talking about you know claiming rights to their data redit. You recently said they're to start charging companies to uh train models of a redit, which was a really big part of open day, open eyes data set for ChatGPT. But if companies like a red are going to start charging um this data obviously getting more valuable.
And now google has like this massive like I all IT like a black hole of data, like thirty percent of all the data is exclusively to them because it's gone off internet, but they have access to IT. So that's really interesting. I wonder how valuable that is, because no one else will have access to that data. And if I stop the internet, probably known that will ever claim copyright or uh accessit like ownership of IT down the line into the future.
Another really interesting thing is that a lot of these A I models IT has been uh said are actually when the category all of the data that they're collecting are not really category ing um uh a lot of IT is not category ing the authors in different things like that because I kind of worried about personal data that's getting suck in, which will talk about in a second because there is a lot of personal data that has been sucked into these models. And i'm curious how that's being used or pulled out or scraped or integrated. But in any case, as we're talking about number two, biggest chunk of this entire thing is personal blogs and that includes a lot of different platforms like sites stock google out com, which can be you know anything from you know like a catholic high school in new jersey to a know a judea u club in new york, whatever.
So these are all just random things and that's a really big chunk of so I think that more than half a million personal blogs were pulled into that which is um you know representing about four percent of the total category ed tokens. So of the actual tax blank, a fifteen percent of all the sites but four percent of the actual text that was input ted to train things on. So it's pretty interesting in a lot of this is wordpress tumble blog spot on live journal.
And I think this is why there's just a lot of different. I I think this is a really important chunk because this really just gives a lot of perspective to people's feelings and thoughts are around a vast array of different topics. I think that's a really valuable part of part to have. Like I was talking about before, a lot of companies like google really heavily filtered the data before feeding IT to A I so see, for which is data that is called IT stands for colossal, clean crowd corpus.
So in addition to remove you all the duplicate text out of the whole of the whole things that's not training on the same text twice, like I mentioned earlier, google also uses a list of what they call you know dirty, naught, have seen or others SE bad words um which is about four hundred and two words in english and um one og and the company typically uses you know a high quality data sets to find tune the models and essentially they just trying to shiel the users from unwanted content chat chat t to cast you out. So I think IT pulls out a lot of that kind of stuff. And like we mentioned earlier, there's a lot of controversy that goes around that I would say, from both sides of the political spectrum.
You know I saw you know the washington post recently was criticising the fact that IT um that IT you know obviously were happy opposed that racial slr and absinth but um they said that they were they were disappointed that eliminated some non sexual L G B T Q content by pulling out absinth so that they are complaining about IT and then the other grape is that IT includes the words waska so they would seem like they would like IT to exclude less things from absinthe's and more things from related summer or to swanker. And I just think that at the end of the day, I don't think I think the less probably the less bias or the less find tuning, the Better will let the users do that. In my opinion, I think that can be the the companies that win in the end.
And would appear that, uh, google an opener trying to see true that samella man has discussed that you know really letting people use this the way they want um obviously not making this like a horrible racist health cape but making IT so it's safe. But you know has a lot of variety of opinions and thoughts except on there. So I am sure we talked about that enough, but in any case, is really interesting.
If you look at the washington post, check out the article that they came up with today that you can look and see if your website was trained on some of this A I data um and IT shows you the top sites which are patents that google outcome as the number one website that was used here. So all of the patents that have ever been written about any content which is really, really interesting um and that is not just in amErica but all over the world patent documents so it's this pretty interesting. All of that has been pulled in there a lot of really cutting stuff as well um which makes me think you could probably ask IT if I was apple and wanted to write a pattern for X, Y, Z, what would I do? That's a really interesting topic.
I think patterns that might be a sick might be like a treasure that might be a gold niger from amen from from some this research is the fact that patents is the biggest data set in there, and you could get a lot of content and ideas out of that. The second is wikipedia at org. Obviously, all wikipedia is a ton of information, open source, that kind of the perfect data set, to be honest, for training.
Um the third is called script, which is essentially audio books and digital books. So just a lot of all of the content of books that have ever been written, number four is in new york times. Then they have journals dot, P O P L O S, dot or gue, which is science in health.
They got the only times guardian forbes having imposed. Number ten was patents stock com, so just more pats in there. Number twelve was court Sarah h, which is interesting.
I would be curious to see exactly what that entails. But um coursera has a lot of uh courses teaching you about a lot different topic. So that might be why uh ChatGPT is really pretty good at teaching things full up calm for business in industrial um and there is a handful of others.
So really a really interesting to see what's going to go what's going to happen in the future. Like I said, uh, read IT recently said that they are going to start charging to train on their data and am expecting to see a lot of other websites do that. So I believe that the people I got in early may actually have a good advantage in a sense, that they were able to get all this data for free ChatGPT over A, I trained off a lot of twitter data before twitter shot off A A P.
I. And now we see that elon is going to make an A. I play, probably training off of a lot of twitter data as well.
So really interesting to see what happens in the future in this whole industry. Um this was a long podcast today for me. You know usually I try to do ten minute bites were at over twenty minutes. So i'll leave you guys here, but I hope you have a wonderful rest of your day.