Hello and welcome to another episode of App Stories. I'm John Voorhees and with me is Federico Fatici. How are you doing, Federico? Oh, hello, John. How are you? I'm doing very well. I want to say thank you right up front to our wonderful sponsor, Notion, and dive right into topics. But we do have a preliminary to mention to you folks. Okay.
The preliminary is that we have a feedback form now. This is 2025. It's time for a feedback form for App Stories and all the great shows on MacStories.net. If you go to MacStories.net slash podcasts,
you'll see a wonderful page with artwork from all the shows, all six shows, as well as a link to a feedback form. And if you open that up, what you'll find is a Google form that allows you to pick a show and leave the host some feedback. So we would hear Adapts stories. Love to hear from people who are listening to the show. If you have...
ideas for episodes, if you have a comment about something we say on the show, whatever it happens to be, we'd love to hear from you. We'd like you to be nice. It's always good for people to be nice. Ideally. Politeness goes a long way, even if you disagree with us. But yeah, we'd love to hear from people. Yeah.
That being said, today we're taking another dive. It's been a few months and we're taking another dive into the crazy, wild, sometimes useful, sometimes useless world of LLMs, of large language models. And specifically, I wanted to sort of circle back to the episode that we did in late 2024 about...
you know, our relationship with assistive AI tools and take a sort of a broader look at how we use large language models in practice on a day-to-day basis for Mac stories on operations. And sort of like as people, like how do we use LLMs as also like beyond Mac stories? I mean, you see these reports of,
OpenAI having half a billion, with a B, that's 500 million weekly users of ChatGPT. And I mean, I think it's, you know, the thing I will say upfront is that it seems clear that
Products like ChatGPT, especially, they're not going anywhere anytime soon. When you are approaching those kinds of numbers, half a billion people, I saw some reports saying that OpenAI believes they can hit a billion weekly users by the end of 2025. I mean, when you're reaching that critical mass, it's quite clear that this is a product that
that people are going to be using for the foreseeable future. Yeah, those are like basically Facebook numbers because doesn't Facebook have roughly like a billion users?
Yeah, yeah. So it's clear that there's something about LLMs that is resonating with people that maybe goes beyond the workplace, that maybe goes beyond productivity, and maybe they're onto something else. So I thought, at least this is how I was sort of thinking about this episode. I've been thinking about the big three, sort of the big three LLMs that I've been using the most personally. I've been using Cloud,
ChatGPT, and Google Gemini. And this relationship with these three has evolved multiple times over the course of the past five months or so, at least for me, to the point where I dip in and out constantly of these LLMs. You know, and obviously it doesn't help, or maybe it helps content-wise, that these companies are constantly one-upping each other. You know, this is, we've been talking about this, you and I, like,
Right now, this moment of the year with so very little happening in terms of Apple news, especially after the delay of Apple intelligence.
And, you know, developers and especially indie developers may be holding off until WWDC and this rumored redesign of iOS 19. All that to say there aren't that many new apps launching now. And it seems like most of the tech news cycle is dominated by these AI news. And there's a lot of hype. There's a lot of grifters, as John likes to call them. There's a lot of clickbait. There's a lot of unjustified hype, I think, for a lot of these products.
But there is something also to the products. And that's sort of what I wanted to unpack with you. Yeah, I think that's a good way to frame it because, yeah, for us, I think it's been interesting, exciting in a way because it's hard to keep up with. And there's always something new and interesting around the corner. A little frustrating at times in the sense that you feel like you get settled into a product eventually.
And all of a sudden there's something that really demands your attention somewhere else, which is probably the opposite of productivity. However, I mean, I think that we have kind of settled around a series of tasks and things that we each use these for that.
works for us and is our real benefits and not just, you know, slop generation or hype that doesn't really exist because we've been trying a lot of it. What do you think? Maybe we should go like LLM by LLM? Do you think that's a good way to do it? Yeah, I think that's a good way to do it. And I think
You know, also as a preface to this discussion, I think what we've also been observing lately is this trend toward consolidation. By that, I'm referring to consolidation when it comes to certain features of the LLMs. Like, for example, how everybody is calling the feature where you can create a little HTML thing or a little document a canvas.
or how all these companies, they're calling the research modes deep research. It seems like there's consolidation in terms of the naming of the features and the features themselves. Obviously, the bigger sort of macro trend has been the rise of these reasoning models. So these models that take more time and more
compute to give you back an answer. So, you know, if you were to take a look at the landscape, you would have your regular models and your reasoning models. And then you have all these features and all of these companies, Google, OpenAI, Anthropic, even Meta, I think. They're all sort of settling on this naming scheme for their various features like deep research, like Canvas. They're basically all the same thing, but they're all in different LLMs.
Now, I wanted to start with Claude by Anthropic. This was my original sort of main NLM and main interest, and I'm still using it with you because we do have a team account. Right.
But as time has gone on, I feel like my relationship with Anthropic and Claude has been changing. And I think especially over the past couple of weeks, I think Anthropic is feeling the pressure more and more from the two sides, you know, the ChatGPT side and the Google Gemini side. Right.
So Anthropic, when I first started using Claude, it was 3.5 Sonnet last year. And there was something about Sonnet, you know, the way that Claude has its own personality. We mentioned this before, the way he talks, the way Claude is, you know, compared to others, still very good when it comes to like CSS and style in general, both visual style and language style. Then
I was very conflicted when 3.7 Sonnet came out a few months ago. So 3.7 comes in two flavors, a regular model and a thinking model. So that would be the reasoning model that spends more time thinking about your requests and spends more time coming back with an output, with a response. Now, I've got to tell you, I've been testing these LLMs both for personal usage
and for research, for looking up facts or asking technical questions, but also for some personal vibe coding, so to speak, like making myself little scripts and little Obsidian plugins, that sort of stuff. I'm not a developer. I have not attempted to vibe code, as the kids say, like a full-on web app or mobile app, but I've done small things. And
my takeaway from 3.7 is that it tends to be a little too eager sometimes especially the thinking version in in adding like a lot of craft like a lot of changes to your text or a lot of changes to your code and that you know having done i've also been doing a lot of reading sort of
trying to understand what's actually going on here. And it seems to be that the consensus is that 3.7 thinking and reasoning models in general tend to be, you know, tend to add a lot of text because they're thinking and that thinking process bleeds into the original request and basically overcomplicates everything.
It overcomplicates your code. It adds a whole bunch of stuff that it's not supposed to be there. And then it sort of feeds back into your original request and everything sort of becomes bloated, which seems, it seems to be that the consensus now, you know, I've been trying to keep an eye on what actual developers are saying. It seems to be that if you are the type of person
If you are a developer and you're using Cursor or you're using Windsurf or you're using one of these modern IDE with the assistance of AI, it seems that, you know, most developers these days are not using the thinking version of Cloud. That being said, Cloud still has, you know, Anthropic as a company, I think they're going for that market. They're going for the developer market. They're going for the enterprise market. Yes. And that is also, I think, highlighted by the latest integrations.
like Google Workspace, for example, which was announced a few days ago. So Claude can now plug into your organization's Google Drive, into your Gmail, into your Google Calendar.
And that brings me to the last two points. The first one is that Cloud, I think compared to others, is very good when it comes to external tools. So the integration of tools, whether that is search or research now, or MCP, which is a whole other beast that I think we'll talk about at some point. But basically that would be the integration with third-party external APIs or external commands. And the other advantage, so that's the first advantage that Cloud has,
you know, beyond the style and your general preference. But the second one, I think, is the team aspect. So the ability in Cloud to create a project,
And a project essentially is like a space where there's a common prompt and a common reference knowledge. And you can share that space with somebody else. Like you and I, we share a bunch of cloud projects. And that feature, the way it's designed, I think it still has an advantage over other competing products. What do you think about that?
Yeah, I think so. I mean, to me, that's one of the most valuable aspects of Claude, because if I come up with something, I can say, hey, Federico, I've got this thing. Does this sound like something you'd be interested in? I can share it with you or vice versa. The thing that I find frustrating about Claude, though, is that.
So it has two modes. You can kind of flip between when you're looking at projects, the projects that just you've created or the projects that are part of your team. And if I create one that's just for me and then I want to share it with Federico, I can't just do that. They don't have a way to do that. I literally have to take the prompts and the information that I've uploaded and
recreate a whole new project that's shared and then delete my original individual one, which is kind of frustrating, but it is a nice feature. I mean, to give people some practical examples of the kinds of things that I've been using it for. Some of these I don't share with Federico because he doesn't do it. But I mean, I do things like reformatting Markdown into a YouTube friendly format. I use, I use cloud to generate tag clouds basically for our YouTube videos and
Uh, I, this, this one, uh, Federico helped me build into cloud, which is we, we will upload an SRT file, uh,
I have historically done it with Claude, upload an SRT file, then get back promotional clip suggestions for videos. Now, we've moved that over to Gemini since then, but that's where it started was on Claude. And then we share some things like proofreading, because not only does it do a fairly good job, I think,
with proofreading in general, just giving it very specific instructions not to change text, but to only find grammatical or spelling errors. But we have uploaded the Apple Style Guide to that project so that things like capitalization of technical terms that are Apple-specific
are all highlighted as well. And it's very interesting to me. I'm, I, having used it for quite a while now, I find new things all the time that I didn't know were in the style guide. And some of them are just like,
Things that aren't even technical terms, they're just ways that Apple does stuff in its writing that I don't necessarily agree with and don't change. But it is good to kind of have that as part of the resource. The second thing that I use Clod for is it is very good at coding. And recently I created some Zapier Zaps.
And one of the things I wanted to do was I was building a spreadsheet and there was information being fed into the spreadsheet that I wanted to distribute via an email to a bunch of people on the Mac Stories team. And I wanted only the most recent data to be sent.
And it turned out the best way to do that in Zapier for exactly what I wanted was some JavaScript. And I know nothing about JavaScript. And I was working with the AI that's built into Zapier, which is also very good because it has...
It's pretty good. Yeah, it has deep knowledge about all of the functions that are built into the product. So it's very good to guide you through how to create an automation. But it was having some trouble with coming up with the JavaScript. So I started with, I went back to
which kind of pointed me in the right direction. It didn't get it 100% right either, but taking what I got from Claude and then putting it in Zapier and asking the Zapier AI to reformat it specifically for the Zapier project
product, it came up with something that worked and it took a little bit of messing around to get that, that up and running, but it helped a lot. And having read it because it was thoroughly commented code, I could understand what was going on and what kind of variables were being used and what kind of functions were being called. And it, uh, and it has turned out to be, uh, you know, a useful tool for that as well. Oh, you know what Federico shoe, I should mention real quick about Claude that you don't have yet, which is the web integration. Uh,
I was going to say that you are in the U.S. and you do have access to the latest web stuff. Yes, yes. So currently I'm actually on two different subscriptions to Claude because I have a personal one that I had before we had our team account. And the personal one is a U.S. account and has access to the web integration. And while it's not as good as Gemini, I think Gemini is...
And we'll get to Gemini in a minute. It's a good place to start with a lot of research in certain contexts. But the thing that can be frustrating about Claude is because it historically had zero knowledge
knowledge of what's going on on the internet. There was no way to pull any information as part of a request into it that way. And even if it's not as good as Gemini, just having a little bit of the ability to do things like take a list of apps and create, you know, link them in Markdown format, it can pull in information from the web and combine it with some of the text things that I've been talking about that works so well with Claude.
and bring them together in a way that's a lot more useful. So I think you're going to really like the web integration when it comes out, Federico. And I hope they continue to really, I'm sure they'll continue to refine it. It's not nearly as good as Gemini yet, but just having any web integration in Cloud has made a big difference. Yeah, yeah. I feel like with Cloud, the biggest limiting factors now in...
some of the reasons why I feel like Anthropic is feeling the pressure on both sides. There's a bunch, actually. Cloud continues to be very limited when it comes to the context window. So if you're a power user and you want to do the kinds of things that I've been doing, like, you know,
feed a long transcript of a long YouTube video, or even index the contents of my Obsidian Vault, or even search the entire archive of my iOS reviews, which are things that I can do with Gemini, that I can do with GPT. We'll talk about that in a few minutes because it's a very recent change.
Cloud continues to be limited to 200,000 input tokens compared to the 2 million that you can do with Gemini 2.5 Pro or the 1 million that you can now do with GPT 4.1. So the context window and the output window continues to be a problem. The cost...
is obviously another concern. Like cloud continues to be very expensive compared to other models. And the reliability of the service, just anecdotally speaking, like I've had so many API outages
Or outages when using the cloud app, the cloud interface itself, you know, constantly running into limits, even though we have a team plan. Or having an API outage error when I was using the LLM command line tool on my Mac or like that sort of stuff doesn't exactly inspire a lot of confidence. And the other part is that Anthropic, you know, compared to these other companies, they tend to have a slower response.
Maybe you can say more methodical, maybe you can say safer, but the reality is that they're just slower compared to Google and OpenAI. And so, for example, they roll out web search. I cannot use it, even though I'm a paid customer, just because they have decided that it's limited to the US and the UK. They cannot launch in the EU. This is another thing.
common problem with OpenAI and some of the latest features, but I feel like Anthropic has been the slowest of all these companies in terms of international rollouts. And now they have even
launched their super expensive plans with like the max tier that is like 100 or 200 per month if you want to have access to the highest possible tier but even if you do pay for those tiers you still don't have a bigger context window for example so depending on the sort of the sort of intensive workflow that you want to have with cloud maybe that's you know that's not for you and it kind of
honestly kind of sucks because I do like the way the cloud talks and the style and and I do like the whole like safe approach of anthropic I read all the research that they publish I think they're doing a really good job and they have a lot of really smart people working there I just wish they were cheaper and faster if if open AI or Google got a team plan with projects how quickly would you jump ship
I was going to say, I'm just kind of watching because I kind of feel we're at the point that if we got similar features on one of the other ones, we would probably switch. Yeah, yeah, yeah. I agree. I think...
It'll be interesting to see. I think Anthropic, if they play their cards right, they have sort of cornered the developer and enterprise market so far. But they have this very, very strong competition from Google and Gemini 2.5 Pro coming.
OpenAI, as shown with GPT 4.1, again, we'll get to that in a minute, that they're also like, you know, they want to play ball in that field as well. So I think Anthropic has a chance to differentiate with things like external tools and integrations. I think they've done a really good job with NCP.
which is the model context protocol. We'll do another episode about that maybe. And also like these native integrations that they have with Google, that they have with GitHub, for example. They have their cloud code integration if you're a developer on Mac OS. So I think they should
that aspect and invest on that aspect of cloud even more because it's exactly where Gemini and ChaiGPT are lacking. ChaiGPT basically doesn't have any third-party integrations and Gemini only has basically like...
built-in Google-only integrations. So maybe that's an aspect that Anthropic could consider as a strong differentiator in the future. And Google, too. The funny thing about Gemini is I think they're very inconsistent in their integration of their own products across different things. Like you have, for instance...
the ability to generate audio from prompts in certain places in Gemini, but it's different than the audio that's generated if you use Notebook LM, for instance. And you can get to a lot of these tools through the tools themselves, whether you're in Google Drive or Google Sheets. They have all these big buttons all over the place now where they want you to summarize a folder in Google Drive or analyze a spreadsheet or something. And...
I haven't really used much of that yet. I mostly just use Gemini, the chatbot at this point. This episode of App Stories is brought to you by Notion. There is no shortage of helpful AI tools out there, but using them means switching back and forth between another digital tool. So instead of simplifying your workflow, it becomes more complicated. Unless, of course, you're using Notion.
I've been using Notion on and off for a long time. There are a lot of things to love about Notion. I think its biggest advantage is that there's so much that you can do within it. It's incredibly flexible. And when you consider that now they're offering email services along with a calendar, and you can track your tasks inside it, take notes.
create documentation, you can get basically all of the information that comprises your personal or work life in one place. And when you have that all in one place like that, it's fantastic because you can use Notion AI to quickly summarize things, pull those tasks out of meeting notes,
and perform all sorts of other magic, finding connections between ideas and the information tucked away in your Notion documents. Notion combines your notes, documents, and projects all in one place that's simple and beautifully designed.
It's your one place to connect with teams, tools, and knowledge so you're empowered to do your most meaningful work. And the fully integrated Notion AI helps you work faster, write better, and think bigger, doing tasks that normally would take you hours in just seconds.
Notion is used by over half of the Fortune 500 companies, and those teams send less email, cancel more meetings, and save time searching for work. Plus, they reduce spending on tools, which helps keep everyone on the same page. Try Notion for free when you go to notion.com slash appstories. That's all lowercase letters.
Notion.com slash appstories to try the powerful, easy-to-use Notion AI today. And when you use our link, you're supporting our show. That's Notion.com slash appstories. Our thanks to Notion for their support of the show.
Yeah. So speaking of Gemini, this has been, I think, my surprise over the past few months in that I think Google has shown that they are building and shipping very fast with very good performance. I mean, 2.5 Pro is arguably the best model in existence right now. It's really good. It's consistently topping all the leaderboards. Now, there's a whole other conversation to have about leaderboards and benchmarks. Are they actually accurate? Yeah.
I think there's nothing like actual everyday experience. And I mean, Google Gemini and especially 2.0 Flash and 2.5 Pro, and I think 2.5 Flash is also coming soon. 2.5 Pro is a reasoning model, but even then it doesn't take forever to come back with an answer. And...
I think, you know, performance-wise, cost-wise, it's faster and cheaper than Anthropic. And it's the only model right now that supports a true 2 million context window for input tokens. It's the one true multi-modal model
Large language model in that not only can you feed like images and PDFs to it, both via the chatbot UI on the Google Gemini website, but also via the API. But you can also give it like all kinds of video files or audio files. And that's been something that I've been doing a lot. Like Gemini 2.5 Pro is my go-to model for video.
audio transcription, both for my personal voice recordings, which I'll, you know, hopefully by the time this episode comes out, there's a story about that. If not, it'll be ready very soon. I use it for transcribing YouTube videos and it doesn't matter how long they are because it's got a giant context window. So you can accept all kinds of, you know, long or short videos. It doesn't matter.
Also, there's that aspect of actually working with Gemini. It's very good at following instructions. That I forgot. I should have mentioned. The big reason why I got into Claude last year was because in my experience last year in late 2024, Claude was the only model that could reliably follow directions. Anthropic was the first company that
I don't want to say pioneered, but codified maybe a style of prompting that involved passing a complex prompt with multiple step-by-step directions by using an XML syntax. And Cloud was doing a really good job at following that prompt, you know, especially when you're
giving the LLM commands that need to happen in succession or stylistic guidelines, that sort of stuff. But now, Gemini 2.5 Pro is also very good at following directions. It doesn't support, at least natively, the XML style syntax, but it works pretty well with markdown instructions. But now...
That advantage is also disappearing for Anthropic because ChatGPT and specifically GPT 4.1 can also follow XML style complex prompts. But yeah, Gemini. So there's that kind of usage that I do for complex workflows. Transcribe this, process this video, put it together as a markdown document, or maybe transcribe my voice recordings, create an Obsidian note, that sort of stuff. But also like,
everyday usage. Like on Android, obviously Gemini is my default assistant. I think what they have done with Talk Live with Gemini and especially the latest integration with like sharing your screen or the video camera live with Gemini as you're talking is excellent. And Gemini just is very good for Google search. Like it's obviously pulling directly from the decades of Google search experience. And so when Gemini 2.5 Pro is launched,
as they call it, grounding its response in Google search, you get a very good factual response with very, very, very few hallucinations, at least in my experience. Now, the thing, and obviously I should mention like the integrations, like Google, I think something that I keep mentioning is that they are the true competitor to a future Apple intelligence if it actually works.
because Google is doing what Apple promised they would do. They have an assistant that supports multiple modalities, text, screen, video, audio, images, documents. Yep.
and they have the integrations with the Google ecosystem. So you can save a note, you can save a reminder, you can take a complex conversation and save it to Google Docs or something. That aspect is basically what Apple was promising with Apple Intelligence, but it's only for Google apps. So time will tell if they can have a third-party ecosystem story. And the thing that I will say about Gemini is
Is that, and this is more of a taste thing. I don't know. It feels a little too sterile. Maybe sometimes it's a little too cold, a little too... It doesn't really have a personality. I know that it's a silly thing to say because these are not really personalities. But compare it to Claude or compare it to Chad GPT.
It doesn't have style. It doesn't have a style, which I think is the difference. I think that that's absolutely true. And I also think that it does seem to pad out its generated text with a lot of stuff that you really don't need. I feel like it could use a conciseness dial for things like deep research. I recently had it create a report on...
on all the different tools and methods for running LLMs locally on a Mac, it was 26 pages long. It was like, it was like a treatise. And a lot of it though, a lot of it was just fluff. Like I,
there were like two or three sections that were really interesting and helpful to me. And a lot of things that I could just skim over because there were maybe repeated two or three times within the thing in various ways. So, but, but at the same time, like the fact that you can use like the kind of workflow that Google, uh,
allows you to have, like where you do the research in Gemini. And so you get a Google Doc. And then you take that Google Doc and you go to Notebook LM, which is this other AI product made by Google, which is excellent for people who do a lot of research. And you get that Google Doc and you say, okay, take this Google Doc as your source. And mind you, like that would be the thing, like I wouldn't be able to research 500 different websites. I mean, I am very online as a person, but I am not that,
online, you know? But you take that Google document and you say to Google, to Notebook LM, okay, take this document now. I can either ask you questions or make me a fake podcast about this topic. I think it's an excellent resource if you just want to catch up on... I've been able to read and learn more about the most possible random things
but like who invented coffee, like these stupid topics that my brain thinks about sometimes. And you just do the deep research in Gemini and then you take the Google doc and you give it to Notebook LM and you're like, okay, now I'm going to ask you questions. I think it's a really good feature and all of it is based on Gemini. But the one thing that I will also say about Google in general, I feel like they are often unable to,
to capitalize on the really good features and the really good performance that they have because they are not as good as OpenAI at making viral products. I don't know if this makes sense, but like Gemini 2.5 is amazing, especially 2.5 Pro is really good. Notebook LM is really good. Like all the features that Google has are really good.
But most people are still talking about chat DPT. And that's because I feel like from a product perspective, OpenAI are able to craft these viral moments, especially on social media, whether it's the image generation, which I don't care about, or like the memory stuff, like they know how to fine tune their features so that they are understandable by common people.
Yeah, I think that's true. And I think Google's problem is that they're only able to generate viral moments from things like, you know, glue on pizza or eating rocks. I mean, those are the viral moments that Google had and they haven't had any since. And I do think there are an awful lot of people who are still looking at Gemini and thinking about the early days of Bard, which it was called before, and how many mistakes it made and the controversies that were raised.
But it's come a long way since then. And I, for one, I mean, for me, all I have to say is I've never paid for Google products until Gemini 2.5 came along. And now, or 2.0, I guess, is when I signed up for a Google One account because...
I didn't care about the Google Drive space. I didn't care about the other features, really. But I wanted the advanced Gemini features. I wanted Notebook LM. I wanted to have the highest tier models. And that has pushed me.
through the integrations to use Google products more than I probably ever have because I'm using Google Sheets, I'm using Google Forms, I'm using Google Drive now for some things, I'm using Google Docs, of course. And one of the things that I, as you said, you know, you come up with these things you want to research.
The kinds of things that I think it's particularly good at, and I'm okay with Gemini and the way deep research does these reports because they do list out all of their sources. And sometimes it can be a list. I did one the other day. It had almost 100 websites.
in the footnotes. And I actually go to those sites and find more information. I mean, it's really helpful to have that. But the things that it's particularly good at is one day I was thinking, I have this fiber service from AT&T and I have this Wi-Fi 7 router.
What are the best ways to optimize my local network, my Wi-Fi network? Those are the kinds of things that are buried in forums and on Reddit and in technical documentation in support pages for the manufacturers of different products. You can find it, but it's the kind of thing where you can spend hours and hours and hours combing through all the data and trying to find particular...
particular answer. Whereas I just did use deep research and I got a really nice report that gave me some very good leads on, Oh, here are like the two or three things that I should try. And then if I, when I was like thinking, Oh, well I've already done this one, but maybe I'll try this one. I could go and tap through on the links and see the Reddit posts and see whether they seemed legit to me or not. And yeah,
It was just a really good starting point. And that's how I feel with a lot of the research I do in Gemini. It's starting points for things that I don't know a lot about.
or it's a starting point for something that I know is going to have information spread far and wide across the internet. The other day, I did another one for a video game where I was trying to figure out just kind of like the development history of a game, what platforms it was on, what the general critical reaction to it was. And I want to gather that into one place because it wasn't going to be the focus of an article I was writing. It was more context
and understanding things before I got down to what I was going to write about in particular and it could give me that head start where I could like run I literally will do things like I'll run that deep research query and then I'll
throw my iPad on my bed, go take a shower. And by the time I come out of the shower, the, uh, the report is ready and I can plug in my headphones to listen to it. Cause a lot of times what I'll do with these is I will turn them into a PDF and listen to them in read wise reader, or I'll do the, uh,
notebook LM thing and listen to it there. So there are a lot of options for how you listen to it. And I do like when for some of this stuff, just to listen to it because especially if I'm out for a walk or something anyway, then
the fact that it's maybe repeats itself a little bit isn't really a bad thing because I'm listening and maybe it reinforces it a little bit more because I'm maybe not paying quite as much attention as I would if I were reading it where that would be more annoying. Whereas listening to it, it's okay for me to kind of get it that way. Yeah. And that's a sort of usage of an LLM that, that,
that I like in the sense that like, you're not, you're not relying on it to like replace an artist to like, neither of us would like, we care about image generation, for example. Right. But in this case, like it acts as a multiplier for the,
the stuff that you want to know. And so like, whether it's preparing for a podcast or an interview or like just having context about facts that you want to know, like you're discovering websites, you're getting that context and then you get back to work and, you know, hopefully you do better work. The other things that I wanted to say about Gemini 2.5 Pro, it is the model that I've been using to have conversations about the contents of my entire Obsidian vault using Obsidian Copilot and, um,
Obviously, 2.5 Pro with its giant context window doesn't care. But there is a new model in town. There is. And that would be GPT 4.1. So my relationship with ChaiGPT has been kind of weird in that. So obviously, it is the most popular AI service in the world. And like I mentioned at the beginning, like half a billion people can be wrong. There must be something there.
And I mean, let's face it, GPT 4.0, 4.0, sorry, it's pretty good. Like 4.0 is incredibly popular. That's what half a billion people use.
Most people, I would bet, don't pay for ChatGPT. My girlfriend just uses it for free without even being logged in. That's how most people use ChatGPT. 4.0 is pretty good. It's got style. It's got its own personality. 4.0 has been this fascinating model in that it has gone through lots of evolutions.
over the past year, in late March, they shipped this highly improved version of 4.0 without changing the name of the model, but they completely changed the way that it talks. And I got to tell you, John, especially in Italian, I have been so surprised by how well and just more naturally 4.0 talks and writes.
It's been really good and they have incorporated some of the changes of the recently announced GPT 4.1. So 4.1 will be API only because according to OpenAI, they have shipped many of the same improvements for natural language, following directions, context window, document analysis, code, math in 4.0. But in the API, you can use 4.1. Now,
why why wouldn't that just be part of the api for 4.0 then i mean why is there a different model if for for just for the api it doesn't make sense to me well well that you you're to charge more i guess because it's more expensive than 4.0 in the api i think also because like i think because like i've been doing some i've seen some people here mention like there are obviously two distinct
teams within OpenAI, like the product team doing the chatbot and the API team doing the platform. And eventually those will have to converge, you know, big picture, but we'll see. But I think that's the reason why, you know, you can probably scale 4.1 better and charge more for it if it's a separate thing in the API. Got it. But 4.1, I've been testing it for the past couple of days.
And I've been really impressed. So it's got a 1 million context window. So not the 2 million of Google Gemini 2.5 Pro, but still big enough to support my entire archive of iOS reviews. Yes, which for context, I think is worth saying is something like 900,000 tokens, right? Almost 1 million tokens.
- Yeah, so it's just-- - Right there. - Right, it's just below the 4.1 threshold, about half of Gemini and like five times what you can get with Cloud. - Yeah, yeah. Now, the thing about the context window of OpenAI
Which hasn't been necessarily... It has been kind of true with Gemini 2.5 Pro, I can tell you. It hasn't been true for Lama 4. A lot of people are making fun of Lama 4 and Meta's claim of like a 10 million context window. The problem is that nobody has been able to really test that 10 million context window because all of the services that expose Lama 4 via the API don't support Lama.
Like if you go to OpenRouter or LibreChat or these services that allow you to talk to a hosted version of Lama 4, they don't support a 10 million context window. So there's no way to test it. And unless you have a super powerful computer that you can run it locally. But even then, it seems like it's not really a perfect 10 million context retrieval window. But
OpenAI claims essentially a 100% rate in retrieving the so-called needle in a haystack for the 1 million context window, which means being able to retrieve a specific detail, like a specific fact across 1 million tokens. Now, from my practical experience, I can tell you that that's accurate. Like I've been able to ask very specific questions to 4.1 across my 10 years of iOS reviews.
as you said, almost filling the entire context window and it hasn't failed. Like every single time it has pinpointed the specific detail of my iOS reviews. Like for example, when
I had a random question at some point, like, when did Apple bring the Apple News integration inside the Stocks app? Now, that's a very specific detail. I bet that there's people inside Apple who don't even know when they did it. But I know that I covered it in one of my iOS reviews because it was so weird to have Apple News integration inside Stocks. I'm going to guess 2019 is my guess.
Do you remember? Yeah, I think you're probably right. I asked this question yesterday. I think you're probably right. And he got it. Like, he found it. And he found, like...
It said you covered it in your iOS something review. This is what you said. And you also, the incredible thing is that it said, and later, like this was your first instance of this mention. But later in other iOS reviews, you also referenced this old feature. Like I've had a really good experience with 4.1, but get this.
I've had a really good experience with 4.1 Mini. Now, 4.1 Mini is the cheaper, faster version of 4.1. Now, 4.1 Mini
4.1 mini matches the performance of the existing 4.0, so the model that is on ChatGPT, but it's based on the bigger context window of 4.1 and the bigger knowledge of 4.1, but with much faster performance, lower latency and all of that, of the regular vanilla 4.1. So I've been able to use
4.1 mini with my entire Obsidian Vault. I was going to say that sounds like a really good solution for Copilot and an Obsidian Vault. Exactly. It's a really good solution because you don't need to wait 30 seconds for an answer. Maybe you just need to wait 10 seconds or 15 seconds, which is like, you know, half the time or even more like a third of the time. Like it's been really good for that. And now all things being equal, if I have two services, two LLMs that support a large context window,
Which one am I going to use? Now, Gemini continues to lead in terms, in my experience, in terms of code and in terms of, but especially in terms of multiple modalities. So Gemini continues to be the only one that supports an audio attachment.
an MP3 or a WAV file or an MP4. It's the only model. GPT 4.1 doesn't let me do it. No, they still have Whisper, which is their open source model, but that has not been updated in years now. Oh, but they also have that new model that they launched like a few weeks ago, 4.0.
audio or something. Oh, I didn't see that. But it's very limited. It doesn't support long audio recordings. That's the problem. Like if I want to give it a long YouTube video, it's not going to work and it's very expensive. Yeah. Whereas with Gemini 2.5, like I can just give it, you know, a long video, like a two hour YouTube video. It doesn't care. So for that, for me at least, when it comes to transcription attachments that are not images,
Gemini does a better job. But I will say that when it comes to text, when it comes to image OCR, I prefer 4.1 because 4.1 talks in a nicer way than Gemini 2.5 Pro and they have a high quality mode for images.
When I give Gemini 2.5 Pro a long screenshot of one of my shortcuts, and I'm like, describe this shortcut to me, the image gets compressed, and Gemini consistently hallucinates everything. Like every single thing. I've run into that a lot with shortcuts, yeah.
With 4.1, if you enable the high quality mode, it does hallucinate, but much, much, much less. Yeah, I've noticed the shortcuts that all of the LLMs tend to make up things like actions that don't exist and all that. I think it's worth maybe mentioning really briefly what we mean when we say expensive, because I do think that there is quite a variety in terms of the API usage costs here, but
By and large, unless you are like a programmer who is trying to vibe code your way through an entire Mac app or iOS app, generally speaking,
It's pretty inexpensive to run these things. I mean, just for instance, I think in a lot of the experiments I've been running recently with Gemini over the past, I don't know, three or four weeks, I think I spent like 60 or 70 cents. And I know you sent me a screenshot where I know you've been doing an awful lot. You've been doing even more than me. And maybe you spent like 60 or 50 cents in euros. No, I mean, I am now up to four bucks. Oh, okay. Four whole euros. Yeah.
Yeah, but yeah, euros actually, not dollars. But like I've been doing these things where like just today I gave OpenAI like 4 million tokens. You know, like I'm doing these things where I'm like testing my iOS review archives or my entire Obsidian vault. So they are very inexpensive. But if you are a developer or if you are, you know, a small development shop and you're doing like
things at scale, you know what I would say? Go to this website, artificialanalysis.ai. It is the best website to visualize price, intelligence, and speed for all the major LLMs. They have these very, very good comparisons of like, okay, what kind of intelligence are you getting? How much are you spending? With what kind of latency? What kind of performance?
that's a, that's sort of a, I mean, right now you can see that, you know, a 3.7 sonnet continues to be leading the chart in terms of, uh, you know, cost. So, yeah. Yeah. To me, it's like, I think for a lot of what we do, once you, um, once you, um,
Once you get beyond the experimentation, you get to the everyday tasks, it tends to be fairly inexpensive. But yes, you do have to be careful because if you're doing coding and some other heavy lifting type of projects, you can get into expensive bills pretty quickly. So all this to say, from my end, I will say that right now I'm in between LLMs.
But I think where things stand right now, and I mean, obviously, you know, these things are changing on a week-to-week basis. They really are. So at this moment in time, April 16th, when we are recording this, I feel that I could, if I wanted to, I could consolidate around Gemini 2.5 Pro and Chai GPT because for a variety of reasons, I don't see the advantages that Claude used to have
The one thing that is still sort of left with Claude is the projects and team sharing. But there are ways to, you know, to live without those, you know, if I wanted to. So I feel like the combination of just the raw power and intelligence of 2.5 Pro from Google
And the improvements of ChatGPT, you know, not to mention in ChatGPT, I mean, if you are into that sort of stuff, image generation is objectively pretty good. I'm just, I'm not into it, but it's pretty good. Voice mode is also very good. The memories feature that you can use because you are in America, John, I cannot use it because of the European Union, but that also seems to be pretty good. Like it's a true assistant that gets to know you the more you use it.
That also, by the way, creates an incentive for open AI to, you know, there is an argument to be made, you know, for memories in chat GPT being like an advantage, like a competitive advantage. Like, why would you go to another LLM
if what you discuss with that LLM does not get saved into your main ChagPT memory. - Right, it becomes a lock-in type of thing. - Yeah, yeah, which is why I wouldn't be surprised to see in a few years
This is going to sound silly now, maybe. I think Ben Thompson has been writing about this. I could see a scenario in which in a few years you have, you know how on the web you have those buttons like login with Google or login with Facebook? I could see login with OpenAI as a thing that you use your OpenAI account as a way to feed back into your memories and to basically treat your personal contacts as like this portable data storage that you're just using across a whole variety of services.
I could see that. I could see it too. It's also a little scary when you think about it, having all that personal knowledge kind of consolidated under one roof with one company. Yeah, I haven't used 4.1 yet. I actually discontinued my ChatGPT subscription about...
almost about a month ago now and things have been moving so fast that I probably shouldn't have canceled it because I think it actually ran out like a week ago and I'm already thinking about resubscribing. So I don't think it would, it saved me any money by, uh, by canceling it. But I, and I canceled because I found that Gemini 2.5 was doing such a good job for me. Uh,
combined with Claude that I was pretty satisfied with where things stand. But what I have seen so far of 4.1 has intrigued me. And I do still have, although I don't have an account for the chat bot, I do still have an open AI API account that I can use. So I can use it with some other tools in the interim, such as Copilot on Cloud.
on Obsidian, but I probably will end up back with an OpenAI chatbot subscription as well. And I do see a similar trend that you have. It'll be interesting to see. We'll have to make a call at some point about our team account. I do feel like one of the problems with the way things are moving so quickly is I do feel like I'm resetting up core
prompts and projects over and over again. And this is why this is my top tip. This is why I've been keeping all those prompts in a folder as text files. Yeah. So like I'm just, I basically have templates now. And so that allows me to switch between things without losing all that work that has gone into putting together those projects. It's the only way to live.
Especially if you want to try everything, the only way to do it is to keep a folder with your prompts and your stuff, and you can just reuse them no matter the service that you're trying. Right. Then you have consistency between experiments as well, too, because you're using the exact same prompts, and it's not affecting the outcome. Well, interesting. Interesting. Yeah.
Wow. It's a lot. I feel like we could be going on for two more hours, but that's the episode. Yeah, no, that's it. Well, we are going to talk about some more AI stuff in the post-show for App Stories Plus members. John has an idea. This is more on the complain side, which is... Kind of, maybe. I mean, kind of. Prediction adjacent. Yeah.
Yes. We'll see. Yes. Where I think the entire AI industry is going in terms of how people are charged and the way these things are turned into products. So that'll be for the post show. But that's it for the main episode. I want to thank our sponsor again. That was Notion. You can find me and Federico over at MacStories.net. We're both on social media too. Just look for at Fatici. That's V-I-T-I-C-C-I. And I'm at John Voorhees.
J-O-H-N-V-O-O-R-H-E-S. Talk to you next week, Federico. Ciao, John.