ChatGPT recently released a new feature, called Deep Research, that allows ChatGPT to “use reasoning”, and process large amount of online information. In his newsletter Platformer, Casey Newton reported on the new ChatGPT’s new feature. To judge the quality and functionality of Deep Research, Newton prompted ChatGPT to output a report of 5k words using Deep Research, and compare this to a similar report made by Google’s Gemini. Notably for this blog, Newton asks ChatGPT for a report about how the fediverse could benefit publishers. A Fediverse Report, you could call it. Newton does not spend a lot of time analysing the results, saying that the output hits on the requirements of the prompt, and says that compared to Gemini, deep research “blows it out of the water”. Not all tech writers are as impressed with Deep Research, AI doomer king Ed Zitron wrote another long article, ‘The Generative AI Con’, in which Zitron pushes back against AI hype. He takes a Deep Research by reading through the same report that about the fediverse that Newton has published in Platformer. Zitron is not impressed by OpenAI’s new feature, saying that “the citations in this “deep research” are flimsy at best“, and that “this thing isn’t well-researched at all.“
A report on the fediverse is quite up the alley for a blog named Fediverse Report. So let’s do some deep research on Deep Research’s fediverse report. I’ll go over ChatGPT’s output in detail, analysing what information ChatGPT gives, and which information is missing.
ChatGPT’s output gets quite a lot of information correct, and more importantly, it structures the information well. It pulls in relevant examples, and also manages to find relevant obscure information. Deep Research’s ability to show what information the output is based on gives insight in how an LLM’s output gets constructed. It also allows the sources to be analysed, and it turns out there is a lot of interesting information you can learn by looking at the sources that Deep Research uses for the output. The report is also pretty well structured, and hits on most of the relevant points that publisher who is curious about the fediverse needs to know.
A common critique of LLMs is that they will give factually incorrect information, often simply called hallucinations. This problem is also visible in Deep Research’s output, it contains factual mistakes. The problem of factual (in)correctness of LLMs is well-known, and not a debate I want to rehash here. What I am interested in is the analysis and research part of Deep Research: which sources does the output use? Are those sources any good? And just as importantly: which relevant sources should have been part of the report, but aren’t? Judging if the output of an LLM is correct or incorrect is reasonably straightforward. But judging if the output of the LLM does not include information that reasonably should have been included is much harder. This goes doubly so for emerging fields like the fediverse, where there is no authoritative base of knowledge to rely upon.
In ChatGPT’s output that Newton uses to get a sense of the performance of Deep Research, I find that there are three types of issues relating to data and analysis. There are issues with the quality of the sources that ChatGPT cites, and there are sources missing that I expect to have been cited. But the most intriguing part for me is when ChatGPT cites a source correctly, but the resulting output is still lacking, because the needed information to get to a good understanding is not actually available on the internet.
Source quality
One issue that Deep Research struggle with is with the quality of the data sources it cites. The clearest example is this specific Reddit post, which ChatGPT cites six different times as a source in the section on monetization in the fediverse. The post is titled ‘monetization’ and posted on /r/fediverse. This specific Reddit post is the second search result on Google, as well as Kagi and DuckDuckGo for the search query ‘fediverse monetization’.
ChatGPT heavily focuses on WebMonetization by Interledger in this section, and how it integrates with Castopod. Interledger is a (non-crypto) payment network that allows people to send microtransactions to creators. There is indeed an WebMonetization integration with Castopod, but ChatGPT’s output gives no indication of how unrepresentative this is for the fediverse. With respect to Castopod and Interledger and what they are building, WebMonetization is in no way any meaningful part of the fediverse. In fact, both organisations presented their integration in 2025 during FOSDEM. The room consisted of the most in-the-know in-crowd of fediverse developers, and as far as I can tell this potential fediverse integration with WebMonetization was new information for most if not all of them. ChatGPT provides a reasonable summary of the position of Interledger regarding monetization in their output, but the complete lack of context of how early in the adoption stage WebMonetization makes that the output does not represent the state of the fediverse, as it is currently used by most people, at all.
An article by TwipeMobile is the prime source that ChatGPT uses throughout the article. TwipeMobile is software development company that builds apps for newspapers. The company also publishes research papers about related topic, and one is called “What is the Fediverse? A guide for publishers and the uninitiated.” With a title like that it is no surprise that ChatGPT likes the article. The quality of the article is mediocre however, and it lives in the twilight zone where it is impossible to tell for sure whether the article is generated by an LLM or written by a human.
The TwipeMobile article has some major issues with factuality, and as a result, ChatGPT’s output suffers as well. This is especially noticeable in ChatGPT’s comparison between ActivityPub and ATProto. The TwipeMobile article describes ActivityPub as ‘widely adopted’ and having an ‘established user base’, and ATProto as early in its growth. The article was published on 6 December 2024, and at that date the fediverse had 1.1 million monthly active users, while Bluesky had around 11 million monthly active users. That ATProto is 10x the size in terms of active users compared to ActivityPub does not seem particularly clear from the language used in the TwipeMobile article, which seems to imply that ActivityPub is more active than ATProto. As ChatGPT’s output relies so heavily on TwipeMobile’s article, this misconception is reflected in ChatGPT’s advice as well, which describes ActivityPub as having an ‘established audience’ and ATProto as being in an ‘early growth stage’.
Missing sources
ChatGPT does miss a few relevant sources, that would help publishers get a good understanding of the state of the fediverse and whether the network is relevant for them. One source that is missing from ChatGPT’s output is regarding monetisation and sub.club. Sub.club was a platform that let fediverse creators offer paid subscriptions and premium content, using existing fediverse infrastructure. It allowed people to set up a fediverse account, to share content with the rest of the fediverse. Creators could then set a paywall on posts if they so wanted, and sub.club provided the payment infrastructure. Sub.club shut down in December 2024, only a few months after launched, and they managed to onboard only 150 people. For publishers that are interested in monetisation on the fediverse Sub.club’s struggle to gain traction is a relevant data point. Sub.club got a fair amount of media attention (1, 2) from well-known outlets, so there was not a lack of sources for ChatGPT to cite from.
Another example of sources that are not included in ChatGPT’s output is not only a matter of not linking to articles, the lines between ‘relevant information that is missing from ChatGPT’s output’ and ‘relevant information that is not covered in well-known news publications’ are thin. Some of the most relevant information that is missing in the output is also missing in articles that rank high in search engines. A notable example is the statistics that Heise editor Martin Holland regularly publishes, which compares traffic to their site from Mastodon, Bluesky, Threads and X over a longer time period. The original prompt by Newton asks for report on how the fediverse could benefit publishers, and traffic data time series is one of the best ways of showing the concrete benefits for publishers.
As best I can tell this data series is not published in news media, which explains why it does not show up. ChatGPT seems to prefer to use English-speaking sources. For example, there is no information on how ZDF, one of Germany’s largest public broadcasters, has had their own Mastodon server for years. Heise being a German news outlet also likely contributes to the data series is being not being reported on in English-speaking media, and not being found by ChatGPT.
On data availability
One limitation that ChatGPT’s output has is that it is dependent on available information. But what if the information is not available online, nor a clear indication there is missing data at all? Let’s take a look at how ChatGPT describes Medium as an example for how publishers can use a Mastodon instance to amplify author’s reach. The output cites Medium’s announcement blog post and summarises the point of why Medium started their Mastodon server this correctly. On a surface level ChatGPT’s output is good: it found relevant information and summarised the important points.
What’s missing here is the follow-up: Medium launched their Mastodon more than 2 years ago. So did their plans actually work out? That seems pretty relevant information for a potential publisher to know. In 2024, Medium barely posted about their Mastodon server on their blog. CEO Tony Stubblebine mentions Medium’s Mastodon server once in his ‘State of Medium‘ post, also saying that he finds Threads to be a better place for self-promotion. Those two additional data points suggest that Medium’s experience with launching a Mastodon server is mixed: the me.dm server has 1.7k MAU, so it clearly provides some benefit to the Medium user base. But neither are there signals that it is an overwhelming success.
The issue here is that Medium (or anyone else) has not published an analysis or statement with a follow-up on Medium’s Mastodon server, to ask the question: “are the benefits of running a Mastodon server by a publishing platform good enough that it is recommended for other publishers to do so as well?” Newton’s original prompt asks for “high-level strategic analysis for a digital-native publisher”, and this is the type of analysis that is needed for a publisher to actually make a decision. For a publisher it is only knowing about the existence of a project is not the point, a publisher needs to know if that project is successful and if they should consider it as well. This type of analysis is hard for ChatGPT to properly execute: LLMs are fundamentally about processing data, and it cannot process data that doesn’t exist.
Some more notes
ChatGPT’s output on recommendation I find to be quite good. The first three points of advice – Establish an Authentic Presence, Integrate Your Website with ActivityPub, Engage with the Community- are quite good at a high level, and advice that I would give to publishers as well. The advice to “Stay Adaptive with Protocols” is also good advice, but the description is wrong on some pretty important parts. Then again, this is also because the quoted source (TwipeMobile again) is wrong on this, so at least ChatGPT cited a bad source correctly.
ChatGPT spends another section on discoverability and engagement, writing a comparison between ActivityPub and ATProto. It correctly notes that discoverability on the fediverse happens to community sharing and hashtags, and that ATProto has space for algorithmic discovery. ChatGPT’s tendency to equate both sides is in full play here: it does correctly say the difference between the networks, but refrains from making a material conclusion about it, describing them both as equal but different. Again, the biggest fault ChatGPT here is not in what it writes, but what it does not write. There is no information on that search is opt-in on the fediverse, that around 5% of accounts have opted into being discovered, and that as a result search and discovery works significantly less well on the fediverse than it does on Bluesky.
Finally, ChatGPT offers the advice to use “third-party analytics services to gauge engagement.” There are indeed tools available for analytics, but not all of them are easy to find, or to know which one to use. ChatGPT does not tell you which tools you can potentially use, instead offering very basic advice on tracking engagement. ChatGPT’s output would have been better here with some more ‘deep research’.
Another general note on ChatGPT’s output: Zitron does not like Deep Research’ tone of writing, and I agree with what he writes here: “I don’t like reading it! I don’t know how else to say this — there is something deeply unpleasant about how Deep Research reads! It’s uncanny valley, if the denizens of said valley were a bit dense and lazy. It’s quintessential LLM copy — soulless and almost, but not quite, right.”
Over the last years how you feel about LLMs and generative AI has quickly become an identity marker for many people, with opinions ranging from LLMs being the new way to build machine god to a torment nexus that is designed to strip workers of their power. For this article I am not aiming to argue for a specific position in the debate about the values and impacts of LLMs, and instead I’m aiming for a smaller goal. OpenAI released a new mode that says it can do deep research, and a prominent tech writer used a singular prompt to get a sense of how good this new mode this. This prompt happened to be on a subject field I do know something about. I wanted to know how much ‘deep research’ was in ChatGPT’s output, so to that end, I simply did some deep research of my own.
 
								