What Semrush’s 80 Million Clickstream Records Actually Revealed: SEOs Know Nothing About LLMs and SEO Isn’t GEO

Reading Time: 12 minutes

So, there’s a lot of talk lately about SEO evolving into something else—specifically, Generative Engine Optimization (GEO).

But after digging into Semrush’s 80 ChatGPT insights, it’s clear most of us in the industry, including many top thought leaders and publications, don’t even demonstrate cursory understandings of how LLMs really work.

I write this with a light hand because SEO gave me, a community college dropout, a path I love. But the end of SEO as we know it has come, and however much we love it, we have to face that reality.

Our industry has misled thousands by equating SEO with GEO. That’s not the central premise of this piece, but it’s where we’ll start: how that mapping leads to false data narratives by analysing Semrush’s Investigating ChatGPT Search: Insights from 80 Million Clickstream Records study.

From there, we’ll explore the real story—cutting-edge techniques that go beyond SEO. The stuff you’ll see is wild, and it’s changing everything.

Content Overview

How Semrush’s Search Intent Mapping Missed 70% of Prompts—Ignoring Basic LLM Utterance Classification

The most glaring issue—and the single most illustrative point of why SEO can’t just be mapped onto LLMs—is that Semrush relied on search intent methods of classification to try to map user utterances.

But here’s the thing: conversation interfaces don’t work like that. They consist of utterances and turns, equating to multi-turn conversations. That’s what we are ACTUALLY optimizing for in GEO.

Any string of text “uttered” by the LLM or user is referred to as an utterance. There two types:

1. User Utterances (strings of text “uttered” by the user)

2. An LLM Utterance (strings of text “uttered” by the user)

An utterance from a user with an utterance from the LLM equates to a turn:

The full picture looks like this:

Ultimately, these prompts should have been classified using what are called utterance classifications. Had that been done, the insights would have been far more valuable.

Utterance classification is a fundamental technique used in every serious LLM and chatbot system. It provides much richer insights than traditional search intent models. Even at a broad level, it offers classifications such as “opinion requesting” and “clarification seeking,” which are more nuanced and descriptive than standard search intents.

They can become highly contextual and nuanced, enabling a deeper understanding of user journeys and multi-turn conversations:

Utterance classifications at a basic level consist of three approaches: zero-shot classification, intent classification, and few shot classification classification.

Zero-shot classification applies when there’s no explicit training data, allowing the model to predict the category of an utterance on the fly. Intent classification identifies what the user wants to achieve with their statement, while hybrid classification combines multiple approaches to provide a more nuanced understanding:

Utterance classification isn’t new. It has been around since the 1990s and played a central role in early voice search systems. Ironically, it’s the foundation of voice search—something the SEO industry frequently discusses.

If utterance classification had been applied to this dataset, Semrush could have labeled and categorized EVERY prompt in the dataset. We could have understood specific intents down to an industry level, user request instructions and so much more.

Instead because Semrush decided to apply SEO classifications to something that is NOT A SEARCH ENGINE we got an outrageous graph with 70% of the data as unlabelled. You can view the chart here.

Pioneering new ways? No, they’re just using a new technology that can be described through methods that have been around since the 90s, that being UTTERANCE CLASSIFICATION.

I’m not trying to be harsh or negative, but this is a serious misstep. It makes it clear that SEO is not the same as GEO. When that distinction is overlooked, it leads to flawed narratives that can seriously harm the marketing ecosystem.

This is a real-world mistake that misleads both businesses and SEO professionals, sending them in the wrong direction at the exact moment they need clarity the most.

If utterance classifications had been used, the data narrative would have looked entirely different, with insights that could have been truly groundbreaking. Instead, this was a misapplication of SEO tools to a domain they were never designed for, and it should finally put to rest the idea that SEO can be directly mapped to LLMs or GEO.

Semrush Completely Missed The Significant Role of Adaptive Retrieval and Its Role in What Drives ChatGPT Citations

The article focused on which prompts in ChatGPT’s primary interface triggered citations, but didn’t mention adaptive retrieval at all.

That’s the fundamental mechanism here: adaptive retrieval is what decides whether to augment a response with external information.

Every single citation in the primary interface of ChatGPT comes from adaptive retrieval scoring:

Adaptive retrieval has existed within the LLM literature many months prior to publication. I would be lying if I could say with a straight face that it is not concerning that one of leading industry publications couldn’t perform a simple Google search prior to writing this:

Why was this Google search not performed? The reason is because SEOs already think they are knowledgeable, why research when you can observe and infer? Well, unfortunately, this is not a search engine and SEO is NOT GEO.The danger here is that they interpreted these queries without a basic understanding, coupled with a strong platform that led to messaging to thousands and thousands of people that our industry doesn’t understand how these things work.

And you know what? It was the RIGHT MESSAGING. SEOs don’t understand how these systems work because they are not search engines, however, emerging leaders in GEO like Michael King DO.

While we’re here, let’s take a moment to explore Adaptive Retrieval and how it works. To the best of my knowledge (though I could be wrong), RAGate was the first research paper to gain mainstream attention for addressing adaptive retrieval. If you’re interested, you can read the original paper here.

At its core, a simplified version of RAGate functions as a binary classifier that determines when an LLM should or should not augment its response. The more detailed explanation is that RAGate uses a gating function to decide when to trigger augmentation. It includes three variants and incorporates a combination of a trained BERT ranker and TF-IDF as rankers and retrievers.

There’s an even more technical breakdown available, but we’ll skip that for now. Below is the original framework:

The answers were grounded using the KETOD dataset. While the process may seem complex at first, the core idea is simple: these classifiers are what trigger augmentation. It is essential for companies to understand them and apply them correctly.

The problem with Semrush framing these insights through the lens of SEO is that it leads to broad generalizations. People start saying things like “news triggers results,” or “trending topics perform well, so we should invest there,” and “informational content does not, so we shouldn’t invest in it.”

These generalizations are harmful. Running observational studies without understanding the underlying context of what you are observing—especially in a system that is already adaptive—is not just misleading. It creates risk.

There is no need to guess or rely on surface-level observations. Dynamic and adaptive classifiers already exist. For example, the Gemini 2.0 API offers access to dynamic thresholds:

These thresholds can be applied to even basic keywords to give brands an unambiguous guide as to which types of keywords WILL and WILL NOT result in getting cited by an LLM.

This entire study would have changed, the understandings conveyed would have changed.

Semrush Tried to Map Singular Intents to Long Prompts, Multiple Intents Should Have Been Talked About, Along With Multi-hop Retrieval/RAG

Am I surprised that Semrush didn’t mention multi-hop retrieval in the context of this study? Not at all. The only time I’ve seen the community even begin to touch on the topic was when analyzing the patent related to AI and fanout queries.

Here’s the reality: multi-retrieval techniques have existed since before 2022. Google is not ahead in LLM development. Their strength lies in information retrieval, not in large language models. Query fanouts are not a novel innovation. While fanout queries are distinct, they’re conceptually similar to multi-hop retrieval and clearly influenced by recent advances in the field.

What’s surprising is that the idea of multiple intents wasn’t even considered—especially given Google’s own internal principle known as “Every Query Deserves Diversity.”

This principle reflects Google’s long-standing understanding that a single query can carry multiple intents. Why it took the broader industry nearly two decades to catch up remains unclear.

Even more concerning is that longer prompts, which almost always contain multiple intents and instructions (as recognized by utterance classification), were overlooked. This was a major publication. Multiple intents have precedent in SEO and should have been part of the discussion.

Had multi-hop concepts been taken into account and applied properly, we could have labeled the dataset with a high degree of accuracy. We could have identified which prompts likely triggered multi-hop retrieval, inferred system-level instructions and intents, and even used regressive logic to map interaction paths.

Multi-hop RAG and retrieval methods are designed to handle this kind of complexity. They solve the challenge of deeper research by performing multiple retrievals for a single query, offering a more comprehensive response. Below is a simplified visual of what multi-hop retrieval looks like when applied to a complex prompt with multiple intents:

The other application is to a more singular topic in which Chain-of-Thought reasoning, at least, in more sophisticated systems. This is what CoT RAG is all about.

From Google’s release video in relation to AI mode we can quite literally see the multi-hops taking place coupled with Chain of Thought reasoning to accomplish tasks:

Right now, Google is taking established concepts and terminology from the research community, rebranding them, and filing patents.

Our community then analyzes these patents, adopts the new terminology, and in doing so, creates competing language with the terminology already used in the LLM community. This fragmentation makes it harder for the GEO community to reach a shared understanding.

We’ve seen this before. What the research community calls adaptive retrieval, Google now refers to as dynamic retrieval. In general, I encourage everyone in the field to study the original research papers that precede consumer-facing implementations. This helps us anticipate what’s coming and better understand how these commercial versions differ from the original concepts.

In the case of Semrush, had they not tried to force single-intent frameworks onto multi-intent prompts, they could have potentially have:

The Worst Sin of All: No Mention of Basic LLM Vocabulary

Perhaps the most concerning issue is the lack of adoption of proper terminology. When search engines emerged, we adopted the concept of search intents and for good reason. The search engines themselves defined those terms.

We clearly need to take the same approach with LLMs.

No one should be offering SEO or GEO guidance if they haven’t even adopted the basic vocabulary. And the truth is, the vocabulary isn’t that difficult.

Utterances = a string of text from an LLM or User
A turn is an utterance + an utterance
Multiple turns = multi-turn conversations

Is there a single mention of utterances in Semrush’s literature? Nope:

If a leading publication can’t take the time to perform a basic Google search before launching studies and building tools, what can I even say? It’s hard to know how to respond to that.

If it’s truly the case that a major publication and tool provider is building products and conducting in-depth studies without first understanding how these systems work, then the marketing world is in serious trouble and it’s clear SEO tools should stay in their lane.

I wish I could say this is true of Semrush, but you know what, it is not. It’s true of the whole community.

Has Ahrefs mentioned utterances? Nope:

I wish I could say the above is not true, but you know what, it is. And to me, this is one of the worst disasters in marketing history, though, this is just my opinion.

How much money went into these tools? How many businesses were misled? This is akin to building an SEO tool without knowing what search intent is.

My Final Verdict: The Industry Needs a Serious Reality Check

Will I catch some flack for writing this? Probably. But I don’t mind. It needed to be said—for the good of the marketing industry.

I strongly urge major tools and publications to re-evaluate their methodologies and revisit their datasets. Start with the basics. Even a simple Google search would have made a difference.

Just imagine how powerful that study could have been if utterance classification had been applied. The insights could have been groundbreaking—arguably more impactful than any study conducted in this space to date.

It was a great dataset, but the analysis missed the mark. It’s unfortunate, but it’s the truth. I encourage Semrush to take another look, analyze the dataset properly, and consider releasing an updated version.

Some The Advanced Methods and Tools We Are Building at Flying V Group

At Flying V Group, we’re not just following trends — we’re building the next generation of GEO methodologies from the ground up. Below are some of the advanced methods and tools we’re developing to push the boundaries of what’s possible in this space.

Identifying Keywords That Will Result in Citations With a High-Degree of Confidence

The industry can use Google’s grounding with Gemini confidence intervals to batch identify which queries (query rewrites) will be served content to LLMs.

This enables us to determine with high confidence which queries and content will result in citations and be pulled from LLMs on a macro scale.

Adaptive retrieval logic decides whether ChatGPT will pull external documents. These adaptive thresholds can be applied to large keyword datasets to identify which types of keywords will generate citations and which will not.

This gives companies a clear guide on what content they should and should not invest in, with a high degree of confidence. The underlying logic is demonstrated in Google’s dynamic threshold example below:

Simulating and Mapping Multi-conversation Based on Persona

We’ve been able to develop an early tool which allows us with some degree of efficacy to simulate full multi-turn conversations based on persona, below is a sneak peak:

Identifying Low-competition Topics by Applying Grounding Scores

Identifying which content will be cited by LLMs is important, but finding low-competition keywords is equally valuable.

By combining dynamic thresholds with grounding scores, we can determine both the types of content likely to be cited and pinpoint low-hanging opportunities.

Grounding scores measure how much an answer improves through retrieval. A low grounding score suggests that retrieval had little impact on the answer. While there can be various reasons for this, one key indicator is content scarcity, which means there is limited existing content on the topic and signals low competition.

Identifying Important Citations Through Thousands of Conversational Simulations by ICP

While we’re proud to have achieved the ability to simulate multi-turn conversations based on persona, a recent discussion with a highly knowledgeable prospective client made it clear that each persona can follow multiple conversational paths.

So how do we identify which content is most likely to be shown to a given persona? Our hypothesis is that we can solve this by generating thousands of permutations by ICP, then analyzing which citations are pulled most frequently across those variations. Below is the conceptual logic:

We’re early days with this, but reckon it’ll be worked into our approach within the next 30-days.

Deep Research and User Behavior

Though deep research is the frontier, optimising for this is largely based on understanding CoT RAG coupled with ToT reasoning. Chain-of-thought reasoning is a progressive form of reasoning while ToT is a recursive form of logic which allows the LLM to reverse its path and pick the best path to achieve the best outcome.

CoT RAG coupled with ToT and a large context window is what enables sophisticated research assistance. Our understanding of how this works is quite strong and we’re working on some pretty cool approaches.

Influencing LLMs Through Brute Force Optimisation

Another area we’re experimenting with is allowing AI to make tweaks to internal web content using the AI to judge the changed LLM outputs from retrieval. Surprisingly, we can actually index and iterate pretty quickly, this logic allows brands to just sit back and have AI optimize content to influence LLM outputs on autopilot.

In Conclusion

As I close this piece, I want to make something clear: I wrote this not to tear down Semrush or to fuel more division in our already fragmented marketing world. I wrote it because I care deeply about the future of this industry, a future I believe is being stunted by misunderstandings, misapplications, and a refusal to evolve.

I am not claiming to have all the answers, but one thing is certain: Generative Engine Optimization is not SEO. It demands new thinking, new techniques, and, above all, a willingness to question old narratives. What I see across the board, SEOs clinging to dated frameworks while powerful LLMs upend everything around them, is a heartbreaking waste of potential.

SEO gave me, a community college dropout, a path. I have seen it give thousands of others the same. But if we want to honor that gift, we have to let go of what is comfortable and embrace what is next. That means understanding utterances, adaptive retrieval, multi-hop logic, even if it feels like we are starting over.

Flying V Group is trying to do that. We are not perfect, but we are doing the hard work to map this new terrain. My hope is that others will join us in rethinking the very foundations of search, marketing, and how we engage with the world’s most powerful AI systems.

Let’s move beyond the old playbook. Let’s build a new one, together.

fvgweb

← Prev: Medical Website SEO: 6 Strategies to Improve Patient Engagement

fvgweb



Search Engine Optimization | Uncategorized



0 Comment (s)



What Semrush’s 80 Million Clickstream Records Actually Revealed: SEOs Know Nothing About LLMs and SEO Isn’t GEO

How Semrush’s Search Intent Mapping Missed 70% of Prompts—Ignoring Basic LLM Utterance Classification

Semrush Completely Missed The Significant Role of Adaptive Retrieval and Its Role in What Drives ChatGPT Citations

Semrush Tried to Map Singular Intents to Long Prompts, Multiple Intents Should Have Been Talked About, Along With Multi-hop Retrieval/RAG

The Worst Sin of All: No Mention of Basic LLM Vocabulary

Some The Advanced Methods and Tools We Are Building at Flying V Group

Identifying Keywords That Will Result in Citations With a High-Degree of Confidence

Simulating and Mapping Multi-conversation Based on Persona

Identifying Low-competition Topics by Applying Grounding Scores

Identifying Important Citations Through Thousands of Conversational Simulations by ICP

Deep Research and User Behavior

Influencing LLMs Through Brute Force Optimisation

In Conclusion

Search Engine Optimization | Uncategorized

June 1, 2025

Recent Posts

Archives

Categories

You may also like

0 Comments

Submit a Comment Cancel reply