A new buzzword is making waves in the tech world, and it goes by several names: large language model optimization (LLMO), generative engine optimization (GEO) or generative AI optimization (GAIO).
At its core, GEO is about optimizing how generative AI applications present your products, brands, or website content in their results. For simplicity, I’ll refer to this concept as GEO throughout this article.
I’ve previously explored whether it’s possible to shape the outputs of generative AI systems. That discussion was my initial foray into the topic of GEO.
Since then, the landscape has evolved rapidly, with new generative AI applications capturing significant attention. It’s time to delve deeper into this fascinating area.
Platforms like ChatGPT, Google AI Overviews, Microsoft Copilot and Perplexity are revolutionizing how users search and consume information and transforming how businesses and brands can gain visibility in AI-generated content.
A quick disclaimer: no proven methods exist yet in this field.
It’s still too new, reminiscent of the early days of SEO when search engine ranking factors were unknown and progress relied on testing, research and a deep technological understanding of information retrieval and search engines.
Understanding the landscape of generative AI
Understanding how natural language processing (NLP) and large language models (LLMs) function is critical in this early stage.
A solid grasp of these technologies is essential for identifying future potential in SEO, digital brand building and content strategies.
The approaches outlined here are based on my research of scientific literature, generative AI patents and over a decade of experience working with semantic search.
How large language models work
Core functionality of LLMs
Before engaging with GEO, it’s essential to have a basic understanding of the technology behind LLMs.
Much like search engines, understanding the underlying mechanisms helps avoid chasing ineffective hacks or false recommendations.
Investing a few hours to grasp these concepts can save resources by steering clear of unnecessary measures.
What makes LLMs revolutionary
LLMs, such as GPT models, Claude or LLaMA, represent a transformative leap in search technology and generative AI.
They change how search engines and AI assistants process and respond to queries by moving beyond simple text matching to deliver nuanced, contextually rich answers.
LLMs demonstrate remarkable capabilities in language comprehension and reasoning that go beyond simple text matching to provide more nuanced and contextual responses, per research like Microsoft’s “Large Search Model: Redefining Search Stack in the Era of LLMs.”
Core functionality in search
The core functionality of LLMs in search is to process queries and produce natural language summaries.
Instead of just extracting information from existing documents, these models can generate comprehensive answers while maintaining accuracy and relevance.
This is achieved through a unified framework that treats all (search-related) tasks as text generation problems.
What makes this approach particularly powerful is its ability to customize answers through natural language prompts. The system first generates an initial set of query results, which the LLM refines and improves.
If additional information is needed, the LLM can generate supplementary queries to collect more comprehensive data.
The underlying processes of encoding and decoding are key to their functionality.
The encoding process
Encoding involves processing and structuring training data into tokens, which are fundamental units used by language models.
Tokens can represent words, n-grams, entities, images, videos or entire documents, depending on the application.
It’s important to note, however, that LLMs do not “understand” in the human sense – they process data statistically rather than comprehending it.
Transforming tokens into vectors
In the next step, tokens are transformed into vectors, forming the foundation of Google’s transformer technology and transformer-based language models.
This breakthrough was a game changer in AI and is a key factor in the widespread adoption of AI models today.
Vectors are numerical representations of tokens, with the numbers capturing specific attributes that describe the properties of each token.
These properties allow vectors to be classified within semantic spaces and related to other vectors, a process known as embeddings.
The semantic similarity and relationships between vectors can then be measured using methods like cosine similarity or Euclidean distance.
The decoding process
Decoding is about interpreting the probabilities that the model calculates for each possible next token (word or symbol).
The goal is to create the most sensible or natural sequence. Different methods, such as top K sampling or top P sampling, can be used when decoding.
Potentially, subsequent words are evaluated with a probability score. Depending on how high the “creativity scope” of the model is, the top K words are considered as possible next words.
In models with a broader interpretation, the following words can also be taken into account in addition to the Top 1 probability and thus be more creative in the output.
This also explains possible different results for the same prompt. With models that are “strictly” designed, you will always get similar results.
Beyond text: The multimedia capabilities of generative AI
The encoding and decoding processes in generative AI rely on natural language processing.
By using NLP, the context window can be expanded to account for grammatical sentence structure, enabling the identification of main and secondary entities during natural language understanding.
Generative AI extends beyond text to include multimedia formats like audio and, occasionally, visuals.
However, these formats are typically transformed into text tokens during the encoding process for further processing. (This discussion focuses on text-based generative AI, which is the most relevant for GEO applications.)
Dig deeper: How to win with generative engine optimization while keeping SEO top-tier
Challenges and advancements in generative AI
Major challenges for generative AI include ensuring information remains up-to-date, avoiding hallucinations, and delivering detailed insights on specific topics.
Basic LLMs are often trained on superficial information, which can lead to generic or inaccurate responses to specific queries.
To address this, retrieval-augmented generation has become a widely used method.
Retrieval-augmented generation: A solution to information challenges
RAG supplies LLMs with additional topic-specific data, helping them overcome these challenges more effectively.
In addition to documents, topic-specific information can also be integrated using knowledge graphs or entity nodes transformed into vectors.
This enables the inclusion of ontological information about relationships between entities, moving closer to true semantic understanding.
RAG offers potential starting points for GEO. While determining or influencing the sources in the initial training data can be challenging, GEO allows for a more targeted focus on preferred topic-specific sources.
The key question is how different platforms select these sources, which depends on whether their applications have access to a retrieval system capable of evaluating and selecting sources based on relevance and quality.
The critical role of retrieval models
Retrieval models play a crucial role in the RAG architecture by acting as information gatekeepers.
They search through large datasets to identify relevant information for text generation, functioning like specialized librarians who know exactly which “books” to retrieve on a given topic.
These models use algorithms to evaluate and select the most pertinent data, enabling the integration of external knowledge into text generation. This enhances context-rich language output and expands the capabilities of traditional language models.
Retrieval systems can be implemented through various mechanisms, including:
Vector embeddings and vector search.
Document index databases using techniques like BM25 and TF-IDF.
Retrieval approaches of major AI platforms
Not all systems have access to such retrieval systems, which presents challenges for RAG.
This limitation may explain why Meta is now working on its own search engine, which would allow it to leverage RAG within its LLaMA models using a proprietary retrieval system.
Perplexity claims to use its own index and ranking systems, though there are accusations that it scrapes or copies search results from other engines like Google.
Claude’s approach remains unclear regarding whether it uses RAG alongside its own index and user-provided information.
Gemini, Copilot and ChatGPT differ slightly. Microsoft and Google leverage their own search engines for RAG or domain-specific training.
ChatGPT has historically used Bing search, but with the introduction of SearchGPT, it’s uncertain if OpenAI operates its own retrieval system.
OpenAI has stated that SearchGPT employs a mix of search engine technologies, including Microsoft Bing.
“The search model is a fine-tuned version of GPT-4o, post-trained using novel synthetic data generation techniques, including distilling outputs from OpenAI o1-preview. ChatGPT search leverages third-party search providers, as well as content provided directly by our partners, to provide the information users are looking for.”
Microsoft is one of ChatGPT’s partners.
When ChatGPT is asked about the best running shoes, there is some overlap between the top-ranking pages in Bing search results and the sources used in its answers, though the overlap is significantly less than 100%.
Evaluating the retrieval-augmented generation process
Other factors may influence the evaluation of the RAG pipeline.
Faithfulness: Measures the factual consistency of generated answers against the given context.
Answer relevancy: Evaluates how pertinent the generated answer is to the given prompt.
Context precision: Assesses whether relevant items in the contexts are ranked appropriately, with scores from 0-1.
Aspect critique:Evaluates submissions based on predefined aspects like harmlessness and correctness, with ability to define custom evaluation criteria.
Groundedness: Measures how well answers align with and can be verified against source information, ensuring claims are substantiated by the context.
Source references: Having citations and links to original sources allows verification and helps identify retrieval issues.
Distribution and coverage: Ensuring balanced representation across different source documents and sections through controlled sampling.
Correctness/Factual accuracy: Whether generated content contains accurate facts.
Mean average precision (MAP): Evaluates the overall precision of retrieval across multiple queries, considering both precision and document ranking. It calculates the mean of average precision scores for each query, where precision is computed at each position in the ranked results. A higher MAP indicates better retrieval performance, with relevant documents appearing higher in search results.
Mean reciprocal rank (MRR): Measures how quickly the first relevant document appears in search results. It’s calculated by taking the reciprocal of the rank position of the first relevant document for each query, then averaging these values across all queries. For example, if the first relevant document appears at position 4, the reciprocal rank would be 1/4. MRR is particularly useful when the position of the first correct result matters most.
Stand-alone quality: Evaluates how context-independent and self-contained the content is, scored 1-5 where 5 means the content makes complete sense by itself without requiring additional context.
Prompt vs. query
A prompt is more complex and aligned with natural language than typical search queries, which are often just a series of key terms.
Prompts are typically framed with explicit questions or coherent sentences, providing greater context and enabling more precise answers.
It is important to distinguish between optimizing for AI Overviews and AI assistant results.
AI Overviews, a Google SERP feature, are generally triggered by search queries.
Whereas AI assistants rely on more complex natural language prompts.
To bridge this gap, the RAG process must convert the prompt into a search query in the background, preserving critical context to effectively identify suitable sources.
Goals and strategies of GEO
The goals of GEO are not always clearly defined in discussions.
Some focus on having their own content cited in referenced source links, while others aim to have their name, brand or products mentioned directly in the output of generative AI.
Both goals are valid but require different strategies.
Being cited in source links involves ensuring your content is referenced.
Whereas mentions in AI output rely on increasing the likelihood of your entity – whether a person, organization or product – being included in relevant contexts.
A foundational step for both objectives is to establish a presence among preferred or frequently selected sources, as this is a prerequisite for achieving either goal.
Do we need to focus on all LLMs?
The varying results of AI applications demonstrate that each platform uses its own processes and criteria for recommending named entities and selecting sources.
In the future, it will likely be necessary to work with multiple large language models or AI assistants and understand their unique functionalities. For SEOs accustomed to Google’s dominance, this will require an adjustment.
Over the coming years, it will be essential to monitor which applications gain traction in specific markets and industries and to understand how each selects its sources.
Why are certain people, brands or products cited by generative AI?
In the coming years, more people will rely on AI applications to search for products and services.
For example, a prompt like:
“I am 47, weigh 95 kilograms, and am 180 cm tall. I go running three times a week, 6 to 8 kilometers. What are the best jogging shoes for me?”
This prompt provides key contextual information, including age, weight, height and distance as attributes, with jogging shoes as the main entity.
Products frequently associated with such contexts have a higher likelihood of being mentioned by generative AI.
Testing platforms like Gemini, Copilot, ChatGPT and Perplexity can reveal which contexts these systems consider.
Based on the headings of the cited sources, all four systems appear to have deduced from the attributes that I am overweight, generating information from posts with headings like:
Best Running Shoes for Heavy Runners (August 2024)
7 Best Running Shoes For Heavy Men in 2024
Best Running Shoes for Heavy Men in 2024
Best running shoes for heavy female runners
7 Best Long Distance Running Shoes in 2024
Copilot
Copilot considers attributes such as age and weight.
Based on the referenced sources, it identifies an overweight context from this information.
All cited sources are informational content, such as tests, reviews and listicles, rather than ecommerce category or product detail pages.
ChatGPT
ChatGPT takes attributes like distance and weight into account. From the referenced sources, it derives an overweight and long-distance context.
All cited sources are informational content, such as tests, reviews and listicles, rather than typical shop pages like category or product detail pages.
Perplexity
Perplexity considers the weight attribute and derives an overweight context from the referenced sources.
The sources include informational content, such as tests, reviews, listicles and typical shop pages.
Gemini
Gemini does not directly provide sources in the output. However, further investigation reveals that it also processes the contexts of age and weight.