A vector database is a collection of data where each piece of data is stored as a (numerical) vector. A vector represents an object or entity, such as an image, person, place etc. in the abstract N-dimensional space.
Vectors, as explained in the previous chapter, are crucial for identifying how entities are related and can be used to find their semantic similarity. This can be applied in several ways for SEO – such as grouping similar keywords or content (using kNN).
In this article, we are going to learn a few ways to apply AI to SEO, including finding semantically similar content for internal linking. This can help you refine your content strategy in an era where search engines increasingly rely on LLMs.
You can also read a previous article in this series about how to find keyword cannibalization using OpenAI’s text embeddings.
Let’s dive in here to start building the basis of our tool.
Understanding Vector Databases
If you have thousands of articles and want to find the closest semantic similarity for your target query, you can’t create vector embeddings for all of them on the fly to compare, as it is highly inefficient.
For that to happen, we would need to generate vector embeddings only once and keep them in a database we can query and find the closest match article.
And that is what vector databases do: They are special types of databases that store embeddings (vectors).
When you query the database, unlike traditional databases, they perform cosine similarity match and return vectors (in this case articles) closest to another vector (in this case a keyword phrase) being queried.
Here is what it looks like:
Text embedding record example in the vector database.
In the vector database, you can see vectors alongside metadata stored, which we can easily query using a programming language of our choice.
In this article, we will be using Pinecone due to its ease of understanding and simplicity of use, but there are other providers such as Chroma, BigQuery, or Qdrant you may want to check out.
Let’s dive in.
1. Understanding Vector Databases2. Create A Vector Database3. Export Your Articles From Your CMS4. Inserting OpenAi’s Text Embeddings Into The Vector Database5. Finding An Article Match For A Keyword6. Inserting Google Vertex AI Text Embeddings Into The Vector Database7. Finding An Article Match For A Keyword Using Google Vertex AI8. Try Testing The Relevance Of Your Article Writing
1. Create A Vector Database
First, register an account at Pinecone and create an index with a configuration of “text-embedding-ada-002” with ‘cosine’ as a metric to measure vector distance. You can name the index anything, we will name itarticle-index-all-ada‘.
Creating a vector database.
This helper UI is only for assisting you during the setup, in case you want to store Vertex AI vector embedding you need to set ‘dimensions’ to 768 in the config screen manually to match default dimensionality and you can store Vertex AI text vectors (you can set dimension value anything from 1 to 768 to save memory).
In this article we will learn how to use OpenAi’s ‘text-embedding-ada-002’ and Google’s Vertex AI ‘text-embedding-005’ models.
Once created, we need an API key to be able to connect to the database using a host URL of the vector database.
Generate an API key
Host URL of vector database
Next, you will need to use Jupyter Notebook. If you don’t have it installed, follow this guide to install it and run this command (below) afterward in your PC’s terminal to install all necessary packages.
pip install openai google-cloud-aiplatform google-auth pandas pinecone-client tabulate ipython numpy
And remember ChatGPT is very useful when you encounter issues during coding!
2. Export Your Articles From Your CMS
Next, we need to prepare a CSV export file of articles from your CMS. If you use WordPress, you can use a plugin to do customized exports.
As our ultimate goal is to build an internal linking tool, we need to decide which data should be pushed to the vector database as metadata. Essentially, metadata-based filtering acts as an additional layer of retrieval guidance, aligning it with the general RAG framework by incorporating external knowledge, which will help to improve retrieval quality.
For instance, if we are editing an article on “PPC” and want to insert a link to the phrase “Keyword Research,” we can specify in our tool that “Category=PPC.” This will allow the tool to query only articles within the “PPC” category, ensuring accurate and contextually relevant linking, or we may want to link to the phrase “most recent google update” and limit the match only to news articles by using ‘Type’ and published this year.
In our case, we will be exporting:
Title.
Category.
Type.
Publish Date.
Publish Year.
Permalink.
Meta Description.
Content.
To help return the best results, we would concatenate the title and meta descriptions fields as they are the best representation of the article that we can vectorize and ideal for embedding and internal linking purposes.
Using the full article content for embeddings may reduce precision and dilute the relevance of the vectors.
This happens because a single large embedding tries to represent multiple topics covered in the article at once, leading to a less focused and relevant representation. Chunking strategies (splitting the article by natural headings or semantically meaningful segments) need to be applied, but these are not the focus of this article.
Here’s the sample export file you can download and use for our code sample below.
2. Inserting OpenAi’s Text Embeddings Into The Vector Database
Assuming you already have an OpenAI API key, this code will generate vector embeddings from the text and insert them into the vector database in Pinecone.
import pandas as pd
from openai import OpenAI
from pinecone import Pinecone
from IPython.display import clear_output
# Setup your OpenAI and Pinecone API keys
openai_client = OpenAI(api_key=’YOUR_OPENAI_API_KEY’) # Instantiate OpenAI client
pinecone = Pinecone(api_key=’YOUR_PINECON_API_KEY’)
# Connect to an existing Pinecone index
index_name = “article-index-all-ada”
index = pinecone.Index(index_name)
def generate_embeddings(text):
“””
Generates an embedding for the given text using OpenAI’s API.
Returns None if text is invalid or an error occurs.
“””
try:
if not text or not isinstance(text, str):
raise ValueError(“Input text must be a non-empty string.”)
result = openai_client.embeddings.create(
input=text,
model=”text-embedding-ada-002″
)
clear_output(wait=True) # Clear output for a fresh display
if hasattr(result, ‘data’) and len(result.data) > 0:
print(“API Response:”, result)
return result.data[0].embedding
else:
raise ValueError(“Invalid response from the OpenAI API. No data returned.”)
except ValueError as ve:
print(f”ValueError: {ve}”)
return None
except Exception as e:
print(f”An error occurred while generating embeddings: {e}”)
return None
# Load your articles from a CSV
df = pd.read_csv(‘Sample Export File.csv’)
# Process each article
for idx, row in df.iterrows():
try:
clear_output(wait=True)
content = row[“Content”]
vector = generate_embeddings(content)
if vector is None:
print(f”Skipping article ID {row[‘ID’]} due to empty or invalid embedding.”)
continue
index.upsert(vectors=[
(
row[‘Permalink’], # Unique ID
vector, # The embedding
{
‘title’: row[‘Title’],
‘category’: row[‘Category’],
‘type’: row[‘Type’],
‘publish_date’: row[‘Publish Date’],
‘publish_year’: row[‘Publish Year’]
}
)
])
except Exception as e:
clear_output(wait=True)
print(f”Error processing article ID {row[‘ID’]}: {str(e)}”)
print(“Embeddings are successfully stored in the vector database.”)
You need to create a notebook file and copy and paste it in there, then upload the CSV file ‘Sample Export File.csv’ in the same folder.
Jupyter project.
Once done, click on the Run button and it will start pushing all text embedding vectors into the index article-index-all-ada we created in the first step.
Running the script.
You will see an output log text of embedding vectors. Once finished, it will show the message at the end that it was successfully finished. Now go and check your index in the Pinecone and you will see your records are there.