Latent Semantic Indexing: Why It Won’t Benefit Your SEO

June 17, 2024
Lachlan Perry

Lachlan Perry

Lachlan is the Founding & Managing Director of Endpoint Digital. He has years of experience as an in-house senior SEO specialist and digital strategist, and has worked with many businesses across a diverse range of industry verticals. With a graduate's degree in Data Science, he is a data-driven expert that leverages statistical insights to craft personalised strategies.

If you love SEO or are wanting to learn more about how to create high-quality, user-focused content that ranks, you’ve more than likely come across the term Latent Semantic Indexing (LSI).

A common claim by people in the SEO world is that Latent Semantic Indexing is being used as part of Google’s core ranking algorithm, however, there is no genuine evidence to support this claim at the time of writing.

It is a highly popular term in SEO circles, often pushed by marketing gurus who suggest that the application of “LSI keywords” will enrich your content with the types of semantic phrasing that Google is looking for, and as such, rank your web pages higher.

Before we begin to understand whether or not LSI has any effect on your SEO, it’s important to understand the history of Latent Semantic Indexing and its applications towards natural language processing.

Contents

What Is Latent Semantic Indexing?

The first seminal paper on LSI was published in 1988, developed by one of Microsoft’s search engineers, Susan Dumais.

Before she joined Microsoft, Dumais conducted previous research on LSI at Bells Labs, publishing ‘Indexing by Latent Semantic Analysis’ with fellow researchers, Scott Deerwester and Richard Harshman. During her time at Bell Labs, she continued to further develop LSI.

Latent Semantic Indexing, commonly referred to as latent semantic analysis (LSA), is a method in the field of information retrieval, designed to improve the process of finding relevant documents, by analysing the relationships between sets of documents and terms they contain by generating a set of concepts associated with the documents and terms.

LSI leverages a mathematical technique known as Singular Value Decomposition (SVD), which reduces the dimensionality of term-document matrices. It was argued that straightforward term-matching document schemes have shortcomings that may reduce relevant information, given that people often describe the same topic using different words and the same word can often have different meanings.

SVD reduction, uncovers latent semantic structures within the data, allowing LSI to capture the underlying meaning and associations between terms, even if those terms do not explicitly appear together in documents. LSI assumes that words that are close in meaning, will occur in similar pieces of text.

This is generally where Latent Semantic Indexing is misunderstood when it comes to SEO applications, as the use of related words and phrases is often attributed to LSI without strong evidence, when Google is likely using more advanced and modern language models.

What LSI Sought Out To Understand & Solve

The key advantage of LSI is in its ability to address the problems of synonymy and polysemy.

Some words may have multiple, different meanings (polysemic) and there are words that mean the same thing as each other (synonyms).

Traditionally, information retrieval (IR) systems were largely one-dimensional, and based on simple term-matching algorithms.

As humans, we understand that language is complex and in order for the correct, most relevant information to be surfaced, it requires a deep understanding of natural language processing.

Synonymy is often the cause of mismatches in vocabulary used by the authors of documents and the users of information retrieval systems.

Back in the 1980s when LSI was first patented, there wasn’t necessarily a focus on optimising documents for natural language processing and the onus generally fell on the users query to surface the document.

If you are familiar with PageRank, you could draw some comparisons. PageRank served as the pivotal function for search engines to rank documents according to their importance, or how many times they were ‘cited’ by other websites.

Often, websites that were cited by other websites as a whole, or another highly important website, would be ranked highly in an information retrieval system for a users query, because the simple term-matching algorithms were providing users with less than desirable results.

What Are LSI Keywords?

Before we deep-dive into the relationship between LSI and SEO, it’s important to define what LSI is and how it differs from LSI keywords.

Latent Semantic Indexing, as discussed earlier, is a real technique in distributional semantics, which aims to understand and analyse the relationships between documents in a corpus, and the relationship between the words that appear in these documents.

LSI keywords, is not a term used in any mathematical or analytical discussion of LSI.

LSI keywords is a term coined by marketers and SEO “gurus”, that falsely imply the use of LSI through introducing related words and phrases to your content.

However, the shortcoming of this application is that these LSI tools, provide no tangible evidence in how LSI is being used in application.

LSI is a scientific approach in indexing and information retrieval, and is not used as a method for optimising content in a document such as web pages.

Google is certainly looking for all those words that enrich your content – synonyms, related phrases and semantically related words, but are likely using modern and advanced natural language processes than LSI.

There is no strong or positive indicators that using these LSI tools will greatly enhance the ranking of your pages in any search engine, as they cannot accurately explain how they are using LSI in concept.

If you are interested in what Google has to say about this topic, John Mueller, who is a fellow search advocate, provided some insight and suggested that LSI keywords are not real.

There's no such thing as LSI keywords — anyone who's telling you otherwise is mistaken, sorry.
— 🍌 John 🍌 (@JohnMu) July 30, 2019

Latent Semantic Indexing & Its Affect On Your SEO

To make a long story short, there is no evidence that Google is using latent semantic indexing to better understand the context of a page, or is using LSI to determine where you should rank regarding a keyword because you might be using synonyms.

Another shortcoming of LSI being used for SEO purposes, is that it was patented long before the world wide web came to fruition.

An indexing and information retrieval system that could not account for the scale of the web, is likely to be something that Google, or any other major search engine would not use as part of its core ranking algorithm.

The original patent described an example set of 9 documents, which is obviously considerably smaller than the world wide web.

Google’s index of documents has grown to be over 100 billion and updates frequently, simply making this patented technology too dated to be a useful consideration for ranking documents in such a large index.

There has long been debate about the importance of LSI and its effect on how well your SEO campaigns perform, but Google themselves have never released a patent nor sub-patent on the use of LSI in their ranking algorithm.

There are LSI patents that appear on Google Patents, however, none of them seem to indicate that latent semantic indexing plays a vital role in how Google looks for synonyms and related words. They do talk about semantics and the use of phrase concurrency, but as far as LSI goes, there isn’t much credibility.

For those looking to overhaul their content strategy, you might’ve looked at LSI keyword generators or tools to help you find semantically related keywords.

As we mentioned before, there is no science behind the application of using “LSI keywords”, given the fact that LSI is largely unlikely to be responsible for how Google uses and understands synonyms and related keywords.

It has been thought these LSI tools are operating on phrase-based indexing, rather than latent semantic indexing and this is generally where the confusion may lie.

This might be because this concept may rely on phrases that are “valid” or “good”, and could take into consideration how frequently these phrases were used and if they might be related to each other.

When we tend to think of keywords we’re ranking for, spinning off as many synonyms as we can think of gives us the impression that we’re ticking off all boxes and this might be what these LSI tools are operating on – but again, this is not using LSI in nature.

The phrase-based indexing patent invented by Anna Lynn Patterson identifies phrases in documents on the internet and indexes them according to those particular phrases.

When a user submits a search query to the search engine, it attempts to provide the most relevant from its repository of information, whilst looking for phrases in the user’s search term.

The information retrieval system will then rank the results it aims to show the user, using phrases to influence the ranking order (or which is the most relevant).

It’s a very interesting patent and given there are several patents assigned to Google, it suggests this may be something that Google is fully utilising.

Forget LSI – Here’s What You Should Do Instead

The advice of using LSI keywords is to “help Google better understand the context of your pages”, but the reality is there are much more efficient ways to do this.

Google is likely utilising a wide range of natural language processing techniques, not only just for indexing pages, but also, for ranking purposes too.

Latent Semantic Indexing, while still useful in principle, is likely too dated to be used in a modern search engine’s core indexing and ranking system.

Google has gotten considerably better at understanding the context of a document, including its content and links in the past two decades, but providing them with additional information about your web pages is always an important aspect for achieving better SEO results.

Focus On Structured Data

Structured data, or commonly known as schema markup, is the use of organised information that crawlers for search engines such as Google and Bing digest to better understand the context of your web pages.

So if you do want to implement some semantic technology, the use of structured data is something both Google understands and uses.

There are plenty of ways you can create organised data for your web pages and the list of different schema types are available on their website

For eCommerce websites, you might want to consider using review rating markup to provide reviews on products that you might sell, as well as listing their price and stock availability.

When you implement schema to your web pages, you can sometimes trigger rich results. Rich results enhance the way your website looks amongst the search results, making it more enticing for users to click in comparison to your competitors.

Google will show rich results on websites at its own discretion, so don’t be disheartened if they don’t show up for your website – the important thing is that you’re aiding their crawlers to understand what your web pages are about.

An image of a rich result for nike.com in Google search results.

Make Use Of High Quality Related Content

Context is important in the grand scheme of identifying intent from a user’s search, and as such, it’s important to provide words that accurately disclose the meaning or context of your pages.

When talking about related content, it’s important to distinguish that we are not talking about synonyms.

There are huge knowledge bases such as Wikipedia and encyclopaedia’s that exist, which are incredibly helpful resources in providing context vocabulary terms that can help Google better understand the meaning of your content.

If you do want to make use of real semantic search, writing about identify attributes that are contextually related to your parent topic is a step in the right direction.

For example, if you’re writing content about car racing, you might head on over to Wikipedia or a trusted knowledgebase and use related terms about famous tracks and famous car drivers which indicates to Google that you are immediately talking about car racing and not any other variation of racing.

Entity recognition is also a large part of semantic search. Identifying and understanding entities mentioned in a user’s query or web page, is highly important for optimising your web pages to cater to all types of relevant queries.

Entities can include people, places, organisations and concepts, are critical to building Google’s Knowledge Graph. This is actionable by looking through trusted knowledge bases on your topic and going from there.

Diversifying your content topic with related words is actively working in the right way.

Take Note Of The Concept Of Co-occurrence

As discussed earlier with the phrase-based indexing patent, the concept of word occurrence is becoming increasingly important as search engines try to understand how certain words relate to each other.

If Google does indeed use phrase-based indexing, then the significance of these related words and how frequently they are mentioned with one another can help Google better understand the context of your pages.

This is more than just using synonyms – this is about using phrases related to your topic.

If you have a blog post that talks about “Australian politics”, you would expect words like “parliament house” and” prime minister” to be included in the document.

As such, the appearance of these phrases can also help predict the semblance of other phrases.

Google has also moved beyond “keyword-stuffing” over a decade ago, so make sure that you’re being quality-friendly with the number of related phrases a user might be searching for as you don’t want to dilute the quality of your content with unnecessary or unmeaningful phrases over and over.

Pay Attention To Technical SEO

John Mueller also had some future advice for SEOs at SMX Virtual, stating that websites that are “technically better” will have a small advantage but overall, content is still king.

Sites that are "technically better" (I'm assuming technical SEO here) have an advantage. Sometimes that's a small advantage, but can be bigger depending on the niche. It's good to get that advantage. Remember, content is king, but strive for strong technical SEO. pic.twitter.com/bg3M4Vuc9y
— Glenn Gabe (@glenngabe) December 8, 2020

Keeping on top of your technical SEO is a vital way to provide search engine crawlers a stream-lined way to crawl and index your website with little technical overhead.

These include:

Ensuring your canonical URLs are set properly

Setting noindex to low-quality or thin content pages that serve no purpose to the user

Mistakenly using parameters or session IDs

Double-checking there are no crawling or indexing problems with your sitemap and robots.txt file

Triple checking that Googlebot can render your pages on mobile as Google move all websites to a mobile-first index.
Ensuring that your pages are not 404ing, and if they are not found or have been moved to a new address, appropriate 301 redirects have been set up.

The Wrap Up

LSI is technology that was invented in the late 1980s and likely does not have any significant bearing on the indexing and ranking of web pages in search engines such as Google.

There isn’t any harm in using synonyms throughout your content for the purpose of enrichment, but to suggest that the use of LSI is responsible for how Google uses semantically related words and synonyms is not necessarily helpful or correct.

It is important to make use of entities, semantically related phrases and understand how polysemy and synoynms mentioned in user queries may affect how your page is ranked.

Dispelling all of the misinformation regarding LSI is ideal if we want to create trust amongst our readers and digital marketers.

If you do want to make strides with semantic approaches, use methods that Google have documented and understand, such as structured data, related words, context and word vectors as well as making sure the technical aspects of your site are up to scratch.

Google is making use of natural language processing models in their search engine, that are likely far more advanced than LSI and as such, we cannot rely on Latent Semantic Indexing to give us clarity on how Google may surface documents from their index.

Heavily edited for clarity & freshness: 17th June 2024.

Get A FREE SEO Strategy.

If you’re looking to take your SEO campaign to the next level, or get an understanding of where your SEO currently sits, feel free to click below and we’ll help you get started.

Latent Semantic Indexing: Why It Won’t Benefit Your SEO

Lachlan Perry

What Is Latent Semantic Indexing?

What LSI Sought Out To Understand & Solve

What Are LSI Keywords?

Latent Semantic Indexing & Its Affect On Your SEO

Forget LSI – Here’s What You Should Do Instead

Focus On Structured Data

Make Use Of High Quality Related Content

Take Note Of The Concept Of Co-occurrence

Pay Attention To Technical SEO

The Wrap Up

More From Us

Get A FREE SEO Strategy.

Menu

PPC

Mon-Fri: 09:00AM - 05:00 PM

Documentation

All Rights Reserved

Let's move your business forward.

Your Digital Strategy Awaits