4 min read

Semantic models in vector search

Artificial intelligence (AI) has become a recurrent element in people’s professional and personal life, more and more present in daily tasks thanks to hundreds of thousands of AI-based applications.

Natural language processing (NLP) and natural language generation (NLG) programs are systems generated with AI and built to understand, for example, how humans talk or write. These systems have allowed the creation of semantic models that can learn how to do specific tasks, like predicting the end of the sentence you’re typing. Thanks to NLP and NLG, a revolution has come to the shoppers’ search experience, as they can leverage an advanced search based on vectors that use algorithms that understand the semantic meaning and context of search queries and documents.

However, the generation of these systems requires a considerable amount of time and resources, as it must be ensured that a large, well-labeled dataset is used to tackle a specific task. That’s why the starting point for vector search is the semantical foundation models, which are large AI models trained on a vast corpus of unlabeled data, often using self-supervised learning, that can power a wide variety of downstream tasks.


Wondering how these foundation models are fine-tuned and labeled? Keep on reading the section Fine-tuning semantic models.

Foundation models in Empathy Platform

Like in the software world, there are two types of foundation models depending on their source:

  • Closed-source foundation models: normally, they are end-to-end applications used to create these models or integrations with APIs.
  • Open-source foundation models: these normally are model hubs that host the foundation models. Applications or APIs can be created based on these hubs.

Semantic models in vector search

Empathy Platform leverages open-source foundation models to create vectorized search experiences mainly for data privacy and integrity reasons, establishing privacy and consent controls that reinforce customers' trust and brands' safeguarding of reputation.

Fine-tuning semantic models

Semantic models can extend into any domain with the following step, the tuning. They’re trained with proprietary tunning data, specific well-labeled domain information to fine-tune the model for specific tasks.

At Empathy Platform, open-source foundation models are trained with query-click and query-product combinations to create semantic associations based on consent integrity, anonymous, and session-based customers’ interactions. Therefore, the domain-based model ensures data privacy and integrity.

Protecting privacy and integrity

The creation of a foundation model requires a huge amount of training data that leads to consent integrity and privacy problems. It’s not possible to ensure and track consent and privacy integrity when working with trained models. As they hold and process this underlined information, it’s required that individuals who performed the actions that feed the models have given their consent (via banners, pop-ups, and roll-downs).

Empathy is establishing privacy controls as a firewall against legal and reputational risks for brands. If there is no data subject consent, there is no data integrity. Therefore, Empathy Platform tries to verify the data integrity, user confidentiality, and consent despite the origin of the data the models were trained with cannot be controlled.

To avoid the impact of dirty datasets from the foundation models used, first single-domain content integrity should be ensured by tuning the models to the specific use case. Besides fine-tuning the model training, the impact of the foundational model weighting can be minimized, reducing the impact of potential non-integral data while working on strategies to validate the integrity of these sources. Then, an ePrivacy stress test is executed to guarantee and safeguard the reputation of retailers and brands, leveraging the AI opportunities based on confidentiality that do not compromise trust.

Leveraging semantic models with Semantics API

Based on NLP foundation models fine-tuned with the customer domain proprietary datasets, the Empathy Platform Semantics API brings semantic similarities between queries, thus, between products from the merchandiser’s product catalogue. The Semantics API is leveraged to create vector-based search experiences that complement keyword search by yielding faster and more relevant results.

Vector search overcomes the most frustrating shoppers’ search experiences, such as zero or partial results, misspellings, or low results.


Check out real use cases to get more insights about how Empathy Platform helps you manage these situations.

As well, semantic models are used to improve search effectiveness and relevance by combining the strengths of keyword-based and vector-based indexing. Thus, the product catalogue can be enhanced with an attribute enrichment at index time, which helps avoid possible consequential issues from vector implementation on search time as well as improving the search performance at query time.

Combining the strengths of keyword and vector search allows the development of a hybrid search solution that enables the effective addressing of both long-tail scenarios and enhances search results' relevance.


Read more about how vector and keyword search as a unified index can enhance your shoppers' search experience.

see it in action

Want to see how Empathy Platform approaches Semantic models based on privacy and consent integrity? Watch now this vector search recap and discover the path towards a closer and closer hybrid search experience.