Multilingual semantic-similarity search with Elasticsearch

AI-powered search, the combination of artificial intelligence technologies and search engines, enables semantic and similarity search that goes beyond keyword matching to understand the intent and context of a query.

The integration of machine learning with Elasticsearch revolutionizes search and data analysis. Elasticsearch, known for its real-time search capabilities, can be powered by machine learning algorithms to provide context-sensitive and intelligent results.

In this article, you will discover how this powerful combination gives users the ability to efficiently index large text datasets and perform multilingual semantic-similarity searches in Elasticsearch using a pretrained model.

The article's step-by-step guide assumes you have a Jupyter notebook or Python environment to run the code, as well as an accessible instance of Elasticsearch running. There are no restrictions, but we recommend using Red Hat OpenShift Container Platform with the Red Hat OpenShift AI and Elasticsearch (ECK) operators. We also recommend creating a sandbox environment through Red Hat Developer, where you can get started simply and quickly, at no cost.

All the code can be found in the tarcis-io/ml_elasticsearch repository.

1. Machine learning

In machine learning and Natural Language Processing (NLP), models take vectors (arrays of numbers) as input. Embedding is a technique used to represent words, phrases, images, audios or others entities as vectors of real numbers in a high-dimensional space (as illustrated in Figure 1). These embeddings capture semantic relationships and contextual information, allowing algorithms to better understand and process language.

Figure 1: Embedding Model — Figure 1: Embedding model.

1.1. Universal Sentence Encoder Multilingual

The Universal Sentence Encoder is a model developed by Google that produces fixed-size embeddings for input sentences or short texts. These embeddings are designed to capture semantic information about the meaning of the input text, making them useful for a variety of natural language processing tasks, such as text classification, text clustering and semantic search.

The Multilingual Universal Sentence Encoder module (shown in Figure 2) is an extension of the Universal Sentence Encoder that includes training on multiple tasks across different languages.

Figure 2: Multilingual Universal Sentence Encoder.

2. Elasticsearch

Elasticsearch is an open source distributed search and analytics engine designed for scalability and real-time search. It can handle large volumes of data of different types, including structured, unstructured, and geospatial data. Elasticsearch is commonly used in various applications, including log and event data analysis, monitoring, business intelligence, and search engines. Its versatility and scalability make it a popular choice for organizations dealing with large and complex datasets.

2.1. Index

In Elasticsearch, an index is a collection of documents that share a similar structure and are stored together for efficient searching and retrieval. It serves as the primary unit for organizing and managing data (Figure 3).

Figure 3: Elasticsearch Actions — Figure 3: Elasticsearch actions.

3. Show me the code!

3.1. Index the dataset

The first step is to download the BBC News dataset and index it in Elasticsearch. The notebook that implements these tasks is 01_index_dataset.ipynb.

3.1.1. Install and import the required packages

To perform indexing tasks, install and import the following packages:

import tensorflow_text

from datasets       import load_dataset
from elasticsearch  import Elasticsearch
from IPython        import display
from tensorflow_hub import load

3.1.2. Create the Elasticsearch client

Create the client that will be responsible for executing actions in Elasticsearch. At a minimum, you must specify the host and some form of authentication, such as a username and password:

es_host     = '<elasticsearch_host>'
es_username = '<elasticsearch_username>'
es_password = '<elasticsearch_password>'

es = Elasticsearch(
    hosts        = es_host,
    basic_auth   = (es_username, es_password),
    verify_certs = False
)

es.info()

If the connection has been established, the es.info() method should return something similar to this:

ObjectApiResponse({ "tagline":"You Know, for Search" })

3.1.3. Download the BBC News dataset

In this article we will use the BBC News dataset which contains over 2000 news in text format, all categorized. We can download it just by calling the load_dataset method from the datasets package:

bbc_news_dataset = load_dataset('SetFit/bbc-news')

The output must be an object of type DatasetDict:

DatasetDict({
    train: Dataset({
        features: ['text', 'label', 'label_text'],
        num_rows: 1225
    })
    test: Dataset({
        features: ['text', 'label', 'label_text'],
        num_rows: 1000
    })
})

3.1.4. Download the Multilingual Universal Sentence Encoder model

In machine learning, a model is a mathematical representation that learns patterns from data to make predictions or decisions without being explicitly programmed.

The Multilingual Universal Sentence Encoder is a model with a Convolutional Neural Network (CNN) architecture specialized on multiple tasks and multiple languages able to create a single embedding space common to all 16 languages which it has been trained on, showing strong performance in multilingual retrieval.

From the tensorflow-hub package, we can easily use this model with the following expression:

model = load('https://www.kaggle.com/models/google/universal-sentence-encoder/frameworks/TensorFlow2/variations/multilingual-large/versions/2')

The model's input is text, and its output is an array of 512 items containing the embeddings of the input expression. For example, for input:

model('Hello World, ML Elasticsearch!')[0].numpy()

The output should be:

array([ 6.04993142e-02,  2.46977881e-02,  3.43629159e-02, -3.45645919e-02,
       -2.19440423e-02,  1.74631681e-02,  4.07335758e-02, -7.56947622e-02,
       -4.24392615e-03, -9.12883040e-03, -7.34170526e-02, -8.02445412e-02,
        7.20020905e-02,  3.64579372e-02, -4.03201990e-02,  3.09278686e-02,
       -4.60638112e-04,  2.28298176e-03,  8.05738419e-02,  1.25325136e-02,
        8.55052546e-02, -4.09091152e-02,  7.49715557e-03,  1.39080118e-02,
        4.30730805e-02, -3.43654901e-02,  5.11647621e-03,  2.94517819e-02,
        1.54668670e-02,  6.75039068e-02,  1.02604687e-01, -3.43576036e-02,
        1.10444212e-02, -5.98613322e-02, -2.86747441e-02,  3.58597264e-02,
        1.37920845e-02, -1.31028690e-04, -5.92646468e-03, -7.39952251e-02,
        3.12727243e-02, -5.23758633e-03, -4.90117408e-02, -2.00900845e-02,
        7.94764757e-02,  1.69147346e-02, -7.27028772e-03,  4.55966964e-02,
        2.82147657e-02, -1.79359596e-02,  3.01514324e-02, -4.47459966e-02,
       -6.71745390e-02,  4.77596521e-02,  7.86093343e-03, -1.41343456e-02,
       -3.58230583e-02, -1.85324792e-02,  4.14996557e-02, -2.08834168e-02,
        8.51072446e-02,  4.16630059e-02, -5.32974862e-02, -4.40437198e-02,
       -4.28032987e-02, -8.95944908e-02,  3.19887549e-02,  6.05730340e-02,
       -3.23659391e-03,  7.54942596e-02, -5.46579855e-03,  3.77340917e-03,
        5.46114445e-02, -4.40792646e-03, -3.59019917e-03,  6.38055429e-02,
        1.04503930e-02,  5.62766846e-03, -3.87495980e-02,  4.47553173e-02, ... ])

3.1.5. Create the index for the dataset

We can then create the index that will store the records from the BBC News dataset using the Elasticsearch client. It is possible and necessary to change the parameters according to your needs:

bbc_news_index = 'bbc_news'

es.indices.create(
    index    = bbc_news_index,
    settings = {
        'number_of_shards'   : 2,
        'number_of_replicas' : 1
    },
    mappings = {
        'properties' : {
            'text'     : { 'type' : 'text' },
            'vector'   : { 'type' : 'dense_vector', 'dims' : 512, 'index' : True },
            'metadata' : {
                'properties' : {
                    'label'        : { 'type' : 'integer' },
                    'label_text'   : { 'type' : 'text' },
                    'dataset_type' : { 'type' : 'text' }
                }
            }
        }
    }
)

The result should be a positive feedback message, for example:

ObjectApiResponse({'acknowledged': True, 'shards_acknowledged': True, 'index': 'bbc_news'})

3.1.6. Index the BBC News dataset in the Elasticsearch

Now, index the BBC News dataset in Elasticsearch using the client created previously. The dataset is divided between train and test, and for this separation we will use the dataset_type field. We will also use the text_embeddings field to save the embeddings of the text field of the dataset. This field will be used to perform the multilingual semantic-similarity search:

for dataset_type in bbc_news_dataset:

    dataset = bbc_news_dataset[dataset_type]
    size    = len(dataset)

    for index, item in enumerate(dataset, start = 1):

        display.clear_output(wait = True)
        print(f'Indexing BBC News { dataset_type } dataset : { index } / { size }')

        document = {
            'text'     : item['text'],
            'vector'   : model(item['text'])[0].numpy(),
            'metadata' : {
                'label'        : item['label'],
                'label_text'   : item['label_text'],
                'dataset_type' : dataset_type
            }
        }

        es.index(index = bbc_news_index, document = document)

A counter will appear that displays the indexing progress. When finished, you can check the total number of records in the index by calling the es.count() method:

es.count(index = bbc_news_index)

ObjectApiResponse({'count': 2225, '_shards': {'total': 2, 'successful': 2, 'skipped': 0, 'failed': 0}})

3.2. Search

With the data from the BBC News dataset indexed in our running instance of Elasticsearch, we can perform searches using the text field for similarity search and the text_embeddings field for multilingual semantic-similarity search.

The code to implement this section can be found in notebook 02_search.ipynb. Let's skip the part about importing packages, creating the Elasticsearch client and downloading the Multilingual Universal Sentence Encoder model, shown in the previous steps.

3.2.1. Create the base `search` function

The search() function will be the basis used for searching and presenting the result:

def search(query):

    result = es.search(index = bbc_news_index, query = query, size = 1)
    result = result['hits']['hits']

    if len(result) == 0:

        print('no results found...')
        return

    result = result[0]

    print(f"score : { result['_score'] }")
    print(f"label : { result['_source']['metadata']['label_text'] }")
    print(f"text  : { result['_source']['text'] }")

3.2.2. Create the `similarity_search` function

Let's define the function that will perform the similarity search. This is the default format used by Elasticsearch to search for indexed documents. We will use the text field:

def similarity_search(text):

    query = {
        'match' : {
            'text' : text
        }
    }

    search(query)

We can test the function on some examples, the same text in different languages:

text_english    = 'european economic growth'
text_spanish    = 'crecimiento económico europeo'
text_portuguese = 'crescimento econômico europeu'

similarity_search(text_english)
similarity_search(text_spanish)
similarity_search(text_portuguese)

We will see that when searching for the text in English, we will have the following result:

score : 12.638578
label : business
text  : newest eu members underpin growth the european union s newest members will bolster europe s economic growth in 2005  according to a new report.  the eight central european states which joined the eu last year will see 4.6% growth  the united nations economic commission for europe (unece) said. in contrast  the 12 euro zone countries will put in a  lacklustre  performance  generating growth of only 1.8%. the global economy will slow in 2005  the unece forecasts  due to widespread weakness in consumer demand.  it warned that growth could also be threatened by attempts to reduce the united states  huge current account deficit which  in turn  might lead to significant volatility in exchange rates.  unece is forecasting average economic growth of 2.2% across the european union in 2005. however  total output across the euro zone is forecast to fall in 2004 from 1.9% to 1.8%. this is due largely to the faltering german economy  which shrank 0.2% in the last quarter of 2004. on monday  germany s bdb private banks association said the german economy would struggle to meet its 1.4% growth target in 2005.  separately  the bundesbank warned that germany s efforts to reduce its budget deficit below 3% of gdp presented  huge risks  given that headline economic growth was set to fall below 1% this year. publishing its 2005 economic survey  the unece said central european countries such as the czech republic and slovenia would provide the backbone of the continent s growth. smaller nations such as cyprus  ireland and malta would also be among the continent s best performing economies this year  it said. the uk economy  on the other hand  is expected to slow in 2005  with growth falling from 3.2% last year to 2.5%.  consumer demand will remain fragile in many of europe s largest countries and economies will be mostly driven by growth in exports.  in view of the fragility of factors of domestic growth and the dampening effects of the stronger euro on domestic economic activity and inflation  monetary policy in the euro area is likely to continue to  wait and see   the organisation said in its report. global economic growth is expected to fall from 5% in 2004 to 4.25% despite the continued strength of the chinese and us economies. the unece warned that attempts to bring about a controlled reduction in the us current account deficit could cause difficulties.  the orderly reversal of the deficit is a major challenge for policy makers in both the united states and other economies   it noted.

However, for the same text in Portuguese and Spanish, no results are found:

no results found...

3.2.3. Create the `multilingual_semantic_similarity_search` function

Finally, let's define the function that will perform the multilingual semantic-similarity search.

In this function, we will use the client to perform the search and use cosine similarity between the input embeddings (using, again, the pretrained Multilingual Universal Sentence Encoder model) and those indexed in Elasticsearch:

def multilingual_semantic_similarity_search(text):

    query = {
        'script_score' : {
            'query'  : { 'match_all' : {} },
            'script' : {
                'source' : "cosineSimilarity(params.vector, 'vector') + 1.0",
                'params' : { 'vector' : model(text)[0].numpy() }
            }
        }
    }

    search(query)

Lastly, we can use this function for the same examples defined previously:

multilingual_semantic_similarity_search(text_english)
multilingual_semantic_similarity_search(text_spanish)
multilingual_semantic_similarity_search(text_portuguese)

For all examples, the same text in different languages, we should get the same result although with a small difference in score calculated by cosine similarity:

score : 1.4061848
label : business
text  : newest eu members underpin growth the european union s newest members will bolster europe s economic growth in 2005  according to a new report.  the eight central european states which joined the eu last year will see 4.6% growth  the united nations economic commission for europe (unece) said. in contrast  the 12 euro zone countries will put in a  lacklustre  performance  generating growth of only 1.8%. the global economy will slow in 2005  the unece forecasts  due to widespread weakness in consumer demand.  it warned that growth could also be threatened by attempts to reduce the united states  huge current account deficit which  in turn  might lead to significant volatility in exchange rates.  unece is forecasting average economic growth of 2.2% across the european union in 2005. however  total output across the euro zone is forecast to fall in 2004 from 1.9% to 1.8%. this is due largely to the faltering german economy  which shrank 0.2% in the last quarter of 2004. on monday  germany s bdb private banks association said the german economy would struggle to meet its 1.4% growth target in 2005.  separately  the bundesbank warned that germany s efforts to reduce its budget deficit below 3% of gdp presented  huge risks  given that headline economic growth was set to fall below 1% this year. publishing its 2005 economic survey  the unece said central european countries such as the czech republic and slovenia would provide the backbone of the continent s growth. smaller nations such as cyprus  ireland and malta would also be among the continent s best performing economies this year  it said. the uk economy  on the other hand  is expected to slow in 2005  with growth falling from 3.2% last year to 2.5%.  consumer demand will remain fragile in many of europe s largest countries and economies will be mostly driven by growth in exports.  in view of the fragility of factors of domestic growth and the dampening effects of the stronger euro on domestic economic activity and inflation  monetary policy in the euro area is likely to continue to  wait and see   the organisation said in its report. global economic growth is expected to fall from 5% in 2004 to 4.25% despite the continued strength of the chinese and us economies. the unece warned that attempts to bring about a controlled reduction in the us current account deficit could cause difficulties.  the orderly reversal of the deficit is a major challenge for policy makers in both the united states and other economies   it noted.

4. Conclusion

In conclusion, the integration of machine learning with Elasticsearch unveils a transformative synergy, enhancing the capabilities of data retrieval, analysis, and decision-making.

This paper presented a way to apply this combination to multilingual semantic-similarity search compared to traditional search, which can be inspired and carried out in the same way for data in other formats, such as videos and texts, and can also be adapted to your business needs while maintaining a high level of performance.

Last updated: August 27, 2024

Multilingual semantic-similarity search with Elasticsearch

Share:

1. Machine learning

1.1. Universal Sentence Encoder Multilingual

2. Elasticsearch

2.1. Index

3. Show me the code!

3.1. Index the dataset

3.1.1. Install and import the required packages

3.1.2. Create the Elasticsearch client

3.1.3. Download the BBC News dataset

3.1.4. Download the Multilingual Universal Sentence Encoder model

3.1.5. Create the index for the dataset

3.1.6. Index the BBC News dataset in the Elasticsearch

3.2. Search

3.2.1. Create the base search function

3.2.2. Create the similarity_search function

3.2.3. Create the multilingual_semantic_similarity_search function

4. Conclusion

Products

Build

Quicklinks

Communicate

RED HAT DEVELOPER

Red Hat legal and privacy links

Red Hat legal and privacy links

Report a website issue

3.2.1. Create the base `search` function

3.2.2. Create the `similarity_search` function

3.2.3. Create the `multilingual_semantic_similarity_search` function