Skip to main content
Redhat Developers  Logo
  • Products

    Featured

    • Red Hat Enterprise Linux
      Red Hat Enterprise Linux Icon
    • Red Hat OpenShift AI
      Red Hat OpenShift AI
    • Red Hat Enterprise Linux AI
      Linux icon inside of a brain
    • Image mode for Red Hat Enterprise Linux
      RHEL image mode
    • Red Hat OpenShift
      Openshift icon
    • Red Hat Ansible Automation Platform
      Ansible icon
    • Red Hat Developer Hub
      Developer Hub
    • View All Red Hat Products
    • Linux

      • Red Hat Enterprise Linux
      • Image mode for Red Hat Enterprise Linux
      • Red Hat Universal Base Images (UBI)
    • Java runtimes & frameworks

      • JBoss Enterprise Application Platform
      • Red Hat build of OpenJDK
    • Kubernetes

      • Red Hat OpenShift
      • Microsoft Azure Red Hat OpenShift
      • Red Hat OpenShift Virtualization
      • Red Hat OpenShift Lightspeed
    • Integration & App Connectivity

      • Red Hat Build of Apache Camel
      • Red Hat Service Interconnect
      • Red Hat Connectivity Link
    • AI/ML

      • Red Hat OpenShift AI
      • Red Hat Enterprise Linux AI
    • Automation

      • Red Hat Ansible Automation Platform
      • Red Hat Ansible Lightspeed
    • Developer tools

      • Red Hat Trusted Software Supply Chain
      • Podman Desktop
      • Red Hat OpenShift Dev Spaces
    • Developer Sandbox

      Developer Sandbox
      Try Red Hat products and technologies without setup or configuration fees for 30 days with this shared Openshift and Kubernetes cluster.
    • Try at no cost
  • Technologies

    Featured

    • AI/ML
      AI/ML Icon
    • Linux
      Linux Icon
    • Kubernetes
      Cloud icon
    • Automation
      Automation Icon showing arrows moving in a circle around a gear
    • View All Technologies
    • Programming Languages & Frameworks

      • Java
      • Python
      • JavaScript
    • System Design & Architecture

      • Red Hat architecture and design patterns
      • Microservices
      • Event-Driven Architecture
      • Databases
    • Developer Productivity

      • Developer productivity
      • Developer Tools
      • GitOps
    • Secure Development & Architectures

      • Security
      • Secure coding
    • Platform Engineering

      • DevOps
      • DevSecOps
      • Ansible automation for applications and services
    • Automated Data Processing

      • AI/ML
      • Data Science
      • Apache Kafka on Kubernetes
      • View All Technologies
    • Start exploring in the Developer Sandbox for free

      sandbox graphic
      Try Red Hat's products and technologies without setup or configuration.
    • Try at no cost
  • Learn

    Featured

    • Kubernetes & Cloud Native
      Openshift icon
    • Linux
      Rhel icon
    • Automation
      Ansible cloud icon
    • Java
      Java icon
    • AI/ML
      AI/ML Icon
    • View All Learning Resources

    E-Books

    • GitOps Cookbook
    • Podman in Action
    • Kubernetes Operators
    • The Path to GitOps
    • View All E-books

    Cheat Sheets

    • Linux Commands
    • Bash Commands
    • Git
    • systemd Commands
    • View All Cheat Sheets

    Documentation

    • API Catalog
    • Product Documentation
    • Legacy Documentation
    • Red Hat Learning

      Learning image
      Boost your technical skills to expert-level with the help of interactive lessons offered by various Red Hat Learning programs.
    • Explore Red Hat Learning
  • Developer Sandbox

    Developer Sandbox

    • Access Red Hat’s products and technologies without setup or configuration, and start developing quicker than ever before with our new, no-cost sandbox environments.
    • Explore Developer Sandbox

    Featured Developer Sandbox activities

    • Get started with your Developer Sandbox
    • OpenShift virtualization and application modernization using the Developer Sandbox
    • Explore all Developer Sandbox activities

    Ready to start developing apps?

    • Try at no cost
  • Blog
  • Events
  • Videos

Multilingual semantic-similarity search with Elasticsearch

Perform searches on large datasets understanding multiple languages.

January 10, 2024
Tarcisio Oliveira
Related topics:
Artificial intelligenceData Science
Related products:
Red Hat OpenShift AI

Share:

    AI-powered search, the combination of artificial intelligence technologies and search engines, enables semantic and similarity search that goes beyond keyword matching to understand the intent and context of a query.

    The integration of machine learning with Elasticsearch revolutionizes search and data analysis. Elasticsearch, known for its real-time search capabilities, can be powered by machine learning algorithms to provide context-sensitive and intelligent results.

    In this article, you will discover how this powerful combination gives users the ability to efficiently index large text datasets and perform multilingual semantic-similarity searches in Elasticsearch using a pretrained model.

    The article's step-by-step guide assumes you have a Jupyter notebook or Python environment to run the code, as well as an accessible instance of Elasticsearch running. There are no restrictions, but we recommend using Red Hat OpenShift Container Platform with the Red Hat OpenShift AI and Elasticsearch (ECK) operators. We also recommend creating a sandbox environment through Red Hat Developer, where you can get started simply and quickly, at no cost.

    All the code can be found in the tarcis-io/ml_elasticsearch repository.

    1. Machine learning

    In machine learning and Natural Language Processing (NLP), models take vectors (arrays of numbers) as input. Embedding is a technique used to represent words, phrases, images, audios or others entities as vectors of real numbers in a high-dimensional space (as illustrated in Figure 1). These embeddings capture semantic relationships and contextual information, allowing algorithms to better understand and process language.

    Figure 1: Embedding Model
    Figure 1: Embedding model.

    1.1. Universal Sentence Encoder Multilingual

    The Universal Sentence Encoder is a model developed by Google that produces fixed-size embeddings for input sentences or short texts. These embeddings are designed to capture semantic information about the meaning of the input text, making them useful for a variety of natural language processing tasks, such as text classification, text clustering and semantic search.

    The Multilingual Universal Sentence Encoder module (shown in Figure 2) is an extension of the Universal Sentence Encoder that includes training on multiple tasks across different languages.

    Figure 2: Multilingual Universal Sentence Encoder
    Figure 2: Multilingual Universal Sentence Encoder.

    2. Elasticsearch

    Elasticsearch is an open source distributed search and analytics engine designed for scalability and real-time search. It can handle large volumes of data of different types, including structured, unstructured, and geospatial data. Elasticsearch is commonly used in various applications, including log and event data analysis, monitoring, business intelligence, and search engines. Its versatility and scalability make it a popular choice for organizations dealing with large and complex datasets.

    2.1. Index

    In Elasticsearch, an index is a collection of documents that share a similar structure and are stored together for efficient searching and retrieval. It serves as the primary unit for organizing and managing data (Figure 3).

    Figure 3: Elasticsearch Actions
    Figure 3: Elasticsearch actions.

    3. Show me the code!

    3.1. Index the dataset

    The first step is to download the BBC News dataset and index it in Elasticsearch. The notebook that implements these tasks is 01_index_dataset.ipynb.

    3.1.1. Install and import the required packages

    To perform indexing tasks, install and import the following packages:

    import tensorflow_text
    
    from datasets       import load_dataset
    from elasticsearch  import Elasticsearch
    from IPython        import display
    from tensorflow_hub import load
    3.1.2. Create the Elasticsearch client

    Create the client that will be responsible for executing actions in Elasticsearch. At a minimum, you must specify the host and some form of authentication, such as a username and password:

    es_host     = '<elasticsearch_host>'
    es_username = '<elasticsearch_username>'
    es_password = '<elasticsearch_password>'
    
    es = Elasticsearch(
        hosts        = es_host,
        basic_auth   = (es_username, es_password),
        verify_certs = False
    )
    
    es.info()

    If the connection has been established, the es.info() method should return something similar to this:

    ObjectApiResponse({ "tagline":"You Know, for Search" })
    3.1.3. Download the BBC News dataset

    In this article we will use the BBC News dataset which contains over 2000 news in text format, all categorized. We can download it just by calling the load_dataset method from the datasets package:

    bbc_news_dataset = load_dataset('SetFit/bbc-news')

    The output must be an object of type DatasetDict:

    DatasetDict({
        train: Dataset({
            features: ['text', 'label', 'label_text'],
            num_rows: 1225
        })
        test: Dataset({
            features: ['text', 'label', 'label_text'],
            num_rows: 1000
        })
    })
    3.1.4. Download the Multilingual Universal Sentence Encoder model

    In machine learning, a model is a mathematical representation that learns patterns from data to make predictions or decisions without being explicitly programmed.

    The Multilingual Universal Sentence Encoder is a model with a Convolutional Neural Network (CNN) architecture specialized on multiple tasks and multiple languages able to create a single embedding space common to all 16 languages which it has been trained on, showing strong performance in multilingual retrieval.

    From the tensorflow-hub package, we can easily use this model with the following expression:

    model = load('https://www.kaggle.com/models/google/universal-sentence-encoder/frameworks/TensorFlow2/variations/multilingual-large/versions/2')

    The model's input is text, and its output is an array of 512 items containing the embeddings of the input expression. For example, for input:

    model('Hello World, ML Elasticsearch!')[0].numpy()

    The output should be:

    array([ 6.04993142e-02,  2.46977881e-02,  3.43629159e-02, -3.45645919e-02,
           -2.19440423e-02,  1.74631681e-02,  4.07335758e-02, -7.56947622e-02,
           -4.24392615e-03, -9.12883040e-03, -7.34170526e-02, -8.02445412e-02,
            7.20020905e-02,  3.64579372e-02, -4.03201990e-02,  3.09278686e-02,
           -4.60638112e-04,  2.28298176e-03,  8.05738419e-02,  1.25325136e-02,
            8.55052546e-02, -4.09091152e-02,  7.49715557e-03,  1.39080118e-02,
            4.30730805e-02, -3.43654901e-02,  5.11647621e-03,  2.94517819e-02,
            1.54668670e-02,  6.75039068e-02,  1.02604687e-01, -3.43576036e-02,
            1.10444212e-02, -5.98613322e-02, -2.86747441e-02,  3.58597264e-02,
            1.37920845e-02, -1.31028690e-04, -5.92646468e-03, -7.39952251e-02,
            3.12727243e-02, -5.23758633e-03, -4.90117408e-02, -2.00900845e-02,
            7.94764757e-02,  1.69147346e-02, -7.27028772e-03,  4.55966964e-02,
            2.82147657e-02, -1.79359596e-02,  3.01514324e-02, -4.47459966e-02,
           -6.71745390e-02,  4.77596521e-02,  7.86093343e-03, -1.41343456e-02,
           -3.58230583e-02, -1.85324792e-02,  4.14996557e-02, -2.08834168e-02,
            8.51072446e-02,  4.16630059e-02, -5.32974862e-02, -4.40437198e-02,
           -4.28032987e-02, -8.95944908e-02,  3.19887549e-02,  6.05730340e-02,
           -3.23659391e-03,  7.54942596e-02, -5.46579855e-03,  3.77340917e-03,
            5.46114445e-02, -4.40792646e-03, -3.59019917e-03,  6.38055429e-02,
            1.04503930e-02,  5.62766846e-03, -3.87495980e-02,  4.47553173e-02, ... ])
    3.1.5. Create the index for the dataset

    We can then create the index that will store the records from the BBC News dataset using the Elasticsearch client. It is possible and necessary to change the parameters according to your needs:

    bbc_news_index = 'bbc_news'
    
    es.indices.create(
        index    = bbc_news_index,
        settings = {
            'number_of_shards'   : 2,
            'number_of_replicas' : 1
        },
        mappings = {
            'properties' : {
                'text'     : { 'type' : 'text' },
                'vector'   : { 'type' : 'dense_vector', 'dims' : 512, 'index' : True },
                'metadata' : {
                    'properties' : {
                        'label'        : { 'type' : 'integer' },
                        'label_text'   : { 'type' : 'text' },
                        'dataset_type' : { 'type' : 'text' }
                    }
                }
            }
        }
    )

    The result should be a positive feedback message, for example:

    ObjectApiResponse({'acknowledged': True, 'shards_acknowledged': True, 'index': 'bbc_news'})
    3.1.6. Index the BBC News dataset in the Elasticsearch

    Now, index the BBC News dataset in Elasticsearch using the client created previously. The dataset is divided between train and test, and for this separation we will use the dataset_type field. We will also use the text_embeddings field to save the embeddings of the text field of the dataset. This field will be used to perform the multilingual semantic-similarity search:

    for dataset_type in bbc_news_dataset:
    
        dataset = bbc_news_dataset[dataset_type]
        size    = len(dataset)
    
        for index, item in enumerate(dataset, start = 1):
    
            display.clear_output(wait = True)
            print(f'Indexing BBC News { dataset_type } dataset : { index } / { size }')
    
            document = {
                'text'     : item['text'],
                'vector'   : model(item['text'])[0].numpy(),
                'metadata' : {
                    'label'        : item['label'],
                    'label_text'   : item['label_text'],
                    'dataset_type' : dataset_type
                }
            }
    
            es.index(index = bbc_news_index, document = document)

    A counter will appear that displays the indexing progress. When finished, you can check the total number of records in the index by calling the es.count() method:

    es.count(index = bbc_news_index)
    ObjectApiResponse({'count': 2225, '_shards': {'total': 2, 'successful': 2, 'skipped': 0, 'failed': 0}})

    3.2. Search

    With the data from the BBC News dataset indexed in our running instance of Elasticsearch, we can perform searches using the text field for similarity search and the text_embeddings field for multilingual semantic-similarity search.

    The code to implement this section can be found in notebook 02_search.ipynb. Let's skip the part about importing packages, creating the Elasticsearch client and downloading the Multilingual Universal Sentence Encoder model, shown in the previous steps.

    3.2.1. Create the base search function

    The search() function will be the basis used for searching and presenting the result:

    def search(query):
    
        result = es.search(index = bbc_news_index, query = query, size = 1)
        result = result['hits']['hits']
    
        if len(result) == 0:
    
            print('no results found...')
            return
    
        result = result[0]
    
        print(f"score : { result['_score'] }")
        print(f"label : { result['_source']['metadata']['label_text'] }")
        print(f"text  : { result['_source']['text'] }")
    3.2.2. Create the similarity_search function

    Let's define the function that will perform the similarity search. This is the default format used by Elasticsearch to search for indexed documents. We will use the text field:

    def similarity_search(text):
    
        query = {
            'match' : {
                'text' : text
            }
        }
    
        search(query)

    We can test the function on some examples, the same text in different languages:

    text_english    = 'european economic growth'
    text_spanish    = 'crecimiento económico europeo'
    text_portuguese = 'crescimento econômico europeu'
    
    similarity_search(text_english)
    similarity_search(text_spanish)
    similarity_search(text_portuguese)

    We will see that when searching for the text in English, we will have the following result:

    score : 12.638578
    label : business
    text  : newest eu members underpin growth the european union s newest members will bolster europe s economic growth in 2005  according to a new report.  the eight central european states which joined the eu last year will see 4.6% growth  the united nations economic commission for europe (unece) said. in contrast  the 12 euro zone countries will put in a  lacklustre  performance  generating growth of only 1.8%. the global economy will slow in 2005  the unece forecasts  due to widespread weakness in consumer demand.  it warned that growth could also be threatened by attempts to reduce the united states  huge current account deficit which  in turn  might lead to significant volatility in exchange rates.  unece is forecasting average economic growth of 2.2% across the european union in 2005. however  total output across the euro zone is forecast to fall in 2004 from 1.9% to 1.8%. this is due largely to the faltering german economy  which shrank 0.2% in the last quarter of 2004. on monday  germany s bdb private banks association said the german economy would struggle to meet its 1.4% growth target in 2005.  separately  the bundesbank warned that germany s efforts to reduce its budget deficit below 3% of gdp presented  huge risks  given that headline economic growth was set to fall below 1% this year. publishing its 2005 economic survey  the unece said central european countries such as the czech republic and slovenia would provide the backbone of the continent s growth. smaller nations such as cyprus  ireland and malta would also be among the continent s best performing economies this year  it said. the uk economy  on the other hand  is expected to slow in 2005  with growth falling from 3.2% last year to 2.5%.  consumer demand will remain fragile in many of europe s largest countries and economies will be mostly driven by growth in exports.  in view of the fragility of factors of domestic growth and the dampening effects of the stronger euro on domestic economic activity and inflation  monetary policy in the euro area is likely to continue to  wait and see   the organisation said in its report. global economic growth is expected to fall from 5% in 2004 to 4.25% despite the continued strength of the chinese and us economies. the unece warned that attempts to bring about a controlled reduction in the us current account deficit could cause difficulties.  the orderly reversal of the deficit is a major challenge for policy makers in both the united states and other economies   it noted.

    However, for the same text in Portuguese and Spanish, no results are found:

    no results found...
    3.2.3. Create the multilingual_semantic_similarity_search function

    Finally, let's define the function that will perform the multilingual semantic-similarity search.

    In this function, we will use the client to perform the search and use cosine similarity between the input embeddings (using, again, the pretrained Multilingual Universal Sentence Encoder model) and those indexed in Elasticsearch:

    def multilingual_semantic_similarity_search(text):
    
        query = {
            'script_score' : {
                'query'  : { 'match_all' : {} },
                'script' : {
                    'source' : "cosineSimilarity(params.vector, 'vector') + 1.0",
                    'params' : { 'vector' : model(text)[0].numpy() }
                }
            }
        }
    
        search(query)

    Lastly, we can use this function for the same examples defined previously:

    multilingual_semantic_similarity_search(text_english)
    multilingual_semantic_similarity_search(text_spanish)
    multilingual_semantic_similarity_search(text_portuguese)

    For all examples, the same text in different languages, we should get the same result although with a small difference in score calculated by cosine similarity:

    score : 1.4061848
    label : business
    text  : newest eu members underpin growth the european union s newest members will bolster europe s economic growth in 2005  according to a new report.  the eight central european states which joined the eu last year will see 4.6% growth  the united nations economic commission for europe (unece) said. in contrast  the 12 euro zone countries will put in a  lacklustre  performance  generating growth of only 1.8%. the global economy will slow in 2005  the unece forecasts  due to widespread weakness in consumer demand.  it warned that growth could also be threatened by attempts to reduce the united states  huge current account deficit which  in turn  might lead to significant volatility in exchange rates.  unece is forecasting average economic growth of 2.2% across the european union in 2005. however  total output across the euro zone is forecast to fall in 2004 from 1.9% to 1.8%. this is due largely to the faltering german economy  which shrank 0.2% in the last quarter of 2004. on monday  germany s bdb private banks association said the german economy would struggle to meet its 1.4% growth target in 2005.  separately  the bundesbank warned that germany s efforts to reduce its budget deficit below 3% of gdp presented  huge risks  given that headline economic growth was set to fall below 1% this year. publishing its 2005 economic survey  the unece said central european countries such as the czech republic and slovenia would provide the backbone of the continent s growth. smaller nations such as cyprus  ireland and malta would also be among the continent s best performing economies this year  it said. the uk economy  on the other hand  is expected to slow in 2005  with growth falling from 3.2% last year to 2.5%.  consumer demand will remain fragile in many of europe s largest countries and economies will be mostly driven by growth in exports.  in view of the fragility of factors of domestic growth and the dampening effects of the stronger euro on domestic economic activity and inflation  monetary policy in the euro area is likely to continue to  wait and see   the organisation said in its report. global economic growth is expected to fall from 5% in 2004 to 4.25% despite the continued strength of the chinese and us economies. the unece warned that attempts to bring about a controlled reduction in the us current account deficit could cause difficulties.  the orderly reversal of the deficit is a major challenge for policy makers in both the united states and other economies   it noted.

    4. Conclusion

    In conclusion, the integration of machine learning with Elasticsearch unveils a transformative synergy, enhancing the capabilities of data retrieval, analysis, and decision-making.

    This paper presented a way to apply this combination to multilingual semantic-similarity search compared to traditional search, which can be inspired and carried out in the same way for data in other formats, such as videos and texts, and can also be adapted to your business needs while maintaining a high level of performance.

    Last updated: August 27, 2024

    Related Posts

    • Introduction to machine learning with Jupyter notebooks

    • Why GPUs are essential for AI and high-performance computing

    • Boost OpenShift Data Science with the Intel AI Analytics Toolkit

    • GPU enablement on MicroShift

    • 4 reasons you'll love using Red Hat OpenShift Data Science

    • Using the Red Hat OpenShift tuned Operator for Elasticsearch

    Recent Posts

    • How Kafka improves agentic AI

    • How to use service mesh to improve AI model security

    • How to run AI models in cloud development environments

    • How Trilio secures OpenShift virtual machines and containers

    • How to implement observability with Node.js and Llama Stack

    What’s up next?

     

    Read Open Source Data Pipelines for Intelligent Applications, which provides data engineers and scientists insight into how Kubernetes provides a platform for building data platforms that increase an organization’s data agility.

    Get the e-book
    Red Hat Developers logo LinkedIn YouTube Twitter Facebook

    Products

    • Red Hat Enterprise Linux
    • Red Hat OpenShift
    • Red Hat Ansible Automation Platform

    Build

    • Developer Sandbox
    • Developer Tools
    • Interactive Tutorials
    • API Catalog

    Quicklinks

    • Learning Resources
    • E-books
    • Cheat Sheets
    • Blog
    • Events
    • Newsletter

    Communicate

    • About us
    • Contact sales
    • Find a partner
    • Report a website issue
    • Site Status Dashboard
    • Report a security problem

    RED HAT DEVELOPER

    Build here. Go anywhere.

    We serve the builders. The problem solvers who create careers with code.

    Join us if you’re a developer, software engineer, web designer, front-end designer, UX designer, computer scientist, architect, tester, product manager, project manager or team lead.

    Sign me up

    Red Hat legal and privacy links

    • About Red Hat
    • Jobs
    • Events
    • Locations
    • Contact Red Hat
    • Red Hat Blog
    • Inclusion at Red Hat
    • Cool Stuff Store
    • Red Hat Summit

    Red Hat legal and privacy links

    • Privacy statement
    • Terms of use
    • All policies and guidelines
    • Digital accessibility

    Report a website issue