Skip to main content
Redhat Developers  Logo
  • Products

    Featured

    • Red Hat Enterprise Linux
      Red Hat Enterprise Linux Icon
    • Red Hat OpenShift AI
      Red Hat OpenShift AI
    • Red Hat Enterprise Linux AI
      Linux icon inside of a brain
    • Image mode for Red Hat Enterprise Linux
      RHEL image mode
    • Red Hat OpenShift
      Openshift icon
    • Red Hat Ansible Automation Platform
      Ansible icon
    • Red Hat Developer Hub
      Developer Hub
    • View All Red Hat Products
    • Linux

      • Red Hat Enterprise Linux
      • Image mode for Red Hat Enterprise Linux
      • Red Hat Universal Base Images (UBI)
    • Java runtimes & frameworks

      • JBoss Enterprise Application Platform
      • Red Hat build of OpenJDK
    • Kubernetes

      • Red Hat OpenShift
      • Microsoft Azure Red Hat OpenShift
      • Red Hat OpenShift Virtualization
      • Red Hat OpenShift Lightspeed
    • Integration & App Connectivity

      • Red Hat Build of Apache Camel
      • Red Hat Service Interconnect
      • Red Hat Connectivity Link
    • AI/ML

      • Red Hat OpenShift AI
      • Red Hat Enterprise Linux AI
    • Automation

      • Red Hat Ansible Automation Platform
      • Red Hat Ansible Lightspeed
    • Developer tools

      • Red Hat Trusted Software Supply Chain
      • Podman Desktop
      • Red Hat OpenShift Dev Spaces
    • Developer Sandbox

      Developer Sandbox
      Try Red Hat products and technologies without setup or configuration fees for 30 days with this shared Openshift and Kubernetes cluster.
    • Try at no cost
  • Technologies

    Featured

    • AI/ML
      AI/ML Icon
    • Linux
      Linux Icon
    • Kubernetes
      Cloud icon
    • Automation
      Automation Icon showing arrows moving in a circle around a gear
    • View All Technologies
    • Programming Languages & Frameworks

      • Java
      • Python
      • JavaScript
    • System Design & Architecture

      • Red Hat architecture and design patterns
      • Microservices
      • Event-Driven Architecture
      • Databases
    • Developer Productivity

      • Developer productivity
      • Developer Tools
      • GitOps
    • Secure Development & Architectures

      • Security
      • Secure coding
    • Platform Engineering

      • DevOps
      • DevSecOps
      • Ansible automation for applications and services
    • Automated Data Processing

      • AI/ML
      • Data Science
      • Apache Kafka on Kubernetes
      • View All Technologies
    • Start exploring in the Developer Sandbox for free

      sandbox graphic
      Try Red Hat's products and technologies without setup or configuration.
    • Try at no cost
  • Learn

    Featured

    • Kubernetes & Cloud Native
      Openshift icon
    • Linux
      Rhel icon
    • Automation
      Ansible cloud icon
    • Java
      Java icon
    • AI/ML
      AI/ML Icon
    • View All Learning Resources

    E-Books

    • GitOps Cookbook
    • Podman in Action
    • Kubernetes Operators
    • The Path to GitOps
    • View All E-books

    Cheat Sheets

    • Linux Commands
    • Bash Commands
    • Git
    • systemd Commands
    • View All Cheat Sheets

    Documentation

    • API Catalog
    • Product Documentation
    • Legacy Documentation
    • Red Hat Learning

      Learning image
      Boost your technical skills to expert-level with the help of interactive lessons offered by various Red Hat Learning programs.
    • Explore Red Hat Learning
  • Developer Sandbox

    Developer Sandbox

    • Access Red Hat’s products and technologies without setup or configuration, and start developing quicker than ever before with our new, no-cost sandbox environments.
    • Explore Developer Sandbox

    Featured Developer Sandbox activities

    • Get started with your Developer Sandbox
    • OpenShift virtualization and application modernization using the Developer Sandbox
    • Explore all Developer Sandbox activities

    Ready to start developing apps?

    • Try at no cost
  • Blog
  • Events
  • Videos

Use OpenVINO to convert speech to text

June 13, 2022
Sean Pryor Ryan Loney
Related topics:
Data ScienceArtificial intelligence
Related products:
Red Hat OpenShift AI

Share:

    Speech to text is one of the most common use cases for artificial intelligence. It's used all over to allow easier human interaction. Phone tree automation is a common use case.

    This article will walk you through a speech-to-text example using OpenVINO, an open-source toolkit for optimizing and deploying AI inference. This example is a variant of the OpenVINO speech-to-text demo notebook which can be found in OpenVINO's GitHub repository.

    What is QuartzNet?

    QuartzNet is a variant of a Jasper network that performs speech-to-text translation.

    The inputs to the network are a series of units called mel spectrograms. These are a way of representing audio data that involves several steps of processing.

    First, the raw audio signal is divided into overlapping sections, and then a Fourier transformation is applied to them converting from signals over time to frequencies. Then the log scale of the frequency is compared to the amplitude to form a spectrogram.

    Finally, the spectrogram's domain is changed to the mel scale, which is a frequency scale that better differentiates between the ranges of frequency that human speech and hearing cover, forming a mel spectrogram.

    For more on mel spectrograms, read Leland Roberts' article on the subject.

    OpenVINO toolkit

    OpenVINO is a framework for optimizing models, as well as an optimized inference server. It allows you to perform several optimizations:

    • Quantization: Reducing floating point precision to increase processing speed. INT8 can be orders of magnitude faster than FP16 with similar levels of precision in some cases.
    • Accuracy-aware quantization: Automated quantization that preserves a user-specified level of accuracy.
    • Pruning and sparsity: Reducing unnecessary complexity of the model. For example, this could involve removing layers that aren't contributing much to the overall result or weights that are extremely small.
    • Operation fusing: Combining several model layers into one. This gives equivalent accuracy but can run significantly faster on Intel hardware given the use of specialized instructions

    On Red Hat OpenShift Data Science, the default deployment is done on Intel hardware, meaning there is no additional setup required.

    The notebook we'll be looking at in this article covers downloading a QuartzNet model, converting it to OpenVINO Intermediate Representation (IR), serving it via OpenVINO Model Server, sending mel spectrograms of English-language audio for inference, and decoding the results using a simple algorithm. Note that this is an example; some parts, such as the decoding algorithm, could be improved if one were to adapt this for a production use case.

    The OpenVINO Model Server (OVMS) is an Intel-optimized model server, which allows a user to serve multiple models, keep track of generations of models, and lets users update them without downtime.

    Download the QuartzNet model

    OpenVINO has its own model zoo where you can browse and download pre-compiled and pre-trained models from.

    For this demo, we will download an ONNX-format QuartzNet model. The ONNX (Open Neural Network Exchange Format) format is easily portable for exchanging models. It allows a user to package a model from a range of frameworks easily into a single file, and is easy to reinstantiate from that single file allowing for great portability. OVMS will later convert this format into its own IR for optimization.

    The notebook, we first start with a few bits of boilerplate by setting up the paths:

    
    model_folder = "model"
    download_folder = "output"
    data_folder = "data"
    
    precision = "FP16"
    model_name = "quartznet-15x5-en"
    
    

    omz_downloader automatically creates a directory structure and downloads the selected model. This step is skipped if the model is already downloaded. The selected model comes from the public directory, which means it must be converted into Intermediate Representation (IR).

    
    # Check if model is already downloaded in download directory
    path_to_model_weights = Path(f'{download_folder}/public/{model_name}/models')
    downloaded_model_file = list(path_to_model_weights.glob('*.pth'))
    
    if not path_to_model_weights.is_dir() or len(downloaded_model_file) == 0:
        download_command = f"omz_downloader --name {model_name} --output_dir {download_folder} --precision {precision}"
        ! $download_command
    
    

    Convert the model to IR

    Next, we need to convert the model from ONNX format into OpenVINO IR format, which consists of three files.

    • XML: The XML file describes the layers of the network, their dimensions, and parameters. It also describes the data flow. However, it does not store the actual weights; those are instead references to the bin file, which contains the weights and other large values.
    • Bin: This file contains the large constant values like layer weights and other things that detail the state of the model.
    • Mapping: This file contains some additional metadata detailing things like the IO between layers.

    A more detailed explanation is available in the OpenVINO docs.

    omz_converter converts the pre-trained PyTorch model to the ONNX model format, which is further converted to the OpenVINO IR format. Both stages of conversion are handled by calling omz_converter.

    
    # Check if model is already converted in model directory
    path_to_converted_weights = Path(f'{model_folder}/public/{model_name}/{precision}/{model_name}.bin')
    
    if not path_to_converted_weights.is_file():
        convert_command = f"omz_converter --name {model_name} --precisions {precision} --download_dir {download_folder} --output_dir {model_folder}"
        ! $convert_command
    
    

    In the end, you should have the following files:

    • quartznet-15x5-en.bin
    • quartznet-15x5-en.mapping
    • quartznet-15x5-en.xml

    Upload the model and directory structure to S3

    In OVMS, the model server looks for the following directory structure:

    
    tree models/
    models/
    ├── model1
    │   ├── 1
    │   │   ├── ir_model.bin
    │   │   └── ir_model.xml
    │   └── 2
    │       ├── ir_model.bin
    │       └── ir_model.xml
    └── model2
    │   └── 1
    │       ├── ir_model.bin
    │       ├── ir_model.xml
    │       └── mapping_config.json
    └── model3
        └── 1
            └── model.onnx
    
    

    You can find more about the model repository directory structure in the OpenVINO docs.

    The next step is uploading the OpenVINO IR files to S3. This demo assumes that you have a local S3 set up in advance; setting that up is beyond the scope of this demo. If you don't have access to a real S3 bucket, there are alternatives like Ceph RadosGW or Google Storage, which are also supported by OVMS.

    
    import boto3
    access_key = 'S3_ACCESS_KEY' # <- Replace with actual key
    secret_key = 'S3_SECRET_KEY' # <- Replace with actual key
    s3 = boto3.client('s3',
                endpoint_url='ENDPOINT_URL', # <- This is only necessary when using Ceph RadosGW
                aws_access_key_id=access_key,
                aws_secret_access_key=secret_key,)
    s3.upload_file('model/public/quartznet-15x5-en/FP16/quartznet-15x5-en.bin',
                   'openvino-quartznet', '1/quartznet-15x5-en.bin')
    s3.upload_file('model/public/quartznet-15x5-en/FP16/quartznet-15x5-en.mapping',
                   'openvino-quartznet', '1/quartznet-15x5-en.mapping')
    s3.upload_file('model/public/quartznet-15x5-en/FP16/quartznet-15x5-en.xml',
                   'openvino-quartznet', '1/quartznet-15x5-en.xml')
    
    

    This gives us the following structure in s3:

    
    # tree s3://openvino-quartznet/
    └── 1
        ├── quartznet-15x5-en.bin
        ├── quartznet-15x5-en.mapping
        └── quartznet-15x5-en.xml
    
    

    Create an OVMS instance

    Now that we have uploaded the model to S3, we can create an instance of the OpenVINO Model Server to serve the model. Intel has an OVMS Operator that will allow users to easily provision an OVMS instance. Here's an example custom resource for the Operator:

    
    kind: ModelServer
    apiVersion: intel.com/v1alpha1
    metadata:
     name: openvino-quartznet-model-server
     namespace: your-project-namespace
    spec:
     image_name: >-
       registry.connect.redhat.com/intel/openvino-model-server@sha256:f670aa3dc014b8786e554b8a3bb7e2e8475744d588e5e72d554660b74430a8c5
     deployment_parameters:
       replicas: 1
       resources:
         limits:
           cpu: '4'
           memory: '4Gi'
         requests:
           cpu: '4'
           memory: '4Gi'
     service_parameters:
       grpc_port: 8080
       rest_port: 8081
     models_settings:
       single_model_mode: true
       config_configmap_name: ''
       model_config: ''
       model_name: 'quartznet' # This is the name the model is served with
       model_path: 's3://openvino-quartznet/' # This URL path to where the model repository was stored earlier
       nireq: 0
       plugin_config: '{"CPU_THROUGHPUT_STREAMS":1}'
       batch_size: ''
       shape: '(1, 64, 176)' # This is needed due to the notebook having a slightly different input shape than the default. OVMS handles this conversion automatically
       model_version_policy: '{"latest": { "num_versions":1 }}'
       layout: ''
       target_device: CPU
       is_stateful: false
       idle_sequence_cleanup: false
       low_latency_transformation: true
       max_sequence_number: 0
     server_settings:
       file_system_poll_wait_seconds: 0
       sequence_cleaner_poll_wait_minutes: 0
       log_level: INFO
       grpc_workers: 1
       rest_workers: 0
     models_repository:
       storage_type: S3
       https_proxy: ''
       http_proxy: ''
       models_host_path: ''
       models_volume_claim: ''
       aws_secret_access_key: 'S3_SECRET_KEY' # Replace with actual key
       aws_access_key_id: 'S3_ACCESS_KEY' # Replace with actual key
       aws_region: ''
       s3_compat_api_endpoint: 'ENDPOINT_URL' # This is only necessary when using Ceph RadosGW
       gcp_creds_secret_name: ''
       azure_storage_connection_string: ''
    
    

    Once the server finishes initializing, the model will be available on both gRPC and HTTP endpoints.

    ovmsclient

    For a simple, lightweight client, ovmsclient is an easy way to interact with an OVMS server. The client maintains an underlying gRPC client to OVMS and provides several convenience features. For starters, the following code allows users to connect to and query the input and output parameters of the model:

    
    import ovmsclient
    import librosa
    import numpy as np
    import scipy
    client = ovmsclient.make_grpc_client("openvino-quartznet-model-server.default.svc.cluster.local:8080")
    model_metadata = client.get_model_metadata(model_name="quartznet")
    print(model_metadata)
    {'model_version': 1,
     'inputs': {'audio_signal': {'shape': [1, 64, 176], 'dtype': 'DT_FLOAT'}},
     'outputs': {'output': {'shape': [1, 88, 29], 'dtype': 'DT_FLOAT'}}}
    
    

    We can see that the shape is the input shape defined in the CR above.

    Convert audio data to mel

    In order to perform inference, raw audio data must be converted to the mel spectrograms we discussed above. The code below performs this conversion:

    
    # First load the audio data, in this case a clip of English audio with the speaker saying "from the edge to the cloud"
    audio, sampling_rate = librosa.load(path=f'data/edge_to_cloud.ogg', sr=16000)
    # This first function converts the audio to mel spectrograms. This has specific window sizing and a hardcoded sampling rate. A different algorithm could be implemented if user needs differ.
    def audio_to_mel(audio, sampling_rate):
        assert sampling_rate == 16000, "Only 16 KHz audio supported"
        preemph = 0.97
        preemphased = np.concatenate([audio[:1], audio[1:] - preemph * audio[:-1].astype(np.float32)])
    
        # Calculate window length
        win_length = round(sampling_rate * 0.02)
    
        # Based on previously calculated window length run short-time Fourier transform
        spec = np.abs(librosa.core.spectrum.stft(preemphased, n_fft=512, hop_length=round(sampling_rate * 0.01),
                      win_length=win_length, center=True, window=scipy.signal.windows.hann(win_length), pad_mode='reflect'))
    
        # Create mel filter-bank, produce transformation matrix to project current values onto Mel-frequency bins
        mel_basis = librosa.filters.mel(sampling_rate, 512, n_mels=64, fmin=0.0, fmax=8000.0, htk=False)
        return mel_basis, spec
    
    # This function changes the mel spectrograms by converting them to a logarithmic scale, normalizing them, and adding padding to make processing easier. Note that this padding ensures the input shape is consistent, and matches the (1, 64, 176) we supplied as the input shape when creating the model server instance.
    def mel_to_input(mel_basis, spec, padding=16):
        # Convert to logarithmic scale
        log_melspectrum = np.log(np.dot(mel_basis, np.power(spec, 2)) + 2 ** -24)
    
        # Normalize output
        normalized = (log_melspectrum - log_melspectrum.mean(1)[:, None]) / (log_melspectrum.std(1)[:, None] + 1e-5)
    
        # Calculate padding
        remainder = normalized.shape[1] % padding
        if remainder != 0:
            return np.pad(normalized, ((0, 0), (0, padding - remainder)))[None]
        return normalized[None]
    
    
    mel_basis, spec = audio_to_mel(audio=audio.flatten(), sampling_rate=sampling_rate)
    audio = mel_to_input(mel_basis=mel_basis, spec=spec)
    
    # The inference server requires a dict that has the following formatting. The input key is the same 'audio_signal' that was returned by the metadata call above
    inputs = {'audio_signal': audio}
    
    # If we look at the shape with the included padding, it's the same shape as the model is expecting now
    print(audio.shape)
    (1, 64, 176)
    
    

    Inference example

    The final step is to actually perform the inference. This involves using ovmsclient to make the inference call, as well as decoding the results. As noted above, the decoding step in this example is a simpler example than would be expected in a production environment and is only provided for demo purposes. In particular, it only decodes each letter as it changes, meaning that words that require repeated letters wouldn't work. In our example, the words contain no double letters, and so will work fine, but please be aware of the example's limitations.

    At the end, we'll have an iterator containing predictions for each time, and the index of each corresponding to a letter in the alphabet array.

    
    character_probabilities = client.predict(inputs = inputs, model_name="quartznet")
    alphabet = " abcdefghijklmnopqrstuvwxyz'~" # These correspond to the 29 different outputs, the value being the probability of each character. We take the maximum prediction as the highest probability for a given letter.
    character_probabilities = next(iter(character_probabilities))
    
    # Remove unnecessary dimension (we are doing inference in batches of 1)
    character_probabilities = np.squeeze(character_probabilities)
    
    # Run argmax to pick most possible symbols
    character_probabilities = np.argmax(character_probabilities, axis=1)
    def ctc_greedy_decode(predictions):
        previous_letter_id = blank_id = len(alphabet) - 1
        transcription = list()
        for letter_index in predictions:
            if previous_letter_id != letter_index != blank_id:
                transcription.append(alphabet[letter_index])
            previous_letter_id = letter_index
        return ''.join(transcription)
    
    
    transcription = ctc_greedy_decode(character_probabilities)
    print(transcription)
    from the edge to the cloud
    
    

    Conclusion

    In this article, you've seen an end-to-end example setup for voice transcription using OpenVINO. This has many applications, from note taking to chatbots to voice search. Using OpenVINO, the model can easily be optimized for any target hardware footprint as well, allowing it to be used anywhere from the edge to the cloud.

    For a deeper dive, check out the complete code for this notebook example.

    Recent Posts

    • How to run a fraud detection AI model on RHEL CVMs

    • How we use software provenance at Red Hat

    • Alternatives to creating bootc images from scratch

    • How to update OpenStack Services on OpenShift

    • How to integrate vLLM inference into your macOS and iOS apps

    What’s up next?

    book cover

    Open Source Data Pipelines for Intelligent Applications provides data engineers and scientists insight into how Kubernetes provides a platform for building data platforms that increase an organization’s data agility. 

    Download the free e-book
    Red Hat Developers logo LinkedIn YouTube Twitter Facebook

    Products

    • Red Hat Enterprise Linux
    • Red Hat OpenShift
    • Red Hat Ansible Automation Platform

    Build

    • Developer Sandbox
    • Developer Tools
    • Interactive Tutorials
    • API Catalog

    Quicklinks

    • Learning Resources
    • E-books
    • Cheat Sheets
    • Blog
    • Events
    • Newsletter

    Communicate

    • About us
    • Contact sales
    • Find a partner
    • Report a website issue
    • Site Status Dashboard
    • Report a security problem

    RED HAT DEVELOPER

    Build here. Go anywhere.

    We serve the builders. The problem solvers who create careers with code.

    Join us if you’re a developer, software engineer, web designer, front-end designer, UX designer, computer scientist, architect, tester, product manager, project manager or team lead.

    Sign me up

    Red Hat legal and privacy links

    • About Red Hat
    • Jobs
    • Events
    • Locations
    • Contact Red Hat
    • Red Hat Blog
    • Inclusion at Red Hat
    • Cool Stuff Store
    • Red Hat Summit

    Red Hat legal and privacy links

    • Privacy statement
    • Terms of use
    • All policies and guidelines
    • Digital accessibility

    Report a website issue