Skip to main content
Redhat Developers  Logo
  • AI

    Get started with AI

    • Red Hat AI
      Accelerate the development and deployment of enterprise AI solutions.
    • AI learning hub
      Explore learning materials and tools, organized by task.
    • AI interactive demos
      Click through scenarios with Red Hat AI, including training LLMs and more.
    • AI/ML learning paths
      Expand your OpenShift AI knowledge using these learning resources.
    • AI quickstarts
      Focused AI use cases designed for fast deployment on Red Hat AI platforms.
    • No-cost AI training
      Foundational Red Hat AI training.

    Featured resources

    • OpenShift AI learning
    • Open source AI for developers
    • AI product application development
    • Open source-powered AI/ML for hybrid cloud
    • AI and Node.js cheat sheet

    Red Hat AI Factory with NVIDIA

    • Red Hat AI Factory with NVIDIA is a co-engineered, enterprise-grade AI solution for building, deploying, and managing AI at scale across hybrid cloud environments.
    • Explore the solution
  • Learn

    Self-guided

    • Documentation
      Find answers, get step-by-step guidance, and learn how to use Red Hat products.
    • Learning paths
      Explore curated walkthroughs for common development tasks.
    • Guided learning
      Receive custom learning paths powered by our AI assistant.
    • See all learning

    Hands-on

    • Developer Sandbox
      Spin up Red Hat's products and technologies without setup or configuration.
    • Interactive labs
      Learn by doing in these hands-on, browser-based experiences.
    • Interactive demos
      Click through product features in these guided tours.

    Browse by topic

    • AI/ML
    • Automation
    • Java
    • Kubernetes
    • Linux
    • See all topics

    Training & certifications

    • Courses and exams
    • Certifications
    • Skills assessments
    • Red Hat Academy
    • Learning subscription
    • Explore training
  • Build

    Get started

    • Red Hat build of Podman Desktop
      A downloadable, local development hub to experiment with our products and builds.
    • Developer Sandbox
      Spin up Red Hat's products and technologies without setup or configuration.

    Download products

    • Access product downloads to start building and testing right away.
    • Red Hat Enterprise Linux
    • Red Hat AI
    • Red Hat OpenShift
    • Red Hat Ansible Automation Platform
    • See all products

    Featured

    • Red Hat build of OpenJDK
    • Red Hat JBoss Enterprise Application Platform
    • Red Hat OpenShift Dev Spaces
    • Red Hat Developer Toolset

    References

    • E-books
    • Documentation
    • Cheat sheets
    • Architecture center
  • Community

    Get involved

    • Events
    • Live AI events
    • Red Hat Summit
    • Red Hat Accelerators
    • Community discussions

    Follow along

    • Articles & blogs
    • Developer newsletter
    • Videos
    • Github

    Get help

    • Customer service
    • Customer support
    • Regional contacts
    • Find a partner

    Join the Red Hat Developer program

    • Download Red Hat products and project builds, access support documentation, learning content, and more.
    • Explore the benefits

Use OpenVINO to convert speech to text

June 13, 2022
Sean Pryor Ryan Loney
Related topics:
Data scienceArtificial intelligence
Related products:
Red Hat OpenShift AI

    Speech to text is one of the most common use cases for artificial intelligence. It's used all over to allow easier human interaction. Phone tree automation is a common use case.

    This article will walk you through a speech-to-text example using OpenVINO, an open-source toolkit for optimizing and deploying AI inference. This example is a variant of the OpenVINO speech-to-text demo notebook which can be found in OpenVINO's GitHub repository.

    What is QuartzNet?

    QuartzNet is a variant of a Jasper network that performs speech-to-text translation.

    The inputs to the network are a series of units called mel spectrograms. These are a way of representing audio data that involves several steps of processing.

    First, the raw audio signal is divided into overlapping sections, and then a Fourier transformation is applied to them converting from signals over time to frequencies. Then the log scale of the frequency is compared to the amplitude to form a spectrogram.

    Finally, the spectrogram's domain is changed to the mel scale, which is a frequency scale that better differentiates between the ranges of frequency that human speech and hearing cover, forming a mel spectrogram.

    For more on mel spectrograms, read Leland Roberts' article on the subject.

    OpenVINO toolkit

    OpenVINO is a framework for optimizing models, as well as an optimized inference server. It allows you to perform several optimizations:

    • Quantization: Reducing floating point precision to increase processing speed. INT8 can be orders of magnitude faster than FP16 with similar levels of precision in some cases.
    • Accuracy-aware quantization: Automated quantization that preserves a user-specified level of accuracy.
    • Pruning and sparsity: Reducing unnecessary complexity of the model. For example, this could involve removing layers that aren't contributing much to the overall result or weights that are extremely small.
    • Operation fusing: Combining several model layers into one. This gives equivalent accuracy but can run significantly faster on Intel hardware given the use of specialized instructions

    On Red Hat OpenShift Data Science, the default deployment is done on Intel hardware, meaning there is no additional setup required.

    The notebook we'll be looking at in this article covers downloading a QuartzNet model, converting it to OpenVINO Intermediate Representation (IR), serving it via OpenVINO Model Server, sending mel spectrograms of English-language audio for inference, and decoding the results using a simple algorithm. Note that this is an example; some parts, such as the decoding algorithm, could be improved if one were to adapt this for a production use case.

    The OpenVINO Model Server (OVMS) is an Intel-optimized model server, which allows a user to serve multiple models, keep track of generations of models, and lets users update them without downtime.

    Download the QuartzNet model

    OpenVINO has its own model zoo where you can browse and download pre-compiled and pre-trained models from.

    For this demo, we will download an ONNX-format QuartzNet model. The ONNX (Open Neural Network Exchange Format) format is easily portable for exchanging models. It allows a user to package a model from a range of frameworks easily into a single file, and is easy to reinstantiate from that single file allowing for great portability. OVMS will later convert this format into its own IR for optimization.

    The notebook, we first start with a few bits of boilerplate by setting up the paths:

    
    model_folder = "model"
    download_folder = "output"
    data_folder = "data"
    
    precision = "FP16"
    model_name = "quartznet-15x5-en"
    
    

    omz_downloader automatically creates a directory structure and downloads the selected model. This step is skipped if the model is already downloaded. The selected model comes from the public directory, which means it must be converted into Intermediate Representation (IR).

    
    # Check if model is already downloaded in download directory
    path_to_model_weights = Path(f'{download_folder}/public/{model_name}/models')
    downloaded_model_file = list(path_to_model_weights.glob('*.pth'))
    
    if not path_to_model_weights.is_dir() or len(downloaded_model_file) == 0:
        download_command = f"omz_downloader --name {model_name} --output_dir {download_folder} --precision {precision}"
        ! $download_command
    
    

    Convert the model to IR

    Next, we need to convert the model from ONNX format into OpenVINO IR format, which consists of three files.

    • XML: The XML file describes the layers of the network, their dimensions, and parameters. It also describes the data flow. However, it does not store the actual weights; those are instead references to the bin file, which contains the weights and other large values.
    • Bin: This file contains the large constant values like layer weights and other things that detail the state of the model.
    • Mapping: This file contains some additional metadata detailing things like the IO between layers.

    A more detailed explanation is available in the OpenVINO docs.

    omz_converter converts the pre-trained PyTorch model to the ONNX model format, which is further converted to the OpenVINO IR format. Both stages of conversion are handled by calling omz_converter.

    
    # Check if model is already converted in model directory
    path_to_converted_weights = Path(f'{model_folder}/public/{model_name}/{precision}/{model_name}.bin')
    
    if not path_to_converted_weights.is_file():
        convert_command = f"omz_converter --name {model_name} --precisions {precision} --download_dir {download_folder} --output_dir {model_folder}"
        ! $convert_command
    
    

    In the end, you should have the following files:

    • quartznet-15x5-en.bin
    • quartznet-15x5-en.mapping
    • quartznet-15x5-en.xml

    Upload the model and directory structure to S3

    In OVMS, the model server looks for the following directory structure:

    
    tree models/
    models/
    ├── model1
    │   ├── 1
    │   │   ├── ir_model.bin
    │   │   └── ir_model.xml
    │   └── 2
    │       ├── ir_model.bin
    │       └── ir_model.xml
    └── model2
    │   └── 1
    │       ├── ir_model.bin
    │       ├── ir_model.xml
    │       └── mapping_config.json
    └── model3
        └── 1
            └── model.onnx
    
    

    You can find more about the model repository directory structure in the OpenVINO docs.

    The next step is uploading the OpenVINO IR files to S3. This demo assumes that you have a local S3 set up in advance; setting that up is beyond the scope of this demo. If you don't have access to a real S3 bucket, there are alternatives like Ceph RadosGW or Google Storage, which are also supported by OVMS.

    
    import boto3
    access_key = 'S3_ACCESS_KEY' # <- Replace with actual key
    secret_key = 'S3_SECRET_KEY' # <- Replace with actual key
    s3 = boto3.client('s3',
                endpoint_url='ENDPOINT_URL', # <- This is only necessary when using Ceph RadosGW
                aws_access_key_id=access_key,
                aws_secret_access_key=secret_key,)
    s3.upload_file('model/public/quartznet-15x5-en/FP16/quartznet-15x5-en.bin',
                   'openvino-quartznet', '1/quartznet-15x5-en.bin')
    s3.upload_file('model/public/quartznet-15x5-en/FP16/quartznet-15x5-en.mapping',
                   'openvino-quartznet', '1/quartznet-15x5-en.mapping')
    s3.upload_file('model/public/quartznet-15x5-en/FP16/quartznet-15x5-en.xml',
                   'openvino-quartznet', '1/quartznet-15x5-en.xml')
    
    

    This gives us the following structure in s3:

    
    # tree s3://openvino-quartznet/
    └── 1
        ├── quartznet-15x5-en.bin
        ├── quartznet-15x5-en.mapping
        └── quartznet-15x5-en.xml
    
    

    Create an OVMS instance

    Now that we have uploaded the model to S3, we can create an instance of the OpenVINO Model Server to serve the model. Intel has an OVMS Operator that will allow users to easily provision an OVMS instance. Here's an example custom resource for the Operator:

    
    kind: ModelServer
    apiVersion: intel.com/v1alpha1
    metadata:
     name: openvino-quartznet-model-server
     namespace: your-project-namespace
    spec:
     image_name: >-
       registry.connect.redhat.com/intel/openvino-model-server@sha256:f670aa3dc014b8786e554b8a3bb7e2e8475744d588e5e72d554660b74430a8c5
     deployment_parameters:
       replicas: 1
       resources:
         limits:
           cpu: '4'
           memory: '4Gi'
         requests:
           cpu: '4'
           memory: '4Gi'
     service_parameters:
       grpc_port: 8080
       rest_port: 8081
     models_settings:
       single_model_mode: true
       config_configmap_name: ''
       model_config: ''
       model_name: 'quartznet' # This is the name the model is served with
       model_path: 's3://openvino-quartznet/' # This URL path to where the model repository was stored earlier
       nireq: 0
       plugin_config: '{"CPU_THROUGHPUT_STREAMS":1}'
       batch_size: ''
       shape: '(1, 64, 176)' # This is needed due to the notebook having a slightly different input shape than the default. OVMS handles this conversion automatically
       model_version_policy: '{"latest": { "num_versions":1 }}'
       layout: ''
       target_device: CPU
       is_stateful: false
       idle_sequence_cleanup: false
       low_latency_transformation: true
       max_sequence_number: 0
     server_settings:
       file_system_poll_wait_seconds: 0
       sequence_cleaner_poll_wait_minutes: 0
       log_level: INFO
       grpc_workers: 1
       rest_workers: 0
     models_repository:
       storage_type: S3
       https_proxy: ''
       http_proxy: ''
       models_host_path: ''
       models_volume_claim: ''
       aws_secret_access_key: 'S3_SECRET_KEY' # Replace with actual key
       aws_access_key_id: 'S3_ACCESS_KEY' # Replace with actual key
       aws_region: ''
       s3_compat_api_endpoint: 'ENDPOINT_URL' # This is only necessary when using Ceph RadosGW
       gcp_creds_secret_name: ''
       azure_storage_connection_string: ''
    
    

    Once the server finishes initializing, the model will be available on both gRPC and HTTP endpoints.

    ovmsclient

    For a simple, lightweight client, ovmsclient is an easy way to interact with an OVMS server. The client maintains an underlying gRPC client to OVMS and provides several convenience features. For starters, the following code allows users to connect to and query the input and output parameters of the model:

    
    import ovmsclient
    import librosa
    import numpy as np
    import scipy
    client = ovmsclient.make_grpc_client("openvino-quartznet-model-server.default.svc.cluster.local:8080")
    model_metadata = client.get_model_metadata(model_name="quartznet")
    print(model_metadata)
    {'model_version': 1,
     'inputs': {'audio_signal': {'shape': [1, 64, 176], 'dtype': 'DT_FLOAT'}},
     'outputs': {'output': {'shape': [1, 88, 29], 'dtype': 'DT_FLOAT'}}}
    
    

    We can see that the shape is the input shape defined in the CR above.

    Convert audio data to mel

    In order to perform inference, raw audio data must be converted to the mel spectrograms we discussed above. The code below performs this conversion:

    
    # First load the audio data, in this case a clip of English audio with the speaker saying "from the edge to the cloud"
    audio, sampling_rate = librosa.load(path=f'data/edge_to_cloud.ogg', sr=16000)
    # This first function converts the audio to mel spectrograms. This has specific window sizing and a hardcoded sampling rate. A different algorithm could be implemented if user needs differ.
    def audio_to_mel(audio, sampling_rate):
        assert sampling_rate == 16000, "Only 16 KHz audio supported"
        preemph = 0.97
        preemphased = np.concatenate([audio[:1], audio[1:] - preemph * audio[:-1].astype(np.float32)])
    
        # Calculate window length
        win_length = round(sampling_rate * 0.02)
    
        # Based on previously calculated window length run short-time Fourier transform
        spec = np.abs(librosa.core.spectrum.stft(preemphased, n_fft=512, hop_length=round(sampling_rate * 0.01),
                      win_length=win_length, center=True, window=scipy.signal.windows.hann(win_length), pad_mode='reflect'))
    
        # Create mel filter-bank, produce transformation matrix to project current values onto Mel-frequency bins
        mel_basis = librosa.filters.mel(sampling_rate, 512, n_mels=64, fmin=0.0, fmax=8000.0, htk=False)
        return mel_basis, spec
    
    # This function changes the mel spectrograms by converting them to a logarithmic scale, normalizing them, and adding padding to make processing easier. Note that this padding ensures the input shape is consistent, and matches the (1, 64, 176) we supplied as the input shape when creating the model server instance.
    def mel_to_input(mel_basis, spec, padding=16):
        # Convert to logarithmic scale
        log_melspectrum = np.log(np.dot(mel_basis, np.power(spec, 2)) + 2 ** -24)
    
        # Normalize output
        normalized = (log_melspectrum - log_melspectrum.mean(1)[:, None]) / (log_melspectrum.std(1)[:, None] + 1e-5)
    
        # Calculate padding
        remainder = normalized.shape[1] % padding
        if remainder != 0:
            return np.pad(normalized, ((0, 0), (0, padding - remainder)))[None]
        return normalized[None]
    
    
    mel_basis, spec = audio_to_mel(audio=audio.flatten(), sampling_rate=sampling_rate)
    audio = mel_to_input(mel_basis=mel_basis, spec=spec)
    
    # The inference server requires a dict that has the following formatting. The input key is the same 'audio_signal' that was returned by the metadata call above
    inputs = {'audio_signal': audio}
    
    # If we look at the shape with the included padding, it's the same shape as the model is expecting now
    print(audio.shape)
    (1, 64, 176)
    
    

    Inference example

    The final step is to actually perform the inference. This involves using ovmsclient to make the inference call, as well as decoding the results. As noted above, the decoding step in this example is a simpler example than would be expected in a production environment and is only provided for demo purposes. In particular, it only decodes each letter as it changes, meaning that words that require repeated letters wouldn't work. In our example, the words contain no double letters, and so will work fine, but please be aware of the example's limitations.

    At the end, we'll have an iterator containing predictions for each time, and the index of each corresponding to a letter in the alphabet array.

    
    character_probabilities = client.predict(inputs = inputs, model_name="quartznet")
    alphabet = " abcdefghijklmnopqrstuvwxyz'~" # These correspond to the 29 different outputs, the value being the probability of each character. We take the maximum prediction as the highest probability for a given letter.
    character_probabilities = next(iter(character_probabilities))
    
    # Remove unnecessary dimension (we are doing inference in batches of 1)
    character_probabilities = np.squeeze(character_probabilities)
    
    # Run argmax to pick most possible symbols
    character_probabilities = np.argmax(character_probabilities, axis=1)
    def ctc_greedy_decode(predictions):
        previous_letter_id = blank_id = len(alphabet) - 1
        transcription = list()
        for letter_index in predictions:
            if previous_letter_id != letter_index != blank_id:
                transcription.append(alphabet[letter_index])
            previous_letter_id = letter_index
        return ''.join(transcription)
    
    
    transcription = ctc_greedy_decode(character_probabilities)
    print(transcription)
    from the edge to the cloud
    
    

    Conclusion

    In this article, you've seen an end-to-end example setup for voice transcription using OpenVINO. This has many applications, from note taking to chatbots to voice search. Using OpenVINO, the model can easily be optimized for any target hardware footprint as well, allowing it to be used anywhere from the edge to the cloud.

    For a deeper dive, check out the complete code for this notebook example.

    Recent Posts

    • Every layer counts: Defense in depth for AI agents with Red Hat AI

    • Fun in the RUN instruction: Why container builds with distroless images can surprise you

    • Trusted software factory: Building trust in the agentic AI era

    • Build a zero trust AI pipeline with OpenShift and RHEL CVMs

    • Red Hat Hardened Images: Top 5 benefits for software developers

    What’s up next?

    book cover

    Open Source Data Pipelines for Intelligent Applications provides data engineers and scientists insight into how Kubernetes provides a platform for building data platforms that increase an organization’s data agility. 

    Download the free e-book
    Red Hat Developers logo LinkedIn YouTube Twitter Facebook

    Platforms

    • Red Hat AI
    • Red Hat Enterprise Linux
    • Red Hat OpenShift
    • Red Hat Ansible Automation Platform
    • See all products

    Build

    • Developer Sandbox
    • Developer tools
    • Interactive tutorials
    • API catalog

    Quicklinks

    • Learning resources
    • E-books
    • Cheat sheets
    • Blog
    • Events
    • Newsletter

    Communicate

    • About us
    • Contact sales
    • Find a partner
    • Report a website issue
    • Site status dashboard
    • Report a security problem

    RED HAT DEVELOPER

    Build here. Go anywhere.

    We serve the builders. The problem solvers who create careers with code.

    Join us if you’re a developer, software engineer, web designer, front-end designer, UX designer, computer scientist, architect, tester, product manager, project manager or team lead.

    Sign me up

    Red Hat legal and privacy links

    • About Red Hat
    • Jobs
    • Events
    • Locations
    • Contact Red Hat
    • Red Hat Blog
    • Inclusion at Red Hat
    • Cool Stuff Store
    • Red Hat Summit
    © 2026 Red Hat

    Red Hat legal and privacy links

    • Privacy statement
    • Terms of use
    • All policies and guidelines
    • Digital accessibility

    Chat Support

    Please log in with your Red Hat account to access chat support.