Featured image for machine learning

Speech to text is one of the most common use cases for artificial intelligence. It's used all over to allow easier human interaction. Phone tree automation is a common use case.

This article will walk you through a speech-to-text example using OpenVINO, an open-source toolkit for optimizing and deploying AI inference. This example is a variant of the OpenVINO speech-to-text demo notebook which can be found in OpenVINO's GitHub repository.

What is QuartzNet?

QuartzNet is a variant of a Jasper network that performs speech-to-text translation.

The inputs to the network are a series of units called mel spectrograms. These are a way of representing audio data that involves several steps of processing.

First, the raw audio signal is divided into overlapping sections, and then a Fourier transformation is applied to them converting from signals over time to frequencies. Then the log scale of the frequency is compared to the amplitude to form a spectrogram.

Finally, the spectrogram's domain is changed to the mel scale, which is a frequency scale that better differentiates between the ranges of frequency that human speech and hearing cover, forming a mel spectrogram.

For more on mel spectrograms, read Leland Roberts' article on the subject.

OpenVINO toolkit

OpenVINO is a framework for optimizing models, as well as an optimized inference server. It allows you to perform several optimizations:

  • Quantization: Reducing floating point precision to increase processing speed. INT8 can be orders of magnitude faster than FP16 with similar levels of precision in some cases.
  • Accuracy-aware quantization: Automated quantization that preserves a user-specified level of accuracy.
  • Pruning and sparsity: Reducing unnecessary complexity of the model. For example, this could involve removing layers that aren't contributing much to the overall result or weights that are extremely small.
  • Operation fusing: Combining several model layers into one. This gives equivalent accuracy but can run significantly faster on Intel hardware given the use of specialized instructions

On Red Hat OpenShift Data Science, the default deployment is done on Intel hardware, meaning there is no additional setup required.

The notebook we'll be looking at in this article covers downloading a QuartzNet model, converting it to OpenVINO Intermediate Representation (IR), serving it via OpenVINO Model Server, sending mel spectrograms of English-language audio for inference, and decoding the results using a simple algorithm. Note that this is an example; some parts, such as the decoding algorithm, could be improved if one were to adapt this for a production use case.

The OpenVINO Model Server (OVMS) is an Intel-optimized model server, which allows a user to serve multiple models, keep track of generations of models, and lets users update them without downtime.

Download the QuartzNet model

OpenVINO has its own model zoo where you can browse and download pre-compiled and pre-trained models from.

For this demo, we will download an ONNX-format QuartzNet model. The ONNX (Open Neural Network Exchange Format) format is easily portable for exchanging models. It allows a user to package a model from a range of frameworks easily into a single file, and is easy to reinstantiate from that single file allowing for great portability. OVMS will later convert this format into its own IR for optimization.

The notebook, we first start with a few bits of boilerplate by setting up the paths:

model_folder = "model"
download_folder = "output"
data_folder = "data"

precision = "FP16"
model_name = "quartznet-15x5-en"

omz_downloader automatically creates a directory structure and downloads the selected model. This step is skipped if the model is already downloaded. The selected model comes from the public directory, which means it must be converted into Intermediate Representation (IR).

# Check if model is already downloaded in download directory
path_to_model_weights = Path(f'{download_folder}/public/{model_name}/models')
downloaded_model_file = list(path_to_model_weights.glob('*.pth'))

if not path_to_model_weights.is_dir() or len(downloaded_model_file) == 0:
    download_command = f"omz_downloader --name {model_name} --output_dir {download_folder} --precision {precision}"
    ! $download_command

Convert the model to IR

Next, we need to convert the model from ONNX format into OpenVINO IR format, which consists of three files.

  • XML: The XML file describes the layers of the network, their dimensions, and parameters. It also describes the data flow. However, it does not store the actual weights; those are instead references to the bin file, which contains the weights and other large values.
  • Bin: This file contains the large constant values like layer weights and other things that detail the state of the model.
  • Mapping: This file contains some additional metadata detailing things like the IO between layers.

A more detailed explanation is available in the OpenVINO docs.

omz_converter converts the pre-trained PyTorch model to the ONNX model format, which is further converted to the OpenVINO IR format. Both stages of conversion are handled by calling omz_converter.

# Check if model is already converted in model directory
path_to_converted_weights = Path(f'{model_folder}/public/{model_name}/{precision}/{model_name}.bin')

if not path_to_converted_weights.is_file():
    convert_command = f"omz_converter --name {model_name} --precisions {precision} --download_dir {download_folder} --output_dir {model_folder}"
    ! $convert_command

In the end, you should have the following files:

  • quartznet-15x5-en.bin
  • quartznet-15x5-en.mapping
  • quartznet-15x5-en.xml

Upload the model and directory structure to S3

In OVMS, the model server looks for the following directory structure:

tree models/
├── model1
│   ├── 1
│   │   ├── ir_model.bin
│   │   └── ir_model.xml
│   └── 2
│       ├── ir_model.bin
│       └── ir_model.xml
└── model2
│   └── 1
│       ├── ir_model.bin
│       ├── ir_model.xml
│       └── mapping_config.json
└── model3
    └── 1
        └── model.onnx

You can find more about the model repository directory structure in the OpenVINO docs.

The next step is uploading the OpenVINO IR files to S3. This demo assumes that you have a local S3 set up in advance; setting that up is beyond the scope of this demo. If you don't have access to a real S3 bucket, there are alternatives like Ceph RadosGW or Google Storage, which are also supported by OVMS.

import boto3
access_key = 'S3_ACCESS_KEY' # <- Replace with actual key
secret_key = 'S3_SECRET_KEY' # <- Replace with actual key
s3 = boto3.client('s3',
            endpoint_url='ENDPOINT_URL', # <- This is only necessary when using Ceph RadosGW
               'openvino-quartznet', '1/quartznet-15x5-en.bin')
               'openvino-quartznet', '1/quartznet-15x5-en.mapping')
               'openvino-quartznet', '1/quartznet-15x5-en.xml')

This gives us the following structure in s3:

# tree s3://openvino-quartznet/
└── 1
    ├── quartznet-15x5-en.bin
    ├── quartznet-15x5-en.mapping
    └── quartznet-15x5-en.xml

Create an OVMS instance

Now that we have uploaded the model to S3, we can create an instance of the OpenVINO Model Server to serve the model. Intel has an OVMS Operator that will allow users to easily provision an OVMS instance. Here's an example custom resource for the Operator:

kind: ModelServer
apiVersion: intel.com/v1alpha1
 name: openvino-quartznet-model-server
 namespace: your-project-namespace
 image_name: >-
   replicas: 1
       cpu: '4'
       memory: '4Gi'
       cpu: '4'
       memory: '4Gi'
   grpc_port: 8080
   rest_port: 8081
   single_model_mode: true
   config_configmap_name: ''
   model_config: ''
   model_name: 'quartznet' # This is the name the model is served with
   model_path: 's3://openvino-quartznet/' # This URL path to where the model repository was stored earlier
   nireq: 0
   plugin_config: '{"CPU_THROUGHPUT_STREAMS":1}'
   batch_size: ''
   shape: '(1, 64, 176)' # This is needed due to the notebook having a slightly different input shape than the default. OVMS handles this conversion automatically
   model_version_policy: '{"latest": { "num_versions":1 }}'
   layout: ''
   target_device: CPU
   is_stateful: false
   idle_sequence_cleanup: false
   low_latency_transformation: true
   max_sequence_number: 0
   file_system_poll_wait_seconds: 0
   sequence_cleaner_poll_wait_minutes: 0
   log_level: INFO
   grpc_workers: 1
   rest_workers: 0
   storage_type: S3
   https_proxy: ''
   http_proxy: ''
   models_host_path: ''
   models_volume_claim: ''
   aws_secret_access_key: 'S3_SECRET_KEY' # Replace with actual key
   aws_access_key_id: 'S3_ACCESS_KEY' # Replace with actual key
   aws_region: ''
   s3_compat_api_endpoint: 'ENDPOINT_URL' # This is only necessary when using Ceph RadosGW
   gcp_creds_secret_name: ''
   azure_storage_connection_string: ''

Once the server finishes initializing, the model will be available on both gRPC and HTTP endpoints.


For a simple, lightweight client, ovmsclient is an easy way to interact with an OVMS server. The client maintains an underlying gRPC client to OVMS and provides several convenience features. For starters, the following code allows users to connect to and query the input and output parameters of the model:

import ovmsclient
import librosa
import numpy as np
import scipy
client = ovmsclient.make_grpc_client("openvino-quartznet-model-server.default.svc.cluster.local:8080")
model_metadata = client.get_model_metadata(model_name="quartznet")
{'model_version': 1,
 'inputs': {'audio_signal': {'shape': [1, 64, 176], 'dtype': 'DT_FLOAT'}},
 'outputs': {'output': {'shape': [1, 88, 29], 'dtype': 'DT_FLOAT'}}}

We can see that the shape is the input shape defined in the CR above.

Convert audio data to mel

In order to perform inference, raw audio data must be converted to the mel spectrograms we discussed above. The code below performs this conversion:

# First load the audio data, in this case a clip of English audio with the speaker saying "from the edge to the cloud"
audio, sampling_rate = librosa.load(path=f'data/edge_to_cloud.ogg', sr=16000)
# This first function converts the audio to mel spectrograms. This has specific window sizing and a hardcoded sampling rate. A different algorithm could be implemented if user needs differ.
def audio_to_mel(audio, sampling_rate):
    assert sampling_rate == 16000, "Only 16 KHz audio supported"
    preemph = 0.97
    preemphased = np.concatenate([audio[:1], audio[1:] - preemph * audio[:-1].astype(np.float32)])

    # Calculate window length
    win_length = round(sampling_rate * 0.02)

    # Based on previously calculated window length run short-time Fourier transform
    spec = np.abs(librosa.core.spectrum.stft(preemphased, n_fft=512, hop_length=round(sampling_rate * 0.01),
                  win_length=win_length, center=True, window=scipy.signal.windows.hann(win_length), pad_mode='reflect'))

    # Create mel filter-bank, produce transformation matrix to project current values onto Mel-frequency bins
    mel_basis = librosa.filters.mel(sampling_rate, 512, n_mels=64, fmin=0.0, fmax=8000.0, htk=False)
    return mel_basis, spec

# This function changes the mel spectrograms by converting them to a logarithmic scale, normalizing them, and adding padding to make processing easier. Note that this padding ensures the input shape is consistent, and matches the (1, 64, 176) we supplied as the input shape when creating the model server instance.
def mel_to_input(mel_basis, spec, padding=16):
    # Convert to logarithmic scale
    log_melspectrum = np.log(np.dot(mel_basis, np.power(spec, 2)) + 2 ** -24)

    # Normalize output
    normalized = (log_melspectrum - log_melspectrum.mean(1)[:, None]) / (log_melspectrum.std(1)[:, None] + 1e-5)

    # Calculate padding
    remainder = normalized.shape[1] % padding
    if remainder != 0:
        return np.pad(normalized, ((0, 0), (0, padding - remainder)))[None]
    return normalized[None]

mel_basis, spec = audio_to_mel(audio=audio.flatten(), sampling_rate=sampling_rate)
audio = mel_to_input(mel_basis=mel_basis, spec=spec)

# The inference server requires a dict that has the following formatting. The input key is the same 'audio_signal' that was returned by the metadata call above
inputs = {'audio_signal': audio}

# If we look at the shape with the included padding, it's the same shape as the model is expecting now
(1, 64, 176)

Inference example

The final step is to actually perform the inference. This involves using ovmsclient to make the inference call, as well as decoding the results. As noted above, the decoding step in this example is a simpler example than would be expected in a production environment and is only provided for demo purposes. In particular, it only decodes each letter as it changes, meaning that words that require repeated letters wouldn't work. In our example, the words contain no double letters, and so will work fine, but please be aware of the example's limitations.

At the end, we'll have an iterator containing predictions for each time, and the index of each corresponding to a letter in the alphabet array.

character_probabilities = client.predict(inputs = inputs, model_name="quartznet")
alphabet = " abcdefghijklmnopqrstuvwxyz'~" # These correspond to the 29 different outputs, the value being the probability of each character. We take the maximum prediction as the highest probability for a given letter.
character_probabilities = next(iter(character_probabilities))

# Remove unnecessary dimension (we are doing inference in batches of 1)
character_probabilities = np.squeeze(character_probabilities)

# Run argmax to pick most possible symbols
character_probabilities = np.argmax(character_probabilities, axis=1)
def ctc_greedy_decode(predictions):
    previous_letter_id = blank_id = len(alphabet) - 1
    transcription = list()
    for letter_index in predictions:
        if previous_letter_id != letter_index != blank_id:
        previous_letter_id = letter_index
    return ''.join(transcription)

transcription = ctc_greedy_decode(character_probabilities)
from the edge to the cloud


In this article, you've seen an end-to-end example setup for voice transcription using OpenVINO. This has many applications, from note taking to chatbots to voice search. Using OpenVINO, the model can easily be optimized for any target hardware footprint as well, allowing it to be used anywhere from the edge to the cloud.

For a deeper dive, check out the complete code for this notebook example.