Deploy a coding copilot model with OpenShift AI

Continue.dev is an innovative plug-in for Visual Studio Code designed to enhance the productivity of developers by enabling connections to a variety of providers and models. Recognizing the unique needs of certain projects, there are instances where an isolated environment is essential, particularly when there’s a requirement to deploy a fine-tuned base model that aligns more closely with a team’s specific coding practices.

Red Hat OpenShift AI

Red Hat OpenShift AI offers a robust solution for such scenarios, providing a secure and isolated platform that supports model serving in environments with stringent connectivity restrictions. It leverages the power of service mesh and Knative to offer inference APIs, ensuring seamless integration and operation within these specialized setups. This allows teams to maintain high levels of accuracy in code generation while operating within the confines of their tailored development ecosystem. OpenShift AI thus represents a critical tool for teams seeking to harness the full potential of AI in their development processes while adhering to their particular operational requirements.

KServe

KServe is a powerful Kubernetes Custom Resource Definition called InferenceService that facilitates machine learning model serving across various frameworks. Designed to address production inference scenarios, it offers a high-performance, highly abstracted interface for popular ML frameworks such as TensorFlow, XGBoost, Scikit-Learn, PyTorch, and ONNX.

Leveraging Red Hat OpenShift Serverless and Red Hat OpenShift Service Mesh, KServe simplifies the complexities of auto-scaling, networking, health checks, and service configuration (Figure 1). It brings cutting-edge service features to ML deployments, such as GPU auto-scaling, scale-to-zero capabilities, traffic monitoring, and canary releases. KServe presents a straightforward, pluggable, and comprehensive solution for production ML serving, encompassing prediction, preprocessing, postprocessing, and explainability.

Model Serving Architecture in OpenShift AI — KServe model serving architecture

Caikit

On the other hand, the Caikit toolkit offers developer-friendly APIs for model management with standard gRPC and HTTP interfaces for querying base models, channeling requests to the TGIS inference service. TGIS, an early branch of the Hugging Face text generation inference service toolkit, enhances LLM performance with features like continuous batching, dynamic batching, tensor parallelism (model sharding), and support for PyTorch 2 compilation.

NVIDIA GPU Operator

The NVIDIA GPU Operator automates the management of software components required for configuring and utilizing NVIDIA GPUs within OpenShift. This includes driver containers, device plug-ins, and the DCGM metrics exporter, which integrates with OpenShift’s Prometheus monitoring to provide insights into GPU metrics such as memory utilization and streaming multiprocessor (SM) utilization.

To effectively implement the TGI serving, it is essential to leverage a specialized serving runtime tailored to meet the specific needs of the application. This can be achieved by importing a custom serving runtime that accompanies the product (Figure 2). The next step involves customizing this runtime to align with the HuggingFace TGI image specifications.

Additionally, it is important to ensure that the runtime is configured to support a RESTful type of inference API, which will facilitate seamless interaction and communication with the service.

The core spec of YAML spec is as follows:

spec:
  containers:
    - args:
        - --model-id=/mnt/models/
        - --port=3000
      command:
        - text-generation-launcher
      env:
        - name: TRANSFORMERS_CACHE
          value: /tmp/transformers_cache
        - name: MAX_INPUT_LENGTH
          value: "1024"
        - name: MAX_TOTAL_TOKENS
          value: "2048"
      image: ghcr.io/huggingface/text-generation-inference:1.3.3
      livenessProbe:
        exec:
          command:
            - curl
            - localhost:3000/health
        initialDelaySeconds: 5
      name: kserve-container
      ports:
        - containerPort: 3000
          name: h2c
          protocol: TCP
      readinessProbe:
        exec
          command:
            - curl
            - localhost:3000/health
        initialDelaySeconds: 5
  multiModel: false
  supportedModelFormats:
    - autoSelect: true
      name: pytorch

Then create a model deploy directly in the AI project scope with imported model data in S3, as shown in Figure 3. The serving in OpenShift AI will copy the model files in the init stage and set up the disconnected serving runtime.

For a few minutes, you'll see the pod running with several restarts (due to the resources, most likely). See Figure 4.

Also note that the TLS certificate is self-signed by default; we have to replace it with a correctly signed certificate or the client cannot proceed with the request to the server. By placing the TLS secret in the ods-operator project, we can point to it with a config in the Data Science Cluster custom resource. See Figure 5.

Then we can configure the Continue’s config.json (Figure 6) to point to the newly setup model serving entry point by Service Mesh.

Then test some of the features in VS Code with Continue (see Figure 7).

Comments are generated as expected, as shown in Figure 8.

Generated result by AI model — Coding pilot results

Conclusion

In response to the need for secure and private model serving in industries where data sensitivity is paramount, such as finance and manufacturing, it is crucial to establish a robust and trusted model serving runtime within a private data center. This approach ensures that sensitive code and data are not exposed to the vulnerabilities associated with the internet.

To facilitate seamless integration and maintain industry standards, the runtime environment should be compatible with open-source model serving protocols. By doing so, organizations can leverage the benefits of advanced analytics and machine learning while upholding strict data governance and security policies. Implementing such a solution allows these industries to harness the power of their data without compromising on confidentiality or compliance requirements.

Red Hat Developer Sandbox

Programming languages & frameworks

System design & architecture

Developer experience

Automated data processing

Platform engineering

Secure development & architectures

E-books

Cheat sheets

Documentation

Deploy a coding copilot model with OpenShift AI

Red Hat OpenShift AI

KServe

Caikit

NVIDIA GPU Operator

Conclusion

Agent Skills: Explore security threats and controls

How to run Slurm workloads on OpenShift with Slinky operator

Effortless Red Hat Enterprise Linux virtual machines with Libvirt and Kickstart

5 steps to triage vLLM performance

Automate AI agents with the Responses API in Llama Stack

Platforms

Build

Quicklinks

Communicate

RED HAT DEVELOPER

Red Hat legal and privacy links

Red Hat legal and privacy links

Report a website issue