How to run AI models in cloud development environments

This learning path explores running AI models, specifically Large Language Models (LLMs), in cloud development environments to enhance developer efficiency and data security. It introduces RamaLama, an open-source tool launched in mid-2024, designed to simplify AI workflows by integrating with container technologies.

Access the Developer Sandbox

Prerequisites:

In this lesson, you will:

  • How Devfile sets up a small, dedicated container to run an AI model within your cloud development workspace. 

What is a devfile?

It looks quite smooth. But behind the scenes, your cloud development environment is assembling various components to ensure seamless interaction with your AI assistant. It configures the large language model (LLM) you’re engaging with via the Continue extension to run in a sidecar container within your cloud workspace pod.

This configuration is in the form of a devfile, an open standard that defines containerized development environments using a YAML-formatted text file. By leveraging a devfile, your system ensures consistency, portability, and streamlined deployment of the AI assistant within your cloud-based environment.

In the devfile, you provide the configuration for your cloud workspace and for the model, as shown in the code snippet:

- name: ramalama
    attributes:
      container-overrides:
        resources:
          limits:
            cpu: 4000m
            memory: 12Gi
          requests:
            cpu: 1000m
            memory: 8Gi
    container:
      image: quay.io/ramalama/ramalama:0.7
      args:
        - "ramalama"
        - "--store"
        - "/models"
        - "serve"
        - "--network=none"
        - "ollama://granite-code:latest"
      mountSources: true
      sourceMapping: /.ramalama
      volumeMounts:
        - name: ramalama-models
          path: /models
      endpoints:
        - exposure: public
          name: ramalamaserve
          protocol: http
          targetPort: 8080
  - name: ramalama-models
    volume:
      size: 5Gi

In the devfile, we added a container for serving the IBM Granite-Code LLM using RamaLama. We used the RamaLama official container image and specified the CPU and memory constraints of the container. RamaLama is set up to use an alternate directory as its model store, which is mounted as a volume. This ensures that models are managed efficiently while maintaining isolation and security within the containerized environment.

The ramalama serve command runs AI models as a REST API, making them accessible on port 8080 for inference requests. To ensure the model operates in a fully air-gapped environment, we explicitly set --network=none, which prevents access from any external network.

Summary

In this learning path, we learned about RamaLama and how you can use this tool in a cloud development environment with the help of Devfile and OpenShift Dev Spaces. We also discussed how you can deploy large language models within your internal infrastructure, ensuring secure development while maintaining full control over your data, free from external dependencies.

If you want to learn more about it, please visit these links:

Previous resource
How to serve IBM Granite with RamaLama in Red Hat OpenShift Dev Spaces