Skip to main content
Redhat Developers  Logo
  • Products

    Platforms

    • Red Hat Enterprise Linux
      Red Hat Enterprise Linux Icon
    • Red Hat AI
      Red Hat AI
    • Red Hat OpenShift
      Openshift icon
    • Red Hat Ansible Automation Platform
      Ansible icon
    • View All Red Hat Products

    Featured

    • Red Hat build of OpenJDK
    • Red Hat Developer Hub
    • Red Hat JBoss Enterprise Application Platform
    • Red Hat OpenShift Dev Spaces
    • Red Hat OpenShift Local
    • Red Hat Developer Sandbox

      Try Red Hat products and technologies without setup or configuration fees for 30 days with this shared Openshift and Kubernetes cluster.
    • Try at no cost
  • Technologies

    Featured

    • AI/ML
      AI/ML Icon
    • Linux
      Linux Icon
    • Kubernetes
      Cloud icon
    • Automation
      Automation Icon showing arrows moving in a circle around a gear
    • View All Technologies
    • Programming Languages & Frameworks

      • Java
      • Python
      • JavaScript
    • System Design & Architecture

      • Red Hat architecture and design patterns
      • Microservices
      • Event-Driven Architecture
      • Databases
    • Developer Productivity

      • Developer productivity
      • Developer Tools
      • GitOps
    • Automated Data Processing

      • AI/ML
      • Data Science
      • Apache Kafka on Kubernetes
    • Platform Engineering

      • DevOps
      • DevSecOps
      • Ansible automation for applications and services
    • Secure Development & Architectures

      • Security
      • Secure coding
  • Learn

    Featured

    • Kubernetes & Cloud Native
      Openshift icon
    • Linux
      Rhel icon
    • Automation
      Ansible cloud icon
    • AI/ML
      AI/ML Icon
    • View All Learning Resources

    E-Books

    • GitOps Cookbook
    • Podman in Action
    • Kubernetes Operators
    • The Path to GitOps
    • View All E-books

    Cheat Sheets

    • Linux Commands
    • Bash Commands
    • Git
    • systemd Commands
    • View All Cheat Sheets

    Documentation

    • Product Documentation
    • API Catalog
    • Legacy Documentation
  • Developer Sandbox

    Developer Sandbox

    • Access Red Hat’s products and technologies without setup or configuration, and start developing quicker than ever before with our new, no-cost sandbox environments.
    • Explore Developer Sandbox

    Featured Developer Sandbox activities

    • Get started with your Developer Sandbox
    • OpenShift virtualization and application modernization using the Developer Sandbox
    • Explore all Developer Sandbox activities

    Ready to start developing apps?

    • Try at no cost
  • Blog
  • Events
  • Videos

Superscalar Execution

March 14, 2016
William Cohen
Related products:
Developer Tools

Share:

    In the traditional processor pipeline model under ideal circumstances one new instruction enters the processor's and one instruction completes execution each cycle. Thus, for the best case the processor can have an average execution rate of one clock per instruction. A superscalar processor allows multiple unrelated instructions to start on the same clock cycle on separate hardware units or pipelines. Under ideal conditions a superscalar processors could have an average clocks per instruction (CPI) be less one, meaning your 2GHz processor might be able to execute "billions and billions of instructions per second" as the famed astronomer Carl Sagan might like to say.

    Different processors have different constraints on which instructions can be issued together for superscalar execution. A processor such as the Intel Atom can issue two instructions per cycle. However, the compiler needs to take care in the instructions it selects and the order of those instructions because the Atom processor has restrictions on which instructions can start at the same time. For example the Intel Atom processor can only start one older X87 floating point instruction per cycle, so the compiler should avoid back-to-back X87 floating point instructions. The Intel Haswell-based processors have four units for basic arithmetic/logic operations and three units for load/store operations and the Intel Skylake-based processor can issue up to 6 uops per cycle. Given these detailed differences selecting and ordering the instructions is probably best left to the compiler.

    Like the pipeline, superscalar execution is attempting to have more instructions processed concurrently. Techniques that allow the pipelined processor work well also apply to superscalar processor. The caches need to quickly provide instructions and data to feed to the processor otherwise it is waiting. The processor needs to have good branch prediction and the code should have longer sequences of unrelated instructions.

    For a computationally intensive loop in user-code a loop-unrolling optimization might provide a larger sequence of instructions giving the processor more opportunities to issue multiple instructions in a single clock cycle. An example of manual loop unrolling can be seen in the BLAS (Basic Linear Algebra Subprogram) saxpy function. The saxpy function multiples the n elements in the vector sx by the scalar sa and adds each element to the matching element in the sy vector. With n elements in each vector the naive implementation of the loop would be something like:

    for (i = 0; i < n; i += 1) {
        sy[i] += sa * sx[i];
    }

    Manual loop unrolling for four iterations might convert the above loop into a loop to handle the cases of small vectors and vectors that are not evenly divisible by the unrolling factor followed by the unrolled loop:

    m = n % 4;
    for (i = 0; i < m; ++i) {
        sy[i] += sa * sx[i];
    }
    if (n>=4) {
        for (i = m; i < n; i += 4) {
            sy[i] += sa * sx[i];
            sy[i + 1] += sa * sx[i + 1];
            sy[i + 2] += sa * sx[i + 2];
            sy[i + 3] += sa * sx[i + 3];
        }
    }

    Whether this type of optimization is profitable depends heavily on the size of the vectors being used. If all the vectors are just a few elements, the additional overhead to check the size of the vectors and set things up may eliminate the benefit of the loop unwinding. You can use tools such as Linux perf event tool identify computationally intensive areas of code that might benefit from such optimizations.

    Last updated: February 6, 2024

    Recent Posts

    • A deep dive into Apache Kafka's KRaft protocol

    • Staying ahead of artificial intelligence threats

    • Strengthen privacy and security with encrypted DNS in RHEL

    • How to enable Ansible Lightspeed intelligent assistant

    • Why some agentic AI developers are moving code from Python to Rust

    Red Hat Developers logo LinkedIn YouTube Twitter Facebook

    Products

    • Red Hat Enterprise Linux
    • Red Hat OpenShift
    • Red Hat Ansible Automation Platform

    Build

    • Developer Sandbox
    • Developer Tools
    • Interactive Tutorials
    • API Catalog

    Quicklinks

    • Learning Resources
    • E-books
    • Cheat Sheets
    • Blog
    • Events
    • Newsletter

    Communicate

    • About us
    • Contact sales
    • Find a partner
    • Report a website issue
    • Site Status Dashboard
    • Report a security problem

    RED HAT DEVELOPER

    Build here. Go anywhere.

    We serve the builders. The problem solvers who create careers with code.

    Join us if you’re a developer, software engineer, web designer, front-end designer, UX designer, computer scientist, architect, tester, product manager, project manager or team lead.

    Sign me up

    Red Hat legal and privacy links

    • About Red Hat
    • Jobs
    • Events
    • Locations
    • Contact Red Hat
    • Red Hat Blog
    • Inclusion at Red Hat
    • Cool Stuff Store
    • Red Hat Summit
    © 2025 Red Hat

    Red Hat legal and privacy links

    • Privacy statement
    • Terms of use
    • All policies and guidelines
    • Digital accessibility

    Report a website issue