Skip to main content
Redhat Developers  Logo
  • AI

    Get started with AI

    • Red Hat AI
      Accelerate the development and deployment of enterprise AI solutions.
    • AI learning hub
      Explore learning materials and tools, organized by task.
    • AI interactive demos
      Click through scenarios with Red Hat AI, including training LLMs and more.
    • AI/ML learning paths
      Expand your OpenShift AI knowledge using these learning resources.
    • AI quickstarts
      Focused AI use cases designed for fast deployment on Red Hat AI platforms.
    • No-cost AI training
      Foundational Red Hat AI training.

    Featured resources

    • OpenShift AI learning
    • Open source AI for developers
    • AI product application development
    • Open source-powered AI/ML for hybrid cloud
    • AI and Node.js cheat sheet

    Red Hat AI Factory with NVIDIA

    • Red Hat AI Factory with NVIDIA is a co-engineered, enterprise-grade AI solution for building, deploying, and managing AI at scale across hybrid cloud environments.
    • Explore the solution
  • Learn

    Self-guided

    • Documentation
      Find answers, get step-by-step guidance, and learn how to use Red Hat products.
    • Learning paths
      Explore curated walkthroughs for common development tasks.
    • Guided learning
      Receive custom learning paths powered by our AI assistant.
    • See all learning

    Hands-on

    • Developer Sandbox
      Spin up Red Hat's products and technologies without setup or configuration.
    • Interactive labs
      Learn by doing in these hands-on, browser-based experiences.
    • Interactive demos
      Click through product features in these guided tours.

    Browse by topic

    • AI/ML
    • Automation
    • Java
    • Kubernetes
    • Linux
    • See all topics

    Training & certifications

    • Courses and exams
    • Certifications
    • Skills assessments
    • Red Hat Academy
    • Learning subscription
    • Explore training
  • Build

    Get started

    • Red Hat build of Podman Desktop
      A downloadable, local development hub to experiment with our products and builds.
    • Developer Sandbox
      Spin up Red Hat's products and technologies without setup or configuration.

    Download products

    • Access product downloads to start building and testing right away.
    • Red Hat Enterprise Linux
    • Red Hat AI
    • Red Hat OpenShift
    • Red Hat Ansible Automation Platform
    • See all products

    Featured

    • Red Hat build of OpenJDK
    • Red Hat JBoss Enterprise Application Platform
    • Red Hat OpenShift Dev Spaces
    • Red Hat Developer Toolset

    References

    • E-books
    • Documentation
    • Cheat sheets
    • Architecture center
  • Community

    Get involved

    • Events
    • Live AI events
    • Red Hat Summit
    • Red Hat Accelerators
    • Community discussions

    Follow along

    • Articles & blogs
    • Developer newsletter
    • Videos
    • Github

    Get help

    • Customer service
    • Customer support
    • Regional contacts
    • Find a partner

    Join the Red Hat Developer program

    • Download Red Hat products and project builds, access support documentation, learning content, and more.
    • Explore the benefits

Claude as your performance analysis partner

Performance Analysis with Claude on CPU Profiles and Traces

May 29, 2026
Archana Ravindar
Related topics:
AI inferenceGo
Related products:
Red Hat OpenShift Dev SpacesRed Hat Advanced Developer Suite

    Performance analysis involves identifying and resolving application bottlenecks by measuring data like hardware counters, CPU profiles, and traces. These data files are often large (hundreds of megabytes), with CPU profiles containing extensive instruction cost details. Visually inspecting these large files—including zooming into traces in a browser to find patterns and dependencies—is a laborious and error-prone task.

    I experimented with Claude to see if the task of performance analysis turns into a rewarding one rather than exhausting. The results were very encouraging. In this article, I discuss how I used Claude to carry out performance analysis using CPU profiles and traces. I consider the new Go Green Tea garbage collector (GC) as the focal code which is evaluated for optimization opportunities using the sweet benchmark suite on POWER10 architecture.

    CPU profile analysis

    Go offers the capability to generate CPU profiles for a binary using the Go tool pprof. These profiles can identify performance bottlenecks, even down to the assembly instruction level. Claude works on finding suggestions in these hotspots marked by pprof. The top ten routines where most time is spent, in case of the bleve index benchmark, is illustrated below:

    go tool pprof BleveIndexBatch100-1212217505-cpu.prof 
    File: bleve-index-bench
    Build ID: 81f203eb5714d9755a42c833024d5f60afd94e03
    Type: cpu
    Time: 2026-03-10 16:10:56 IST
    Duration: 5.40s, Total samples = 15.18s (280.91%)
    Entering interactive mode (type "help" for commands, "o" for options)
    (pprof) top 10
    Showing nodes accounting for 8830ms, 58.17% of 15180ms total
    Dropped 208 nodes (cum <= 75.90ms)
    Showing top 10 nodes out of 103
      flat  flat%   sum%    cum   cum%
     3170ms 20.88% 20.88%   4260ms 28.06%  runtime.tryDeferToSpanScan
     1100ms  7.25% 28.13%   2970ms 19.57%  runtime.scanObjectSmall
     840ms  5.53%  33.66%    900ms  5.93%  github.com/blevesearch/segment.segmentWords#
     710ms  4.68%  38.34%   7180ms 47.30%  runtime.scanSpan#
     670ms  4.41%  42.75%   2940ms 19.37%  runtime.scanObjectsSmall
    
     520ms  3.43%  46.18%    970ms  6.39%  github.com/blevesearch/bleve/index/store/gtreap.itemCompare
     520ms  3.43%  49.60%    3610ms 23.78%  github.com/steveyen/gtreap.(*Treap).union
     440ms  2.90%  52.50%    2070ms 13.64%  github.com/steveyen/gtreap.(*Treap).split
     430ms  2.83%  55.34%      450ms  2.96%  runtime.extractHeapBitsSmall
     430ms  2.83%  58.17%     1690ms 11.13%  runtime.mallocgcSmallScanNoHeader

    The atomics example

    Within the hottest routine runtime.tryDeferToSpanScan, Claude quickly identified a section of code involving atomics that presented significant opportunities for optimization.

    The code performs 2 atomic operations per object:

    1. atomic.Load8(): Check if already marked (1.10s from profile)
    2. atomic.Or8(): Set the mark bit (690ms from profile)

    Total: 1.79s in atomic operations alone

    Proposed optimizations

    Single atomic test-and-set (Recommended): The atomic.Or8() function in the Go atomic package is limited because it does not return the old value, unlike atomic.Or32(). Due to this limitation, the current implementation must first use Load8 to determine whether the object is already marked. Given that atomic instructions are costly on the PowerPC architecture, Claude suggested an alternative implementation that uses only one atomic instruction Or32.

    // Set mark bit with single atomic operation
    idx, mask := objIndex/8, uint8(1)<<(objIndex%8)
    
    // Use Or32 on aligned 32-bit word to get old value
    byteOffset := uintptr(idx) & 0b11
    wordPtr := (*uint32)(unsafe.Pointer(uintptr(unsafe.Pointer(&q.marks[idx])) &^ 0b11))
    shift := byteOffset * 8
    
    if goarch.BigEndian {
        shift = 32 - shift - 8
    }
    oldWord := atomic.Or32(wordPtr, uint32(mask)<<shift)
    if (oldWord>>shift)&uint32(mask) != 0 {
        return true  // Already marked
    }

    Expected savings: ~1.10s (eliminates the separate Load8)

    Claude also pointed out that this is exactly the same pattern used in spanScanOwnership.or() (lines 95-110)!

    The suggested approach, although seemingly sound, led to a performance regression. This was primarily due to false sharing in the garbage collection process. Multiple threads accessing the heap bookkeeping area caused contention:

    • When threads accessed separate bytes within the same 4-byte word, the lock was applied to the entire word, forcing sequential access.
    • This issue was corroborated by Claude, which noted the following problems with the implementation:
      • False sharing: Using Or32 means four adjacent bytes share the same atomic operation.
      • High contention: Multiple GC threads marking different objects in the same span compete for the same 32-bit word.
      • Cache line bouncing: Each Or32 operation invalidates cache lines across different CPUs. The profile confirmed this:
    • atomic.Or32: 0.43s to 1.95s (+1.52s)
    • tryDeferToSpanScan itself improved (4.53s to 2.61s), but the cost of Or32 ate all the gains

    Even though optimization does not help in the context of garbage collection, it could make a difference in another application when there is not much contention for the same word.

    Arithmetic optimization

    In the same function tryDeferToSpanScan, Claude also suggested an interesting Power-of-2 division optimization during objIndex calculation.

    // For power-of-2 sizes, use shift instead of magic division
    elemsize := gc.SizeClassToSize[q.class.sizeclass()]
    if isPowerOfTwo(elemsize) {
        objIndex = uint16((p - base) >> log2(elemsize))
    } else {
        objIndex = uint16((uint64(p-base) * uint64(gc.SizeClassToDivMagic[q.class.sizeclass()])) >> 32)
    }

    The performance improvement from this change is limited, as it only benefits programs that handle objects with sizes that are powers of 2. In other cases, an additional condition check is required. Therefore, this suggestion's effectiveness is heavily dependent on specific benchmark characteristics and is not considered generally effective.

    Low-Level code analysis

    In the biogo-igor benchmark, Claude suggested caching the q.class.sizeclass variable as a potential optimization. When challenged that the compiler should handle such simple peephole optimizations, and subsequently presented with the assembly dump, Claude confirmed the optimization was not being performed. Furthermore, the model was able to pinpoint the likely reason: an atomic call that accessed a portion of the variable, which likely prevented the compiler from applying the optimization.

    This shows that Claude can be used to pore over assembly code and glean insights that are easy to miss, given that the assembly format is difficult to read.

    Evidence from Assembly (igor.dump lines 70772-70831):

    First Access: Computing objIndex (lines 70772-70789)

    # Source: objIndex := uint16((uint64(p-base) * uint64(gc.SizeClassToDivMagic[q.class.sizeclass()])) >> 32)
    
    70773→  0x440e3a  MOVZX 0x7f(DX), SI  #① Load q.class from memory (offset 0x7f in DX)
    70775→  0x440e3e  SHRL $0x1, SI   #② Compute sizeclass() - shift right by 1
    70777→  0x440e41  MOVSX SI, SI    #③ Sign-extend sizeclass result
    ...
    70784→  0x440e4a  CMPQ SI, $0x44  #④ Bounds check: sizeclass < 68?
    70785→  0x440e4e  JAE 0x441012    #⑤ Jump if out of bounds
    70786→  0x440e54  LEAQ gc.SizeClassToDivMagic(SB), R8  # ⑥ Load array base address
    70787→  0x440e5b  MOVL 0(R8)(SI*4), CX   #⑦ Load gc.SizeClassToDivMagic[sizeclass]
    70788→  0x440e5f  IMULQ DI, CX           #⑧ Multiply: (p-base) * divMagic
    70789→  0x440e63  SHRQ $0x20, CX         #⑨ Shift right 32 → objIndex
    

    Operations:

    1. Memory load: q.class from offset 0x7f
    2. Computation: sizeclass() (right shift)
    3. Array access: SizeClassToDivMagic[sizeclass]

    Second Access: In noscan Path (lines 70814-70831)

    # Source: gcw.bytesMarked += uint64(gc.SizeClassToSize[q.class.sizeclass()])
    
    70815→  0x440ead    MOVZX 0x7f(DX), CX         # ① Load q.class from memory AGAIN! (same offset 0x7f)
    70817→  0x440eb1    TESTL $0x1, CL             # ② Check noscan bit
    ...
    70823→  0x440eb8    SHRL $0x1, CL              # ③ Compute sizeclass() AGAIN! (shift right)
    70825→  0x440eba    MOVSX CL, CX               # ④ Sign-extend AGAIN
    ...
    70827→  0x440ec0    CMPQ CX, $0x44             # ⑤ Bounds check AGAIN: sizeclass < 68?
    70828→  0x440ec4    JAE 0x440ffa               # ⑥ Jump if out of bounds AGAIN
    70829→  0x440eca    LEAQ gc.SizeClassToSize(SB), DX  # ⑦ Load array base address
    70830→  0x440ed1    MOVZX 0(DX)(CX*2), CX      # ⑧ Load gc.SizeClassToSize[sizeclass]
    70831→  0x440ed5    ADDQ CX, 0x840(BX)         # ⑨ Add to gcw.bytesMarked
    

    Redundant operations:

    1. Memory load: q.class from offset 0x7f again (line 70815)
    2. Computation: sizeclass() again (line 70823)
    3. Bounds: Bounds check again (line 70827)
    4. Array access: SizeClassToSize[sizeclass] (line 70830)

    Prioritization of TODO work in the GC for maximum impact

    There are several TODO items embedded within comments throughout the GC code. For developers interested in getting their hands dirty working with the garbage collector, using Claude to sift through the TODOs and prioritize them based on impact and difficulty could be very helpful to ensure they can make a meaningful contribution. Once Claude is familiar with the code base and hotspots, it is in a position to give some good leads.

    For example, the following is a list that Claude suggested when presented with the profiles of all sweet benchmarks on PowerPC in priority order based on profile hotness and difficulty:

    1. Filter nil pointers in write barrier fast path
    2. Batch heap frees to reduce lock contention in reclaim function
    3. Lift sweepLocker out of sweepone loop
    4. Improve fractional mark worker scheduling
    5. Atomic max primitive for RISC-V
    6. Better distribution of GC work among workbufs
    7. Unified greying path through wbBuf in write barrier code
    8. Skip shading new pointers if stack already shaded in write barrier
    9. Green Tea GC: Batch spanSPMC freeing

    Trace file analysis

    Trace files are a vital complement to CPU profiles. While CPU profiles indicate where a performance problem is, trace files provide the necessary insights to understand why. Analyzing a trace file is essential for determining if the garbage collector is causing a performance bottleneck. They can be generated by the option go tool trace.

    Pattern recognition and guidance

    For beginners trying to learn how to view and understand what the trace file is trying to convey, Claude can be a good guide. The trace output contains thousands of compressed time intervals of data as shown in the figure below. These intervals can be zoomed in by the user to view events occurring in the interval in more detail (figure 1). Visually analyzing a trace file in the browser can be exhausting as unless you know what and where to look.

    Trace viewing in a browser.
    Figure 1: Viewing trace results in a browser.

    This is where Claude comes handy. When prompted on what patterns to look for, it can provide a nice cheat sheet, as shown below, on patterns to look for at a high level so you can quickly focus on the problematic areas in the trace. Visual patterns to look for:

    Good pattern: Dense processor utilization

    Continuously busy processors is an indication of efficiency.

    P0: ████████████████████████████████
    P1: ████████████████████████████████
    P2: ████████████████████████████████
    P3: ████████████████████████████████

    Bad pattern: Idle processors during GC

    Gaps during GC marking indicate inefficient work distribution.

    P0: ████░░░░░░██████░░░░░░████████
    P1: ████████░░░░░░░░██████████████
    P2: ░░░░░░░░████████████░░░░░░░░░░
    P3: ████████████░░░░░░░░░░░░██████
        ↑ Work imbalance during mark

    Bad Pattern: Frequent GC cycles

    GC running every 20ms is a very high frequency (expected for igor).

    Timeline (1 second view):
    GC: █░░░░█░░░░█░░░░█░░░░█░░░░█░░░░
        ↑       ↑       ↑       ↑        ↑      ↑
       20ms     40ms    60ms    80ms     100ms...

    Bad pattern: Mark assist gaps

    Frequent gaps indicate that the mutator is being throttled.

    Application Goroutine:
    ████░█████░█████░█████░█████░████
          ↑         ↑         ↑         ↑         ↑
          Mark assist pauses (goroutine helping GC)

    Good pattern: STW duration

    Short STW phases are good.

    GC Cycle Zoom In:
    STW-Start: █ (< 1ms)
    Concurrent Mark: ██████████ (10-20ms)
    STW-End: █ (< 1ms)

    Bad pattern: Long gcMarkDone

    A long gcMarkDone bar indicates a work imbalance.

    Mark Phase End:
    Workers: ████░░░░ (some finish early)
             ████████ (some still working)
             ████░░░░
    gcMarkDone: ░░░░░░░█ (long wait for stragglers)

    Pattern recognition cheat sheet

    • Dense bars everywhere: Good efficiency
    • Frequent vertical lines: Many context switches (bad)
    • Gaps during GC phases: Work imbalance
    • Regular GC bars < 100ms: Acceptable frequency
    • GC bars every 20ms: Too frequent
    • Long horizontal blocks: Good, continuous work
    • Fragmented blocks: Bad, interrupted work

    Extracting GC-related statistics from traces

    Sometimes even looking for patterns becomes exhausting due to the sheer size of the trace output. In such cases, you can ask Claude to draw insights from the trace files. For example, in the case of biogo-igor, Claude is able to draw the following conclusions based on the trace statistics:

    GC is running frequently:

    • 951 GC cycles in ~20 seconds

      = 47.55 GCs per second = One GC every 21ms

    • Conclusion: Very high GC pressure

    Mark assists indicate mutator throttling:

    • 224 mark assist events
    • Over 951 GCs = 0.24 assists per GC
    • Conclusion: GC marking can't keep up, mutator helping

    High allocation rate:

    • 179,896 HeapAlloc events in 20s = 8,995 allocations per second
    • Conclusion: Allocation-heavy workload

    Scheduler churn:

    • 130,994 ProcStart/Stop events 15,678 ProcSteal events
    • Conclusion: Significant scheduling overhead

    Mark termination is expensive:

    • gcMarkDone: 181.16ms cumulative
    • Across 951 GCs = ~190μs per GC
    • Conclusion: Work imbalance in marking phase

    It can also give us data points to look for while trying to identify what the problem is. This is useful when we are doing offline analysis and don't have Claude handy.

    • GC frequency > 10/sec? Likely GC issue
    • Mark assists > 5% of GCs? Marking can't keep up
    • gcMarkDone latency high? Work imbalance
    • High ProcSteal? Load imbalance
    • Many HeapAllocs? Allocation pressure

    Parameter tuning

    Based on bottlenecks pointed to by trace or CPU profile files, Claude can suggest parameter tuning, which is a favorite among users because it is the least expensive option and doesn't need modification of source code or the Go runtime or compiler, or recompilation of source code. In other words, the gains are "free"!

    Some common recommendations given by Claude in the context of GC are shown below. The parameter you choose to tune depends on the bottleneck you encounter in the application.

    Tune the GOGC parameter

    Lower GOGC to trigger GC more frequently, but with less work per cycle. For example, you could lower it from its default 100 to 50:

    export GOGC=50

    This reduces mark assist by keeping heap growth in check, at the cost of more frequent but shorter GC cycles.

    Set Memory limit (requires Go 1.19+)

    You can set soft memory limit to prevent excessive heap growth. Adjust based on available memory:

    export GOMEMLIMIT=8GiB

    This helps the GC pace itself better and reduce mark assist events.

    Increase GC workers

    If you have spare CPU cores, ensure GOMAXPROCS matches your available cores (this example uses 36, so adjust it for your actual core count):

    export GOMAXPROCS=36

    Comparative analysis

    Comparing two trace files to identify performance differences across various phases is a common necessity. Manually comparing trace output in two browser windows is laborious, inefficient and error-prone, especially since traces can contain thousands of data intervals. However, a tool like Claude can effectively highlight these differences as shown below, offering significant value in performance evaluation.

    For instance, this capability is crucial when assessing a new feature, such as the Green Tea garbage collector. By generating two traces—one with the baseline GC and one with Green Tea GC — you can compare them to pinpoint areas needing attention.

    Also this feature is extremely useful when developing a GC optimization.This capability enables the detection of regressions resulting from the optimization, helping to pinpoint the root cause. Similarly, you can use Claude to compare the CPU profiles of two binaries to analyze the differences.

    MetricBenchmark: IGORBenchmark: KRISHNARatio
    Allocations/sec9,00017551x
    GC frequency47/sec0.1/sec470x
    Mark assists2242112x
    GC CPU overhead37%<1%37x

    Multi-trace comparison

    Extending this idea forward to more than two benchmarks, we can evaluate the Green Tea GC design at a much higher level. For instance, comparing trace files for all sweet benchmarks, Claude was able to give a high level summary of where Green Tea GC was good at, and where it wasn't.

    BenchmarkGC overheadGreen Tea benefitGreen Tea feature
    tile3820-25%٭٭٭٭٭ ExcellentPage level tracking eliminates a lot of per-object overhead
    biogo-igor77%٭٭٭٭٭ ExcellentSpan level batching reduces per object decisions
    bleve-index55%٭٭٭٭ Very goodBatch span scanning more efficient than individual object scanning
    esbuild35-40%٭٭٭ GoodHigh parallelism, GC overhead is moderate
    etcd-STM10-15%٭٭ ModerateMain bottleneck is transaction coordination rather than the GC
    gopher-lua10-15%٭ LowGC overhead low
    etcd-Put5% (masked)٭ LowGC overhead low
    biogo-krishna2%MinimalAlmost non existent GC overhead
    markdown10-15%MinimalAlmost non existent GC overhead

    In summary, it has been observed based on the patterns seen in sweet benchmarks that Green Tea GC excels when:

    • High GC cycle frequency (> 10K cycles)
    • Clustered allocations (trees, queues, graphs)
    • High mark assist pressure (> 2 assists/cycle)
    • Significant time in tryDeferToSpanScan
    • High write barrier buffer overhead

      Green Tea provides little benefit when:

    • Low GC overhead (< 10%)
    • Computational bottlenecks dominate
    • Sequential allocation patterns
    • Large static data with few updates

    CPU profiles vs traces: When to use each

    Profiles offer instruction-level detail by associating a cost with every instruction. This deep insight helps pinpoint bottlenecks within specific code regions, such as determining what can be improved within a phase like marking or sweeping. Suggestions derived from profiles are typically more immediately actionable.

    Traces, conversely, offer a higher-level view, focusing on which phases are most problematic, identifying issues in phase transition, or analyzing how phases are scheduled. Traces can also be very valuable if the application bottlenecks are not CPU bound. If Claude points out that the bottleneck could be in the IO or the network layer, you can focus our efforts on optimizing those layers instead.

    Due to their differing input formats, the data and actionable insights provided by profiles and traces are quite different. However, they complement each other effectively in providing a comprehensive view of performance.

    In some cases, using both profiles and traces help us converge to the performance issue faster. For example, Claude reports the following in the case of biogo-igor benchmark and suggests adaptive write buffer resizing as a solution.

    • CPU profile alone: Shows wbBuf time, but doesn't tell us if 13.56s is normal
    • Trace alone: Shows GC frequency, but doesn't show where time goes
    • COMBINED:
      • Trace shows EXTREMELY high GC frequency (938/sec)
      • CPU shows write barrier buffer management is 29.8% of runtime
      • Math reveals that the buffer is too small for this allocation pattern

    Current limitations

    There are limitations to be aware of when using Claude for this kind of analysis.

    Architecture-dependent suggestions require validation

    Suggestions related to low-level hardware details, such as cache lines, memory layout, SIMD, or ISA, must be validated. The inherent complexity of modern processor architectures can introduce unforeseen dependencies, meaning the assumed performance gains may not materialize. An example exists in the earlier paragraph where Claude suggests using adaptive write barrier buffer resizing without realizing that increasing the buffer size beyond an architecture specific value (cache line size) can actually hurt overall performance.

    Short-sighted suggestions miss the bigger picture

    Claude may mistakenly identify a "missed optimization", such as a lack of aggressive inlining or loop unrolling, when the compiler intentionally skipped it. This is often done to manage register pressure, demonstrating a trade-off that Claude may have failed to recognize.

    Confusion with minimal overhead components (for example, a garbage collector)

    When a component like the GC already has minimal overhead or footprint in a specific benchmark, Claude can become confused. For instance, in the gopherlua benchmark, where GC overhead is very low, it incorrectly assumed that the benchmark was not built with Green Tea GC enabled.

    To enhance Claude's responsiveness and analytical quality, ensuring suggestions lead to genuine performance gains, we plan to implement an agentic workflow focused on reducing token consumption. Additionally, incorporating specific guardrails using projects.md and skills.md documentation, helps minimize false positives and communicates the functioning of the GC to Claude more effectively, making this experiment worthwhile.

    Summary

    Claude excels at understanding the binary CPU profile format, including the ability to interpret assembly dumps to identify compiler optimizations. It can suggest simple optimizations though these require subsequent validation and evaluation.

    Claude demonstrates a strong grasp of the binary trace format. It can independently parse trace files, filter out events related to garbage collection, and generate scripts for later use. In certain scenarios, Claude can correlate data between traces and CPU profiles. A key capability is its ability to compare two trace or CPU profile files to determine which is superior and provide the rationale. This is useful for identifying regressions or understanding performance differences between versions. This comparative analysis can be extended to multiple files to derive higher-level insights, such as the relationship between memory allocation and GC activity.

    Conclusion

    Claude is a valuable exploratory tool that augments Go compiler and runtime performance analysis. It helps narrow the focus to likely areas for performance gains, overcoming the manual labor of analyzing large profiles and trace files. It can be used to quickly evaluate a prototype implementation for regressions and improvements as it can compare two profiles and traces, and pinpoint the differences very well. It can also be used over a range of large benchmarks to come up with repetitive patterns and understand the bigger picture of where the implementation excels and where it can be lacking.

    As with any AI tool, Claude's suggestions are limited, and architecture-specific recommendations can be inaccurate because providing the complete context isn't always feasible. In some cases, Claude may show confusion when components, like the garbage collector, are already performing well, requiring careful context-setting prompts. You must be mindful of these limitations when utilizing Claude.

    Related Posts

    • MCP servers vs. skills: Choosing the right context for your AI

    • Vulnerability analysis for Golang applications with Red Hat CodeReady Dependency Analytics

    • Getting started with Golang Operators by using Operator SDK

    • Integrate Claude Code with Red Hat AI Inference Server on OpenShift

    Recent Posts

    • What's new in OpenShift Container Platform system management

    • Claude as your performance analysis partner

    • LogAn: Large-scale log analysis with small language models

    • stalld’s BPF Backend: Breaking Free from debugfs

    • Running AI inference on Rebellions ATOM NPU with Red Hat AI

    Red Hat Developers logo LinkedIn YouTube Twitter Facebook

    Platforms

    • Red Hat AI
    • Red Hat Enterprise Linux
    • Red Hat OpenShift
    • Red Hat Ansible Automation Platform
    • See all products

    Build

    • Developer Sandbox
    • Developer tools
    • Interactive tutorials
    • API catalog

    Quicklinks

    • Learning resources
    • E-books
    • Cheat sheets
    • Blog
    • Events
    • Newsletter

    Communicate

    • About us
    • Contact sales
    • Find a partner
    • Report a website issue
    • Site status dashboard
    • Report a security problem

    RED HAT DEVELOPER

    Build here. Go anywhere.

    We serve the builders. The problem solvers who create careers with code.

    Join us if you’re a developer, software engineer, web designer, front-end designer, UX designer, computer scientist, architect, tester, product manager, project manager or team lead.

    Sign me up

    Red Hat legal and privacy links

    • About Red Hat
    • Jobs
    • Events
    • Locations
    • Contact Red Hat
    • Red Hat Blog
    • Inclusion at Red Hat
    • Cool Stuff Store
    • Red Hat Summit
    © 2026 Red Hat

    Red Hat legal and privacy links

    • Privacy statement
    • Terms of use
    • All policies and guidelines
    • Digital accessibility

    Chat Support

    Please log in with your Red Hat account to access chat support.