How to investigate 7 common problems in production

Introduction to the Node.js reference architecture, part 13: Problem determination

Node.js lead for Red Hat and IBM

Leveraging the experience shared in the Node.js reference architecture can help you minimize problems in production. However, it's a fact of life that problems will still occur and you need to do problem determination. This installment of the ongoing Node.js reference architecture series covers the Node.js reference architecture team’s experience with respect to how you can investigate common problems when they do occur.

7 common problems include:

Memory leaks
Hangs or slow performance
Application failures
Unhandled promise rejections or exceptions
Resource leaks
Network issues
Natives crashes

Read the series:

Part 1: Overview of the Node.js reference architecture
Part 2: Logging in Node.js
Part 3: Code consistency in Node.js
Part 4: GraphQL in Node.js
Part 5: Building good containers
Part 6: Choosing web frameworks
Part 7: Code coverage
Part 8: Typescript
Part 9: Securing Node.js applications
Part 10: Accessibility
Part 11: Typical development workflows
Part 12: npm development
Part 13: Problem determination
Part 14: Testing
Part 15: Transaction handling
Part 16: Load balancing, threading, and scaling
Part 17: CI/CD best practices in Node.js
Part 18: Wrapping up

Preparing for production problems

Investigating problems in production most often requires tools and processes to be in place and approved in advance. It is important to define the tools you’ll rely on and to get approval for those tools to either already be in place or to be installed when needed. Production environments are often tightly controlled, and doing this in advance will speed up your ability to get information when problems do occur.

The team typically used one of the following to capture a set of metrics regularly in production:

Existing application performance management (APM) offering
Custom solution leveraging available platform tools (for example, those provided by a cloud provider or Red Hat OpenShift)

You can read more about the suggested approach for capturing metrics in the metrics section. When those metrics have identified there is a problem, it’s time to kick off the problem determination process and try to match the symptoms to one of the common problems.

APM or custom solution

From the discussion we had within the team, one of the key factors of whether you want to use an APM or custom solution is whether you are operating a service or developing an application that will be operated by your customers.

If you are operating a service where you own all deployments an existing application performance management (APM) solution can make sense if budget is not an issue. They offer the advantage of requiring less upfront investment and leveraging a solution designed to help investigate problems. The team has had success with using Dynatrace, Instana, and NewRelic.

If you develop applications that will be operated by customers, adding a dependency on a specific APM for problem determination is not recommended. The cost may be an issue for some customers, while others may already have standardized on a specific APM.

Whichever you choose, you’ll want to be ready to:

Instrument the application in order to capture:
- Logs
- Metrics
- Traces
Generate and extract heap snapshots.
Generate and extract core dumps.
Dynamically change log levels.

The implementation section of the problem-determination section of the Node.js reference architecture has some good suggestions on how to do these based on our experience.

Investigating specific problems

Once you have your chosen approach in place (APM or custom) and your metrics start reporting an issue, the general flow the team discussed and agreed on for problem determination was to:

Match the symptoms reported by traces, metrics, logs, and health checks to one of the common problems.
Follow a set of steps from easiest to hardest to confirm/refute the suspected problem.
When confirmed, capture additional information.
Repeat as necessary until you’ve narrowed it down to the right problem.

I won’t repeat the content from the common problems section of the problem determination section in the reference architecture. But each problem includes:

Symptoms - how to identify the problem from the traces, metrics, logs, or health checks.
Approach - the steps and tools used by the team to investigate and capture more information for that problem.
Guidance - guidance and things to look out for when applying the approach suggested.

The common problems section covers the following problems based on the team's experience:

Memory leaks
Hangs or slow performance
Application failures
Unhandled promise rejections or exceptions
Resource leaks
Network issues
Natives crash

The information in that section for each of those problems should help you identify what problem your application is encountering based on the symptoms and then investigate to get more information as to what might be causing it to occur.

Coming next

I hope that this quick overview of the problem determination section of the Node.js reference architecture, along with the team discussions that led to that content, has been helpful and that the information shared in the architecture helps you in your future problem determination efforts.

We cover new topics regularly as part of the Node.js reference architecture series. Next, read about testing in Node.js.

We invite you to visit the Node.js reference architecture repository on GitHub, where you will see the work we have done and future topics. To learn more about what Red Hat is up to on the Node.js front, check out our Node.js page.

Last updated: January 9, 2024

Report a website issue

How to investigate 7 common problems in production

Preparing for production problems

APM or custom solution

Investigating specific problems

Coming next

Products

Build

Quicklinks

Communicate

RED HAT DEVELOPER

Red Hat legal and privacy links

Red Hat legal and privacy links

Report a website issue