Leveraging the experience shared in the Node.js reference architecture can help you minimize problems in production. However, it's a fact of life that problems will still occur and you need to do problem determination. This installment of the ongoing Node.js reference architecture series covers the Node.js reference architecture team’s experience with respect to how you can investigate common problems when they do occur.
7 common problems include:
- Memory leaks
- Hangs or slow performance
- Application failures
- Unhandled promise rejections or exceptions
- Resource leaks
- Network issues
- Natives crashes
Read the series:
- Part 1: Overview of the Node.js reference architecture
- Part 2: Logging in Node.js
- Part 3: Code consistency in Node.js
- Part 4: GraphQL in Node.js
- Part 5: Building good containers
- Part 6: Choosing web frameworks
- Part 7: Code coverage
- Part 8: Typescript
- Part 9: Securing Node.js applications
- Part 10: Accessibility
- Part 11: Typical development workflows
- Part 12: npm development
- Part 13: Problem determination
- Part 14: Testing
- Part 15: Transaction handling
- Part 16: Load balancing, threading, and scaling
- Part 17: CI/CD best practices in Node.js
- Part 18: Wrapping up
Preparing for production problems
Investigating problems in production most often requires tools and processes to be in place and approved in advance. It is important to define the tools you’ll rely on and to get approval for those tools to either already be in place or to be installed when needed. Production environments are often tightly controlled, and doing this in advance will speed up your ability to get information when problems do occur.
The team typically used one of the following to capture a set of metrics regularly in production:
- Existing application performance management (APM) offering
- Custom solution leveraging available platform tools (for example, those provided by a cloud provider or Red Hat OpenShift)
You can read more about the suggested approach for capturing metrics in the metrics section. When those metrics have identified there is a problem, it’s time to kick off the problem determination process and try to match the symptoms to one of the common problems.
APM or custom solution
From the discussion we had within the team, one of the key factors of whether you want to use an APM or custom solution is whether you are operating a service or developing an application that will be operated by your customers.
If you are operating a service where you own all deployments an existing application performance management (APM) solution can make sense if budget is not an issue. They offer the advantage of requiring less upfront investment and leveraging a solution designed to help investigate problems. The team has had success with using Dynatrace, Instana, and NewRelic.
If you develop applications that will be operated by customers, adding a dependency on a specific APM for problem determination is not recommended. The cost may be an issue for some customers, while others may already have standardized on a specific APM.
Whichever you choose, you’ll want to be ready to:
- Instrument the application in order to capture:
- Generate and extract heap snapshots.
- Generate and extract core dumps.
- Dynamically change log levels.
The implementation section of the problem-determination section of the Node.js reference architecture has some good suggestions on how to do these based on our experience.
Investigating specific problems
Once you have your chosen approach in place (APM or custom) and your metrics start reporting an issue, the general flow the team discussed and agreed on for problem determination was to:
- Match the symptoms reported by traces, metrics, logs, and health checks to one of the common problems.
- Follow a set of steps from easiest to hardest to confirm/refute the suspected problem.
- When confirmed, capture additional information.
- Repeat as necessary until you’ve narrowed it down to the right problem.
I won’t repeat the content from the common problems section of the problem determination section in the reference architecture. But each problem includes:
- Symptoms - how to identify the problem from the traces, metrics, logs, or health checks.
- Approach - the steps and tools used by the team to investigate and capture more information for that problem.
- Guidance - guidance and things to look out for when applying the approach suggested.
The common problems section covers the following problems based on the team's experience:
- Memory leaks
- Hangs or slow performance
- Application failures
- Unhandled promise rejections or exceptions
- Resource leaks
- Network issues
- Natives crash
The information in that section for each of those problems should help you identify what problem your application is encountering based on the symptoms and then investigate to get more information as to what might be causing it to occur.
Coming next
I hope that this quick overview of the problem determination section of the Node.js reference architecture, along with the team discussions that led to that content, has been helpful and that the information shared in the architecture helps you in your future problem determination efforts.
We cover new topics regularly as part of the Node.js reference architecture series. Next, read about testing in Node.js.
We invite you to visit the Node.js reference architecture repository on GitHub, where you will see the work we have done and future topics. To learn more about what Red Hat is up to on the Node.js front, check out our Node.js page.
Last updated: January 9, 2024