Implementing Graceful Shutdown on uncaughtException in Node.js
When an unhandled exception reaches the top of the Node.js call stack, your process has one last chance to clean up before it dies. This page covers the exact drain-and-exit sequence you need to implement inside that handler — part of the broader topic of Node.js uncaughtException vs unhandledRejection within the core JavaScript error handling boundaries guide.
Symptom / Trigger
The two failure modes look very different in production but share the same root cause — an uncaughtException handler that is either too brutal or non-existent.
Mode A — immediate crash, upstream sees connection resets:
node:internal/process/promises:288
triggerUncaughtException(err, true /* fromPromise */);
^
Error: Cannot read properties of undefined (reading 'userId')
at processOrder (/app/orders.js:42:18)
at /app/worker.js:17:3
[K8s] Container exited with code 1
[ALB] 502 Bad Gateway — upstream connection reset
Mode B — process hangs forever, container orchestrator sends SIGKILL:
[K8s] Sending SIGTERM to container
[K8s] Grace period expired (30s) — sending SIGKILL
[K8s] Container killed
[APM] 847 spans lost — telemetry buffer never flushed
Both modes produce data loss and visible errors in your load balancer or APM dashboard.
Root Cause Explanation
The most common broken pattern is a naked process.exit() call with no drain step, or an empty handler that tries to await something (which silently does nothing because uncaughtException handlers cannot suspend the event loop):
// Broken pattern A: exit before connections drain
process.on('uncaughtException', (err) => {
console.error('Uncaught:', err); // logs, but buffers may not flush
process.exit(1); // kills live sockets immediately
});
// Broken pattern B: await inside handler — the await is ignored
process.on('uncaughtException', async (err) => {
await Sentry.captureException(err); // this Promise is never awaited by Node
await Sentry.flush(2000); // same — silently dropped
process.exit(1);
});
Pattern A causes the load balancer to see an abrupt TCP reset on every active keepalive connection. Pattern B gives the illusion of flushing telemetry but async functions inside uncaughtException handlers return a Promise that nobody holds — flush() never completes before exit() runs.
The core issue: server.close() stops the listening socket from accepting new TCP connections, but it does not destroy sockets that are already connected and idle in a keepalive state. Those sockets keep the process alive indefinitely if you never close them, or get torn down mid-response if you call process.exit() too soon.
Step-by-Step Fix
Step 1: Set exit code and stop accepting new connections
Register the handler early, before any server starts listening. Set the exit code immediately so that even if the hard-timeout fires, the OS sees the right code:
// shutdown.js — import this before anything else in server.js
import http from 'node:http';
import { WebSocketServer } from 'ws';
let httpServer; // assigned after server.listen()
let wssServer; // assigned after wss instantiation
export function registerServers(server, wss) {
httpServer = server;
wssServer = wss;
}
process.on('uncaughtException', (err, origin) => {
process.exitCode = 1; // set now; process.exit() later inherits this
process.stderr.write(
`[FATAL] uncaughtException origin=${origin}\n${err.stack}\n`
);
// Stop the listening socket — no new TCP handshakes accepted
httpServer?.close();
wssServer?.close();
initiateShutdown(err);
});
Calling httpServer.close() returns immediately; it does not wait for connections to drain. That work happens in Step 2.
Step 2: Drain active HTTP and WebSocket connections
Track every socket the server opens and destroy idle keepalive sockets once they finish their current request:
// connectionTracker.js
const activeConnections = new Set();
export function trackConnections(server) {
server.on('connection', (socket) => {
activeConnections.add(socket);
socket.once('close', () => activeConnections.delete(socket));
});
}
export function drainConnections() {
for (const socket of activeConnections) {
// destroySoon() finishes any in-progress write, then closes the socket.
// On an idle keepalive socket it closes immediately.
socket.destroySoon();
}
}
For WebSocket clients you need a slightly different approach because WebSocket frames may be mid-send:
// wsTracker.js
const activeClients = new Set();
export function trackWsClients(wss) {
wss.on('connection', (ws) => {
activeClients.add(ws);
ws.once('close', () => activeClients.delete(ws));
});
}
export function drainWsClients() {
for (const ws of activeClients) {
// 1001 = "Going Away" close code — clients will reconnect
ws.close(1001, 'Server shutting down');
}
}
If you prefer a package instead of manual tracking, http-terminator wraps the same logic and handles edge cases like half-open sockets.
Step 3: Flush telemetry synchronously or with a Promise.race timeout
Because you cannot await inside an uncaughtException handler, you must either use a synchronous flush API or race a Promise against a hard wall-clock timeout — but call it without await:
import * as Sentry from '@sentry/node';
export function flushTelemetry(err) {
// Capture the exception synchronously (adds it to Sentry's internal queue)
Sentry.captureException(err);
// flush() returns a Promise — race it against 2 s, but do NOT await here.
// Instead return the Promise so initiateShutdown can chain it.
return Promise.race([
Sentry.flush(2000),
new Promise((resolve) => setTimeout(resolve, 2000)), // hard ceiling
]);
}
For a last-gasp structured log that must reach stdout/stderr before the process dies, use the synchronous process.stderr.write() — unlike console.error, it does not go through the async stream machinery:
// Synchronous write — safe to call inside uncaughtException
process.stderr.write(
JSON.stringify({ level: 'fatal', msg: err.message, stack: err.stack }) + '\n'
);
Step 4: Enforce a hard-timeout fallback with .unref()
Wire the drain and flush steps together and arm a hard timeout that fires only if the drain stalls:
// shutdown.js (continued from Step 1)
import { drainConnections } from './connectionTracker.js';
import { drainWsClients } from './wsTracker.js';
import { flushTelemetry } from './telemetry.js';
let shuttingDown = false; // guard against re-entrant calls
export async function initiateShutdown(err) {
if (shuttingDown) return; // uncaughtException can fire multiple times
shuttingDown = true;
// Hard timeout: fires after 10 s if drain or flush stalls.
// .unref() means this timer does NOT hold the event loop open —
// if drain finishes in 2 s and the event loop empties, Node exits cleanly
// without waiting for the 10 s to elapse.
const hardTimeout = setTimeout(() => {
process.stderr.write('[SHUTDOWN] Hard timeout reached — forcing exit\n');
process.exit(1);
}, 10_000).unref();
try {
drainConnections(); // destroySoon() on all tracked HTTP sockets
drainWsClients(); // ws.close(1001) on all tracked WS clients
await flushTelemetry(err); // races internally against 2 s ceiling
} catch (shutdownErr) {
process.stderr.write(`[SHUTDOWN] Error during drain: ${shutdownErr.message}\n`);
} finally {
clearTimeout(hardTimeout); // cancel the 10 s timer if we got here in time
process.exit(1);
}
}
The critical detail: .unref() on the setTimeout handle tells Node not to count this timer as a reason to keep the event loop alive. Without .unref(), if every real connection drained at 3 seconds but telemetry flushed at 4 seconds, Node would still wait until the full 10-second timer elapsed before the process could exit naturally. With .unref(), once the event loop is otherwise empty, Node exits immediately — the 10-second timer only fires if something is genuinely stuck.
Kubernetes alignment: set terminationGracePeriodSeconds in your Pod spec to at least 2–3 seconds more than your hard timeout. If your hard timeout is 10 seconds, use terminationGracePeriodSeconds: 15. This gives Kubernetes time to send SIGTERM, wait for your drain to complete, and only send SIGKILL as a final resort if your process is still alive after 15 seconds.
Note that SIGTERM and uncaughtException should share the same drain logic. Wire them together:
// Reuse the same shutdown path for graceful restarts (e.g. Kubernetes rolling deploy)
process.on('SIGTERM', () => {
process.stderr.write('[SHUTDOWN] SIGTERM received\n');
httpServer?.close();
wssServer?.close();
initiateShutdown(new Error('SIGTERM')).catch(() => process.exit(1));
});
Verification
Start your server with the shutdown module loaded, then inject a synthetic exception after a short delay and verify that in-flight requests complete cleanly:
// test-shutdown.js — run with: node test-shutdown.js
import http from 'node:http';
import { registerServers } from './shutdown.js';
import { trackConnections } from './connectionTracker.js';
const server = http.createServer((req, res) => {
// Simulate a slow request (2 s) to verify drain waits for it
setTimeout(() => {
res.writeHead(200);
res.end('ok\n');
}, 2000);
});
trackConnections(server);
server.listen(3000, () => {
registerServers(server, null);
console.log('Listening on :3000');
// Fire synthetic exception after 500 ms
setTimeout(() => {
throw new Error('synthetic uncaughtException for shutdown test');
}, 500);
});
Expected output in the terminal:
Listening on :3000
[FATAL] uncaughtException origin=uncaughtException
Error: synthetic uncaughtException for shutdown test
at Timeout.<anonymous> (test-shutdown.js:18:11)
[SHUTDOWN] Sentry flushed
[SHUTDOWN] Drain complete — exiting cleanly
Expected load-balancer behavior: any request that arrived before the exception fires should receive a 200 OK. Any request that arrives after server.close() executes should receive a 502 from the load balancer (the upstream socket was closed cleanly with a TCP FIN, not a reset). Without drain, the load balancer sees a connection reset and may also return 502, but the error is different — check your ALB access logs for TCP_RESET_BY_TARGET vs TCP_REFUSED_BY_TARGET.
Edge Cases & Gotchas
-
WebSocket clients mid-message. Sending
ws.close(1001)while a fragmented WebSocket message is being received causes the client to lose that message. If message delivery guarantees matter, drain the client-side write buffer first by waiting for thedrainevent on the underlying socket before closing — or use a protocol-level handshake to tell clients to stop sending before you close. -
Clustered Node.js processes. In a
clustersetup,uncaughtExceptionfires only in the worker that threw. The master process does not see it. Each worker must run its own drain sequence. The master’sworker.on('exit')event fires after the worker drains and exits; the master should then fork a replacement. Do not let the masterprocess.exit()— that kills all other healthy workers simultaneously. -
OOM errors bypassing uncaughtException. When Node.js exhausts the V8 heap (
--max-old-space-size), the runtime sometimes terminates withSIGABRTorERR_WORKER_OUT_OF_MEMORYbeforeuncaughtExceptioncan fire. Run with--abort-on-uncaught-exceptionduring development to get a core dump for heap analysis instead of a silent exit. In production, set container memory limits above--max-old-space-sizeso the OS does not OOM-kill the process before Node can handle it. -
Re-entrant uncaughtException. If your drain code itself throws, Node will fire
uncaughtExceptionagain. TheshuttingDownguard in Step 4 prevents infinite recursion, but be aware that a second uncaught exception during shutdown changes the exit code unless you explicitly setprocess.exitCodeat the very start of the handler (as shown in Step 1).
FAQ
Why can’t I just use async/await inside the uncaughtException handler?
Node.js does not await the return value of uncaughtException listeners. If you mark the callback async, the await expressions inside it suspend the microtask — but the event that fired the handler has already returned. Node continues executing and may reach process.exit() before your awaited Promise resolves. The workaround is to call an async function without await from inside the handler and let that function drive all cleanup, as shown in Step 4.
How does this interact with unhandledRejection in modern Node.js?
Since Node 15, an unhandled Promise rejection is treated as a fatal error and triggers uncaughtException (via an internal unhandledRejection → uncaughtException bridge). Your drain logic therefore covers both cases automatically if you register a single uncaughtException handler. You may still want a separate unhandledRejection handler to log the rejection before it escalates, but the drain sequence only needs to live in uncaughtException.
Should the hard timeout be 5 seconds, 10 seconds, or 30 seconds?
Match it to the sum of your slowest possible in-flight request duration plus your telemetry flush ceiling. A typical API server with a 5-second request timeout and a 2-second Sentry flush needs a hard timeout of roughly 8–10 seconds. Set terminationGracePeriodSeconds in Kubernetes to that value plus 5 seconds of buffer. Values over 30 seconds are rarely justified and increase the blast radius of a hung shutdown.