Speed Up Your Nodejs Server

We will leverage Node.js worker threads and cluster modules to parallelize CPU-bound tasks, improving application performance by utilizing multiple cores efficiently.

First of all, we are trying to improve Node.js servers use CPU-intensive JavaScript operations. The Node.js built-in asynchronous I/O operations are more efficient than Workers or Child Processes can be.

Now let’s determine what kind of parallelization we need for our server. Does it need to share memory? Does it need to share a port? Or does it do different things in isolation? Those questions will help us pick the right parallelization module for the task.

child_process Link to heading

The node:child_process module provides the ability to spawn subprocesses in a manner that is similar, but not identical, to popen(3). This capability is primarily provided by the child_process.spawn() function:

import { spawn } from 'node:child_process';
const ls = spawn('ls', ['-lh', '/usr']);

ls.stdout.on('data', (data) => {
  console.log(`stdout: ${data}`);
});

ls.stderr.on('data', (data) => {
  console.error(`stderr: ${data}`);
});

ls.on('close', (code) => {
  console.log(`child process exited with code ${code}`);
});

It allows you to execute external commands or running a Node.js script (with fork()).

Imagine a movie theater sells drinks, popcorn and gifts from other vendors.

Tip

So child_process will be useful when we need to call external process, or a CPU-intensive script from current process.

cluster Link to heading

Clusters of Node.js processes can be used to run multiple instances of Node.js that can distribute workloads among their application threads. When process isolation is not needed, use the worker_threads module instead, which allows running multiple application threads within a single Node.js instance.
The cluster module allows easy creation of child processes that all share server ports.

The cluster module is built on top of child_process.fork(). Its main purpose is to create network applications that can utilize all available CPU cores on a machine.

import cluster from 'node:cluster';
import http from 'node:http';
import { availableParallelism } from 'node:os';
import process from 'node:process';

const numCPUs = availableParallelism();

if (cluster.isPrimary) {
  console.log(`Primary ${process.pid} is running`);

  // Fork workers.
  for (let i = 0; i < numCPUs; i++) {
    cluster.fork();
  }

  cluster.on('exit', (worker, code, signal) => {
    console.log(`worker ${worker.process.pid} died`);
  });
}
  else {
  // Workers can share any TCP connection
  // In this case it is an HTTP server
  http.createServer((req, res) => {
    res.writeHead(200); res.end('hello world\n');
  }).listen(8000);

  console.log(`Worker ${process.pid} started`); 
}

Running Node.js will now share port 8000 between the workers:

$ node server.js
Primary 3596 is running
Worker 4324 started
Worker 4520 started
Worker 6056 started
Worker 5644 started

Imagine a theater directs guests to different identical room to watch the same movie.

You can’t share memory between primary process and child process. But you can still communicate through message channel.

import cluster from 'node:cluster';
import http from 'node:http';
import { availableParallelism } from 'node:os';
import process from 'node:process';

if (cluster.isPrimary) {

  // Keep track of http requests let numReqs = 0;
  setInterval(() => {
    console.log(`numReqs = ${numReqs}`);
  }, 1000);

  // Count requests
  function messageHandler(msg) {
    if (msg.cmd && msg.cmd === 'notifyRequest') {
      numReqs += 1;
    }
  }

  // Start workers and listen for messages containing notifyRequest
  const numCPUs = availableParallelism();
  for (let i = 0; i < numCPUs; i++) {
    cluster.fork();
  }

  for (const id in cluster.workers) {
    cluster.workers[id].on('message', messageHandler);
  }

} else {

  // Worker processes have a http
  server. http.Server((req, res) => {
    res.writeHead(200);
    res.end('hello world\n');

    // Notify primary about the request
    process.send({ cmd: 'notifyRequest' });
  }).listen(8000);
}

Tip

This is useful when you have some CPU intensive API or service to utilize all available cores.

worker_threads Link to heading

Info

The node:worker_threads module enables the use of threads that execute JavaScript in parallel.

To access it:

import worker from 'node:worker_threads';

Note

Workers (threads) are useful for performing CPU-intensive JavaScript operations. They do not help much with I/O-intensive work. The Node.js built-in asynchronous I/O operations are more efficient than Workers can be.

Unlike child_process or cluster, worker_threads can share memory. They do so by transferring ArrayBuffer instances or sharing SharedArrayBuffer instances.

Imagine when you buy tickets, some popcorn and drink, the accountant will print tickets, while the other employee will grab some popcorn, and an other will pour drink into a cup. Then they will give all those things to you.

Tip

This is useful when you have some logic inside a specific task need parallelization and not I/O bound.

poolifier Link to heading

poolifier is a library that implement worker pool using worker_threads and cluster. It allows you to use fixed or dynamic worker pool without any complexity.

// worker.ts
import type { Server } from 'node:http'
import type { AddressInfo } from 'node:net'

import express, { type Express, type Request, type Response } from 'express'
import { ClusterWorker } from 'poolifier'

import type { WorkerData, WorkerResponse } from './types.js'

class ExpressWorker extends ClusterWorker<WorkerData, WorkerResponse> {
  private static server: Server

  public constructor () {
    super(ExpressWorker.startExpress, {
      killHandler: () => {
        ExpressWorker.server.close()
      },
    })
  }

  private static readonly factorial = (n: bigint | number): bigint => {
    if (n === 0 || n === 1) {
      return 1n
    }
    n = BigInt(n)
    let factorial = 1n
    for (let i = 1n; i <= n; i++) {
      factorial *= i
    }
    return factorial
  }

  private static readonly startExpress = (
    workerData?: WorkerData
  ): WorkerResponse => {
    // eslint-disable-next-line @typescript-eslint/no-non-null-assertion
    const { port } = workerData!

    const application: Express = express()

    // Parse only JSON requests body
    application.use(express.json())

    application.all('/api/echo', (req: Request, res: Response) => {
      res.send(req.body).end()
    })

    application.get('/api/factorial/:number', (req: Request, res: Response) => {
      const { number } = req.params
      res
        .send({
          number: ExpressWorker.factorial(Number.parseInt(number)).toString(),
        })
        .end()
    })

    let listenerPort: number | undefined
    ExpressWorker.server = application.listen(port, () => {
      listenerPort = (ExpressWorker.server.address() as AddressInfo).port
      console.info(
        `⚡️[express server]: Express server is started in cluster worker at http://localhost:${listenerPort.toString()}/`
      )
    })
    return {
      port: listenerPort ?? port,
      status: true,
    }
  }
}

export const expressWorker = new ExpressWorker()

// server.ts
import { dirname, extname, join } from 'node:path'
import { fileURLToPath } from 'node:url'
import { availableParallelism, FixedClusterPool } from 'poolifier'

import type { WorkerData, WorkerResponse } from './types.js'

const workerFile = join(
  dirname(fileURLToPath(import.meta.url)),
  `worker${extname(fileURLToPath(import.meta.url))}`
)

const pool = new FixedClusterPool<WorkerData, WorkerResponse>(
  availableParallelism(),
  workerFile,
  {
    enableEvents: false,
    errorHandler: (e: Error) => {
      console.error('Cluster worker error:', e)
    },
    onlineHandler: () => {
      pool
        .execute({ port: 8080 })
        .then(response => {
          if (response.status) {
            console.info(
              `Express is listening in cluster worker on port ${response.port?.toString()}`
            )
          }
          return undefined
        })
        .catch((error: unknown) => {
          console.error('Express failed to start in cluster worker:', error)
        })
    },
  }
)

Don’t believe me Link to heading

Warning

I lied

Even if it sounds convincing, don’t take any of my words as truth. Instead, do your own research and experiment to find out the result.

I have made some simple performance tests but you can create your own quite easily.

For simple Hello World! node http server Link to heading

$ wrk -t 10 -c 100 -d 60 http://localhost:8000  # this is single process server
Running 1m test @ http://localhost:8000
  10 threads and 100 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     1.30ms    1.28ms 110.13ms   99.70%
    Req/Sec     7.89k   338.74     8.55k    88.29%
  4721090 requests in 1.00m, 688.86MB read
Requests/sec:  78551.96
Transfer/sec:     11.46MB

$ wrk -t 10 -c 100 -d 60 http://localhost:8001  # this is cluster with 8 child_process
Running 1m test @ http://localhost:8001
  10 threads and 100 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     2.38ms    5.62ms 126.62ms   92.84%
    Req/Sec    10.05k     1.11k   19.48k    75.45%
  6007714 requests in 1.00m, 0.86GB read
Requests/sec:  99956.59
Transfer/sec:     14.58MB

Looking at the stats, we only gain 1.27x more requests/sec, but 1.8x latency with 8 extra CPUs 🤔 In this test, the server only take the request and response with Hello World! text, therefore Node.js built-in asynchronous I/O operations handle this kind of work much better than spawning bunch of clusters. Of course, in real world, we don’t have such simple Hello World! so this is not surprising.

The cluster mode shines when we need to do CPU intensive tasks. So we need to simulate one to see the difference.

CPU-bound task Link to heading

Our task function will have to run through 1 million iterations loop.

async function task() {
  let x = Math.random();
  for (let i = 0; i < 1000_000; i++) {
    x = x + Math.random() * i;
  }
  return x;
}

export default task;

[!TIP] node:crypto module often handles CPU-intensive tasks.

And don’t get me wrong when I use async for non-async code. In reality, we often see async function handle complex logic. Those functions have hidden cost if we just look at where they are called thinking we can get away without performance issue.

wrk -t 10 -c 100 -d 60 --timeout 1s http://127.0.0.1:8000
Running 1m test @ http://127.0.0.1:8000
  10 threads and 100 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   515.43ms  146.44ms 985.61ms   82.31%
    Req/Sec    24.89     17.68   130.00     67.89%
  9545 requests in 1.00m, 1.58MB read
  Socket errors: connect 0, read 0, write 0, timeout 141
Requests/sec:    158.81
Transfer/sec:     26.93KB

wrk -t 10 -c 100 -d 60 --timeout 1s http://127.0.0.1:8001
Running 1m test @ http://127.0.0.1:8001
  10 threads and 100 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   132.52ms   42.04ms 981.30ms   92.05%
    Req/Sec    76.52     18.18   343.00     74.20%
  45640 requests in 1.00m, 7.56MB read
Requests/sec:    759.42
Transfer/sec:    128.79KB

In this test, I add a --timeout 1s option to see how many slow requests happen. In our single process server, 141 requests got timed out. Avg latency is much higher, requests/sec is much lower compare to the server with multiple clusters.

Again, don’t take any numbers in this blog post as your conclusion. I cooked them to look nice. I hope you can run your experiment and have your own conclusion.