Abstract:

This work presents a performance study of CPU-based scaling on ARM Cortex-A53 processors, specifically targeting the NXP i.MX 8M Mini and Raspberry Pi 3, which share the same CPU architecture. The goal is to evaluate the feasibility of implementing a real-time software scaler as part of a broader video pipeline, in scenarios where dedicated hardware acceleration or GPU-based solutions are not available.
The experiments implement multiple interpolation algorithms: nearest neighbour, bilinear, and bicubic—in user-space using applications written in C, and in kernel-space using custom Linux drivers. The kernel-space experiments leverage both dma_alloc_coherent() and vmalloc() memory allocation strategies to study their performance impacts. Benchmarking results are collected for various input/output resolutions and analyzed in terms of frames per second (FPS) and scaling latency.
The results demonstrate that while high-quality interpolation methods like bicubic are computationally expensive and unsuitable for real-time use on a CPU alone, simpler techniques such as nearest neighbour and bilinear interpolation can achieve acceptable frame rates under certain resolution constraints. The study also highlights trade-offs between kernel and user-space approaches, as well as between memory allocation strategies, providing valuable insights into CPU-only scaling performance on embedded ARM SoCs.

Introduction:

This work investigates the performance and feasibility of such a CPU-only video scaling pipeline on ARM Cortex-A53-based embedded platforms, specifically the NXP i.MX 8M Mini and Raspberry Pi 3. The experiment focuses on performing all operations purely in software, without using OpenGL, hardware shaders, or other GPU-accelerated APIs.
To simulate a video stream, we devised a simple pattern:
  • Initialize a source frame with synthetic pixel data (e.g., RGBA pattern)
  • Scale it to a target resolution using any scaling algorithm (e.g., nearest neighbour, bilinear, bicubic) in C
  • Copy the scaled frame to a simulated output memory
  • Repeat the operation 100 times to emulate a video stream
The core idea is to benchmark how well the CPU can handle repeated read → scale → write operations in real time across resolution changes. Performance metrics such as total time taken and frame processing consistency were measured to evaluate suitability for real-time embedded use. By keeping the implementation minimal, thread-safe (using OpenMP), and hardware-agnostic, this work aims to understand:
  • The raw scaling capability of ARM CPUs
  • The system’s ability to maintain a target frame rate (e.g., 60 FPS or better)
  • How memory bandwidth and CPU load affect consistent scaling

Simple User Space C Code

1. Basic Memory Read and Write

The first step in evaluating CPU-based performance for image and video processing was to establish a simple baseline by measuring the time taken for repeated memory read and write operations. This served as a foundational test, helping to understand how efficiently the CPU can move raw frame data in memory, without any image processing overhead. For this purpose, a straightforward C program was written that allocates a single 1920×1080 resolution frame in memory, fills it with random pixel values, and then repeatedly copies this frame data into another allocated buffer.

To simulate video-like behavior, the copy operation was performed a hundred times, mimicking 100 frames being read and written during a typical playback scenario. The copying process used the standard memcpy() function, which provides a fast and reliable way to benchmark raw memory bandwidth. OpenMP was employed to parallelize the loop executing the copy operation across multiple threads — in this case, four — to test how well the system scales with parallel memory tasks. The use of `omp_get_wtime() allowed precise timing of the entire loop, giving insight into total operation latency.

This baseline setup revealed key characteristics of the system’s memory performance. Since no processing was done on the image data, the test isolated the cost of memory I/O alone. It helped understand how well the CPU handles frame-sized memory blocks under repeated access and gave an initial idea of what portion of frame processing time might be consumed purely by memory movement, independent of any image manipulation like scaling or filtering. This experiment served as the groundwork before progressing to more complex operations like CPU-based scaling.

2. Scaling Operation – Basic Mechanism – Nearest Neighbour Algorithm

After establishing a memory-only baseline, the next logical step was to introduce basic image processing specifically, resolution scaling. The objective was to scale any input resolution source image up to a resolution of 1920×1080 using only CPU resources, without involving any hardware acceleration like GPU. The process involves taking pixel data from a smaller source image and mapping it onto a larger output image of higher resolution. For each pixel position in the output resolution, a corresponding pixel from the input image is picked and its values are copied to the output.
The resizing is done by calculating a ratio between the original dimensions and the target dimensions. These ratios are then used to determine where in the input image the pixel data should be taken from. For instance, if an image of 640×480 resolution is being scaled to 1920×1080, then each output pixel is mapped to a pixel from the input based on how the dimensions scale proportionally.
This method of resizing doesn’t involve any averaging or blending; it simply picks pixel data from the nearest location in the source image and places it in the output image. The entire operation is straightforward and focused on achieving resolution transformation without any graphical enhancements or filters. This made it ideal for initial CPU load and performance evaluation, without involving the GPU or other hardware accelerators.

Mathematical Formulation:

Implementation in C

void scaleResolution(Resolution* src, Resolution* dst) {

  float x_ratio = (float)src->width / dst->width;

  float y_ratio = (float)src->height / dst->height;

  #pragma omp parallel for collapse(2)

  for (int y = 0; y < dst->height; y++) {

    for (int x = 0; x < dst->width; x++) {

      int srcX = (int)(x * x_ratio);

      int srcY = (int)(y * y_ratio);

      int srcIndex = (srcY * src->width + srcX) * PIXEL_SIZE;

      int dstIndex = (y * dst->width + x) * PIXEL_SIZE;

      memcpy(&dst->data[dstIndex], &src->data[srcIndex], PIXEL_SIZE);

    }

 }

}

Experimentation Result:

To understand the performance of the nearest neighbor scaling algorithm on embedded platforms, we conducted a series of experiments on two different devices: the Raspberry Pi 3 and the i.MX 8M Mini. Each test involved scaling an image from an input resolution to a specified output resolution 100 times. The time taken for these 100 iterations and the corresponding frames per second (FPS) were recorded.

Performance of Nearest Neighbour Algorithm on Raspberry Pi 3:

Performance of Nearest Neighbour Algorithm on i.MX 8M Mini:

Key Observations:

  • Across all input resolutions, the i.MX 8M Mini consistently delivers much faster processing times and higher frame rates (FPS) than the Raspberry Pi 3
  • Both devices were also tested on a no-scaling path. The i.MX 8M Mini still processes this faster than the Pi 3, confirming a more efficient memory read/write bandwidth or faster memory subsystem
The main advantage of this method is speed and simplicity. It requires minimal memory bandwidth and no complex arithmetic operations. On devices like the Raspberry Pi 3 or the i.MX 8M Mini, this algorithm can scale moderately-sized frames (e.g., 1024×768 → 1920×1080) in just a few milliseconds, depending on hardware optimization.
However, the drawback lies in its visual artifacts. It does not produce smooth gradients or anti-aliasing, making it visually inferior compared to bilinear or bicubic interpolation. Artifacts are especially pronounced during upscaling operations where many output pixels map to a single input pixel.

2.  Bilinear Interpolation

When scaling images, one of the most commonly used algorithms beyond the simplest nearest neighbour approach is bilinear interpolation. It strikes a balance between computational complexity and visual quality, making it suitable for a wide range of real-time and offline applications. Bilinear interpolation works by considering the four nearest pixel values in the original image that surround the target location. Instead of simply selecting the closest pixel, as done in nearest neighbour interpolation, bilinear interpolation performs a weighted average of these four pixels. The weight is based on the relative distance of the new pixel’s location from the original pixels.

Mathematical Formulation:

Implementation in C

void scaleResolutionBilinear(Resolution* src, Resolution* dst) {
float x_ratio = ((float)src->width – 1) / dst->width;
float y_ratio = ((float)src->height – 1) / dst->height;

 #pragma omp parallel for collapse(2)

  for (int y = 0; y < dst->height; y++) {

    for (int x = 0; x < dst->width; x++) {

      float srcX = x * x_ratio;

      float srcY = y * y_ratio;

      int xL = (int)srcX;

      int yT = (int)srcY;

      int xH = (xL + 1 < src->width) ? xL + 1 : xL;

      int yB = (yT + 1 < src->height) ? yT + 1 : yT;

      float xWeight = srcX – xL;

      float yWeight = srcY – yT;

      int indexTL = (yT * src->width + xL) * PIXEL_SIZE;

      int indexTR = (yT * src->width + xH) * PIXEL_SIZE;

      int indexBL = (yB * src->width + xL) * PIXEL_SIZE;

      int indexBR = (yB * src->width + xH) * PIXEL_SIZE;

      int dstIndex = (y * dst->width + x) * PIXEL_SIZE;

      for (int c = 0; c < PIXEL_SIZE; c++) {

          float top = src->data[indexTL + c] * 

            (1 – xWeight) + src->data[indexTR + c] * xWeight;

          float bottom = src->data[indexBL + c] * 

            (1 – xWeight) + src->data[indexBR + c] * xWeight;

          dst->data[dstIndex + c] = (unsigned char)(top * (1 – yWeight) + bottom * yWeight);

      }

    }

  }

}

Experimentation Result:

The same experiment was performed as before (nearest neighbour algorithm), with the following results:

Performance of Bilinear Interpolation on Raspberry Pi 3:

Performance of Bilinear Interpolation on i.MX 8M Mini:

Key Observations:

  • On average, the i.MX 8M Mini achieves ~2.9× higher FPS than the Raspberry Pi 3
  • Interestingly, processing time remains nearly constant across different input resolutions for both platforms
We’ve seen how nearest neighbour and bilinear interpolation perform on ARM Cortex‑A53 platforms.
But what about more advanced techniques like bicubic interpolation, and how do memory allocation strategies in kernel space influence performance? Read on in the next blog to find out.

Leave a Reply

Your email address will not be published. Required fields are marked *