January 25, 2026

Build a High-Performance Asynchronous Video Reader/Writer

python
multi-threading
asynchronous

In this post, I’ll walk you through my experience of building a high-performance, asynchronous video reader and writer using multithreading. We’ll first look at why video I/O in Python often becomes a performance bottleneck, then dive into a practical approach to redesign the pipeline for near-zero latency and much higher throughput.

1. Standard Video I/O in Python

For most Computer Vision Engineers, handling video input and output is a fairly standard task. In practice, the go-to solution is OpenCV, using VideoCapture to read frames and VideoWriter to write them back to a video stream.

import cv2

cap = cv2.VideoCapture("input.mp4")

if not cap.isOpened():
    exit()

frame_width = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))
frame_height = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))
fps = cap.get(cv2.CAP_PROP_FPS)

fourcc = cv2.VideoWriter_fourcc(*'mp4v')
out = cv2.VideoWriter("output.mp4", fourcc, fps, (frame_width, frame_height))

while True:
    ret, frame = cap.read()

    if not ret:
        break

    out.write(frame)

    if cv2.waitKey(1) == ord('q'):
        break

cap.release()
out.release()
cv2.destroyAllWindows()

However, when profiling the system, this approach introduces noticeable latency in both the read and write operations, as illustrated in the table below:

Operation	Average Time (ms)	Theoretical FPS (Max)
`cap.read()`	~4.96 ms	~201 FPS
`out.write(frame)`	~12.74 ms	~78 FPS

Overall performance: In a sequential (serial) processing loop, the minimum time required to handle a single frame is approximately 17.7 ms.

This corresponds to roughly 56 FPS. With a target of 30 FPS (33.3 ms per frame), the current system performs very well. However, once you start adding computationally intensive image processing algorithms—such as AI-based detection or segmentation—the processing time ($T_{process}$) will increase, which in turn reduces the overall FPS of the pipeline.

2. Faster Video I/O with intermediate frame data

When a video is reused multiple times across different stages of a system, it’s often unnecessary—and inefficient—to repeatedly read from or write to the video file. A more effective approach is to perform video I/O only once, then work with the intermediate data in memory-friendly formats.

In this case, each frame is extracted and stored as a np.ndarray. Saving frames in NumPy format allows downstream processing stages to load data much faster than decoding video files again, while also preserving the original image quality without compression artifacts. This strategy significantly reduces I/O overhead and is especially useful in pipelines that involve multiple processing passes, experimentation, or heavy computer vision workloads.

The following example demonstrates how to extract frames from a video and store them as np.ndarray for later reuse:


intermediate_folder = "intermediate_frames"  # create intermediate folder
os.makedirs(intermediate_folder, exist_ok=True)

i = 0
while True:
    ret, frame = cap.read()

    if not ret:
        break

    np.save(os.path.join(intermediate_folder, f"{i}.npy"), frame)  # save as a numpy array for later use
    i += 1
    
    intermediate_data = np.load(os.path.join(intermediate_folder, f"{i}.npy"))  # load the frame data for later use

    if cv2.waitKey(1) == ord('q'):
        break

Overall performance: With this approach, the reading time drops to ~1.5 ms, and the writing time drops to ~2.47 ms, which are significant improvements.

Operation	Average Time (ms)	Theoretical FPS (Max)
`np.load()`	~1.54 ms	~649 FPS
`np.save()`	~2.47 ms	~404 FPS

While storing intermediate frames in memory-friendly formats improves processing speed, it introduces a new challenge: storage overhead. Saving each frame as an individual file quickly consumes a significant amount of disk space, especially for high-resolution or long-duration videos. For the FullHD 30s video, this translates to approximately 5.3 GB of data.

Operation	Storage (MB)
`cv2.VideoCapture()` + `cv2.VideoWriter()`	~35 MB
`np.save()` + `np.load()`	~5300 MB

3. High-Performance Asynchronous Video Reader/Writer

The core idea behind this approach is to decouple video I/O from the main processing pipeline. Instead of reading and writing video frames synchronously in the main thread—which often becomes a performance bottleneck—we move these operations into dedicated background threads.

In this design, video reading and writing are handled independently, allowing the main thread to focus entirely on frame processing (e.g. inference, filtering, or transformation). This separation significantly reduces idle time caused by blocking I/O operations and leads to a much smoother and more predictable frame rate.

In this section, we’ll focus on the asynchronous video reader. The asynchronous writer follows the same design principles and can be implemented in a very similar way.

import queue
import threading
from typing import Tuple, Optional
import cv2
import numpy as np


class AsyncReader:
    def __init__(self, video_path: str, queue_size: int = 100):
        self._video_path = video_path

        self.__queue_size = queue_size
        self.__reader_queue: Optional[queue.Queue[Optional[np.ndarray]]] = None
        self.__read_thread: Optional[threading.Thread] = None
        self.__stop: Optional[bool] = None
        self.__reader: Optional[cv2.VideoCapture] = None

    def init_reader(self):
        if self.__reader is None:
            self.__reader = cv2.VideoCapture(self._video_path)

        self._init_queue()  # always init queue at the end of init_reader for asynchronous reading

    def release_reader(self):
        if self.__reader is not None:
            self.__reader.release()
            self.__reader = None

        self._release_queue()  # always release queue at the end of release_reader for asynchronous reading

    def _next(self) -> Tuple[bool, Optional[np.ndarray]]:
        try:
            ret, frame = self.__reader.read()
            if not ret:
                return False, None
            return True, frame
        except Exception as e:
            return False, None

    def read(self) -> Tuple[bool, Optional[np.ndarray]]:
        frame = self.__get_from_queue()
        return frame is not None, frame

    def _init_queue(self):
        if self.__reader_queue is None:
            self.__reader_queue = queue.Queue(maxsize=self.__queue_size)
        if self.__read_thread is None:
            self.__read_thread = threading.Thread(target=self.__put_to_queue, daemon=True)
            self.__read_thread.start()
        if self.__stop is None:
            self.__stop = False

    def _release_queue(self):
        if self.__read_thread is not None:
            self.__read_thread.join()
            self.__read_thread = None
        if self.__reader_queue is not None:
            self.__reader_queue.empty()
            self.__reader_queue = None

    def __put_to_queue(self):
        while True:
            ret, frame = self._next()
            frame = frame if ret else None
            self.__reader_queue.put(frame)
            if not ret:
                break

    def __get_from_queue(self) -> np.ndarray:
        frame = self.__reader_queue.get()
        return frame

Design Overview

The asynchronous reader works as a producer–consumer pipeline:

A background thread continuously reads frames from the video source.
Each frame is pushed into a thread-safe queue.
The main thread simply pulls frames from the queue whenever it is ready to process them.

By buffering frames ahead of time, we hide the latency of cv2.VideoCapture.read() and avoid blocking the main loop.

Code Walkthrough

class AsyncReader:
    def __init__(self, video_path: str, queue_size: int = 100):

The constructor initializes the reader with:

video_path: path to the input video.
queue_size: maximum number of frames buffered in memory. This acts as backpressure to prevent unbounded memory growth.

self.__reader_queue: Optional[queue.Queue[Optional[np.ndarray]]] = None
self.__read_thread: Optional[threading.Thread] = None
self.__reader: Optional[cv2.VideoCapture] = None

These internal variables store:

A thread-safe queue for frame buffering.
A background thread responsible for reading frames.
An OpenCV VideoCapture object for low-level video access.

Initializing and Releasing Resources

def init_reader(self):
    if self.__reader is None:
        self.__reader = cv2.VideoCapture(self._video_path)
    self._init_queue()

init_reader() opens the video file and starts the asynchronous reading thread.

The queue is always initialized last to ensure the reader is ready before frames start flowing.

def release_reader(self):
    if self.__reader is not None:
        self.__reader.release()
    self._release_queue()

release_reader() gracefully releases both the video capture object and the background thread.

Low-level Frame Reading

def _next(self):
    ret, frame = self.__reader.read()
    if not ret:
        return False, None
    return True, frame

This method performs the actual blocking read() call. It is only used by the background thread, never by the main thread.

Background Producer Thread

def __put_to_queue(self):
    while True:
        ret, frame = self._next()
        frame = frame if ret else None
        self.__reader_queue.put(frame)
        if not ret:
            break

This loop continuously:

Reads the next frame from the video.
Pushes it into the queue.
Inserts a None sentinel when the video ends, signaling consumers to stop.

Main-thread Consumer API

def read(self):
    frame = self.__get_from_queue()
    return frame is not None, frame

From the caller’s perspective, read() behaves just like cv2.VideoCapture.read(), but without blocking on I/O. If frames are already buffered, the call returns immediately.

Why This Works

By moving video decoding into a separate thread and buffering frames in advance:

I/O latency is hidden behind computation
The main loop remains responsive and deterministic
Overall FPS becomes limited by processing time, not video I/O

This pattern is especially effective when combined with heavy workloads such as deep learning inference, tracking, or segmentation.

4. Benchmark Results

Performance Benchmark

In this section, we run a simple benchmark to compare the performance of three different approaches to video I/O:

The standard OpenCV pipeline using VideoCapture and VideoWriter
An intermediate-frame approach based on np.save() / np.load()
The proposed asynchronous pipeline using AsyncReader and AsyncWriter

Benchmark Setup

All experiments were conducted using a Full HD (1920×1080) video, 30 seconds in duration, recorded at 30 FPS.

To better reflect real-world workloads, each pipeline included a heavy task placeholder (e.g., AI inference or complex image processing) in the main processing loop.

This setup allows us to measure not only raw I/O performance, but also how well each approach overlaps I/O with computation.

Results:

The benchmark results are summarized in the table below:

Method	Average Time (ms)	Storage (MB)
`VideoCapture` + `VideoWriter`	~17.7 ms	~35 MB
`np.load` + `np.save`	~4.01 ms	~5300 MB
`AsyncReader` + `AsyncWriter`	~0 ms	~35 MB

Discussion

The standard OpenCV approach is simple and storage-efficient, but its synchronous nature introduces noticeable I/O latency in each processing iteration.

Using np.load() and np.save() dramatically reduces per-frame I/O time, but at the cost of extreme storage overhead, making it impractical for long or high-resolution videos.

The asynchronous pipeline achieves the best balance: I/O latency is almost completely hidden by computation, resulting in near-zero effective read/write time, while maintaining the same storage footprint as the standard video pipeline.

Note: This benchmark assumes that video I/O is combined with a heavy processing task (e.g., AI inference). Under these conditions, AsyncReader and AsyncWriter truly shine, as they are able to overlap I/O with computation. In lightweight pipelines without significant processing, the performance gains may be less pronounced.

5. Conclusion

In this post, we explored several approaches to handling video I/O in Python, starting from the standard OpenCV pipeline, moving through intermediate-frame caching with NumPy, and finally arriving at a high-performance asynchronous design.

While the default VideoCapture and VideoWriter APIs are simple and storage-efficient, their synchronous nature makes them a poor fit for pipelines that include heavy computation. Storing intermediate frames as NumPy arrays significantly reduces I/O latency, but the massive storage overhead quickly becomes a limiting factor.

The asynchronous reader/writer approach strikes the best balance. By decoupling video I/O from the main processing loop and overlapping it with computation, we can effectively hide read/write latency and keep the pipeline running at a stable, predictable frame rate—even under heavy workloads such as AI inference or complex computer vision tasks. At the same time, it preserves the compact storage footprint of standard video files.

In practice, this pattern is most beneficial when:

Video processing involves expensive per-frame computation
Stable FPS and low latency are more important than code simplicity
The system needs to scale to longer videos or higher resolutions

Although the asynchronous design introduces additional complexity, the performance gains often justify the trade-off in real-world, production-grade systems.

If you’re building a computer vision pipeline that struggles with I/O bottlenecks, moving video reading and writing off the main thread is a simple yet powerful optimization—and a pattern well worth adding to your engineering toolbox.