January 25, 2026
Build a High-Performance Asynchronous Video Reader/Writer
- python
- multi-threading
- asynchronous
In this post, I’ll walk you through my experience of building a high-performance, asynchronous video reader and writer using multithreading. We’ll first look at why video I/O in Python often becomes a performance bottleneck, then dive into a practical approach to redesign the pipeline for near-zero latency and much higher throughput.
1. Standard Video I/O in Python
For most Computer Vision Engineers, handling video input and output is a fairly standard task. In practice, the go-to
solution is OpenCV, using VideoCapture to read frames and VideoWriter to write them back to a video stream.
import cv2
cap = cv2.VideoCapture("input.mp4")
if not cap.isOpened():
exit()
frame_width = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))
frame_height = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))
fps = cap.get(cv2.CAP_PROP_FPS)
fourcc = cv2.VideoWriter_fourcc(*'mp4v')
out = cv2.VideoWriter("output.mp4", fourcc, fps, (frame_width, frame_height))
while True:
ret, frame = cap.read()
if not ret:
break
out.write(frame)
if cv2.waitKey(1) == ord('q'):
break
cap.release()
out.release()
cv2.destroyAllWindows()
However, when profiling the system, this approach introduces noticeable latency in both the read and write operations, as illustrated in the table below:
| Operation | Average Time (ms) | Theoretical FPS (Max) |
|---|---|---|
cap.read() | ~4.96 ms | ~201 FPS |
out.write(frame) | ~12.74 ms | ~78 FPS |
Overall performance: In a sequential (serial) processing loop, the minimum time required to handle a single frame is approximately 17.7 ms.
This corresponds to roughly 56 FPS. With a target of 30 FPS (33.3 ms per frame), the current system performs very well. However, once you start adding computationally intensive image processing algorithms—such as AI-based detection or segmentation—the processing time ($T_{process}$) will increase, which in turn reduces the overall FPS of the pipeline.
2. Faster Video I/O with intermediate frame data
When a video is reused multiple times across different stages of a system, it’s often unnecessary—and inefficient—to repeatedly read from or write to the video file. A more effective approach is to perform video I/O only once, then work with the intermediate data in memory-friendly formats.
In this case, each frame is extracted and stored as a np.ndarray. Saving frames in NumPy format allows downstream
processing stages to load data much faster than decoding video files again, while also preserving the original image
quality without compression artifacts. This strategy significantly reduces I/O overhead and is especially useful in
pipelines that involve multiple processing passes, experimentation, or heavy computer vision workloads.
The following example demonstrates how to extract frames from a video and store them as np.ndarray for later reuse:
intermediate_folder = "intermediate_frames" # create intermediate folder
os.makedirs(intermediate_folder, exist_ok=True)
i = 0
while True:
ret, frame = cap.read()
if not ret:
break
np.save(os.path.join(intermediate_folder, f"{i}.npy"), frame) # save as a numpy array for later use
i += 1
intermediate_data = np.load(os.path.join(intermediate_folder, f"{i}.npy")) # load the frame data for later use
if cv2.waitKey(1) == ord('q'):
break
Overall performance: With this approach, the reading time drops to ~1.5 ms, and the writing time drops to ~2.47 ms, which are significant improvements.
| Operation | Average Time (ms) | Theoretical FPS (Max) |
|---|---|---|
np.load() | ~1.54 ms | ~649 FPS |
np.save() | ~2.47 ms | ~404 FPS |
While storing intermediate frames in memory-friendly formats improves processing speed, it introduces a new challenge: storage overhead. Saving each frame as an individual file quickly consumes a significant amount of disk space, especially for high-resolution or long-duration videos. For the FullHD 30s video, this translates to approximately 5.3 GB of data.
| Operation | Storage (MB) |
|---|---|
cv2.VideoCapture() + cv2.VideoWriter() | ~35 MB |
np.save() + np.load() | ~5300 MB |
3. High-Performance Asynchronous Video Reader/Writer
The core idea behind this approach is to decouple video I/O from the main processing pipeline. Instead of reading and writing video frames synchronously in the main thread—which often becomes a performance bottleneck—we move these operations into dedicated background threads.
In this design, video reading and writing are handled independently, allowing the main thread to focus entirely on frame processing (e.g. inference, filtering, or transformation). This separation significantly reduces idle time caused by blocking I/O operations and leads to a much smoother and more predictable frame rate.
In this section, we’ll focus on the asynchronous video reader. The asynchronous writer follows the same design principles and can be implemented in a very similar way.
import queue
import threading
from typing import Tuple, Optional
import cv2
import numpy as np
class AsyncReader:
def __init__(self, video_path: str, queue_size: int = 100):
self._video_path = video_path
self.__queue_size = queue_size
self.__reader_queue: Optional[queue.Queue[Optional[np.ndarray]]] = None
self.__read_thread: Optional[threading.Thread] = None
self.__stop: Optional[bool] = None
self.__reader: Optional[cv2.VideoCapture] = None
def init_reader(self):
if self.__reader is None:
self.__reader = cv2.VideoCapture(self._video_path)
self._init_queue() # always init queue at the end of init_reader for asynchronous reading
def release_reader(self):
if self.__reader is not None:
self.__reader.release()
self.__reader = None
self._release_queue() # always release queue at the end of release_reader for asynchronous reading
def _next(self) -> Tuple[bool, Optional[np.ndarray]]:
try:
ret, frame = self.__reader.read()
if not ret:
return False, None
return True, frame
except Exception as e:
return False, None
def read(self) -> Tuple[bool, Optional[np.ndarray]]:
frame = self.__get_from_queue()
return frame is not None, frame
def _init_queue(self):
if self.__reader_queue is None:
self.__reader_queue = queue.Queue(maxsize=self.__queue_size)
if self.__read_thread is None:
self.__read_thread = threading.Thread(target=self.__put_to_queue, daemon=True)
self.__read_thread.start()
if self.__stop is None:
self.__stop = False
def _release_queue(self):
if self.__read_thread is not None:
self.__read_thread.join()
self.__read_thread = None
if self.__reader_queue is not None:
self.__reader_queue.empty()
self.__reader_queue = None
def __put_to_queue(self):
while True:
ret, frame = self._next()
frame = frame if ret else None
self.__reader_queue.put(frame)
if not ret:
break
def __get_from_queue(self) -> np.ndarray:
frame = self.__reader_queue.get()
return frame
Design Overview
The asynchronous reader works as a producer–consumer pipeline:
- A background thread continuously reads frames from the video source.
- Each frame is pushed into a thread-safe queue.
- The main thread simply pulls frames from the queue whenever it is ready to process them.
By buffering frames ahead of time, we hide the latency of cv2.VideoCapture.read() and avoid blocking the main loop.
Code Walkthrough
class AsyncReader:
def __init__(self, video_path: str, queue_size: int = 100):
The constructor initializes the reader with:
video_path: path to the input video.queue_size: maximum number of frames buffered in memory. This acts as backpressure to prevent unbounded memory growth.
self.__reader_queue: Optional[queue.Queue[Optional[np.ndarray]]] = None
self.__read_thread: Optional[threading.Thread] = None
self.__reader: Optional[cv2.VideoCapture] = None
These internal variables store:
- A thread-safe queue for frame buffering.
- A background thread responsible for reading frames.
- An OpenCV
VideoCaptureobject for low-level video access.
Initializing and Releasing Resources
def init_reader(self):
if self.__reader is None:
self.__reader = cv2.VideoCapture(self._video_path)
self._init_queue()
init_reader() opens the video file and starts the asynchronous reading thread.
The queue is always initialized last to ensure the reader is ready before frames start flowing.
def release_reader(self):
if self.__reader is not None:
self.__reader.release()
self._release_queue()
release_reader() gracefully releases both the video capture object and the background thread.
Low-level Frame Reading
def _next(self):
ret, frame = self.__reader.read()
if not ret:
return False, None
return True, frame
This method performs the actual blocking read() call. It is only used by the background thread, never by the main thread.
Background Producer Thread
def __put_to_queue(self):
while True:
ret, frame = self._next()
frame = frame if ret else None
self.__reader_queue.put(frame)
if not ret:
break
This loop continuously:
- Reads the next frame from the video.
- Pushes it into the queue.
- Inserts a
Nonesentinel when the video ends, signaling consumers to stop.
Main-thread Consumer API
def read(self):
frame = self.__get_from_queue()
return frame is not None, frame
From the caller’s perspective, read() behaves just like cv2.VideoCapture.read(), but without blocking on I/O. If
frames are already buffered, the call returns immediately.
Why This Works
By moving video decoding into a separate thread and buffering frames in advance:
- I/O latency is hidden behind computation
- The main loop remains responsive and deterministic
- Overall FPS becomes limited by processing time, not video I/O
This pattern is especially effective when combined with heavy workloads such as deep learning inference, tracking, or segmentation.
4. Benchmark Results
Performance Benchmark
In this section, we run a simple benchmark to compare the performance of three different approaches to video I/O:
- The standard OpenCV pipeline using
VideoCaptureandVideoWriter - An intermediate-frame approach based on
np.save()/np.load() - The proposed asynchronous pipeline using
AsyncReaderandAsyncWriter
Benchmark Setup
All experiments were conducted using a Full HD (1920×1080) video, 30 seconds in duration, recorded at 30 FPS.
To better reflect real-world workloads, each pipeline included a heavy task placeholder (e.g., AI inference or complex image processing) in the main processing loop.
This setup allows us to measure not only raw I/O performance, but also how well each approach overlaps I/O with computation.
Results:
The benchmark results are summarized in the table below:
| Method | Average Time (ms) | Storage (MB) |
|---|---|---|
VideoCapture + VideoWriter | ~17.7 ms | ~35 MB |
np.load + np.save | ~4.01 ms | ~5300 MB |
AsyncReader + AsyncWriter | ~0 ms | ~35 MB |
Discussion
The standard OpenCV approach is simple and storage-efficient, but its synchronous nature introduces noticeable I/O latency in each processing iteration.
Using np.load() and np.save() dramatically reduces per-frame I/O time, but at the cost of extreme storage
overhead, making it impractical for long or high-resolution videos.
The asynchronous pipeline achieves the best balance: I/O latency is almost completely hidden by computation, resulting in near-zero effective read/write time, while maintaining the same storage footprint as the standard video pipeline.
Note: This benchmark assumes that video I/O is combined with a heavy processing task (e.g., AI inference). Under
these conditions, AsyncReader and AsyncWriter truly shine, as they are able to overlap I/O with computation. In
lightweight pipelines without significant processing, the performance gains may be less pronounced.
5. Conclusion
In this post, we explored several approaches to handling video I/O in Python, starting from the standard OpenCV pipeline, moving through intermediate-frame caching with NumPy, and finally arriving at a high-performance asynchronous design.
While the default VideoCapture and VideoWriter APIs are simple and storage-efficient, their synchronous nature
makes them a poor fit for pipelines that include heavy computation. Storing intermediate frames as NumPy arrays
significantly reduces I/O latency, but the massive storage overhead quickly becomes a limiting factor.
The asynchronous reader/writer approach strikes the best balance. By decoupling video I/O from the main processing loop and overlapping it with computation, we can effectively hide read/write latency and keep the pipeline running at a stable, predictable frame rate—even under heavy workloads such as AI inference or complex computer vision tasks. At the same time, it preserves the compact storage footprint of standard video files.
In practice, this pattern is most beneficial when:
- Video processing involves expensive per-frame computation
- Stable FPS and low latency are more important than code simplicity
- The system needs to scale to longer videos or higher resolutions
Although the asynchronous design introduces additional complexity, the performance gains often justify the trade-off in real-world, production-grade systems.
If you’re building a computer vision pipeline that struggles with I/O bottlenecks, moving video reading and writing off the main thread is a simple yet powerful optimization—and a pattern well worth adding to your engineering toolbox.