January 24, 2026

Build a Resilient Streaming Platform with FFMPEG

ffmpeg
rtmp
streaming

When working with Live Streaming systems (especially when integrating with Computer Vision or AI pipelines), we often treat FFmpeg as a "black box": push data in and hope it streams smoothly to the server.

However, in a production environment, network jitter or FFmpeg process crashes are inevitable. This post will guide you from a naive implementation to a resilient architecture capable of self-healing and handling high loads.

1. The naive approach

Let's start with the simplest approach: You have audio and video data (e.g., from OpenCV or an AI Model), and you write them into FFmpeg pipes sequentially.

In this example, we assume the use of Named Pipes (FIFO) to clearly separate Audio and Video inputs.

audio_pipe_fp = "/tmp/audio_pipe"
video_pipe_fp = "/tmp/video_pipe"

audio_chunk = b'\x00' * 4096  # Simulated PCM data
video_frame = b'\xFF' * 10000 # Simulated Encoded Frame

# create pipes
if not os.path.exists(audio_pipe_fp):
    os.mkfifo(audio_pipe_fp)
if not os.path.exists(video_pipe_fp):
    os.mkfifo(video_pipe_fp)

# create file descriptors
audio_fd = os.open(audio_pipe_fp, os.O_RDWR)
video_fd = os.open(video_pipe_fp, os.O_RDWR)

while True:
    # write audio bytes to pipe
    os.write(audio_fd, audio_chunk)

    # write video bytes to pipe
    os.write(video_fd, video_frame)

    time.sleep(0.04)  # 25 fps

Corresponding FFmpeg command for testing:

ffmpeg -re -f s16le -ar 44100 -ac 2 -i /tmp/audio_pipe \
       -f h264 -i /tmp/video_pipe \
       -c:v copy -c:a aac -f flv rtmp://localhost/live/stream

In the Python snippet above, we initialize dedicated Named Pipes for both audio and video. These pipes act as the bridge, accepting raw data from our script and acting as the input source for FFmpeg to ingest and stream.

Why use Pipes? You might wonder why we use pipes instead of writing to physical files. While pipes are treated as files by the OS, they reside entirely in RAM (In-Memory). This architecture completely bypasses disk I/O bottlenecks, ensuring the high-speed throughput and low latency required for real-time streaming systems.

Key Technical Takeaways:

Blocking I/O & Backpressure: The pipes are opened in blocking mode (e.g., using os.O_RDWR or os.O_RDWR). This creates a rigid flow control: if the pipe's buffer fills up, the os.write command will hang (block execution) until FFmpeg consumes enough data to free up space.
Synchronous Execution: The writing process is strictly serial. We write the audio chunk first, and only after that operation completes do we attempt to write the video frame.
Simulated Workload: The time.sleep(0.04) function serves as a placeholder to simulate the processing overhead required to maintain a steady 25 FPS frame rate.

2. Asynchronous I/O with Multi-threading

As we discussed regarding FFmpeg's ingestion mechanism, the engine typically requires simultaneous data from both the audio and video pipes to properly mux and stream the content.

The synchronous, blocking approach we used earlier creates a dangerous deadlock scenario. If the audio pipe fills up, the script freezes at os.write(audio_fd, audio_chunk). Meanwhile, FFmpeg might be starving for video data to continue processing, but our script cannot provide it because it is stuck waiting on the audio pipe. The result? The entire system hangs indefinitely.

The Solution: Decoupling with Multi-threading To fix this, we need to separate the audio and video write operations into two independent threads. This ensures that a blockage in one pipe never paralyzes the other.

class PipeWriter(threading.Thread):
    def __init__(self, path):
        super().__init__()
        self.path = path
        self.q = Queue(maxsize=100)  # Internal buffer to prevent RAM overflow
        self.running = True
        self.fd = None

    def run(self):
        # Open pipe in a separate thread to avoid blocking the main thread
        self.fd = os.open(self.path, os.O_WRONLY)
        while self.running:
            try:
                data = self.q.get(timeout=0.04)

                while True:
                    _, wlist, _ = select.select([], [self.fd], [], 0.01)
                    if not wlist: 
                        continue
                    os.write(self.fd, data)
                    break

            except Empty:
                continue
            except BrokenPipeError:
                break

        if self.fd: 
            os.close(self.fd)

    def write(self, data):
        while True:
            try:
                self.q.put(data, timeout=0.04)
                break
            except Full:
                continue

    def stop(self):
        self.running = False


# Usage
audio_writer = PipeWriter(audio_pipe_fp)
video_writer = PipeWriter(video_pipe_fp)

audio_writer.start()
video_writer.start()

while True:
    # ... generate data ...
    audio_writer.write(audio_chunk)
    video_writer.write(video_frame)

Enhancing Reliability with select: Beyond just threading, I have also introduced select.select into the workflow. Instead of blindly attempting to write, this allows us to poll the pipe's status first. This mechanism significantly improves system reliability and data integrity by ensuring the pipe is actually ready to accept data (writable) before we commit to the write operation.

3. Optimization: Increasing Pipe Buffer Size with fcntl

The Saturation Bottleneck: Streaming continuously for long durations places immense stress on the pipes, often leading to saturation and "Broken Pipe" errors. We found the audio pipe to be the most common point of failure. Because audio data is processed and queued rapidly, it keeps the standard pipe buffer in a perpetually full state, leaving zero margin for error.

The Fix - Tuning Buffer Capacity: A simple yet highly effective solution is to increase the operating system's buffer size for these pipes.

import fcntl

F_SETPIPE_SZ = 1031
size = 1024 * 1024  # 1 MB
try:
    fcntl.fcntl(fd, F_SETPIPE_SZ, size)
    print(f"Pipe buffer increased to {size} bytes")
except OSError as e:
    print(f"Could not set pipe size: {e}")

The Results: The impact of this optimization was drastic. By expanding the buffer size from the default 64KB to 1MB, we extended the system's continuous uptime from a mere 8–9 hours to a stable 5–7 days.

4. Fail-over: Health Checks and Self-Healing

As previously noted, pipe exhaustion is an inevitability during long-running streaming sessions. To address this, we implement a fail-over mechanism.

It is important to clarify that this approach does not magically eliminate the root cause of broken pipes. Instead, it adheres to the "Design for Failure" principle. By enabling the system to detect failures and automatically provision fresh pipes, we ensure that a local I/O error triggers a graceful recovery routine rather than causing a cascading failure that crashes the entire system.

Failover Mechanism

The fail-over architecture is built upon two core pillars:

Health Monitoring: Continuous validation of the named pipes' connectivity and the lifecycle status of the active FFmpeg process.
Self-Healing (Auto-Recovery): Automated routines to gracefully terminate and respawn (re-create) both the pipes and the FFmpeg instance whenever a critical failure is detected.

The following Python implementation demonstrates this logic:

def health_check(self):
    if not self.process:
        return False

    if self.video_fifo_fp is None:
        return False

    if self.audio_fifo_fp is None:
        return False

    return_code = self.process.poll()
    if return_code is not None:
        return False

    return True

def failover(primary_instance, standby_instance=None):
    primary_instance.terminate()

    if standby_instance is None:  # now, always go into block because standby is not set
        primary_instance = create_instance()
    elif standby_instance.health_check():
        primary_instance = standby_instance
    else:
        primary_instance = create_instance()
        standby_instance.terminate()
    
    return primary_instance, standby_instance

Results:

Total Resilience: The system no longer crashes when a pipe breaks; it handles the failure gracefully.
Minimal Downtime: The recovery process (terminating and re-initializing pipes) incurs a brief downtime of approximately 2–3 seconds.

Architectural Note: While implementing a hot standby instance could reduce downtime significantly (to just 5–10 frames) and eliminate connection warm-up time, we decided against it for this specific use case. Keeping a redundant instance active consumes valuable system resources and processing power—both of which are critical constraints in high-performance streaming. Given that pipe failures occur infrequently (roughly once a week), the current "cold restart" design offers the optimal balance between resource efficiency and availability.

5. Conclusion

In this post, we have explored a resilient architecture for streaming data from Python to FFmpeg. We have demonstrated how to leverage asynchronous I/O and multi-threading to improve system reliability and minimize data loss. We have also demonstrated how to increase the pipe buffer size to prevent saturation and improve system performance. Finally, we have covered the fail-over mechanism that automatically recovers from system failures.