Understanding Docker Internals: Building a Container Runtime in Python

I’ve been working with containers professionally for several years now, using Docker and Kubernetes daily in production environments. Like many developers, I initially treated containers as a “black box” - I knew how to use them, but didn’t really understand what was happening under the hood. It wasn’t until I needed to debug a particularly container networking issue at work that I realized I needed to understand the underlying technology better.

In this post, I’ll take you on a journey to breakdown container technology by building a simple container runtime in Python. We’ll explore the Linux primitives that make containers possible and implement them step by step. By the end, you’ll understand how containers work.

What Actually IS a Container?

Before we start building, let’s clear up a common misconception: containers are NOT lightweight virtual machines. This comparison, while convenient for explaining containers to newcomers, is technically misleading.

A virtual machine includes an entire operating system with its own kernel. Containers, on the other hand, share the host’s kernel and use Linux features to create isolated environments. Specifically, containers are built on three main Linux primitives:

Namespaces - Provide isolation (process, network, filesystem, etc.)
Control Groups (cgroups) - Limit and monitor resource usage (CPU, memory, I/O)
Filesystem Isolation - Use chroot/pivot_root to change the root filesystem

When you run docker run ubuntu bash, Docker is essentially:

Creating namespaces to isolate the process
Setting up cgroups to limit resources
Using an overlay filesystem to provide the Ubuntu root filesystem
Executing /bin/bash in this isolated environment

Let’s build this ourselves to see exactly how it works.

Understanding Linux Namespaces

Namespaces are a Linux kernel feature that partitions kernel resources. Different processes can have different views of the system. Linux provides several types of namespaces:

PID Namespace - Process isolation. Processes in a namespace only see processes within that namespace.
Network Namespace - Network isolation. Each namespace has its own network devices, IP addresses, routing tables.
Mount Namespace - Filesystem isolation. Each namespace can have its own mount points.
UTS Namespace - Hostname isolation. Each namespace can have its own hostname.
IPC Namespace - Inter-process communication isolation.
User Namespace - User and group ID isolation.

Let’s start by implementing the simplest form of isolation: PID namespaces.

Building Our Container Runtime

Step 1: Basic Process Isolation with PID Namespaces

Let’s create our first container that isolates processes:

#!/usr/bin/env python3
import os
import sys
import subprocess

def run_in_container(command):
    """
    Run a command in an isolated PID namespace.
    This creates a new process namespace where the command
    will be PID 1 and won't see host processes.
    """
    print(f"Starting container with command: {command}")
    print(f"Parent process PID: {os.getpid()}")

    # Create a child process
    pid = os.fork()

    if pid == 0:
        # Child process
        try:
            # Create a new PID namespace
            # CLONE_NEWPID creates a new process namespace
            os.unshare(os.CLONE_NEWPID)

            # Mount /proc so we can see our isolated process tree
            # Note: This requires root privileges
            subprocess.run(['mount', '-t', 'proc', 'proc', '/proc'])

            print(f"Container process PID: {os.getpid()}")

            # Execute the command
            os.execvp(command[0], command)
        except Exception as e:
            print(f"Error in container: {e}")
            sys.exit(1)
    else:
        # Parent process - wait for child to complete
        os.waitpid(pid, 0)
        print("Container exited")

if __name__ == "__main__":
    if len(sys.argv) < 2:
        print("Usage: python3 simple_container.py <command>")
        sys.exit(1)

    if os.geteuid() != 0:
        print("This script requires root privileges")
        sys.exit(1)

    command = sys.argv[1:]
    run_in_container(command)

Testing PID Isolation

Save this as simple_container.py and run it:

sudo python3 simple_container.py bash

Inside the container, try running:

ps aux  # You'll only see processes in this namespace!
echo $$  # This will show PID 1

This is our first step towards a container - we’ve isolated the process tree!

Step 2: Filesystem Isolation with chroot

Now let’s add filesystem isolation. We’ll create a minimal root filesystem and use chroot to change the root directory:

#!/usr/bin/env python3
import os
import sys
import subprocess
import tempfile
import shutil

def setup_rootfs(rootfs_path):
    """
    Create a minimal root filesystem.
    In production, this would be container image layers.
    """
    print(f"Setting up root filesystem at {rootfs_path}")

    # Create basic directory structure
    dirs = ['bin', 'lib', 'lib64', 'usr', 'proc', 'sys', 'dev', 'tmp']
    for d in dirs:
        os.makedirs(os.path.join(rootfs_path, d), exist_ok=True)

    # Copy essential binaries (bash and ls for demo)
    binaries = ['/bin/bash', '/bin/ls', '/bin/ps']
    for binary in binaries:
        if os.path.exists(binary):
            dest = os.path.join(rootfs_path, binary.lstrip('/'))
            shutil.copy2(binary, dest)

            # Copy required shared libraries
            copy_dependencies(binary, rootfs_path)

def copy_dependencies(binary, rootfs_path):
    """
    Copy shared library dependencies for a binary.
    Uses ldd to find dependencies.
    """
    try:
        result = subprocess.run(['ldd', binary],
                              capture_output=True,
                              text=True)

        for line in result.stdout.split('\n'):
            if '=>' in line:
                parts = line.split('=>')
                if len(parts) > 1:
                    lib_path = parts[1].strip().split()[0]
                    if os.path.exists(lib_path):
                        dest = os.path.join(rootfs_path, lib_path.lstrip('/'))
                        os.makedirs(os.path.dirname(dest), exist_ok=True)
                        if not os.path.exists(dest):
                            shutil.copy2(lib_path, dest)
    except Exception as e:
        print(f"Warning: Could not copy dependencies for {binary}: {e}")

def run_container(command, rootfs_path):
    """
    Run a command in an isolated container with its own filesystem.
    """
    print(f"Starting container with command: {command}")

    pid = os.fork()

    if pid == 0:
        # Child process
        try:
            # Create new namespaces
            # CLONE_NEWPID: new process namespace
            # CLONE_NEWNS: new mount namespace
            # CLONE_NEWUTS: new hostname namespace
            os.unshare(os.CLONE_NEWPID | os.CLONE_NEWNS | os.CLONE_NEWUTS)

            # Set hostname for this container
            hostname = "container"
            os.system(f'hostname {hostname}')

            # Change root filesystem
            os.chroot(rootfs_path)
            os.chdir('/')

            # Mount /proc in the container
            os.makedirs('/proc', exist_ok=True)
            subprocess.run(['mount', '-t', 'proc', 'proc', '/proc'],
                         stderr=subprocess.DEVNULL)

            print(f"Container started with hostname: {hostname}")
            print(f"Root filesystem: {rootfs_path}")

            # Execute the command
            os.execvp(command[0], command)

        except Exception as e:
            print(f"Error in container: {e}")
            sys.exit(1)
    else:
        # Parent process
        try:
            os.waitpid(pid, 0)
        except KeyboardInterrupt:
            print("\nContainer interrupted")
        print("Container exited")

if __name__ == "__main__":
    if os.geteuid() != 0:
        print("This script requires root privileges")
        sys.exit(1)

    if len(sys.argv) < 2:
        print("Usage: sudo python3 container_v2.py <command>")
        sys.exit(1)

    # Create temporary root filesystem
    rootfs_path = tempfile.mkdtemp(prefix='container_rootfs_')

    try:
        setup_rootfs(rootfs_path)
        command = sys.argv[1:]
        run_container(command, rootfs_path)
    finally:
        # Cleanup
        print(f"Cleaning up {rootfs_path}")
        shutil.rmtree(rootfs_path, ignore_errors=True)

Now when you run this, you’ll have a container with:

Isolated process tree
Isolated filesystem
Custom hostname

sudo python3 container_v2.py bash

Try these commands inside:

hostname  # Should show "container"
ls /      # Should only see our minimal filesystem
ps aux    # Only processes in this namespace

Step 3: Resource Limits with cgroups

Now let’s add resource limits using cgroups (control groups). This is what prevents a container from consuming all system resources:

#!/usr/bin/env python3
import os
import sys
import subprocess

class CgroupManager:
    """
    Manages cgroups v2 for resource limiting.
    Modern Linux systems use cgroups v2.
    """

    def __init__(self, container_id):
        self.container_id = container_id
        self.cgroup_path = f"/sys/fs/cgroup/container_{container_id}"

    def create(self, memory_limit_mb=100, cpu_shares=512):
        """
        Create a cgroup with resource limits.

        Args:
            memory_limit_mb: Memory limit in megabytes
            cpu_shares: CPU shares (1024 = 100% of one CPU)
        """
        try:
            # Create cgroup directory
            os.makedirs(self.cgroup_path, exist_ok=True)

            # Set memory limit
            memory_limit_bytes = memory_limit_mb * 1024 * 1024
            with open(f"{self.cgroup_path}/memory.max", 'w') as f:
                f.write(str(memory_limit_bytes))

            # Set CPU limit
            # cpu.max format: $MAX $PERIOD (in microseconds)
            # For example, "50000 100000" means 50% of one CPU
            cpu_quota = int((cpu_shares / 1024) * 100000)
            with open(f"{self.cgroup_path}/cpu.max", 'w') as f:
                f.write(f"{cpu_quota} 100000")

            print(f"Created cgroup with limits:")
            print(f"  Memory: {memory_limit_mb}MB")
            print(f"  CPU: {cpu_shares}/1024 shares")

        except Exception as e:
            print(f"Warning: Could not set cgroup limits: {e}")
            print("Continuing without resource limits...")

    def add_process(self, pid):
        """Add a process to this cgroup."""
        try:
            with open(f"{self.cgroup_path}/cgroup.procs", 'w') as f:
                f.write(str(pid))
        except Exception as e:
            print(f"Warning: Could not add process to cgroup: {e}")

    def cleanup(self):
        """Remove the cgroup."""
        try:
            os.rmdir(self.cgroup_path)
        except Exception as e:
            print(f"Warning: Could not remove cgroup: {e}")

def run_container_with_limits(command, memory_mb=100, cpu_shares=512):
    """
    Run a container with resource limits.
    """
    import time
    import uuid

    container_id = str(uuid.uuid4())[:8]
    cgroup = CgroupManager(container_id)

    print(f"Container ID: {container_id}")

    # Create cgroup with limits
    cgroup.create(memory_limit_mb=memory_mb, cpu_shares=cpu_shares)

    pid = os.fork()

    if pid == 0:
        # Child process
        try:
            # Create namespaces
            os.unshare(os.CLONE_NEWPID | os.CLONE_NEWUTS | os.CLONE_NEWNS)

            # Set hostname
            subprocess.run(['hostname', f'container-{container_id}'],
                         stderr=subprocess.DEVNULL)

            print(f"Container process started (PID: {os.getpid()})")

            # Execute command
            os.execvp(command[0], command)

        except Exception as e:
            print(f"Error in container: {e}")
            sys.exit(1)
    else:
        # Parent process
        try:
            # Add container process to cgroup
            cgroup.add_process(pid)

            # Wait for container to exit
            os.waitpid(pid, 0)

        except KeyboardInterrupt:
            print("\nContainer interrupted")
        finally:
            # Cleanup cgroup
            cgroup.cleanup()
            print("Container exited")

if __name__ == "__main__":
    if os.geteuid() != 0:
        print("This script requires root privileges")
        sys.exit(1)

    if len(sys.argv) < 2:
        print("Usage: sudo python3 container_v3.py <command> [memory_mb] [cpu_shares]")
        print("Example: sudo python3 container_v3.py bash 100 512")
        sys.exit(1)

    command = sys.argv[1:-2] if len(sys.argv) > 3 else [sys.argv[1]]
    memory_mb = int(sys.argv[-2]) if len(sys.argv) > 2 else 100
    cpu_shares = int(sys.argv[-1]) if len(sys.argv) > 3 else 512

    run_container_with_limits(command, memory_mb, cpu_shares)

To test the memory limit, inside the container try:

# This Python one-liner will try to allocate memory until it hits the limit
python3 -c "a = ['x' * 1024 * 1024 for i in range(200)]"

The process should be killed when it exceeds the memory limit!

Step 4: Complete Container Runtime

Now let’s put everything together into a complete, production-like container runtime:

#!/usr/bin/env python3
"""
A minimal container runtime implementation in Python.
Demonstrates how Docker-like containers work under the hood.

Usage:
    sudo python3 container.py run <image_dir> <command>

Example:
    sudo python3 container.py run ./alpine bash
"""

import os
import sys
import subprocess
import shutil
import uuid
import argparse
from pathlib import Path

class Container:
    """Represents a running container with full isolation."""

    def __init__(self, image_dir, command, memory_mb=512, cpu_shares=1024):
        self.id = str(uuid.uuid4())[:12]
        self.image_dir = Path(image_dir).resolve()
        self.command = command
        self.memory_mb = memory_mb
        self.cpu_shares = cpu_shares
        self.cgroup_path = f"/sys/fs/cgroup/container_{self.id}"

    def setup_cgroup(self):
        """Create and configure cgroup for resource limits."""
        try:
            os.makedirs(self.cgroup_path, exist_ok=True)

            # Memory limit
            with open(f"{self.cgroup_path}/memory.max", 'w') as f:
                f.write(str(self.memory_mb * 1024 * 1024))

            # CPU limit
            cpu_quota = int((self.cpu_shares / 1024) * 100000)
            with open(f"{self.cgroup_path}/cpu.max", 'w') as f:
                f.write(f"{cpu_quota} 100000")

            print(f"[{self.id}] Resource limits: {self.memory_mb}MB RAM, {self.cpu_shares}/1024 CPU")
        except Exception as e:
            print(f"Warning: Could not set cgroups: {e}")

    def setup_network(self):
        """
        Setup network namespace and virtual network interface.
        In a real implementation, this would create veth pairs,
        bridges, and configure iptables for NAT.
        """
        try:
            # Create new network namespace
            os.unshare(os.CLONE_NEWNET)

            # Bring up loopback interface
            subprocess.run(['ip', 'link', 'set', 'lo', 'up'],
                         stderr=subprocess.DEVNULL)

            print(f"[{self.id}] Network namespace created")
        except Exception as e:
            print(f"Warning: Could not setup network: {e}")

    def setup_filesystem(self):
        """Setup isolated filesystem with mount namespace."""
        try:
            # Ensure image directory exists
            if not self.image_dir.exists():
                raise Exception(f"Image directory not found: {self.image_dir}")

            # Create mount namespace
            os.unshare(os.CLONE_NEWNS)

            # Remount everything private to avoid propagation
            subprocess.run(['mount', '--make-rprivate', '/'],
                         stderr=subprocess.DEVNULL)

            # Change to the container's root
            os.chroot(str(self.image_dir))
            os.chdir('/')

            # Mount essential filesystems
            os.makedirs('/proc', exist_ok=True)
            subprocess.run(['mount', '-t', 'proc', 'proc', '/proc'],
                         stderr=subprocess.DEVNULL)

            os.makedirs('/sys', exist_ok=True)
            subprocess.run(['mount', '-t', 'sysfs', 'sys', '/sys'],
                         stderr=subprocess.DEVNULL)

            os.makedirs('/dev', exist_ok=True)
            subprocess.run(['mount', '-t', 'devtmpfs', 'dev', '/dev'],
                         stderr=subprocess.DEVNULL)

            os.makedirs('/tmp', exist_ok=True)
            subprocess.run(['mount', '-t', 'tmpfs', 'tmpfs', '/tmp'],
                         stderr=subprocess.DEVNULL)

            print(f"[{self.id}] Filesystem isolated (root: {self.image_dir})")
        except Exception as e:
            raise Exception(f"Failed to setup filesystem: {e}")

    def run(self):
        """Run the container."""
        print(f"[{self.id}] Starting container...")
        print(f"[{self.id}] Command: {' '.join(self.command)}")

        # Setup cgroup first
        self.setup_cgroup()

        pid = os.fork()

        if pid == 0:
            # Child process - this becomes the container
            try:
                # Create new namespaces
                os.unshare(
                    os.CLONE_NEWPID |   # Process isolation
                    os.CLONE_NEWNS |    # Mount isolation
                    os.CLONE_NEWUTS |   # Hostname isolation
                    os.CLONE_NEWIPC     # IPC isolation
                )

                # Fork again to become PID 1 in the new namespace
                container_pid = os.fork()

                if container_pid == 0:
                    # Grandchild - this is PID 1 in the container

                    # Set hostname
                    subprocess.run(['hostname', f'container-{self.id}'],
                                 stderr=subprocess.DEVNULL)

                    # Setup filesystem
                    self.setup_filesystem()

                    # Setup network
                    self.setup_network()

                    # Set environment
                    os.environ['HOSTNAME'] = f'container-{self.id}'
                    os.environ['PATH'] = '/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin'

                    print(f"[{self.id}] Container ready!")
                    print("="*60)

                    # Execute the command
                    os.execvp(self.command[0], self.command)
                else:
                    # Child process - wait for grandchild
                    os.waitpid(container_pid, 0)
                    sys.exit(0)

            except Exception as e:
                print(f"[{self.id}] Error: {e}")
                sys.exit(1)
        else:
            # Parent process
            try:
                # Add container to cgroup
                with open(f"{self.cgroup_path}/cgroup.procs", 'w') as f:
                    f.write(str(pid))

                # Wait for container to exit
                os.waitpid(pid, 0)

            except KeyboardInterrupt:
                print(f"\n[{self.id}] Interrupted")
            finally:
                self.cleanup()

    def cleanup(self):
        """Cleanup container resources."""
        try:
            os.rmdir(self.cgroup_path)
        except Exception:
            pass
        print(f"[{self.id}] Container stopped")

def main():
    parser = argparse.ArgumentParser(
        description='A minimal container runtime',
        formatter_class=argparse.RawDescriptionHelpFormatter,
        epilog="""
Examples:
  sudo python3 container.py run /path/to/rootfs bash
  sudo python3 container.py run ./alpine sh -c "echo hello from container"
  sudo python3 container.py run ./ubuntu bash --memory 256 --cpu 512
        """
    )

    parser.add_argument('action', choices=['run'], help='Action to perform')
    parser.add_argument('image', help='Path to root filesystem')
    parser.add_argument('command', nargs='+', help='Command to run in container')
    parser.add_argument('--memory', type=int, default=512,
                       help='Memory limit in MB (default: 512)')
    parser.add_argument('--cpu', type=int, default=1024,
                       help='CPU shares (default: 1024 = 1 CPU)')

    args = parser.parse_args()

    # Check root
    if os.geteuid() != 0:
        print("Error: This program requires root privileges")
        print("Run with: sudo python3 container.py ...")
        sys.exit(1)

    if args.action == 'run':
        container = Container(
            args.image,
            args.command,
            memory_mb=args.memory,
            cpu_shares=args.cpu
        )
        container.run()

if __name__ == "__main__":
    main()

Testing Your Container Runtime

To test this, you’ll need a root filesystem. Here’s how to create a minimal one using an existing Docker image:

# Create a directory for our container image
mkdir alpine_rootfs

# Export an Alpine Linux filesystem (requires Docker)
docker export $(docker create alpine:latest) | tar -C alpine_rootfs -xf -

# Or download a minimal rootfs
wget https://dl-cdn.alpinelinux.org/alpine/v3.18/releases/x86_64/alpine-minirootfs-3.18.0-x86_64.tar.gz
mkdir alpine_rootfs
tar -xzf alpine-minirootfs-3.18.0-x86_64.tar.gz -C alpine_rootfs

# Now run your container!
sudo python3 container.py run alpine_rootfs sh

Inside the container, you can verify isolation:

# Check hostname
hostname  # Should show container-<id>

# Check processes (only container processes)
ps aux

# Check filesystem (should only see alpine files)
ls /

# Check resource limits
cat /sys/fs/cgroup/memory.max

What We Built vs. What Docker Does

Our container runtime demonstrates the core concepts, but production container runtimes like Docker/containerd do much more:

What we built:

Process isolation (PID namespaces)
Filesystem isolation (mount namespaces + chroot)
Resource limits (cgroups v2)
Basic network isolation
Hostname isolation (UTS namespace)

What Docker adds:

Image Management: Layered filesystems using overlay2/AUFS
Image Distribution: Pulling images from registries
Advanced Networking: Bridge networks, overlay networks, port mapping
Volume Management: Persistent storage with bind mounts and volumes
Security Features: seccomp profiles, AppArmor/SELinux, capability dropping
Container Orchestration APIs: REST API for managing containers
Logging & Monitoring: stdout/stderr capture, metrics collection
Health Checks: Container health monitoring
Restart Policies: Automatic restart on failure

Understanding the Security Implications

It’s crucial to understand that our simple implementation lacks many security features:

No User Namespaces: Our containers run as root. Production containers should use user namespaces to map container root to unprivileged users.
No seccomp: We don’t restrict system calls. Docker uses seccomp profiles to block dangerous syscalls.
No Capability Dropping: Our containers have all Linux capabilities. Docker drops most by default.
No AppArmor/SELinux: No mandatory access control.

These missing features are why you should never use this implementation in production!

Conclusion

By building this container runtime, we’ve demystified how containers actually work. They’re not magic - they’re clever applications of Linux kernel features that have existed for years:

Namespaces (2002-2013): Provide isolation
cgroups (2007): Provide resource limiting
chroot (1979!): Provides filesystem isolation

Docker’s innovation wasn’t inventing these technologies - it was packaging them into an easy-to-use tool with great developer experience.

Understanding these fundamentals makes you a better DevOps engineer. When things go wrong in production, you’ll know where to look. When you need to optimize container performance, you’ll understand the levers you can pull.

Further Learning

If you enjoyed this deep dive, here are resources to continue learning:

Linux Namespaces: man namespaces, man unshare
cgroups: Kernel cgroups documentation
OCI Runtime Spec: The standard container runtime specification
runc Source Code: Docker’s actual container runtime
LXC/LXD: Linux containers project - the original container tech

I also highly recommend CodeCrafters’ “Build Your Own Docker” challenge - it’s an interactive way to build a container runtime with guided steps.

Next Steps

In a future post, I might explore:

Implementing container image layers with overlay filesystems
Building container networking from scratch (veth pairs, bridges, NAT)
Creating a simple container orchestrator (mini-Kubernetes)

Let me know in the comments what you’d like to see next!

Announcements

If you’re interested in more content like this, I post regularly about DevOps, Python, and systems programming. Follow me on Twitter/X for updates.
I’m available for Python and DevOps consulting. If you need help with containerization, automation, or infrastructure, feel free to reach out via email.

If you share this on X, tag me @muhammad_o7 - I’d love to see your thoughts! You can also connect with me on LinkedIn.

Note: Want to be notified about posts like this? Subscribe to my RSS feed or leave your email here

About the Author

Muhammad Raza is a Senior DevOps Engineer and former AWS Professional Services Consultant with 5 years of experience in cloud infrastructure, CI/CD automation, and DevOps solutions. He has helped numerous clients optimize AWS costs, implement Infrastructure as Code, and build reliable deployment pipelines.

Need help with your DevOps workflows? I'm available for consulting on CI/CD pipelines, infrastructure automation, and AWS architecture. Book a free 30-min call or email me.

Connect on LinkedIn Follow on X/Twitter GitHub