Contents

Building a Container from Scratch

Building a Simple Container from Scratch

Introduction

Containers have revolutionized how we deploy applications, providing isolation, portability, and efficiency. While tools like Docker make containerization accessible, understanding what happens under the hood is valuable. This blog post documents my journey creating a basic container implementation using Linux’s native features.

What is Containerization?

Containerization is a lightweight virtualization technique that isolates applications without the overhead of full virtual machines. Containers share the host’s kernel but run in isolated environments with their own:

  • Filesystem
  • Process tree
  • Network stack
  • Resource limits

Core Technologies

My implementation uses four key Linux features:

  1. Namespaces: Provide isolation for system resources
  2. Chroot: Creates a new root filesystem view
  3. Cgroups: Limit resource usage
  4. Virtual networking: Isolates network communication

Implementation Steps

1. Setting Up the Base Filesystem

The first step was creating a minimal root filesystem for our container:

1
2
3
4
#!/bin/bash
ROOTFS="./rootfs"
mkdir -p $ROOTFS
debootstrap --variant=minbase focal $ROOTFS

This creates a minimal Ubuntu Focal installation in the rootfs directory, which becomes our container’s filesystem.

2. Process Isolation with Namespaces

Linux namespaces isolate processes from the host system. My implementation uses:

  • PID namespace: Container processes can’t see host processes
  • Mount namespace: Container has its own filesystem mounts
  • UTS namespace: Container has its own hostname
  • IPC namespace: Container has its own IPC resources
  • Network namespace: Container has its own network stack

The key to proper PID namespace isolation is to mount /proc inside the new namespace:

1
2
3
4
5
6
7
ip netns exec container_ns unshare --mount --uts --ipc --pid --fork bash -c "
    # We need to mount /proc again inside the new PID namespace
    mount -t proc proc $ROOTFS/proc
    
    # Now chroot into the container
    exec chroot $ROOTFS /bin/bash
"

This approach ensures that when you run ps inside the container, you only see container processes, with the bash shell having PID 1.

3. Resource Limits with Cgroups v2

For resource control, I used cgroups v2, which has a unified hierarchy:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
# Create a new cgroup
CGROUP_NAME="my_container"
mkdir -p /sys/fs/cgroup/$CGROUP_NAME

# Enable cpu and memory controllers
echo "+cpu +memory" > /sys/fs/cgroup/cgroup.subtree_control

# Set CPU limit (50% of one core)
echo "50000 100000" > /sys/fs/cgroup/$CGROUP_NAME/cpu.max

# Set memory limit (256MB)
echo "268435456" > /sys/fs/cgroup/$CGROUP_NAME/memory.max

4. Network Isolation

For networking, I created a virtual ethernet pair with one end in the container:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
# Create network namespace
ip netns add container_ns

# Create veth pair
ip link add veth0 type veth peer name veth1

# Move veth1 to container namespace
ip link set veth1 netns container_ns

# Configure interfaces
ip addr add 10.0.0.1/24 dev veth0
ip link set veth0 up
ip netns exec container_ns ip addr add 10.0.0.2/24 dev veth1
ip netns exec container_ns ip link set veth1 up
ip netns exec container_ns ip link set lo up

Then I set up NAT for internet access:

1
2
3
4
5
6
7
8
9
# Enable IP forwarding
echo 1 > /proc/sys/net/ipv4/ip_forward
sysctl -w net.ipv4.ip_forward=1

# Set up NAT
iptables -t nat -F
iptables -F FORWARD
iptables -P FORWARD ACCEPT
iptables -t nat -A POSTROUTING -s 10.0.0.0/24 -j MASQUERADE

5. Running a Web Server in the Container

To demonstrate the container’s functionality, I added a simple Python web server:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
# Add a simple web server to the container
mkdir -p $ROOTFS/var/www/html
echo "<html><body><h1>Hello from Container!</h1></body></html>" > $ROOTFS/var/www/html/index.html

# Add a script to start the web server
cat > $ROOTFS/start_server.sh << 'EOF'
#!/bin/bash
cd /var/www/html
python3 -m http.server 8080
EOF

chmod +x $ROOTFS/start_server.sh

Inside the container, you can start the server with /start_server.sh and access it from the host at http://10.0.0.2:8080.

Challenges and Solutions

Challenge 1: Mount Points

Initially, the container couldn’t access /proc and /sys. Solution: Mount these special filesystems inside the container:

1
2
3
mount -t proc proc $ROOTFS/proc
mount -t sysfs sysfs $ROOTFS/sys
mount -t devtmpfs devtmpfs $ROOTFS/dev

Challenge 2: Network Connectivity

Getting internet access from the container was tricky. I tried several approaches:

  1. First attempt: Using specific outgoing interfaces in NAT rules

    1
    
    iptables -t nat -A POSTROUTING -s 10.0.0.0/24 -o eth0 -j MASQUERADE
    

    This didn’t work reliably because the interface name varies between systems.

  2. Second attempt: Using user namespace with UID/GID mapping

    1
    
    unshare --user --map-root-user
    

    This caused permission issues with network access.

  3. Final solution: Simplified NAT setup with FORWARD policy set to ACCEPT

    1
    2
    3
    4
    
    iptables -t nat -F
    iptables -F FORWARD
    iptables -P FORWARD ACCEPT
    iptables -t nat -A POSTROUTING -s 10.0.0.0/24 -j MASQUERADE
    

    This approach worked consistently across different systems.

Challenge 3: DNS Resolution

DNS resolution initially failed in the container. The solution was to:

  1. Configure a proper resolv.conf with Google’s DNS servers

    1
    2
    
    echo "nameserver 8.8.8.8" > $ROOTFS/etc/resolv.conf
    echo "nameserver 8.8.4.4" >> $ROOTFS/etc/resolv.conf
    
  2. Add direct IP entries for common Ubuntu repositories

    1
    
    echo "91.189.91.81 archive.ubuntu.com" > $ROOTFS/etc/hosts
    

Challenge 4: Process Isolation

Initially, the container could see host processes. The solution was to properly mount /proc inside the new PID namespace:

1
2
3
4
5
6
7
ip netns exec container_ns unshare --mount --uts --ipc --pid --fork bash -c "
    # We need to mount /proc again inside the new PID namespace
    mount -t proc proc $ROOTFS/proc
    
    # Now chroot into the container
    exec chroot $ROOTFS /bin/bash
"

This ensures that the container only sees its own processes.

Testing the Container

To verify isolation, I ran several tests:

  1. Process isolation: ps inside the container showed only container processes

    1
    2
    3
    4
    
    root@container:/# ps
    PID TTY          TIME CMD
      1 ?        00:00:00 bash
      7 ?        00:00:00 ps
    
  2. Network isolation: The container had its own IP address (10.0.0.2)

    1
    2
    3
    4
    5
    
    root@container:/# ip addr
    1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN
        inet 127.0.0.1/8 scope host lo
    2: veth1@if8: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP
        inet 10.0.0.2/24 scope global veth1
    
  3. Internet connectivity: The container could access the internet

    1
    2
    3
    
    root@container:/# ping 8.8.8.8
    PING 8.8.8.8 (8.8.8.8) 56(84) bytes of data.
    64 bytes from 8.8.8.8: icmp_seq=1 ttl=111 time=25.7 ms
    
  4. Filesystem isolation: The container had its own root filesystem

  5. Web server test: Running the Python web server inside the container and accessing it from the host

    1
    2
    3
    4
    5
    6
    7
    
    # Inside container
    root@container:/# /start_server.sh
    Serving HTTP on 0.0.0.0 port 8080 ...
    
    # From host
    $ curl http://10.0.0.2:8080
    <html><body><h1>Hello from Container!</h1></body></html>
    

Future Improvements

While this implementation covers the basics, several enhancements could be made:

  1. User namespace isolation: Properly map UIDs/GIDs between container and host
  2. Disk I/O limits: Add cgroup controls for disk operations
  3. Security hardening: Implement seccomp or AppArmor profiles
  4. Non-root execution: Run applications as non-root users inside the container
  5. Container image management: Add support for layered filesystem images
  6. Container orchestration: Implement basic container lifecycle management

Conclusion

Building a container from scratch helped me understand how Docker and other container technologies work internally. The implementation demonstrates the core concepts of containerization using Linux’s native features.

The most important lesson: containers aren’t magic - they’re just clever combinations of existing Linux isolation mechanisms! By understanding these mechanisms, we can better utilize, troubleshoot, and secure containerized applications.

This project shows that while container tools like Docker provide a polished experience, the underlying technology is accessible and can be implemented with basic Linux commands. The journey of building this container implementation has given me a deeper appreciation for the elegance of containerization and the power of Linux’s isolation features.