<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:media="http://search.yahoo.com/mrss/">
  <channel>
    <title></title>
    <description>Personal Blog where I write about things I learn or discover.</description>
    <link>//muhammadraza.me/</link>
    <atom:link href="//muhammadraza.me/feed.xml" rel="self" type="application/rss+xml"/>
    <pubDate>Fri, 03 Apr 2026 15:34:48 +0000</pubDate>
    <lastBuildDate>Fri, 03 Apr 2026 15:34:48 +0000</lastBuildDate>
    <generator>Jekyll v3.10.0</generator>
    
      <item>
        <title>Building CodeWiki: Compiling Codebases Into Living Wikis With LLMs</title>
        <description>
          <![CDATA[
            
            <p>Every coding agent session starts from zero. The agent doesn’t know how your code is organized, which files matter, how the pieces connect. It has to rediscover the architecture from scratch. Grep around, read some files, build a mental model, start working. That mental model disappears the moment the session ends.</p>

<p>I kept watching this happen. Ten minutes of exploration before any real work, every single time. If you work across multiple repos or come back to a project after a couple weeks, it’s worse. The agent is essentially reading the codebase for the first time, again.</p>

<p>I wanted to fix this.</p>

<h2 id="the-idea">The idea</h2>

<p>A few weeks ago Karpathy <a href="https://x.com/karpathy/status/2039805659525644595">tweeted</a> about using LLMs to build personal knowledge bases. The workflow: collect raw sources, have an LLM compile them into a structured wiki of markdown files, then query and build on that wiki over time. Every query makes the wiki richer. The knowledge adds up.</p>

<p>The part that stuck with me: he’s not using fancy RAG. The LLM maintains its own index files and summaries, and at his scale (~100 articles, ~400K words) it just works. The LLM reads its own compiled knowledge to answer questions.</p>

<p>Codebases are raw data too. Source files are unstructured information that happens to be executable. What if the LLM compiled a codebase into a wiki the same way, with module overviews, architecture docs, concept articles, and then used that wiki as its starting point for every session?</p>

<p>That’s <a href="https://github.com/mraza007/codewiki">CodeWiki</a>.</p>

<h2 id="how-it-works">How it works</h2>

<p>CodeWiki is a thin Rust CLI called <code class="language-plaintext highlighter-rouge">cw</code> paired with a Claude Code skill. The CLI handles git ops, directory scaffolding, and metadata. The agent does all the actual reading and writing. No API keys, no LLM calls from the CLI. Your agent is the intelligence.</p>

<p>When you run <code class="language-plaintext highlighter-rouge">cw init</code> in a repo, it creates a wiki directory at <code class="language-plaintext highlighter-rouge">~/.codewiki/&lt;project&gt;/</code> with this structure:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>~/.codewiki/my-project/
├── _index.md         # master index
├── _architecture.md  # system overview
├── _patterns.md      # recurring patterns
├── _meta.yaml        # last compiled commit
├── modules/          # one article per module
├── concepts/         # cross-cutting concerns
├── decisions/        # why things are the way they are
├── learnings/        # bugs fixed, patterns discovered
└── queries/          # past Q&amp;A, filed back
</code></pre></div></div>

<p>The first time you start a Claude Code session after init, the skill kicks in. The agent walks your codebase, reads the source files, and writes wiki articles. Module articles describe what each part of the code actually does. Not what it’s supposed to do, what it does. Key files, functions, data flow, connections to other modules.</p>

<p>Concept articles cut across modules. “How does error handling work across the system” or “how does data flow from request to response.” These are the questions that normally require reading eight files across four directories. The wiki answers them in one place.</p>

<h2 id="keeping-it-fresh">Keeping it fresh</h2>

<p>The wiki is only useful if it stays current. Every article has YAML frontmatter with a <code class="language-plaintext highlighter-rouge">source_files</code> field:</p>

<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nn">---</span>
<span class="na">title</span><span class="pi">:</span> <span class="s">Authentication Module</span>
<span class="na">type</span><span class="pi">:</span> <span class="s">module</span>
<span class="na">source_files</span><span class="pi">:</span>
  <span class="pi">-</span> <span class="s">src/auth/middleware.py</span>
  <span class="pi">-</span> <span class="s">src/auth/tokens.py</span>
<span class="na">tags</span><span class="pi">:</span> <span class="pi">[</span><span class="nv">auth</span><span class="pi">,</span> <span class="nv">middleware</span><span class="pi">,</span> <span class="nv">jwt</span><span class="pi">]</span>
<span class="nn">---</span>
</code></pre></div></div>

<p>The CLI tracks which commit the wiki was last compiled against. When you start a new session, <code class="language-plaintext highlighter-rouge">cw status</code> diffs against that commit and cross-references changed files against every article’s <code class="language-plaintext highlighter-rouge">source_files</code>:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cw status
Changed since last compile (4964c23):
  M src/auth/middleware.py
  M src/auth/tokens.py

Stale articles:
  ! modules/auth.md
</code></pre></div></div>

<p>The agent sees this and knows exactly what to re-read and update. No guessing, no full recompile.</p>

<p>At session end, the agent writes learnings and decisions back into the wiki. Fixed a bug? That becomes <code class="language-plaintext highlighter-rouge">learnings/auth-token-race-condition.md</code>. Made a design decision? That’s <code class="language-plaintext highlighter-rouge">decisions/switched-to-redis-sessions.md</code>. Then it updates <code class="language-plaintext highlighter-rouge">_meta.yaml</code> with the current commit hash.</p>

<p>Next session picks up where this one left off.</p>

<h2 id="the-cli">The CLI</h2>

<p>About 400 lines of Rust. Here are the commands:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>cw init                <span class="c"># scaffold wiki for current repo</span>
cw status              <span class="c"># what changed since last compile</span>
cw path                <span class="c"># print wiki path</span>
cw projects            <span class="c"># list all wikis</span>
cw index               <span class="c"># rebuild _index.md from article frontmatter</span>
cw meta update         <span class="c"># record current commit as compiled</span>

cw setup claude-code   <span class="c"># install skill into Claude Code</span>
cw setup codex         <span class="c"># install instructions for Codex</span>
cw setup qmd           <span class="c"># register wiki as QMD search collection</span>
</code></pre></div></div>

<p>The CLI doesn’t make any LLM calls. It handles the things agents are bad at: tracking git state, knowing which files changed, maintaining timestamps. The agent handles what it’s good at: reading code and writing about it.</p>

<h2 id="search-with-qmd">Search with QMD</h2>

<p>For larger wikis, <a href="https://github.com/tobi/qmd">QMD</a> by Tobi Lutke adds proper search. It’s a local search engine for markdown with hybrid BM25 plus vector search plus a small reranker model. Running <code class="language-plaintext highlighter-rouge">cw setup qmd</code> registers your wiki as a searchable collection. The agent can then query the wiki through QMD’s MCP server during a session.</p>

<p>At the scale of most repos people actually work in, you probably don’t need it. A well organized wiki with an index file is enough for the LLM to navigate on its own. But when the wiki gets large, QMD keeps retrieval fast.</p>

<h2 id="viewing-with-obsidian">Viewing with Obsidian</h2>

<p>All wiki articles live at <code class="language-plaintext highlighter-rouge">~/.codewiki/</code>. Open that directory as an Obsidian vault and you get a browsable knowledge graph of all your projects. Articles use <code class="language-plaintext highlighter-rouge">[[backlinks]]</code> so modules connect to each other. The auth article links to <code class="language-plaintext highlighter-rouge">[[database]]</code> and <code class="language-plaintext highlighter-rouge">[[api]]</code>. You never have to write or edit these articles yourself. The agent maintains everything.</p>

<h2 id="why-not-rag">Why not RAG</h2>

<p>Traditional RAG chunks your code, embeds it, retrieves fragments when you ask a question. You get decontextualized snippets and hope the LLM can stitch them together.</p>

<p>CodeWiki is different. The LLM reads the code and writes structured articles about it. The auth article already connects the middleware to the token service to the database layer. That connection doesn’t exist in any single source file. It exists in the compiled understanding.</p>

<p>Karpathy found the same thing with his research wiki. You don’t need vector search over raw data when you have a well organized collection of articles. The LLM reads the index, finds the relevant articles, reads those. Simple and it works.</p>

<h2 id="getting-started">Getting started</h2>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>git clone https://github.com/mraza007/codewiki.git
<span class="nb">cd </span>codewiki
cargo <span class="nb">install</span> <span class="nt">--path</span> <span class="nb">.</span>

<span class="nb">cd </span>your-project
cw init
cw setup claude-code
</code></pre></div></div>

<p>Start a Claude Code session and the skill handles the rest. The project is MIT licensed and on <a href="https://github.com/mraza007/codewiki">GitHub</a>.</p>

          ]]>
        </description>
        <pubDate>Fri, 03 Apr 2026 00:00:00 +0000</pubDate>
        <link>//muhammadraza.me/2026/building-codewiki-compiling-codebases-into-living-wikis/</link>
        <guid isPermaLink="true">//muhammadraza.me/2026/building-codewiki-compiling-codebases-into-living-wikis/</guid>
        
        <category>ai</category>
        
        <category>rust</category>
        
        <category>tools</category>
        
        <category>devops</category>
        
        
        
        <dc:creator>{&quot;name&quot;=&gt;&quot;Muhammad Raza&quot;}</dc:creator>
        <dc:rights></dc:rights>
      </item>
    
      <item>
        <title>I Built an Orchestrator That Watches GitHub Issues and Sends Agents to Fix Them</title>
        <description>
          <![CDATA[
            
            <p>I have too many issues and not enough time. Same as everyone. The usual loop is: pick an issue, context switch into it, write the code, open a PR, pick the next one. Do that until the sprint ends or you lose the will.</p>

<p>Coding agents help with this. I can point Claude Code at an issue and let it work while I do something else. But that’s still one agent, one issue, one terminal. If I have 10 issues labeled “agent-ready,” I’m not babysitting 10 terminal tabs.</p>

<p>I wanted something that just watches for new issues and sends agents after them. Then OpenAI released their <a href="https://github.com/openai/symphony/blob/main/SPEC.md">Symphony spec</a>, an orchestrator pattern for their Codex agent. The architecture was solid: poll an issue tracker, dispatch agents into isolated workspaces, reconcile when issues close. But it was built around Codex and Linear, and I use Claude Code and GitHub Issues.</p>

<p>So I took the ideas I liked from Symphony and built my own. That’s <a href="https://github.com/mraza007/baton">Baton</a>.</p>

<h2 id="what-it-does">What it does</h2>

<p>Baton is a Python daemon. You start it in your repo, it polls GitHub Issues matching your configured labels, creates an isolated git worktree per issue, and runs Claude Code CLI as a subprocess. When the agent finishes and opens a PR, Baton releases the claim and grabs the next issue.</p>

<p>One config file. One command. Go do something else.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>WORKFLOW.md -&gt; Orchestrator -&gt; Worker (per issue)
                  |              |
                  |              +-- git worktree create
                  |              +-- hooks (before_run)
                  |              +-- claude -p "&lt;prompt&gt;"
                  |              +-- check issue state
                  |              +-- hooks (after_run)
                  |
                  +-- Poller (gh issue list)
                  +-- Dispatcher (concurrency control)
                  +-- Reconciler (stale run detection)
</code></pre></div></div>

<p>The name comes from relay races. You hand off the baton and the runner goes.</p>

<h2 id="the-config">The config</h2>

<p>Everything lives in <code class="language-plaintext highlighter-rouge">WORKFLOW.md</code>. YAML front matter for configuration, Jinja2 template below for the prompt. Baton reloads this file on every poll cycle, so you can change settings without restarting.</p>

<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nn">---</span>
<span class="na">tracker</span><span class="pi">:</span>
  <span class="na">kind</span><span class="pi">:</span> <span class="s">github</span>
  <span class="na">labels</span><span class="pi">:</span> <span class="pi">[</span><span class="s2">"</span><span class="s">agent"</span><span class="pi">]</span>
  <span class="na">exclude_labels</span><span class="pi">:</span> <span class="pi">[</span><span class="s2">"</span><span class="s">blocked"</span><span class="pi">]</span>

<span class="na">polling</span><span class="pi">:</span>
  <span class="na">interval_ms</span><span class="pi">:</span> <span class="m">30000</span>

<span class="na">agent</span><span class="pi">:</span>
  <span class="na">max_concurrent</span><span class="pi">:</span> <span class="m">3</span>
  <span class="na">max_turns</span><span class="pi">:</span> <span class="m">5</span>
  <span class="na">command</span><span class="pi">:</span> <span class="s">claude</span>
  <span class="na">permission_mode</span><span class="pi">:</span> <span class="s">bypassPermissions</span>

<span class="na">hooks</span><span class="pi">:</span>
  <span class="na">before_run</span><span class="pi">:</span> <span class="pi">|</span>
    <span class="s">git fetch origin main &amp;&amp; git rebase origin/main</span>
  <span class="na">timeout_ms</span><span class="pi">:</span> <span class="m">60000</span>
<span class="nn">---</span>

<span class="s">You are an autonomous software engineer working on issue</span> <span class="c1">#{{ issue.number " }}: {{ issue.title " }}.</span>

<span class="pi">{{</span> <span class="nv">issue.body "</span> <span class="pi">}}</span>

<span class="pi">{</span><span class="err">%</span> <span class="nv">if attempt %</span><span class="pi">}</span>
<span class="s">This is continuation attempt {{ attempt " }}. Review what was done and continue.</span>
<span class="pi">{</span><span class="err">%</span> <span class="nv">endif %</span><span class="pi">}</span>

<span class="c1">## Instructions</span>

<span class="s">1. Understand the issue requirements</span>
<span class="s">2. Write clean, well-tested code</span>
<span class="s">3. Run existing tests to make sure nothing breaks</span>
<span class="s">4. Commit your changes with a descriptive message</span>
<span class="s">5. Push the branch and create a pull request linking to</span> <span class="c1">#{{ issue.number " }}</span>
</code></pre></div></div>

<p>Labels filter which issues get picked up. <code class="language-plaintext highlighter-rouge">max_concurrent</code> controls parallel agents. <code class="language-plaintext highlighter-rouge">max_turns</code> is the retry limit per issue. Hooks run shell commands at different points. I use <code class="language-plaintext highlighter-rouge">before_run</code> to rebase on main so the agent starts from fresh code.</p>

<p>The prompt template gets <code class="language-plaintext highlighter-rouge">issue.number</code>, <code class="language-plaintext highlighter-rouge">issue.title</code>, <code class="language-plaintext highlighter-rouge">issue.body</code>, <code class="language-plaintext highlighter-rouge">issue.labels</code>, and <code class="language-plaintext highlighter-rouge">attempt</code> for retries. Standard Jinja2.</p>

<h2 id="why-worktrees">Why worktrees</h2>

<p>Each issue gets its own worktree under <code class="language-plaintext highlighter-rouge">.symphony/worktrees/</code>, with a branch name slugified from the issue title: <code class="language-plaintext highlighter-rouge">baton/fix-login-redirect-42</code>.</p>

<p>I thought about Docker containers and temp directories but worktrees won out. They share the git object database so creating one is almost instant, unlike a full clone. They’re real checkouts, so linters and test runners and build scripts all work without any path hacking. And they’re isolated. If one agent trashes its branch, the others don’t care.</p>

<h2 id="why-gh-cli-instead-of-the-github-api">Why <code class="language-plaintext highlighter-rouge">gh</code> CLI instead of the GitHub API</h2>

<p>Baton shells out to <code class="language-plaintext highlighter-rouge">gh issue list</code> and <code class="language-plaintext highlighter-rouge">gh pr create</code> instead of using PyGitHub or the REST API. Seems odd, but think about setup.</p>

<p>With the API, you need a personal access token. You need to configure it somewhere. You need to handle rate limits.</p>

<p>With <code class="language-plaintext highlighter-rouge">gh</code>, you authenticate once (<code class="language-plaintext highlighter-rouge">gh auth login</code>) and everything on your machine uses the same credentials. No token management in the orchestrator. The tradeoff is speed, but Baton polls every 30 seconds. The overhead of a subprocess call doesn’t matter at that pace.</p>

<h2 id="the-permission-problem">The permission problem</h2>

<p>This tripped me up. Claude Code has permission modes: <code class="language-plaintext highlighter-rouge">default</code> asks for everything, <code class="language-plaintext highlighter-rouge">acceptEdits</code> auto-approves file edits but prompts for shell commands, and <code class="language-plaintext highlighter-rouge">bypassPermissions</code> auto-approves everything.</p>

<p>I started with <code class="language-plaintext highlighter-rouge">acceptEdits</code> because it felt like the right balance. Let the agent write code freely, but make it ask before running commands. Problem: “ask” means a human clicking yes, and in an autonomous orchestrator there’s no human. The agent just blocks forever waiting for a prompt nobody will answer.</p>

<p>I wasted about 20 minutes watching it hang before I figured this out. For autonomous operation you need <code class="language-plaintext highlighter-rouge">bypassPermissions</code>, which maps to <code class="language-plaintext highlighter-rouge">--dangerously-skip-permissions</code>. The flag name is honest about the risk. I’m comfortable with it because the agents run in isolated worktrees on disposable branches, not in my main checkout.</p>

<h2 id="auto-releasing-on-pr-creation">Auto-releasing on PR creation</h2>

<p>My first version had a dumb problem. The agent would finish its work, create a PR on turn 2 of 5, and Baton would keep scheduling continuation turns for the remaining 3. The slot was occupied but nobody was doing anything useful.</p>

<p>The fix: after each worker finishes, check if a PR exists for that issue’s branch. If yes, release the claim immediately and free up the slot. If not, schedule a short retry.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">pr_exists</span> <span class="o">=</span> <span class="k">await</span> <span class="bp">self</span><span class="p">.</span><span class="n">tracker</span><span class="p">.</span><span class="n">check_pr_exists</span><span class="p">(</span><span class="n">issue</span><span class="p">.</span><span class="n">number</span><span class="p">)</span>
<span class="k">if</span> <span class="n">pr_exists</span><span class="p">:</span>
    <span class="n">log</span><span class="p">.</span><span class="n">info</span><span class="p">(</span><span class="sa">f</span><span class="s">"PR_READY #</span><span class="si">{</span><span class="n">issue</span><span class="p">.</span><span class="n">number</span><span class="si">}</span><span class="s"> -- PR found, releasing claim"</span><span class="p">)</span>
    <span class="k">return</span> <span class="s">"pr_created"</span>
<span class="k">return</span> <span class="s">"no_pr"</span>
</code></pre></div></div>

<p>Small change, but it meant the orchestrator stopped wasting time on finished work.</p>

<h2 id="extensibility-through-skills-and-mcp-servers">Extensibility through skills and MCP servers</h2>

<p>Baton itself is deliberately simple. It polls, dispatches, and manages worktrees. The interesting part is what you put in the prompt and what tools you give the agent.</p>

<p>Claude Code supports MCP servers, which means you can wire up external tools and the agent can use them during its run. Baton passes MCP server config through to each worker:</p>

<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="na">agent</span><span class="pi">:</span>
  <span class="na">mcp_servers</span><span class="pi">:</span>
    <span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">playwright</span>
      <span class="na">command</span><span class="pi">:</span> <span class="s">npx @playwright/mcp@latest</span>
</code></pre></div></div>

<p>That means the agent has access to a headless browser while it works. It can open a page, click around, take screenshots, verify that the UI renders correctly. You don’t have to build that into Baton. You just declare which MCP servers you want and the agent figures out when to use them.</p>

<p>Same idea with CLI tools. If <a href="https://github.com/vercel-labs/agent-browser">agent-browser</a> is installed on the machine, you can tell the agent to use it in the prompt template. “Before creating a PR, open the app with agent-browser and verify the acceptance criteria.” The agent spins up a local server, opens the page, clicks buttons, fills inputs, takes snapshots. All from instructions in WORKFLOW.md, nothing hardcoded in the orchestrator.</p>

<p>Claude Code also has skills, which are reusable prompt fragments that teach the agent specific capabilities. If you have a code review skill or a testing skill installed, the agent can use them during its run. Baton’s config supports a <code class="language-plaintext highlighter-rouge">skills</code> list for this:</p>

<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="na">agent</span><span class="pi">:</span>
  <span class="na">skills</span><span class="pi">:</span>
    <span class="pi">-</span> <span class="s">code-reviewer</span>
    <span class="pi">-</span> <span class="s">accessibility-checker</span>
</code></pre></div></div>

<p>You can also override skills per issue by adding a <code class="language-plaintext highlighter-rouge">## Skills</code> section to the issue body. If one issue needs Playwright but the others don’t, just add it to that issue.</p>

<p>The point is that Baton doesn’t need to know about browsers or test runners or linters. It just needs to dispatch agents with the right config. The prompt and the tools do the rest.</p>

<h2 id="putting-it-together-a-todo-app-from-scratch">Putting it together: a todo app from scratch</h2>

<p>To see all of this working end to end, I had Baton build a todo app. Fresh repo, no code. I created three GitHub issues labeled <code class="language-plaintext highlighter-rouge">baton</code>:</p>

<ol>
  <li>Create basic HTML structure</li>
  <li>Add JavaScript for create/delete</li>
  <li>Add localStorage persistence</li>
</ol>

<p>The WORKFLOW.md prompt told the agent to use agent-browser for verification before opening PRs. I ran <code class="language-plaintext highlighter-rouge">baton start</code> and went to make coffee.</p>

<p>Baton picked up issue #1, created a worktree on <code class="language-plaintext highlighter-rouge">baton/create-basic-todo-app-html-structure-1</code>, and dispatched Claude Code. The agent wrote <code class="language-plaintext highlighter-rouge">index.html</code>, spun up a local server with <code class="language-plaintext highlighter-rouge">npx serve</code>, opened it with agent-browser, confirmed the layout rendered, then committed, pushed, and opened a PR. The PR description included what agent-browser found:</p>

<blockquote>
  <p>Opened <code class="language-plaintext highlighter-rouge">http://localhost:3456</code> and confirmed the page renders correctly.
Ran <code class="language-plaintext highlighter-rouge">agent-browser snapshot -i</code> confirming interactive elements: textbox and button.</p>
</blockquote>

<p>I merged it. The issue auto-closed (the PR had <code class="language-plaintext highlighter-rouge">Closes #1</code>). Baton saw the issue was gone on the next poll, released the slot, and picked up issue #2. Same cycle. Then #3.</p>

<p>Three issues, three PRs, three merges. I didn’t write a line of the todo app. The agent-browser verification wasn’t built into Baton. It was just instructions in the prompt and a CLI tool on my machine.</p>

<h2 id="getting-started">Getting started</h2>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>pip <span class="nb">install</span> <span class="nt">-e</span> <span class="nb">.</span>
<span class="nb">cp </span>WORKFLOW.md.example WORKFLOW.md
<span class="c"># Edit WORKFLOW.md: set your labels, tweak the prompt</span>
baton start
</code></pre></div></div>

<p>You need Python 3.11+, Claude Code CLI (<code class="language-plaintext highlighter-rouge">claude</code>), GitHub CLI (<code class="language-plaintext highlighter-rouge">gh</code>) authenticated, and Git.</p>

<p>The code is at <a href="https://github.com/mraza007/baton">github.com/mraza007/baton</a>. MIT licensed. About 10 Python modules, no external services, no databases. State lives in memory with JSON persistence for the status command.</p>

<h2 id="what-i-want-to-add-next">What I want to add next</h2>

<ul>
  <li>A proper TUI instead of <code class="language-plaintext highlighter-rouge">baton status</code> reading a JSON file</li>
  <li>Issue dependency ordering so issue 3 waits for issue 2 if it needs to</li>
  <li>Cost tracking per issue, so I can see what automating the backlog actually costs in tokens</li>
  <li>More trackers besides GitHub Issues (Linear, Jira, GitLab)</li>
</ul>

<p>If you’ve got a repo with a pile of issues sitting there, try pointing Baton at it. Start with one label and <code class="language-plaintext highlighter-rouge">max_concurrent: 1</code>. See what it does. The setup takes about five minutes and the worst case is you get a bad PR that you close. The code is MIT licensed, the whole thing is ten files, and there’s nothing weird in it. Fork it, break it, rip out the parts you don’t like.</p>

<p>If you try it, I want to hear what breaks.</p>

<hr />

<p>I write a newsletter called <a href="https://devconsole.substack.com/">Dev Console</a> where I cover what’s actually happening in AI, minus the hype. New tools, real use cases, stuff I’m building. If this post was interesting, you’ll probably like it.</p>

          ]]>
        </description>
        <pubDate>Fri, 27 Mar 2026 00:00:00 +0000</pubDate>
        <link>//muhammadraza.me/2026/building-baton-autonomous-agent-orchestrator/</link>
        <guid isPermaLink="true">//muhammadraza.me/2026/building-baton-autonomous-agent-orchestrator/</guid>
        
        <category>ai</category>
        
        <category>python</category>
        
        <category>tools</category>
        
        <category>automation</category>
        
        
        
        <dc:creator>{&quot;name&quot;=&gt;&quot;Muhammad Raza&quot;}</dc:creator>
        <dc:rights></dc:rights>
      </item>
    
      <item>
        <title>Harness Engineering: The DevOps Skill Nobody Told You About</title>
        <description>
          <![CDATA[
            
            <p>I’ve written before about how <a href="https://muhammadrazame.github.io/blog/2026/01/03/ai-agents-devops-perspective">AI agents are just CI pipelines with an LLM plugged in</a>. That post mapped agent concepts to infrastructure patterns you already know. But there’s a discipline forming around the infrastructure side of agents that deserves its own name.</p>

<p>Harness engineering. It’s the practice of building everything around the LLM — the execution environment, tool definitions, safety boundaries, observability, and lifecycle management. The stuff that turns a chatbot into a production system.</p>

<p>If you work in DevOps, you’ve been doing this for years. You just called it something else.</p>

<h2 id="why-harnesses-matter-more-than-models">Why Harnesses Matter More Than Models</h2>

<p>Pick any AI agent demo. Strip out the model. What’s left?</p>

<p>A container or sandbox. A set of callable tools. A loop that reads output and decides what happens next. Logging. Timeouts. Cleanup.</p>

<p>That’s the harness. And it’s where agents succeed or fail. A great model in a bad harness hallucinates, loops forever, leaks secrets, or silently does nothing useful. A decent model in a good harness stays bounded, recovers from errors, and produces auditable results.</p>

<p>DevOps engineers already think this way. You don’t just pick a good application — you build the infrastructure that makes it reliable. Same thing here.</p>

<h2 id="the-five-parts-of-a-harness">The Five Parts of a Harness</h2>

<p>Here’s how I break down harness engineering into components. Each one maps directly to something you’ve built before.</p>

<p><strong>1. Execution environment.</strong> Where does the agent run? A container, a VM, a temporary directory, a git worktree. You need isolation so the agent can’t corrupt shared state. You need reproducibility so runs are consistent. This is the same problem as CI job runners. Docker, Firecracker, nsjail — pick your isolation boundary.</p>

<p><strong>2. Tool definitions.</strong> Tools are the agent’s API surface. Read a file. Run a command. Query a database. Call an endpoint. Each tool needs input validation, output formatting, error handling, and permission scoping. Think of it like designing an API — you wouldn’t expose raw database access through a REST endpoint. Don’t give an agent raw shell access either. The tool layer is your contract.</p>

<p><strong>3. Control loop.</strong> Observe, decide, execute, verify. The loop is what makes an agent an agent instead of a one-shot prompt. Your job as a harness engineer is to decide: how many iterations? What’s the timeout per step? What happens when a tool call fails? When does the loop escalate to a human? This is the same logic you put in health check loops and deployment rollback controllers.</p>

<p><strong>4. Guardrails.</strong> Cost caps. Token limits. Command allowlists. File path restrictions. Rate limiting on external calls. Without guardrails, an agent can burn through your API budget in minutes or write to paths it shouldn’t touch. Every guardrail is a policy decision — same as IAM policies, network rules, and resource quotas you already manage.</p>

<p><strong>5. Observability.</strong> If you can’t see what the agent did, you can’t debug it, audit it, or trust it. Log every tool call, every LLM response, every decision point. Capture diffs, timing, token usage, and cost. This is no different from structured logging in any production system. The difference is that agent traces are longer and less predictable than HTTP request traces, so you need good tooling to navigate them.</p>

<h2 id="where-devops-context-overlaps">Where DevOps Context Overlaps</h2>

<p>Here’s where your existing skills plug in directly.</p>

<p><strong>Infrastructure as code.</strong> Agent harnesses should be declarative and version-controlled. The tool definitions, policies, and environment specs should live in config files, not hardcoded in application logic. When you change a tool’s behavior, that change should be reviewable in a PR.</p>

<p><strong>Pipeline orchestration.</strong> Multi-agent systems look a lot like multi-stage pipelines. One agent does research, passes context to a planning agent, which passes a plan to an implementation agent. You’re managing handoffs, shared artifacts, and failure propagation — the same coordination problem as CI/CD stages.</p>

<p><strong>Incident response.</strong> When an agent goes wrong, you need the same muscle memory. Check the logs. Find the failing step. Understand the input that caused it. Roll back if needed. The debugging workflow is identical.</p>

<p><strong>Security boundaries.</strong> Least privilege applies to agents just like it applies to services. What tools can this agent access? What files can it read? Can it make network calls? Can it spend money? Every agent needs a security boundary, and DevOps engineers already think in terms of boundaries.</p>

<h2 id="getting-started">Getting Started</h2>

<p>If you want to start building harnesses, you don’t need a new framework. Start with what you have.</p>

<p>Take a simple task — say, analyzing a failed CI build. Write a script that collects the logs, sends them to an LLM with a prompt, parses the response, and posts a summary to Slack. That’s a harness. A minimal one, but it has all the components: environment setup, tool use (log collection, Slack posting), a control flow, and output handling.</p>

<p>Then add complexity. Let the LLM decide which logs to fetch. Add a retry loop. Add a cost cap. Add structured logging. Each addition is a harness engineering decision.</p>

<p>You don’t need to learn ML. You don’t need to fine-tune models. You need to build the infrastructure that makes models useful — and that’s the job you already do.</p>

<p>Harness engineering isn’t a new discipline. It’s DevOps applied to a new kind of workload. The sooner you see it that way, the faster you’ll build agents that actually work in production.</p>

          ]]>
        </description>
        <pubDate>Sat, 14 Mar 2026 00:00:00 +0000</pubDate>
        <link>//muhammadraza.me/2026/harness-engineering-devops-perspective/</link>
        <guid isPermaLink="true">//muhammadraza.me/2026/harness-engineering-devops-perspective/</guid>
        
        <category>ai</category>
        
        <category>devops</category>
        
        <category>automation</category>
        
        
        
        <dc:creator>{&quot;name&quot;=&gt;&quot;Muhammad Raza&quot;}</dc:creator>
        <dc:rights></dc:rights>
      </item>
    
      <item>
        <title>I Built Local Memory for Coding Agents Because They Keep Forgetting Everything</title>
        <description>
          <![CDATA[
            
            <p>Here’s something that frustrates me about coding agents. They forget everything. Every single session starts from scratch. The agent that spent 45 minutes yesterday figuring out your authentication flow? Gone. The decision to use JWT over sessions? Gone. The bug it found in your ORM’s lazy loading? Gone.</p>

<p>You start a new session and it re-discovers the same patterns. Repeats the same mistakes. Asks the same questions. It’s like working with a brilliant colleague who gets amnesia every night.</p>

<p>I got tired of this. So I built <a href="https://github.com/mraza007/echovault">EchoVault</a> — a local memory system that gives coding agents persistent memory across sessions. No cloud. No API keys. No cost. Just a SQLite database and some Markdown files on your machine.</p>

<h2 id="the-problem-is-real">The Problem Is Real</h2>

<p>I use coding agents daily across multiple client projects. Claude Code, Cursor, Codex — I switch between them depending on the task. Every time I start a session, I’m repeating context that the agent should already know.</p>

<p>“We chose FastAPI over Flask because of async support.”
“The deploy script needs –no-cache or the CSS breaks.”
“Don’t touch the legacy auth module — it’s being replaced next sprint.”</p>

<p>I was copy-pasting this stuff into every session. That’s not how tools should work.</p>

<p>I tried existing solutions. Supermemory announced their MCP and I was tempted, but it saves everything in the cloud. I work with multiple companies as a consultant — I don’t want codebase decisions stored on someone else’s servers. Claude Mem was the first tool I tried, but it was eating too much memory in my sessions and became a bottleneck when running multiple agents at the same time.</p>

<p>So I built my own.</p>

<h2 id="how-echovault-works">How EchoVault Works</h2>

<p>EchoVault runs as an MCP server. When your agent starts a session, it has three tools available:</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">memory_context</code> — load prior decisions, bugs, and context for the current project</li>
  <li><code class="language-plaintext highlighter-rouge">memory_search</code> — find specific memories by keyword or semantic similarity</li>
  <li><code class="language-plaintext highlighter-rouge">memory_save</code> — persist a decision, bug fix, pattern, or learning</li>
</ul>

<p>The agent calls these tools like it calls any other tool. No hooks. No shell scripts. No prompt injection. The MCP protocol handles everything.</p>

<p>Here’s what happens in practice:</p>

<p><strong>Session start.</strong> The agent sees <code class="language-plaintext highlighter-rouge">memory_context</code> in its available tools. The tool description says “You MUST call this at session start.” The agent calls it and gets back a list of prior memories for the project. Now it knows what happened yesterday.</p>

<p><strong>During work.</strong> You ask about authentication. The agent calls <code class="language-plaintext highlighter-rouge">memory_search</code> with “authentication” and gets back the decision to use JWT, the bug with token refresh, and the migration plan. It has context before writing a single line of code.</p>

<p><strong>Session end.</strong> The agent just fixed a tricky race condition. The tool description says “You MUST call memory_save before ending any session where you made changes.” It saves the root cause, the fix, and what to watch for.</p>

<p>Next session, that knowledge is there. Every session builds on the last one.</p>

<h2 id="the-architecture">The Architecture</h2>

<p>I kept it simple. The whole system is four things:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>~/.memory/
├── vault/                    # Obsidian-compatible Markdown
│   └── my-project/
│       └── 2026-02-01-session.md
├── index.db                  # SQLite: FTS5 + sqlite-vec
└── config.yaml               # Optional embedding config
</code></pre></div></div>

<p><strong>Markdown vault.</strong> Every memory gets written to a session file — one file per day per project. These are valid Markdown with YAML frontmatter. You can point Obsidian at <code class="language-plaintext highlighter-rouge">~/.memory/vault/</code> and browse your agent’s memory visually. You can read them in any editor. They’re not locked in a proprietary format.</p>

<p><strong>SQLite index.</strong> This is where search happens. FTS5 handles keyword search out of the box — no configuration needed. If you want semantic search (where “authentication” matches a memory titled “JWT token setup”), add an embedding provider. I use Ollama with <code class="language-plaintext highlighter-rouge">nomic-embed-text</code> locally. You can also use OpenAI or OpenRouter if you prefer cloud.</p>

<p><strong>MCP server.</strong> The agent talks to EchoVault through the Model Context Protocol. Three tools, stdio transport, nothing fancy. The server starts when the agent needs it and stops when the session ends. Zero idle cost.</p>

<p><strong>Secret redaction.</strong> Three layers. Explicit <code class="language-plaintext highlighter-rouge">&lt;redacted&gt;</code> tags for things you mark yourself. Pattern detection that catches API keys, passwords, and credentials automatically. And <code class="language-plaintext highlighter-rouge">.memoryignore</code> rules for custom patterns. Nothing sensitive hits disk.</p>

<h2 id="making-agents-actually-save">Making Agents Actually Save</h2>

<p>Here’s the thing about MCP tools — the agent <em>can</em> call them, but will it? Retrieval works well because agents tend to grab context at the start. Saving is the hard part. The agent finishes its work and moves on. It doesn’t naturally think “I should save what I learned.”</p>

<p>The trick is the tool descriptions. When you register an MCP tool, you include a description. Agents read these descriptions and treat them as instructions. So instead of:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>"Save a memory for future sessions. Call this when you make decisions."
</code></pre></div></div>

<p>I wrote:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>"Save a memory for future sessions. You MUST call this before ending
any session where you made changes, fixed bugs, made decisions, or
learned something. This is not optional — failing to save means the
next session starts from zero."
</code></pre></div></div>

<p>That “MUST” language makes a real difference. It’s not 100% reliable — nothing with LLMs is — but agents follow strong tool descriptions much more consistently than passive ones.</p>

<h2 id="cross-agent-memory">Cross-Agent Memory</h2>

<p>One of the things I wanted was a single vault for all my agents. A memory saved by Claude Code should be searchable from Cursor or Codex. They’re all working on the same codebase. Why should they have separate memories?</p>

<p>EchoVault stores everything in one place. The MCP server is the same regardless of which agent connects to it. Setup is one command per agent:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>memory setup claude-code   <span class="c"># writes ~/.claude.json</span>
memory setup cursor        <span class="c"># writes .cursor/mcp.json</span>
memory setup codex         <span class="c"># writes .codex/config.toml + AGENTS.md</span>
memory setup opencode      <span class="c"># writes opencode.json</span>
</code></pre></div></div>

<p>Each agent has its own config format and conventions. Claude Code uses JSON with <code class="language-plaintext highlighter-rouge">mcpServers</code>. Cursor uses the same schema but different file paths. Codex uses TOML with <code class="language-plaintext highlighter-rouge">[mcp_servers]</code>. OpenCode uses JSON with a <code class="language-plaintext highlighter-rouge">mcp</code> key and a different command format (<code class="language-plaintext highlighter-rouge">command</code> as an array instead of separate <code class="language-plaintext highlighter-rouge">command</code> + <code class="language-plaintext highlighter-rouge">args</code>).</p>

<p>I wrote shared helpers so each agent’s setup is just a thin wrapper around <code class="language-plaintext highlighter-rouge">_install_mcp_servers()</code> or <code class="language-plaintext highlighter-rouge">_install_toml_mcp()</code>. Adding a new agent takes maybe 20 lines of code.</p>

<h2 id="what-gets-saved">What Gets Saved</h2>

<p>Not everything should be a memory. Trivial changes don’t need to be persisted. Information that’s obvious from reading the code doesn’t need a memory. The goal is to capture what a future agent wouldn’t know from just looking at the codebase.</p>

<p>Good memories:</p>

<ul>
  <li><strong>Decisions.</strong> “Chose JWT over sessions because the API needs to be stateless.” A future agent reading the code sees JWT but doesn’t know <em>why</em>.</li>
  <li><strong>Bugs.</strong> “The ORM lazy-loads relationships by default, causing N+1 queries in the user list endpoint. Fixed by adding <code class="language-plaintext highlighter-rouge">.options(joinedload(...))</code>. Root cause: SQLAlchemy default behavior.” A future agent won’t hit the same bug.</li>
  <li><strong>Patterns.</strong> “All API endpoints follow the pattern: validate input, check permissions, execute, return response. Don’t add business logic in the route handler.” A future agent follows the existing patterns instead of inventing new ones.</li>
  <li><strong>Context.</strong> “The legacy auth module is being replaced. Don’t modify it — changes go into the new auth service at <code class="language-plaintext highlighter-rouge">src/auth/v2/</code>.” A future agent doesn’t waste time on dead code.</li>
</ul>

<p>Each memory has a title, a “what happened” summary, optional “why” and “impact” fields, tags, and a category. Search returns compact ~50-token summaries. Full details are fetched on demand so context windows don’t get bloated.</p>

<h2 id="the-technical-bits">The Technical Bits</h2>

<p>A few implementation details that might be useful if you’re building something similar.</p>

<p><strong>FTS5 for keyword search.</strong> SQLite’s FTS5 extension is fast and works with zero configuration. No external service needed. It handles stemming, phrase matching, and ranking. For most use cases, this is all you need.</p>

<p><strong>sqlite-vec for semantic search.</strong> When you want “authentication” to match “JWT token rotation”, you need vectors. I use <code class="language-plaintext highlighter-rouge">sqlite-vec</code> to store embeddings right in the same SQLite database. No vector database needed. Embedding providers are pluggable — Ollama for local, OpenAI or OpenRouter for cloud.</p>

<p><strong>Hybrid search.</strong> The search pipeline runs FTS5 first (fast, precise), then semantic search (slower, fuzzy), and merges the results. This gives you the best of both worlds — exact keyword matches and semantic similarity.</p>

<p><strong>TOML parsing with fallbacks.</strong> Codex writes some non-standard TOML — unquoted filesystem paths as table keys, dotted version strings as key names. Standard <code class="language-plaintext highlighter-rouge">tomllib</code> chokes on these. I added a fallback that appends the MCP section directly via string operations when parsing fails. It’s not pretty but it handles real-world config files.</p>

<p><strong>Symlink handling.</strong> Some agents create symlinks in their skill directories. <code class="language-plaintext highlighter-rouge">shutil.rmtree()</code> crashes on symlinks. Small thing but it bit me in production.</p>

<h2 id="setting-it-up">Setting It Up</h2>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>pip <span class="nb">install </span>git+https://github.com/mraza007/echovault.git
memory init
memory setup claude-code
</code></pre></div></div>

<p>That’s it. Three commands. The agent has memory now.</p>

<p>If you want semantic search, configure an embedding provider:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>memory config init
<span class="c"># Edit ~/.memory/config.yaml to set your provider</span>
memory reindex
</code></pre></div></div>

<p>For fully local operation with no external API calls, use Ollama:</p>

<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="na">embedding</span><span class="pi">:</span>
  <span class="na">provider</span><span class="pi">:</span> <span class="s">ollama</span>
  <span class="na">model</span><span class="pi">:</span> <span class="s">nomic-embed-text</span>
</code></pre></div></div>

<h2 id="what-ive-learned">What I’ve Learned</h2>

<p>Building this taught me a few things about agent tooling.</p>

<p><strong>Tool descriptions are instructions.</strong> Agents read them and follow them. Strong, directive language in tool descriptions is more effective than passive documentation. “You MUST” works better than “You can.”</p>

<p><strong>Local-first matters.</strong> Not because of ideology, but because of practical constraints. Consultants work with multiple clients. Sensitive decisions shouldn’t leave the machine. And when your internet goes out, local tools still work.</p>

<p><strong>MCP is the right abstraction.</strong> Instead of writing agent-specific hooks, skills, and config formats, I write one MCP server and each agent connects to it. When a new agent comes along, I add a setup function for its config format. The memory logic doesn’t change.</p>

<p><strong>Simple storage wins.</strong> Markdown files you can read in any editor. SQLite you can query with any tool. No custom binary formats. No daemon to keep running. The system is completely inspectable and debuggable.</p>

<p>The code is at <a href="https://github.com/mraza007/echovault">github.com/mraza007/echovault</a>. It’s MIT licensed. If you’re tired of your agents forgetting everything, give it a shot.</p>

          ]]>
        </description>
        <pubDate>Tue, 17 Feb 2026 00:00:00 +0000</pubDate>
        <link>//muhammadraza.me/2026/building-local-memory-for-coding-agents/</link>
        <guid isPermaLink="true">//muhammadraza.me/2026/building-local-memory-for-coding-agents/</guid>
        
        <category>ai</category>
        
        <category>python</category>
        
        <category>tools</category>
        
        
        
        <dc:creator>{&quot;name&quot;=&gt;&quot;Muhammad Raza&quot;}</dc:creator>
        <dc:rights></dc:rights>
      </item>
    
      <item>
        <title>AI Agents Are Just CI Pipelines With an LLM Plugged In</title>
        <description>
          <![CDATA[
            
            <p>In this post, I’ll show you how to think about AI agents through the infrastructure patterns you already use. Think about your CI runner. It spins up an environment. Runs some steps. Reads files. Runs tests. Captures output. Decides what to do next. Knows when to stop.</p>

<p>Now swap out the hardcoded logic for an LLM. That’s it. That’s an AI agent in simpler terms. The fancy demos want you to think it’s magic. Some brand new thing you need to learn from scratch. It’s not. When you take away the hype, an agent is just a controlled automation loop. The LLM handles the reasoning and everything else is infrastructure you’ve built a hundred times.</p>

<p>Here’s what matters, the agent itself isn’t the hard part but The harness is, the execution environment, tooling, guardrails, and observability. It’s all the important stuff that makes automation work in production.</p>

<p>DevOps engineers have been building harnesses forever. CI runners. Deployment pipelines. Infrastructure automation. The patterns are the same. The skills transfer directly.</p>

<p>So if you’re wondering whether AI agents are worth learning, here’s the short answer. You’re already halfway there.</p>

<h2 id="what-an-agent-actually-looks-like">What an Agent Actually Looks Like</h2>

<p>Let’s forget the marketing hype around AI agents and understand from a DevOps engineer’s point of view, what an agent actually looks like. An AI agent has six parts.</p>

<ol>
  <li>
    <p><code class="language-plaintext highlighter-rouge">An LLM</code>: Now LLM is the most important part of an agent as this acts as a brain. It reads context and decides what to do next. It doesn’t touch anything directly.</p>
  </li>
  <li>
    <p><code class="language-plaintext highlighter-rouge">A workspace</code>: Think of it as a sandboxed environment. A cloned repo. A container. A temp directory. Same as any CI job.</p>
  </li>
  <li>
    <p><code class="language-plaintext highlighter-rouge">A set of tools</code>: These are the actions it can request. Read a file. Run a command. Call an API. Query logs. The agent doesn’t run these itself. It asks for them.</p>
  </li>
  <li>
    <p><code class="language-plaintext highlighter-rouge">A control loop</code>: This is the core pattern. Observe the current state. Decide an action. Execute it. Check the result. Keep going until you’re done.</p>
  </li>
  <li>
    <p><code class="language-plaintext highlighter-rouge">Policies and limits</code>: Timeouts. Permission boundaries. Rate limits. Cost caps. Without these, agents can spin forever or do things they shouldn’t.</p>
  </li>
  <li>
    <p><code class="language-plaintext highlighter-rouge">A termination condition</code>: The agent needs to know when to stop. Task complete. Error threshold hit. Human review needed. Something has to end the loop.</p>
  </li>
</ol>

<p>Now none of this is new as you’ve built systems with all these components. The only difference is the LLM sitting in the decision seat.</p>

<h2 id="the-harness-does-the-heavy-lifting">The Harness Does the Heavy Lifting</h2>

<p>Everyone focuses on the LLM. They miss the important part. The harness is what makes an agent actually work.</p>

<p>The harness is everything around the model. It spins up the environment. Exposes tools. Executes commands on the agent’s behalf. Captures logs and diffs. Enforces limits. Decides when the loop should stop.</p>

<p>Sound familiar? It should. This is what CI runners do.</p>

<p>GitHub Actions. GitLab runners. Jenkins agents. They all follow the same pattern. Spin up an isolated environment. Run steps. Capture output. Handle success and failure. Clean up.</p>

<p>An agent harness does the exact same thing. The only twist is the steps aren’t hardcoded in YAML. They come from the LLM at runtime.</p>

<p>This is why DevOps engineers are perfect for this work. You already think about isolation, execution, logging, and cleanup. You already build systems that run untrusted code safely. Agent harnesses are the same problem with a new input source.</p>

<h2 id="tool-use-is-the-safety-mechanism">Tool Use Is the Safety Mechanism</h2>

<p>Agents don’t touch systems directly. This matters. The LLM never runs a command itself. Never writes a file itself. It requests actions through tools.</p>

<p>The harness gets the request. Validates it. Executes it in a controlled way. Returns a structured result.</p>

<p>This is how you keep agents safe.</p>

<p>Say the agent wants to run a shell command. The harness can check it against an allowlist. Run it in a sandbox. Set a timeout. Capture stderr. The agent never gets raw shell access.</p>

<p>Same thing for file operations. The agent requests a file write. The harness checks the path. Validates the content. Writes the file and returns confirmation.</p>

<p>You control what tools exist. You control how they behave. You control what the agent can even ask for.</p>

<p>This is the same idea behind least privilege. The agent only gets access to what it needs. The harness enforces the boundary.</p>

<h2 id="the-control-loop-in-practice">The Control Loop in Practice</h2>

<p>The core of any agent is the control loop. It looks like this.</p>

<ol>
  <li>
    <p><code class="language-plaintext highlighter-rouge">Observe</code>: The agent reads the current state. Test output. Log files. Diffs. Error messages. Whatever context it needs.</p>
  </li>
  <li>
    <p><code class="language-plaintext highlighter-rouge">Decide</code>: The LLM looks at the state and picks an action. Run another test. Edit a file. Ask for more information. Give up.</p>
  </li>
  <li>
    <p><code class="language-plaintext highlighter-rouge">Execute</code>: The harness runs the requested action and returns the result.</p>
  </li>
  <li>
    <p><code class="language-plaintext highlighter-rouge">Verify</code>: The agent checks if the action worked. Did the test pass? Did the error go away? Is the task done?</p>
  </li>
  <li>
    <p><code class="language-plaintext highlighter-rouge">Repeat</code>: If the task isn’t complete, go back to observe.</p>
  </li>
</ol>

<p>This loop keeps running until a termination condition hits—success, failure, timeout, max iterations, or human intervention.</p>

<p>You’ve seen this before: build, test, fix, rebuild. CI pipelines do this, deployment rollbacks do this, and health check loops do this.</p>

<p>Agents just make the “decide” step dynamic instead of scripted, and here’s where they actually help in DevOps work.</p>

<p><strong>CI failure analysis.</strong> When a test fails, the agent reads the logs, checks the diff, identifies the cause, and suggests a fix—maybe even applying it and rerunning the test.</p>

<p><strong>Terraform drift detection.</strong> The agent compares actual state to declared state, flags the drift, and proposes a remediation plan while a human approves before anything changes.</p>

<p><strong>Kubernetes manifest review.</strong> The agent checks YAML against best practices (missing resource limits, no liveness probes, exposed secrets) catching the stuff humans miss in review.</p>

<p><strong>Cost anomaly investigation.</strong> When spending spikes, the agent queries cost explorer, correlates with recent deployments, and surfaces the likely cause, saving an hour of digging.</p>

<p><strong>Incident log triage.</strong> Faced with pages of logs, the agent reads them, extracts the relevant lines, and summarizes what went wrong (not replacing the engineer, but getting them to the answer faster).</p>

<p>Notice the pattern: the agent assists and handles the tedious parts while the human stays in control of decisions that matter.</p>

<p>AI agents sound complicated with their new frameworks, new terminology, and new paradigms.</p>

<p>But look past the hype and you’ll see something familiar.</p>

<p>An agent is an automation loop where the LLM picks the next step, the harness executes it safely, tools provide controlled access to systems, and policies keep things bounded.</p>

<p>This is CI/CD architecture, infrastructure thinking, the stuff you already do.</p>

<p>When you read about agent frameworks or watch demos of coding assistants, you now have a lens to see the harness underneath, spot the control loop, and ask the right questions: what tools does it expose, what limits exist, and how does it handle failure?</p>

<p>You don’t need to become an ML engineer to understand agents—you just need to recognize the infrastructure patterns you’ve been using all along.</p>

<p>The LLM is the new part. Everything else is your domain.</p>

          ]]>
        </description>
        <pubDate>Sat, 03 Jan 2026 00:00:00 +0000</pubDate>
        <link>//muhammadraza.me/2026/ai-agents-devops-perspective/</link>
        <guid isPermaLink="true">//muhammadraza.me/2026/ai-agents-devops-perspective/</guid>
        
        <category>ai</category>
        
        <category>devops</category>
        
        <category>automation</category>
        
        
        
        <dc:creator>{&quot;name&quot;=&gt;&quot;Muhammad Raza&quot;}</dc:creator>
        <dc:rights></dc:rights>
      </item>
    
      <item>
        <title>AWS Cost Optimization Case Study: How I Cut a Client&apos;s Bill by 50%</title>
        <description>
          <![CDATA[
            
            <p>Last month, a client’s AWS bill hit $5,000 — up 40% from last year with no clear explanation.</p>

<p>After one week of systematic analysis, I cut it to <strong>$2,500/month</strong> — a 50% reduction, saving them <strong>$30,000 annually</strong>. Here’s exactly how I did it, with the scripts you can use.</p>

<h2 id="the-discovery-phase-how-i-found-the-problems">The Discovery Phase: How I Found the Problems</h2>

<p>Before touching anything, I needed to understand the infrastructure. Here’s my systematic approach:</p>

<h3 id="step-1-pull-cost-data-by-service">Step 1: Pull Cost Data by Service</h3>

<p>First, I analyzed their AWS Cost Explorer data to understand where money was going:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>aws ce get-cost-and-usage <span class="se">\</span>
  <span class="nt">--time-period</span> <span class="nv">Start</span><span class="o">=</span>2024-11-01,End<span class="o">=</span>2024-11-30 <span class="se">\</span>
  <span class="nt">--granularity</span> MONTHLY <span class="se">\</span>
  <span class="nt">--metrics</span> <span class="s2">"BlendedCost"</span> <span class="se">\</span>
  <span class="nt">--group-by</span> <span class="nv">Type</span><span class="o">=</span>DIMENSION,Key<span class="o">=</span>SERVICE
</code></pre></div></div>

<p>This gave me the high-level breakdown. But I needed more detail.</p>

<h3 id="step-2-build-a-resource-inventory">Step 2: Build a Resource Inventory</h3>

<p>I wrote a Python script to scan all resources across regions and identify optimization opportunities:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">boto3</span>

<span class="k">def</span> <span class="nf">scan_ebs_volumes</span><span class="p">():</span>
    <span class="s">"""Find GP2 volumes that should be GP3 and unattached volumes."""</span>
    <span class="n">ec2</span> <span class="o">=</span> <span class="n">boto3</span><span class="p">.</span><span class="n">client</span><span class="p">(</span><span class="s">'ec2'</span><span class="p">)</span>
    <span class="n">volumes</span> <span class="o">=</span> <span class="n">ec2</span><span class="p">.</span><span class="n">describe_volumes</span><span class="p">()[</span><span class="s">'Volumes'</span><span class="p">]</span>

    <span class="n">gp2_volumes</span> <span class="o">=</span> <span class="p">[]</span>
    <span class="n">unattached</span> <span class="o">=</span> <span class="p">[]</span>

    <span class="k">for</span> <span class="n">vol</span> <span class="ow">in</span> <span class="n">volumes</span><span class="p">:</span>
        <span class="k">if</span> <span class="n">vol</span><span class="p">[</span><span class="s">'VolumeType'</span><span class="p">]</span> <span class="o">==</span> <span class="s">'gp2'</span><span class="p">:</span>
            <span class="n">gp2_volumes</span><span class="p">.</span><span class="n">append</span><span class="p">({</span>
                <span class="s">'VolumeId'</span><span class="p">:</span> <span class="n">vol</span><span class="p">[</span><span class="s">'VolumeId'</span><span class="p">],</span>
                <span class="s">'Size'</span><span class="p">:</span> <span class="n">vol</span><span class="p">[</span><span class="s">'Size'</span><span class="p">],</span>
                <span class="s">'State'</span><span class="p">:</span> <span class="n">vol</span><span class="p">[</span><span class="s">'State'</span><span class="p">],</span>
                <span class="s">'MonthlyCost'</span><span class="p">:</span> <span class="n">vol</span><span class="p">[</span><span class="s">'Size'</span><span class="p">]</span> <span class="o">*</span> <span class="mf">0.10</span><span class="p">,</span>  <span class="c1"># GP2 pricing
</span>                <span class="s">'GP3Cost'</span><span class="p">:</span> <span class="n">vol</span><span class="p">[</span><span class="s">'Size'</span><span class="p">]</span> <span class="o">*</span> <span class="mf">0.08</span><span class="p">,</span>      <span class="c1"># GP3 pricing
</span>                <span class="s">'Savings'</span><span class="p">:</span> <span class="n">vol</span><span class="p">[</span><span class="s">'Size'</span><span class="p">]</span> <span class="o">*</span> <span class="mf">0.02</span>
            <span class="p">})</span>

        <span class="k">if</span> <span class="n">vol</span><span class="p">[</span><span class="s">'State'</span><span class="p">]</span> <span class="o">==</span> <span class="s">'available'</span><span class="p">:</span>  <span class="c1"># Not attached
</span>            <span class="n">unattached</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">vol</span><span class="p">)</span>

    <span class="k">return</span> <span class="n">gp2_volumes</span><span class="p">,</span> <span class="n">unattached</span>
</code></pre></div></div>

<p>I built similar functions for:</p>
<ul>
  <li>RDS instances (storage type, utilization, Multi-AZ necessity)</li>
  <li>EC2 instances (generation, Reserved Instance coverage)</li>
  <li>Elastic IPs (attached vs idle)</li>
  <li>EBS snapshots (age, associated volumes)</li>
  <li>S3 buckets (storage class, lifecycle policies)</li>
</ul>

<h3 id="step-3-analyze-cloudwatch-metrics-for-utilization">Step 3: Analyze CloudWatch Metrics for Utilization</h3>

<p>This is critical. Before recommending any right-sizing, I needed data:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">get_instance_utilization</span><span class="p">(</span><span class="n">instance_id</span><span class="p">,</span> <span class="n">days</span><span class="o">=</span><span class="mi">30</span><span class="p">):</span>
    <span class="s">"""Get average CPU utilization over the past N days."""</span>
    <span class="n">cloudwatch</span> <span class="o">=</span> <span class="n">boto3</span><span class="p">.</span><span class="n">client</span><span class="p">(</span><span class="s">'cloudwatch'</span><span class="p">)</span>

    <span class="n">response</span> <span class="o">=</span> <span class="n">cloudwatch</span><span class="p">.</span><span class="n">get_metric_statistics</span><span class="p">(</span>
        <span class="n">Namespace</span><span class="o">=</span><span class="s">'AWS/EC2'</span><span class="p">,</span>
        <span class="n">MetricName</span><span class="o">=</span><span class="s">'CPUUtilization'</span><span class="p">,</span>
        <span class="n">Dimensions</span><span class="o">=</span><span class="p">[{</span><span class="s">'Name'</span><span class="p">:</span> <span class="s">'InstanceId'</span><span class="p">,</span> <span class="s">'Value'</span><span class="p">:</span> <span class="n">instance_id</span><span class="p">}],</span>
        <span class="n">StartTime</span><span class="o">=</span><span class="n">datetime</span><span class="p">.</span><span class="n">now</span><span class="p">()</span> <span class="o">-</span> <span class="n">timedelta</span><span class="p">(</span><span class="n">days</span><span class="o">=</span><span class="n">days</span><span class="p">),</span>
        <span class="n">EndTime</span><span class="o">=</span><span class="n">datetime</span><span class="p">.</span><span class="n">now</span><span class="p">(),</span>
        <span class="n">Period</span><span class="o">=</span><span class="mi">86400</span><span class="p">,</span>  <span class="c1"># Daily
</span>        <span class="n">Statistics</span><span class="o">=</span><span class="p">[</span><span class="s">'Average'</span><span class="p">,</span> <span class="s">'Maximum'</span><span class="p">]</span>
    <span class="p">)</span>

    <span class="k">return</span> <span class="n">response</span><span class="p">[</span><span class="s">'Datapoints'</span><span class="p">]</span>
</code></pre></div></div>

<p>The results were eye-opening:</p>
<ul>
  <li>Two t3.xlarge instances averaging <strong>12% CPU</strong></li>
  <li>RDS storage at <strong>95% free space</strong></li>
  <li>Multiple log groups with <strong>no retention policy</strong> (storing terabytes)</li>
</ul>

<h3 id="step-4-map-dependencies-before-cutting">Step 4: Map Dependencies Before Cutting</h3>

<p>Before deleting anything, I mapped what depended on what:</p>
<ul>
  <li>Which services used which Elastic IPs?</li>
  <li>Which applications wrote to which log groups?</li>
  <li>Which backups were actually needed for compliance?</li>
</ul>

<p>This prevented the classic mistake of breaking production while optimizing costs.</p>

<h2 id="the-starting-point">The Starting Point</h2>

<p>After the discovery phase, here’s what I was working with:</p>

<table>
  <thead>
    <tr>
      <th>Service</th>
      <th>Monthly Cost</th>
      <th>% of Total</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>EC2-Other (EBS, NAT, IPs)</td>
      <td>$1,250</td>
      <td>25%</td>
    </tr>
    <tr>
      <td>RDS</td>
      <td>$750</td>
      <td>15%</td>
    </tr>
    <tr>
      <td>EC2 Compute</td>
      <td>$650</td>
      <td>13%</td>
    </tr>
    <tr>
      <td>CloudWatch</td>
      <td>$500</td>
      <td>10%</td>
    </tr>
    <tr>
      <td>AWS Backup</td>
      <td>$500</td>
      <td>10%</td>
    </tr>
    <tr>
      <td>ECS Fargate</td>
      <td>$400</td>
      <td>8%</td>
    </tr>
    <tr>
      <td>S3</td>
      <td>$250</td>
      <td>5%</td>
    </tr>
    <tr>
      <td>VPC</td>
      <td>$250</td>
      <td>5%</td>
    </tr>
    <tr>
      <td>Everything else</td>
      <td>$450</td>
      <td>9%</td>
    </tr>
    <tr>
      <td><strong>Total</strong></td>
      <td><strong>$5,000</strong></td>
      <td><strong>100%</strong></td>
    </tr>
  </tbody>
</table>

<p>The distribution told me a lot. EC2-related costs (compute + EBS + networking) made up over 38% of the bill. That’s where I started.</p>

<h2 id="phase-1-quick-wins-implemented-same-day">Phase 1: Quick Wins (Implemented Same Day)</h2>

<h3 id="release-idle-elastic-ips--saved-50month">Release Idle Elastic IPs — Saved $50/month</h3>

<p>My inventory script flagged 5 Elastic IPs with no association:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>aws ec2 describe-addresses <span class="nt">--query</span> <span class="s1">'Addresses[?AssociationId==null]'</span>
</code></pre></div></div>

<p>Someone had provisioned them for test environments that were deleted months ago. Classic ghost infrastructure.</p>

<p><strong>Time to fix:</strong> 5 minutes.</p>

<h3 id="migrate-ebs-gp2-to-gp3--saved-125month">Migrate EBS GP2 to GP3 — Saved $125/month</h3>

<p>The script found 6,000+ GB across multiple EBS volumes still on GP2. GP3 costs <a href="https://aws.amazon.com/ebs/pricing/">20% less</a> <strong>and</strong> provides better baseline performance (3,000 IOPS vs GP2’s variable IOPS based on size).</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>aws ec2 modify-volume <span class="nt">--volume-id</span> vol-xxx <span class="nt">--volume-type</span> gp3
</code></pre></div></div>

<p>No downtime. Just a CLI command per volume.</p>

<p><strong>Time to fix:</strong> 30 minutes for all volumes.</p>

<h3 id="set-cloudwatch-log-retention--saved-100month">Set CloudWatch Log Retention — Saved $100/month</h3>

<p>My scan found 20+ log groups with no retention policy — storing logs forever:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>aws logs describe-log-groups <span class="nt">--query</span> <span class="s1">'logGroups[?retentionInDays==null].logGroupName'</span>
</code></pre></div></div>

<p>Set production to 90 days, staging to 30 days.</p>

<p><strong>Time to fix:</strong> 20 minutes.</p>

<h2 id="phase-2-the-big-discoveries">Phase 2: The Big Discoveries</h2>

<h3 id="aws-backup-running-24x-more-often-than-needed--saved-400month">AWS Backup Running 24x More Often Than Needed — Saved $400/month</h3>

<p>This was the most surprising find. When I pulled the backup plan configuration:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>aws backup get-backup-plan <span class="nt">--backup-plan-id</span> xxx
</code></pre></div></div>

<p>I saw: <strong>hourly backups</strong>. 24 backups per day. For every resource.</p>

<p>The backup storage had grown to $500/month — 10% of their total bill.</p>

<p>I reviewed their recovery requirements (they only needed daily backups with 14-day retention) and reconfigured:</p>

<div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">{</span><span class="w">
  </span><span class="nl">"ScheduleExpression"</span><span class="p">:</span><span class="w"> </span><span class="s2">"cron(0 5 * * ? *)"</span><span class="p">,</span><span class="w">
  </span><span class="nl">"StartWindowMinutes"</span><span class="p">:</span><span class="w"> </span><span class="mi">60</span><span class="p">,</span><span class="w">
  </span><span class="nl">"CompletionWindowMinutes"</span><span class="p">:</span><span class="w"> </span><span class="mi">120</span><span class="p">,</span><span class="w">
  </span><span class="nl">"Lifecycle"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
    </span><span class="nl">"DeleteAfterDays"</span><span class="p">:</span><span class="w"> </span><span class="mi">14</span><span class="w">
  </span><span class="p">}</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>

<p><strong>Time to fix:</strong> 1 hour (including testing).</p>

<h3 id="cloudwatch-metric-streams-to-nowhere--saved-400month">CloudWatch Metric Streams to Nowhere — Saved $400/month</h3>

<p>My CloudWatch cost breakdown showed $400/month on “Metric Streams” — 100+ million metric updates going somewhere.</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>aws cloudwatch list-metric-streams
</code></pre></div></div>

<p>Found a stream configured to send data to a third-party monitoring tool. When I asked about it, nobody on the team knew it existed. The integration had been set up by a previous contractor and was never used.</p>

<p>This is a perfect example of ghost infrastructure that accumulates over time.</p>

<h3 id="rds-over-provisioned-by-95">RDS Over-Provisioned by 95%</h3>

<p>My RDS analysis showed all instances had massive storage allocated. The CloudWatch metrics told the real story:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>aws cloudwatch get-metric-statistics <span class="se">\</span>
  <span class="nt">--namespace</span> AWS/RDS <span class="se">\</span>
  <span class="nt">--metric-name</span> FreeStorageSpace <span class="se">\</span>
  <span class="nt">--dimensions</span> <span class="nv">Name</span><span class="o">=</span>DBInstanceIdentifier,Value<span class="o">=</span>production-db <span class="se">\</span>
  <span class="nt">--start-time</span> 2024-11-01T00:00:00Z <span class="se">\</span>
  <span class="nt">--end-time</span> 2024-11-30T23:59:59Z <span class="se">\</span>
  <span class="nt">--period</span> 86400 <span class="se">\</span>
  <span class="nt">--statistics</span> Average
</code></pre></div></div>

<p><strong>Result:</strong> 95% free space across all databases.</p>

<p>RDS storage can only be increased, not decreased. But I migrated all instances from GP2 to GP3 storage — same price, better performance.</p>

<p>For the next database refresh, I recommended right-sized storage instead of the default massive allocations.</p>

<p><strong>Saved:</strong> $150/month</p>

<h2 id="phase-3-infrastructure-improvements">Phase 3: Infrastructure Improvements</h2>

<h3 id="nat-gateway-consolidation--saved-125month">NAT Gateway Consolidation — Saved $125/month</h3>

<p>My VPC analysis showed NAT Gateways in every AZ across multiple regions costing $500/month combined. After reviewing their architecture and traffic patterns, they only needed half of them.</p>

<h3 id="ecs-task-right-sizing--saved-250month">ECS Task Right-Sizing — Saved $250/month</h3>

<p>The ECS service scan found:</p>
<ul>
  <li>A staging service constantly failing health checks and restarting (consuming resources 24/7 while accomplishing nothing)</li>
  <li>Legacy services still running in production that nobody was using</li>
</ul>

<p>These issues relate directly to the <a href="/2025/ecs-decisions-that-waste-6-weeks/">ECS architectural decisions</a> that often waste weeks of engineering time. Plus, enabled Fargate Spot for fault-tolerant workloads (70% savings on those tasks).</p>

<h3 id="s3-lifecycle-policies--saved-150month">S3 Lifecycle Policies — Saved $150/month</h3>

<p>My S3 bucket analysis showed backup buckets had grown to 10+ TB with no lifecycle policy. Old backups were stored in Standard tier forever.</p>

<p>Added a policy to transition to Glacier after 90 days:</p>

<div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">{</span><span class="w">
  </span><span class="nl">"Rules"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="w">
    </span><span class="p">{</span><span class="w">
      </span><span class="nl">"ID"</span><span class="p">:</span><span class="w"> </span><span class="s2">"archive-old-backups"</span><span class="p">,</span><span class="w">
      </span><span class="nl">"Status"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Enabled"</span><span class="p">,</span><span class="w">
      </span><span class="nl">"Transitions"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="w">
        </span><span class="p">{</span><span class="w">
          </span><span class="nl">"Days"</span><span class="p">:</span><span class="w"> </span><span class="mi">90</span><span class="p">,</span><span class="w">
          </span><span class="nl">"StorageClass"</span><span class="p">:</span><span class="w"> </span><span class="s2">"GLACIER"</span><span class="w">
        </span><span class="p">}</span><span class="w">
      </span><span class="p">]</span><span class="w">
    </span><span class="p">}</span><span class="w">
  </span><span class="p">]</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>

<h3 id="reserved-instances-for-stable-workloads--saved-500month">Reserved Instances for Stable Workloads — Saved $500/month</h3>

<p>My EC2 coverage analysis showed zero Reserved Instance coverage on instances running 24/7.</p>

<p>I helped them purchase <a href="https://aws.amazon.com/savingsplans/compute-pricing/">Compute Savings Plans</a> covering their steady-state workloads. Immediate 30-40% savings on compute.</p>

<h3 id="ec2-instance-right-sizing--saved-250month">EC2 Instance Right-Sizing — Saved $250/month</h3>

<p>The utilization data was clear: multiple instances running at 10-15% CPU.</p>

<p>Downsized t3.xlarge instances to t3.large where utilization data supported it. Same workload, half the cost.</p>

<h2 id="the-results">The Results</h2>

<table>
  <thead>
    <tr>
      <th>Category</th>
      <th>Monthly Savings</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Reserved Instances / Savings Plans</td>
      <td>$500</td>
    </tr>
    <tr>
      <td>AWS Backup (hourly → daily)</td>
      <td>$400</td>
    </tr>
    <tr>
      <td>CloudWatch Metric Streams</td>
      <td>$400</td>
    </tr>
    <tr>
      <td>ECS cleanup + Fargate Spot</td>
      <td>$250</td>
    </tr>
    <tr>
      <td>EC2 right-sizing</td>
      <td>$250</td>
    </tr>
    <tr>
      <td>S3 lifecycle policies</td>
      <td>$150</td>
    </tr>
    <tr>
      <td>RDS improvements</td>
      <td>$150</td>
    </tr>
    <tr>
      <td>EBS GP2 → GP3</td>
      <td>$125</td>
    </tr>
    <tr>
      <td>NAT Gateway consolidation</td>
      <td>$125</td>
    </tr>
    <tr>
      <td>CloudWatch log retention</td>
      <td>$100</td>
    </tr>
    <tr>
      <td>Idle Elastic IPs</td>
      <td>$50</td>
    </tr>
    <tr>
      <td><strong>Total Monthly Savings</strong></td>
      <td><strong>$2,500</strong></td>
    </tr>
  </tbody>
</table>

<p><strong>From $5,000/month to $2,500/month — exactly 50% reduction.</strong></p>

<p>Over a year, that’s <strong>$30,000 back in their pocket</strong>.</p>

<h2 id="the-methodology">The Methodology</h2>

<p>Here’s the systematic approach I use for every cost optimization engagement:</p>

<h3 id="1-get-the-data-first">1. Get the Data First</h3>

<p>Before making any changes, I pull:</p>
<ul>
  <li>AWS Cost Explorer data (by service, by tag, over time)</li>
  <li>CloudWatch metrics for utilization</li>
  <li>Resource inventory across all regions</li>
</ul>

<h3 id="2-find-the-ghosts">2. Find the Ghosts</h3>

<p>“Ghost infrastructure” costs more than you think:</p>
<ul>
  <li>Unused Elastic IPs</li>
  <li>Detached EBS volumes</li>
  <li>Empty S3 buckets accumulating requests</li>
  <li>Log groups with infinite retention</li>
  <li>Metric Streams nobody monitors</li>
  <li>Test environments that outlived their purpose</li>
</ul>

<h3 id="3-right-size-ruthlessly">3. Right-Size Ruthlessly</h3>

<p>Check actual utilization before committing to Reserved Instances:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># EC2 CPU utilization over 30 days</span>
aws cloudwatch get-metric-statistics <span class="se">\</span>
  <span class="nt">--namespace</span> AWS/EC2 <span class="se">\</span>
  <span class="nt">--metric-name</span> CPUUtilization <span class="se">\</span>
  <span class="nt">--dimensions</span> <span class="nv">Name</span><span class="o">=</span>InstanceId,Value<span class="o">=</span>i-xxx <span class="se">\</span>
  <span class="nt">--start-time</span> <span class="si">$(</span><span class="nb">date</span> <span class="nt">-d</span> <span class="s2">"30 days ago"</span> +%Y-%m-%dT%H:%M:%S<span class="si">)</span> <span class="se">\</span>
  <span class="nt">--end-time</span> <span class="si">$(</span><span class="nb">date</span> +%Y-%m-%dT%H:%M:%S<span class="si">)</span> <span class="se">\</span>
  <span class="nt">--period</span> 3600 <span class="se">\</span>
  <span class="nt">--statistics</span> Average
</code></pre></div></div>

<p>If a t3.xlarge averages 15% CPU, you’re paying for 85% idle capacity.</p>

<h3 id="4-modernize-storage">4. Modernize Storage</h3>

<p>GP2 → GP3 is almost always worth it:</p>
<ul>
  <li>20% cheaper at baseline</li>
  <li>Better performance (3,000 IOPS baseline)</li>
  <li>Zero downtime migration</li>
</ul>

<h3 id="5-review-backup-policies">5. Review Backup Policies</h3>

<p>Backups grow silently. Questions to ask:</p>
<ul>
  <li>How often do you actually need backups?</li>
  <li>How long do you really need to keep them?</li>
  <li>Are you backing up dev/test environments at production frequency?</li>
</ul>

<h2 id="what-this-looks-like-over-time">What This Looks Like Over Time</h2>

<table>
  <thead>
    <tr>
      <th>Metric</th>
      <th>Before</th>
      <th>After</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Monthly spend</td>
      <td>$5,000</td>
      <td>$2,500</td>
    </tr>
    <tr>
      <td>Annual spend</td>
      <td>$60,000</td>
      <td>$30,000</td>
    </tr>
    <tr>
      <td><strong>Annual savings</strong></td>
      <td>—</td>
      <td><strong>$30,000</strong></td>
    </tr>
  </tbody>
</table>

<p>The best part? None of these changes affected performance or reliability. Most improved it.</p>

<h2 id="common-patterns-i-see">Common Patterns I See</h2>

<p>After doing this for multiple clients, patterns emerge:</p>

<ol>
  <li><strong>Metric Streams nobody monitors</strong> — $100-400/month just disappearing</li>
  <li><strong>Hourly backups for daily restore needs</strong> — 24x the storage cost</li>
  <li><strong>GP2 volumes from years ago</strong> — never migrated to GP3</li>
  <li><strong>Multi-AZ staging databases</strong> — paying for HA nobody needs</li>
  <li><strong>NAT Gateways in every AZ</strong> — when one or two would suffice</li>
  <li><strong>Logs kept forever</strong> — “just in case”</li>
  <li><strong>No Reserved Instances</strong> — paying full on-demand for 24/7 workloads</li>
  <li><strong>Over-provisioned everything</strong> — “it might need it someday”</li>
</ol>

<hr />

<h2 id="need-help-with-your-aws-bill">Need Help With Your AWS Bill?</h2>

<p>I do AWS cost optimization as part of my DevOps consulting practice. If your AWS bill feels too high or you just want a second pair of eyes on your infrastructure, let’s talk.</p>

<p><strong><a href="https://calendly.com/muhammad-07/30-minute-meeting">Book a free 30-minute call</a></strong> — I’ll review your current setup and tell you where I see opportunities.</p>

<hr />

<p><em>Have questions about any of these optimizations? Drop a comment below or reach out on <a href="https://twitter.com/muhammad_o7">Twitter/X</a>.</em></p>


          ]]>
        </description>
        <pubDate>Sat, 27 Dec 2025 00:00:00 +0000</pubDate>
        <link>//muhammadraza.me/2025/aws-cost-optimization-case-study/</link>
        <guid isPermaLink="true">//muhammadraza.me/2025/aws-cost-optimization-case-study/</guid>
        
        <category>aws</category>
        
        <category>devops</category>
        
        <category>case-study</category>
        
        
        
        <dc:creator>{&quot;name&quot;=&gt;&quot;Muhammad Raza&quot;}</dc:creator>
        <dc:rights></dc:rights>
      </item>
    
      <item>
        <title>The 5 ECS Decisions That Waste 6 Weeks (And What to Pick Instead)</title>
        <description>
          <![CDATA[
            
            <p>I’ve been helping Python teams deploy to AWS for the past 2 years now. The pattern is always the same: a team has a working FastAPI or Django app running perfectly on their laptops with <code class="language-plaintext highlighter-rouge">docker-compose up</code>, and then someone says “let’s put this in ECS.” Six weeks later, they’re still arguing about whether to use Fargate or EC2.</p>

<p>The problem isn’t that ECS is hard. The problem is that teams treat infrastructure decisions like they’re permanent. They’re not.</p>

<p>Last year I worked with a startup that spent 5 weeks evaluating container orchestration options. Five weeks. They had a working app. They had paying customers waiting. But the engineering team was stuck in an endless loop of “what if we need to scale?” and “shouldn’t we future-proof this?”</p>

<p>They launched on Fargate. It took 3 days once they stopped debating.</p>

<p>Here are the 5 decisions that waste the most time and what I tell every client to pick.</p>

<h2 id="fargate-vs-ec2-just-use-fargate">Fargate vs EC2: Just Use Fargate</h2>

<p>This one wastes more time than all the others combined.</p>

<p>I get it. EC2 looks cheaper on paper. You can run the numbers, build a spreadsheet, show that at 50 containers you’ll save $400/month with EC2. The finance person gets excited. Someone mentions spot instances. Now you’re three meetings deep into capacity planning for traffic you don’t have yet.</p>

<p>Here’s what actually happens with EC2: you spend a week figuring out instance types, another week on auto-scaling groups, then you hit some weird issue where your containers won’t place because the bin-packing algorithm can’t find space, and suddenly your “cheaper” option has eaten two sprints of engineering time.</p>

<p>Fargate just works. You tell it how much CPU and memory you need, and it runs your container. No instances to manage, no patching, no capacity planning.</p>

<p>“But it’s more expensive!”</p>

<p>Sure. Maybe 20-30% more at scale. But you’re not at scale. You’re trying to ship. And even if Fargate costs you an extra $200/month right now, that’s nothing compared to the $30k+ in engineering salaries you’re burning while debating this.</p>

<p>Python apps especially benefit from Fargate. Your Django app with Celery workers is memory-heavy and I/O bound. You’re not doing CPU-intensive work. Fargate lets you right-size memory without playing Tetris with EC2 instance types.</p>

<p>Pick Fargate. When you’re running 200 containers 24/7 and have real cost data, revisit. Until then, move on.</p>

<h2 id="ecs-service-discovery-use-an-internal-alb">ECS Service Discovery: Use an Internal ALB</h2>

<p>When your services need to talk to each other, AWS gives you three options: Cloud Map, internal ALB, or Service Connect. I’ve seen teams spend weeks evaluating all three, setting up proof-of-concepts, reading whitepapers.</p>

<p>Just use an internal ALB.</p>

<p>I know, it’s not great. It’s a load balancer. It’s been around forever. But that’s exactly why you should use it:</p>

<ul>
  <li>It gives you a stable DNS name your services can call</li>
  <li>Health checks work out of the box</li>
  <li>You get access logs for debugging</li>
  <li>Every developer on your team already understands HTTP</li>
</ul>

<p>Your FastAPI service calls <code class="language-plaintext highlighter-rouge">http://api-internal.yourdomain.local/users</code> and it just works. No service mesh. No Envoy sidecars. No DNS caching gotchas.</p>

<p>Cloud Map is fine, but I’ve debugged too many issues where services couldn’t find each other because of DNS TTL problems. Service Connect is powerful, but now you’re operating a service mesh. Do you really want to be debugging Envoy proxy configuration when your actual problem is a database query?</p>

<p>The internal ALB is boring. Boring is good. Boring means you’re debugging your application code instead of your infrastructure.</p>

<h2 id="cicd-for-ecs-use-github-actions">CI/CD for ECS: Use GitHub Actions</h2>

<p>I’m gonna be honest here: if your code is on GitHub, use GitHub Actions. Don’t overthink this.</p>

<p>“But CodePipeline is AWS-native!”</p>

<p>Yes, and it requires you to set up a pipeline with Source, Build, and Deploy stages, configure IAM roles for each stage, create buildspec files, and wire everything together. It’s more YAML for the same result.</p>

<p>“But Jenkins gives us more control!”</p>

<p>It’s 2025. Please don’t set up a Jenkins server. You’ll spend more time maintaining Jenkins than deploying your app.</p>

<p>GitHub Actions has an official AWS action that handles ECS deployments:</p>

<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">Deploy to ECS</span>
  <span class="na">uses</span><span class="pi">:</span> <span class="s">aws-actions/amazon-ecs-deploy-task-definition@v1</span>
  <span class="na">with</span><span class="pi">:</span>
    <span class="na">task-definition</span><span class="pi">:</span> <span class="s">task-definition.json</span>
    <span class="na">service</span><span class="pi">:</span> <span class="s">my-service</span>
    <span class="na">cluster</span><span class="pi">:</span> <span class="s">my-cluster</span>
    <span class="na">wait-for-service-stability</span><span class="pi">:</span> <span class="no">true</span>
</code></pre></div></div>

<p>That’s the whole thing. It registers your task definition, updates the service, and waits for the deployment to stabilize. AWS maintains it. It works.</p>

<p>Your deployment workflow lives in your repo, your team already knows GitHub Actions from running tests, and you’re not managing another piece of infrastructure. If you want to understand how these pipeline runners actually work internally, I wrote a deep dive on <a href="/2025/building-cicd-pipeline-runner-python/">building a CI/CD pipeline runner from scratch in Python</a>.</p>

<p>If you’re on GitLab, use GitLab CI. If you’re on Bitbucket, use Bitbucket Pipelines. The point is: use whatever’s already integrated with your code. Don’t add complexity.</p>

<h2 id="ecs-secrets-management-use-ssm-parameter-store">ECS Secrets Management: Use SSM Parameter Store</h2>

<p>Where do you store your database passwords and API keys?</p>

<p>Not in your task definition. I’ve seen that. Please don’t.</p>

<p>The two real options are SSM Parameter Store and Secrets Manager. Teams debate this endlessly because Secrets Manager has automatic rotation and sounds more “enterprise.”</p>

<p>Here’s the thing: SSM Parameter Store is free, integrates natively with ECS, and handles 99% of use cases.</p>

<div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">{</span><span class="w">
  </span><span class="nl">"secrets"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="w">
    </span><span class="p">{</span><span class="w">
      </span><span class="nl">"name"</span><span class="p">:</span><span class="w"> </span><span class="s2">"DATABASE_URL"</span><span class="p">,</span><span class="w">
      </span><span class="nl">"valueFrom"</span><span class="p">:</span><span class="w"> </span><span class="s2">"arn:aws:ssm:us-east-1:123456789:parameter/myapp/database_url"</span><span class="w">
    </span><span class="p">}</span><span class="w">
  </span><span class="p">]</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>

<p>Your Python app reads <code class="language-plaintext highlighter-rouge">os.environ['DATABASE_URL']</code> like it does locally. No SDK, no code changes.</p>

<p>Secrets Manager costs $0.40 per secret per month and is worth it if you need automatic rotation for RDS credentials. But you probably don’t need that on day one. Start with SSM, migrate specific secrets to Secrets Manager later if you need rotation.</p>

<p>And please, don’t set up HashiCorp Vault unless you have compliance requirements that specifically mandate it. You’re now operating a distributed system just to store passwords. That’s not simplifying your life.</p>

<h2 id="ecs-logging-and-monitoring-use-cloudwatch">ECS Logging and Monitoring: Use CloudWatch</h2>

<p>Every team wants to evaluate Datadog, New Relic, Honeycomb, and then maybe self-host Prometheus and Grafana “for cost savings.”</p>

<p>Stop. Use CloudWatch.</p>

<p>Add this to your task definition:</p>

<div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">{</span><span class="w">
  </span><span class="nl">"logConfiguration"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
    </span><span class="nl">"logDriver"</span><span class="p">:</span><span class="w"> </span><span class="s2">"awslogs"</span><span class="p">,</span><span class="w">
    </span><span class="nl">"options"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
      </span><span class="nl">"awslogs-group"</span><span class="p">:</span><span class="w"> </span><span class="s2">"/ecs/my-service"</span><span class="p">,</span><span class="w">
      </span><span class="nl">"awslogs-region"</span><span class="p">:</span><span class="w"> </span><span class="s2">"us-east-1"</span><span class="p">,</span><span class="w">
      </span><span class="nl">"awslogs-stream-prefix"</span><span class="p">:</span><span class="w"> </span><span class="s2">"ecs"</span><span class="w">
    </span><span class="p">}</span><span class="w">
  </span><span class="p">}</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>

<p>Done. Your container logs go to CloudWatch. You can query them with Log Insights. Enable Container Insights and you get CPU/memory metrics. Set up a few alarms. You now have better observability than 80% of startups.</p>

<p>Datadog is genuinely good software. I like it. But it costs $15+ per host per month from day one, and you need to manage another vendor relationship. You can add it later when you actually need distributed tracing or APM.</p>

<p>Self-hosted observability is a trap. I’ve seen teams spend months building ELK stacks and Prometheus clusters. That’s infrastructure work that doesn’t ship features. Unless you have a dedicated platform team, don’t volunteer for this.</p>

<h2 id="the-actual-pattern-here">The Actual Pattern Here</h2>

<p>Look at what I recommended:</p>

<ul>
  <li>Fargate over EC2</li>
  <li>Internal ALB over Cloud Map or Service Connect</li>
  <li>GitHub Actions over CodePipeline or Jenkins</li>
  <li>SSM over Secrets Manager or Vault</li>
  <li>CloudWatch over Datadog or self-hosted</li>
</ul>

<p>Every single choice optimizes for the same thing: <strong>less stuff to manage</strong>.</p>

<p>Yes, some of these cost slightly more money. Yes, some of them are less flexible. But they all share one property: they let you ship faster and debug easier.</p>

<p>And here’s what nobody puts in their architecture decision records: all of these choices are reversible.</p>

<ul>
  <li>Fargate to EC2? Task definitions work on both.</li>
  <li>ALB to Service Connect? Just DNS changes.</li>
  <li>SSM to Secrets Manager? Same integration pattern.</li>
  <li>CloudWatch to Datadog? Add the agent, keep CloudWatch as backup.</li>
</ul>

<p>The “wrong” choice costs you maybe a few hundred dollars a month in inefficiency. The debate about the “right” choice costs you weeks of engineering time.</p>

<h2 id="what-ive-actually-seen-happen">What I’ve Actually Seen Happen</h2>

<p>Teams that follow this advice ship in about a week:</p>

<ul>
  <li>Day 1-2: Fargate cluster up, first service running</li>
  <li>Day 3: ALB routing traffic, services talking to each other</li>
  <li>Day 4: GitHub Actions deploying on push to main</li>
  <li>Day 5: Secrets in SSM, logs in CloudWatch, basic alarms set up</li>
</ul>

<p>Week 2: Building features.</p>

<p>Teams that “do it right” are still having meetings about networking topology in week 6.</p>

<p>I’ve watched startups run out of runway while their infrastructure was still “almost ready.” I’ve seen senior engineers burn out on DevOps work instead of building the product that got them excited in the first place.</p>

<p>Your Python app on Fargate with CloudWatch logs isn’t going to fall over at 1,000 users. Probably not at 10,000. By the time scale is actually a problem, you’ll have the traffic data and revenue to solve it properly.</p>

<p>Ship first. Optimize later.</p>

<hr />

<p><strong>If you found this helpful, share it on X and tag me <a href="https://twitter.com/muhammad_o7">@muhammad_o7</a></strong> - I’d love to hear about your ECS deployment experiences. You can also connect with me on <a href="https://www.linkedin.com/in/muhammad-raza-07/">LinkedIn</a>.</p>

<p><strong>Need Help?</strong> I’m available for AWS and DevOps consulting. If you’re stuck in ECS decision paralysis or need help getting to production faster, reach out via <a href="mailto:muhammadraza0047@gmail.com">email</a> or DM me on <a href="https://twitter.com/muhammad_o7">X/Twitter</a>.</p>

          ]]>
        </description>
        <pubDate>Thu, 18 Dec 2025 00:00:00 +0000</pubDate>
        <link>//muhammadraza.me/2025/ecs-decisions-that-waste-6-weeks/</link>
        <guid isPermaLink="true">//muhammadraza.me/2025/ecs-decisions-that-waste-6-weeks/</guid>
        
        <category>aws</category>
        
        <category>devops</category>
        
        <category>python</category>
        
        
        
        <dc:creator>{&quot;name&quot;=&gt;&quot;Muhammad Raza&quot;}</dc:creator>
        <dc:rights></dc:rights>
      </item>
    
      <item>
        <title>Building AI Agents for DevOps: From CI/CD Automation to Autonomous Deployments</title>
        <description>
          <![CDATA[
            
            <p>In my previous post, I showed you how to build a <a href="https://muhammadrazame.github.io/blog/2025/11/24/building-ci-cd-pipeline-runner-from-scratch-in-python/">CI/CD pipeline runner from scratch in Python</a>. We built something powerful: a system that could orchestrate jobs, manage dependencies, and pass artifacts between stages. It was the muscles of your deployment workflow.</p>

<p>But here’s the problem: that pipeline runner can only do exactly what you tell it to do.</p>

<p>It’s 2 AM. Your deployment pipeline fails. The error message is cryptic: Error: Connection refused on port 5432. Your traditional CI/CD pipeline stops dead. It sends an alert. You wake up, check the logs, realize the database connection pool was exhausted, restart the service, and go back to bed frustrated.</p>

<p>What if your pipeline could investigate the failure itself?</p>

<p>What if, instead of just stopping and alerting you, it could:</p>

<ul>
  <li>Analyze the error logs</li>
  <li>Check recent code changes</li>
  <li>Search for similar issues in your repository</li>
  <li>Identify that this same error happened two weeks ago when someone forgot to increase the connection pool</li>
  <li>Post a detailed root cause analysis to Slack with a suggested fix</li>
</ul>

<p>That’s not science fiction. That’s what AI agents can do for your DevOps workflows.</p>

<p>Over the past 2 years working independently as a DevOps consultant, I’ve seen the same patterns at every client: pipeline failures that need investigation, deployment decisions that require context, and incidents that demand rapid root cause analysis. These aren’t problems that need faster execution. They need reasoning.</p>

<p>That’s when I realized: the CI/CD runner we built is powerful, but it’s missing a brain. So I decided to add one.</p>

<h2 id="traditional-automation-vs-ai-agents">Traditional Automation vs. AI Agents</h2>

<p>Here’s the fundamental difference:</p>

<table>
  <thead>
    <tr>
      <th>Traditional CI/CD Pipeline</th>
      <th>AI Agent</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Executes predefined steps in order</td>
      <td>Reasons about what steps to take</td>
    </tr>
    <tr>
      <td>Fails when encountering unexpected situations</td>
      <td>Investigates and adapts to new situations</td>
    </tr>
    <tr>
      <td>Requires humans to make decisions</td>
      <td>Makes informed decisions autonomously</td>
    </tr>
    <tr>
      <td>Uses fixed if-then-else logic</td>
      <td>Uses context-aware reasoning</td>
    </tr>
    <tr>
      <td>Needs explicit error handling for every case</td>
      <td>Generalizes from patterns and past experience</td>
    </tr>
  </tbody>
</table>

<p>Your traditional pipeline is like a factory assembly line: efficient and reliable for known workflows, but completely stuck when something unexpected happens.</p>

<p>An AI agent is like a DevOps engineer who can think, investigate, and make decisions based on the full context of your system.</p>

<h2 id="what-were-building">What We’re Building</h2>

<p>In this post, I’m going to show you how to build a Pipeline Health Monitor Agent: an AI system that watches your GitHub Actions workflows and autonomously investigates failures.</p>

<p>Here’s what our agent will do:</p>

<ul>
  <li><strong>Monitor</strong>: Watch for GitHub Actions workflow failures via webhooks</li>
  <li><strong>Investigate</strong>: Automatically fetch logs, check recent commits, and analyze error patterns</li>
  <li><strong>Reason</strong>: Use an LLM (like GPT-4 or Claude) to understand what went wrong</li>
  <li><strong>Report</strong>: Post detailed findings to Slack with actionable recommendations</li>
  <li><strong>Learn</strong>: Remember similar issues and apply learned patterns</li>
</ul>

<p>And we’ll do all of this securely. Research shows that 48% of AI-generated code contains vulnerabilities, and I’m going to show you exactly how to validate every action your agent takes.</p>

<h2 id="what-youll-learn">What You’ll Learn</h2>

<p>By the end of this post, you’ll be able to:</p>

<ul>
  <li>Understand how AI agents differ from traditional automation and when to use each</li>
  <li>Build a working DevOps AI agent using LangChain and LangGraph</li>
  <li>Integrate the agent with your existing GitHub Actions workflows</li>
  <li>Implement security validation layers to prevent AI-generated vulnerabilities</li>
</ul>

<p>We’ll build this progressively: starting with the core agent, adding GitHub Actions integration, and then hardening it with security layers. Every code example will be complete and runnable.</p>

<p>The core philosophy: AI agents augment your pipeline, they don’t replace it. You’ll still have your traditional CI/CD workflows. The agent just makes them smarter.</p>

<p>Let’s start by understanding what AI agents actually are and how they work.</p>

<h2 id="understanding-ai-agents-the-4-core-components">Understanding AI Agents: The 4 Core Components</h2>

<p>Before we start coding, you need to understand what makes an AI agent fundamentally different from a script or a traditional automation workflow.</p>

<p>A traditional pipeline is a sequence of commands. An AI agent is a reasoning loop.</p>

<h3 id="the-agent-loop">The Agent Loop</h3>

<p>Every AI agent operates in a continuous cycle:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>┌─────────────────────────────────────────────────────┐
│                                                     │
│  Observe → Reason → Plan → Act → Observe (repeat)   │
│                                                     │
└─────────────────────────────────────────────────────┘
</code></pre></div></div>

<p>Here’s what happens when your GitHub Actions workflow fails:</p>

<ol>
  <li><strong>Observe</strong>: Agent receives webhook notification about pipeline failure</li>
  <li><strong>Reason</strong>: LLM analyzes the error message and context</li>
  <li><strong>Plan</strong>: Agent decides which tools to use (check logs, git history, search issues)</li>
  <li><strong>Act</strong>: Agent executes those tools and gathers information</li>
  <li><strong>Observe</strong>: Agent reviews tool outputs and repeats the cycle until it has an answer</li>
</ol>

<p>This is completely different from your CI/CD runner, which executes steps linearly and stops when something fails.</p>

<h3 id="the-4-core-components">The 4 Core Components</h3>

<p>Every AI agent is built from these four pieces:</p>

<h4 id="1-the-llm-brain">1. The LLM (Brain)</h4>

<p>The Large Language Model is the decision-making engine. It takes in context (pipeline logs, error messages, git history) and decides what to do next.</p>

<p>Think of it as the “thinking” part. When your pipeline fails with a database connection error, the LLM reasons: “This could be a configuration issue, a networking problem, or resource exhaustion. I should check recent config changes first, then network logs, then resource usage.”</p>

<p>Common choices: GPT-4, Claude 3.5 Sonnet, GPT-3.5 (cheaper for simple tasks)</p>

<h4 id="2-tools-hands">2. Tools (Hands)</h4>

<p>Tools are functions the agent can call to interact with the world. For DevOps, these might be:</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">get_github_logs(workflow_id)</code> - Fetch pipeline logs</li>
  <li><code class="language-plaintext highlighter-rouge">analyze_recent_commits(repo, hours)</code> - Check recent code changes</li>
  <li><code class="language-plaintext highlighter-rouge">search_similar_issues(error_message)</code> - Find related GitHub issues</li>
  <li><code class="language-plaintext highlighter-rouge">get_docker_status(container_id)</code> - Check container health</li>
  <li><code class="language-plaintext highlighter-rouge">query_prometheus(metric, timerange)</code> - Get monitoring data</li>
</ul>

<p>The LLM decides which tools to call and when. You just define what each tool does.</p>

<h4 id="3-memory">3. Memory</h4>

<p>Agents need two types of memory:</p>

<p><strong>Short-term memory (conversation history)</strong>: The current investigation. “I checked the logs and found a connection error. Then I checked recent commits and found a database config change.”</p>

<p><strong>Long-term memory (learned patterns)</strong>: Historical knowledge. “The last three times we saw Connection refused on port 5432, it was because the connection pool size was too small.”</p>

<p>For our pipeline monitor, we’ll start with short-term memory. Long-term memory requires a vector database (we’ll save that for a future post).</p>

<h4 id="4-prompts-instructions">4. Prompts (Instructions)</h4>

<p>The prompt is how you tell the agent what its job is and how to behave. A good DevOps agent prompt includes:</p>

<ul>
  <li><strong>Role definition</strong>: “You are a DevOps AI agent that investigates pipeline failures.”</li>
  <li><strong>Context</strong>: “The system runs on Kubernetes in AWS. Database is PostgreSQL. Cache is Redis.”</li>
  <li><strong>Constraints</strong>: “Never execute destructive commands. Always explain your reasoning.”</li>
  <li><strong>Output format</strong>: “Provide a root cause analysis with suggested fixes.”</li>
</ul>

<p>Prompt engineering is critical. A vague prompt like “debug the issue” will give you vague results. A specific prompt with context will give you actionable insights.</p>

<h3 id="how-it-all-works-together">How It All Works Together</h3>

<p>Here’s a concrete example of the agent loop in action:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Pipeline failure detected
    ↓
Agent observes: "Workflow #1234 failed with exit code 1"
    ↓
LLM reasons: "Exit code 1 is generic. I need more information."
    ↓
Agent plans: "Call get_github_logs() to see the actual error"
    ↓
Agent acts: Fetches logs, finds "psycopg2.OperationalError: could not connect to server"
    ↓
LLM reasons: "Database connection failure. Could be config, network, or resources."
    ↓
Agent plans: "Check recent commits for database config changes"
    ↓
Agent acts: Calls analyze_recent_commits(), finds commit changing DATABASE_URL
    ↓
LLM reasons: "Root cause identified. Recent commit broke database connection."
    ↓
Agent outputs: Detailed report with commit hash, explanation, and fix suggestion
</code></pre></div></div>

<h3 id="when-to-use-ai-agents-vs-traditional-automation">When to Use AI Agents vs. Traditional Automation</h3>

<p>Not every problem needs an AI agent. Here’s when each makes sense:</p>

<p><strong>Use traditional CI/CD automation when:</strong></p>

<ul>
  <li>The workflow is fully deterministic</li>
  <li>You know all possible failure modes</li>
  <li>Speed and cost are critical</li>
  <li>Zero tolerance for unexpected behavior</li>
</ul>

<p><strong>Use AI agents when:</strong></p>

<ul>
  <li>Failures require investigation and reasoning</li>
  <li>Context matters (recent changes, system state, historical patterns)</li>
  <li>The problem space is too large for explicit if-then rules</li>
  <li>You need adaptive behavior</li>
</ul>

<p><strong>Examples:</strong></p>

<p>Traditional automation: “If tests fail, don’t deploy” (simple rule)</p>

<p>AI agent: “Tests failed. Analyze which tests, check if they’re flaky, review recent code changes, determine if this is a real issue or infrastructure problem, suggest next steps” (complex reasoning)</p>

<h3 id="what-were-building-next">What We’re Building Next</h3>

<p>Now that you understand the components, we’re going to build a Pipeline Health Monitor Agent that uses:</p>

<ul>
  <li><strong>LLM</strong>: GPT-4 or Claude for reasoning</li>
  <li><strong>Tools</strong>: GitHub API, log analysis, issue search</li>
  <li><strong>Memory</strong>: Conversation history for multi-step investigation</li>
  <li><strong>Prompts</strong>: DevOps-specific instructions with infrastructure context</li>
</ul>

<p>In the next section, we’ll write the actual code.</p>

<h2 id="building-version-1-pipeline-health-monitor-agent">Building Version 1: Pipeline Health Monitor Agent</h2>

<p>Now we’re going to build a working AI agent that monitors your GitHub Actions workflows and investigates failures. This is production-ready code that you can deploy today.</p>

<h3 id="what-our-agent-will-do">What Our Agent Will Do</h3>

<p>When a GitHub Actions workflow fails, our agent will:</p>

<ul>
  <li>Receive a webhook notification with the workflow ID</li>
  <li>Fetch the workflow logs from GitHub</li>
  <li>Analyze recent commits to find what changed</li>
  <li>Search existing GitHub issues for similar errors</li>
  <li>Use an LLM (GPT-4, Claude, or others via OpenRouter) to reason about the root cause</li>
  <li>Generate a detailed report with recommendations</li>
</ul>

<p>Let’s build it step by step.</p>

<h3 id="installation-and-setup">Installation and Setup</h3>

<p>First, install uv if you don’t have it already:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># On macOS/Linux</span>
curl <span class="nt">-LsSf</span> https://astral.sh/uv/install.sh | sh

<span class="c"># On Windows</span>
powershell <span class="nt">-c</span> <span class="s2">"irm https://astral.sh/uv/install.ps1 | iex"</span>
</code></pre></div></div>

<p>Create a new project directory and set up a virtual environment:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">mkdir </span>pipeline-agent
<span class="nb">cd </span>pipeline-agent

<span class="c"># Create virtual environment with uv</span>
uv venv

<span class="c"># Activate the virtual environment</span>
<span class="c"># On macOS/Linux:</span>
<span class="nb">source</span> .venv/bin/activate

<span class="c"># On Windows:</span>
.venv<span class="se">\S</span>cripts<span class="se">\a</span>ctivate
</code></pre></div></div>

<p>Install the required dependencies using uv:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>uv pip <span class="nb">install </span>langchain langchain-openai requests python-dotenv
</code></pre></div></div>

<p>Set up your environment variables in a <code class="language-plaintext highlighter-rouge">.env</code> file.</p>

<p><strong>Option 1: Using OpenAI directly</strong></p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">OPENAI_API_KEY</span><span class="o">=</span>your_openai_api_key_here
<span class="nv">GITHUB_TOKEN</span><span class="o">=</span>your_github_personal_access_token
<span class="nv">GITHUB_REPO</span><span class="o">=</span>username/repository
<span class="nv">USE_OPENROUTER</span><span class="o">=</span><span class="nb">false</span>
</code></pre></div></div>

<p><strong>Option 2: Using OpenRouter (recommended for cost savings)</strong></p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">OPENROUTER_API_KEY</span><span class="o">=</span>your_openrouter_api_key_here
<span class="nv">GITHUB_TOKEN</span><span class="o">=</span>your_github_personal_access_token
<span class="nv">GITHUB_REPO</span><span class="o">=</span>username/repository
<span class="nv">USE_OPENROUTER</span><span class="o">=</span><span class="nb">true
</span><span class="nv">MODEL_NAME</span><span class="o">=</span>anthropic/claude-3.5-sonnet  <span class="c"># or openai/gpt-4, google/gemini-pro, etc.</span>
</code></pre></div></div>

<p><strong>Why OpenRouter?</strong></p>

<ul>
  <li>Access multiple LLM providers through one API</li>
  <li>Often cheaper than going direct (they negotiate bulk rates)</li>
  <li>Easy to switch between models without changing code</li>
  <li>Get API key at: https://openrouter.ai/</li>
</ul>

<h3 id="step-1-define-the-tools">Step 1: Define the Tools</h3>

<p>Tools are functions the agent can call. Each tool is decorated with <code class="language-plaintext highlighter-rouge">@tool</code> and includes a docstring that tells the LLM what it does.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># agent_investigator.py
</span><span class="kn">import</span> <span class="nn">os</span>
<span class="kn">import</span> <span class="nn">requests</span>
<span class="kn">from</span> <span class="nn">datetime</span> <span class="kn">import</span> <span class="n">datetime</span><span class="p">,</span> <span class="n">timedelta</span>
<span class="kn">from</span> <span class="nn">typing</span> <span class="kn">import</span> <span class="n">Optional</span>
<span class="kn">from</span> <span class="nn">langchain.tools</span> <span class="kn">import</span> <span class="n">tool</span>
<span class="kn">from</span> <span class="nn">dotenv</span> <span class="kn">import</span> <span class="n">load_dotenv</span>

<span class="n">load_dotenv</span><span class="p">()</span>

<span class="n">GITHUB_TOKEN</span> <span class="o">=</span> <span class="n">os</span><span class="p">.</span><span class="n">getenv</span><span class="p">(</span><span class="s">"GITHUB_TOKEN"</span><span class="p">)</span>
<span class="n">GITHUB_REPO</span> <span class="o">=</span> <span class="n">os</span><span class="p">.</span><span class="n">getenv</span><span class="p">(</span><span class="s">"GITHUB_REPO"</span><span class="p">)</span>
<span class="n">HEADERS</span> <span class="o">=</span> <span class="p">{</span>
    <span class="s">"Authorization"</span><span class="p">:</span> <span class="sa">f</span><span class="s">"token </span><span class="si">{</span><span class="n">GITHUB_TOKEN</span><span class="si">}</span><span class="s">"</span><span class="p">,</span>
    <span class="s">"Accept"</span><span class="p">:</span> <span class="s">"application/vnd.github.v3+json"</span>
<span class="p">}</span>

<span class="o">@</span><span class="n">tool</span>
<span class="k">def</span> <span class="nf">get_workflow_logs</span><span class="p">(</span><span class="n">workflow_run_id</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">str</span><span class="p">:</span>
    <span class="s">"""
    Fetch logs from a failed GitHub Actions workflow run.

    Args:
        workflow_run_id: The GitHub Actions workflow run ID

    Returns:
        String containing the workflow logs
    """</span>
    <span class="k">try</span><span class="p">:</span>
        <span class="c1"># Get workflow run details
</span>        <span class="n">run_url</span> <span class="o">=</span> <span class="sa">f</span><span class="s">"https://api.github.com/repos/</span><span class="si">{</span><span class="n">GITHUB_REPO</span><span class="si">}</span><span class="s">/actions/runs/</span><span class="si">{</span><span class="n">workflow_run_id</span><span class="si">}</span><span class="s">"</span>
        <span class="n">run_response</span> <span class="o">=</span> <span class="n">requests</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="n">run_url</span><span class="p">,</span> <span class="n">headers</span><span class="o">=</span><span class="n">HEADERS</span><span class="p">)</span>
        <span class="n">run_response</span><span class="p">.</span><span class="n">raise_for_status</span><span class="p">()</span>
        <span class="n">run_data</span> <span class="o">=</span> <span class="n">run_response</span><span class="p">.</span><span class="n">json</span><span class="p">()</span>

        <span class="c1"># Get jobs for this workflow run
</span>        <span class="n">jobs_url</span> <span class="o">=</span> <span class="sa">f</span><span class="s">"</span><span class="si">{</span><span class="n">run_url</span><span class="si">}</span><span class="s">/jobs"</span>
        <span class="n">jobs_response</span> <span class="o">=</span> <span class="n">requests</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="n">jobs_url</span><span class="p">,</span> <span class="n">headers</span><span class="o">=</span><span class="n">HEADERS</span><span class="p">)</span>
        <span class="n">jobs_response</span><span class="p">.</span><span class="n">raise_for_status</span><span class="p">()</span>
        <span class="n">jobs_data</span> <span class="o">=</span> <span class="n">jobs_response</span><span class="p">.</span><span class="n">json</span><span class="p">()</span>

        <span class="c1"># Extract logs from failed jobs
</span>        <span class="n">logs</span> <span class="o">=</span> <span class="p">[]</span>
        <span class="n">logs</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="sa">f</span><span class="s">"Workflow: </span><span class="si">{</span><span class="n">run_data</span><span class="p">[</span><span class="s">'name'</span><span class="p">]</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
        <span class="n">logs</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="sa">f</span><span class="s">"Status: </span><span class="si">{</span><span class="n">run_data</span><span class="p">[</span><span class="s">'conclusion'</span><span class="p">]</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
        <span class="n">logs</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="sa">f</span><span class="s">"Started: </span><span class="si">{</span><span class="n">run_data</span><span class="p">[</span><span class="s">'created_at'</span><span class="p">]</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
        <span class="n">logs</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="sa">f</span><span class="s">"Branch: </span><span class="si">{</span><span class="n">run_data</span><span class="p">[</span><span class="s">'head_branch'</span><span class="p">]</span><span class="si">}</span><span class="se">\n</span><span class="s">"</span><span class="p">)</span>

        <span class="k">for</span> <span class="n">job</span> <span class="ow">in</span> <span class="n">jobs_data</span><span class="p">[</span><span class="s">'jobs'</span><span class="p">]:</span>
            <span class="k">if</span> <span class="n">job</span><span class="p">[</span><span class="s">'conclusion'</span><span class="p">]</span> <span class="o">==</span> <span class="s">'failure'</span><span class="p">:</span>
                <span class="n">logs</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="sa">f</span><span class="s">"</span><span class="se">\n</span><span class="s">Failed Job: </span><span class="si">{</span><span class="n">job</span><span class="p">[</span><span class="s">'name'</span><span class="p">]</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
                <span class="n">logs</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="sa">f</span><span class="s">"Conclusion: </span><span class="si">{</span><span class="n">job</span><span class="p">[</span><span class="s">'conclusion'</span><span class="p">]</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>

                <span class="c1"># Get job logs
</span>                <span class="n">log_url</span> <span class="o">=</span> <span class="sa">f</span><span class="s">"https://api.github.com/repos/</span><span class="si">{</span><span class="n">GITHUB_REPO</span><span class="si">}</span><span class="s">/actions/jobs/</span><span class="si">{</span><span class="n">job</span><span class="p">[</span><span class="s">'id'</span><span class="p">]</span><span class="si">}</span><span class="s">/logs"</span>
                <span class="n">log_response</span> <span class="o">=</span> <span class="n">requests</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="n">log_url</span><span class="p">,</span> <span class="n">headers</span><span class="o">=</span><span class="n">HEADERS</span><span class="p">)</span>

                <span class="k">if</span> <span class="n">log_response</span><span class="p">.</span><span class="n">status_code</span> <span class="o">==</span> <span class="mi">200</span><span class="p">:</span>
                    <span class="c1"># Extract last 50 lines (most relevant errors are at the end)
</span>                    <span class="n">log_lines</span> <span class="o">=</span> <span class="n">log_response</span><span class="p">.</span><span class="n">text</span><span class="p">.</span><span class="n">split</span><span class="p">(</span><span class="s">'</span><span class="se">\n</span><span class="s">'</span><span class="p">)</span>
                    <span class="n">relevant_logs</span> <span class="o">=</span> <span class="n">log_lines</span><span class="p">[</span><span class="o">-</span><span class="mi">50</span><span class="p">:]</span>
                    <span class="n">logs</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="s">"</span><span class="se">\n</span><span class="s">Last 50 lines of logs:"</span><span class="p">)</span>
                    <span class="n">logs</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="s">'</span><span class="se">\n</span><span class="s">'</span><span class="p">.</span><span class="n">join</span><span class="p">(</span><span class="n">relevant_logs</span><span class="p">))</span>

        <span class="k">return</span> <span class="s">'</span><span class="se">\n</span><span class="s">'</span><span class="p">.</span><span class="n">join</span><span class="p">(</span><span class="n">logs</span><span class="p">)</span>

    <span class="k">except</span> <span class="n">requests</span><span class="p">.</span><span class="n">exceptions</span><span class="p">.</span><span class="n">RequestException</span> <span class="k">as</span> <span class="n">e</span><span class="p">:</span>
        <span class="k">return</span> <span class="sa">f</span><span class="s">"Error fetching workflow logs: </span><span class="si">{</span><span class="nb">str</span><span class="p">(</span><span class="n">e</span><span class="p">)</span><span class="si">}</span><span class="s">"</span>


<span class="o">@</span><span class="n">tool</span>
<span class="k">def</span> <span class="nf">analyze_recent_commits</span><span class="p">(</span><span class="n">hours</span><span class="p">:</span> <span class="nb">int</span> <span class="o">=</span> <span class="mi">24</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">str</span><span class="p">:</span>
    <span class="s">"""
    Analyze recent commits to the repository that might have caused the failure.

    Args:
        hours: Number of hours to look back (default: 24)

    Returns:
        String containing recent commits with author, message, and files changed
    """</span>
    <span class="k">try</span><span class="p">:</span>
        <span class="n">since</span> <span class="o">=</span> <span class="p">(</span><span class="n">datetime</span><span class="p">.</span><span class="n">utcnow</span><span class="p">()</span> <span class="o">-</span> <span class="n">timedelta</span><span class="p">(</span><span class="n">hours</span><span class="o">=</span><span class="n">hours</span><span class="p">)).</span><span class="n">isoformat</span><span class="p">()</span> <span class="o">+</span> <span class="s">'Z'</span>
        <span class="n">commits_url</span> <span class="o">=</span> <span class="sa">f</span><span class="s">"https://api.github.com/repos/</span><span class="si">{</span><span class="n">GITHUB_REPO</span><span class="si">}</span><span class="s">/commits"</span>
        <span class="n">params</span> <span class="o">=</span> <span class="p">{</span><span class="s">'since'</span><span class="p">:</span> <span class="n">since</span><span class="p">,</span> <span class="s">'per_page'</span><span class="p">:</span> <span class="mi">10</span><span class="p">}</span>

        <span class="n">response</span> <span class="o">=</span> <span class="n">requests</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="n">commits_url</span><span class="p">,</span> <span class="n">headers</span><span class="o">=</span><span class="n">HEADERS</span><span class="p">,</span> <span class="n">params</span><span class="o">=</span><span class="n">params</span><span class="p">)</span>
        <span class="n">response</span><span class="p">.</span><span class="n">raise_for_status</span><span class="p">()</span>
        <span class="n">commits</span> <span class="o">=</span> <span class="n">response</span><span class="p">.</span><span class="n">json</span><span class="p">()</span>

        <span class="k">if</span> <span class="ow">not</span> <span class="n">commits</span><span class="p">:</span>
            <span class="k">return</span> <span class="sa">f</span><span class="s">"No commits found in the last </span><span class="si">{</span><span class="n">hours</span><span class="si">}</span><span class="s"> hours."</span>

        <span class="n">result</span> <span class="o">=</span> <span class="p">[</span><span class="sa">f</span><span class="s">"Recent commits (last </span><span class="si">{</span><span class="n">hours</span><span class="si">}</span><span class="s"> hours):</span><span class="se">\n</span><span class="s">"</span><span class="p">]</span>

        <span class="k">for</span> <span class="n">commit</span> <span class="ow">in</span> <span class="n">commits</span><span class="p">:</span>
            <span class="n">sha</span> <span class="o">=</span> <span class="n">commit</span><span class="p">[</span><span class="s">'sha'</span><span class="p">][:</span><span class="mi">7</span><span class="p">]</span>
            <span class="n">author</span> <span class="o">=</span> <span class="n">commit</span><span class="p">[</span><span class="s">'commit'</span><span class="p">][</span><span class="s">'author'</span><span class="p">][</span><span class="s">'name'</span><span class="p">]</span>
            <span class="n">message</span> <span class="o">=</span> <span class="n">commit</span><span class="p">[</span><span class="s">'commit'</span><span class="p">][</span><span class="s">'message'</span><span class="p">].</span><span class="n">split</span><span class="p">(</span><span class="s">'</span><span class="se">\n</span><span class="s">'</span><span class="p">)[</span><span class="mi">0</span><span class="p">]</span>  <span class="c1"># First line only
</span>            <span class="n">date</span> <span class="o">=</span> <span class="n">commit</span><span class="p">[</span><span class="s">'commit'</span><span class="p">][</span><span class="s">'author'</span><span class="p">][</span><span class="s">'date'</span><span class="p">]</span>

            <span class="c1"># Get files changed in this commit
</span>            <span class="n">commit_detail_url</span> <span class="o">=</span> <span class="sa">f</span><span class="s">"https://api.github.com/repos/</span><span class="si">{</span><span class="n">GITHUB_REPO</span><span class="si">}</span><span class="s">/commits/</span><span class="si">{</span><span class="n">commit</span><span class="p">[</span><span class="s">'sha'</span><span class="p">]</span><span class="si">}</span><span class="s">"</span>
            <span class="n">commit_response</span> <span class="o">=</span> <span class="n">requests</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="n">commit_detail_url</span><span class="p">,</span> <span class="n">headers</span><span class="o">=</span><span class="n">HEADERS</span><span class="p">)</span>
            <span class="n">commit_data</span> <span class="o">=</span> <span class="n">commit_response</span><span class="p">.</span><span class="n">json</span><span class="p">()</span>

            <span class="n">files_changed</span> <span class="o">=</span> <span class="p">[</span><span class="n">f</span><span class="p">[</span><span class="s">'filename'</span><span class="p">]</span> <span class="k">for</span> <span class="n">f</span> <span class="ow">in</span> <span class="n">commit_data</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="s">'files'</span><span class="p">,</span> <span class="p">[])]</span>

            <span class="n">result</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="sa">f</span><span class="s">"</span><span class="se">\n</span><span class="s">Commit </span><span class="si">{</span><span class="n">sha</span><span class="si">}</span><span class="s"> by </span><span class="si">{</span><span class="n">author</span><span class="si">}</span><span class="s"> (</span><span class="si">{</span><span class="n">date</span><span class="si">}</span><span class="s">)"</span><span class="p">)</span>
            <span class="n">result</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="sa">f</span><span class="s">"Message: </span><span class="si">{</span><span class="n">message</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
            <span class="n">result</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="sa">f</span><span class="s">"Files changed: </span><span class="si">{</span><span class="s">', '</span><span class="p">.</span><span class="n">join</span><span class="p">(</span><span class="n">files_changed</span><span class="p">[</span><span class="si">:</span><span class="mi">5</span><span class="p">])</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>  <span class="c1"># First 5 files
</span>            <span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">files_changed</span><span class="p">)</span> <span class="o">&gt;</span> <span class="mi">5</span><span class="p">:</span>
                <span class="n">result</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="sa">f</span><span class="s">"... and </span><span class="si">{</span><span class="nb">len</span><span class="p">(</span><span class="n">files_changed</span><span class="p">)</span> <span class="o">-</span> <span class="mi">5</span><span class="si">}</span><span class="s"> more files"</span><span class="p">)</span>

        <span class="k">return</span> <span class="s">'</span><span class="se">\n</span><span class="s">'</span><span class="p">.</span><span class="n">join</span><span class="p">(</span><span class="n">result</span><span class="p">)</span>

    <span class="k">except</span> <span class="n">requests</span><span class="p">.</span><span class="n">exceptions</span><span class="p">.</span><span class="n">RequestException</span> <span class="k">as</span> <span class="n">e</span><span class="p">:</span>
        <span class="k">return</span> <span class="sa">f</span><span class="s">"Error analyzing commits: </span><span class="si">{</span><span class="nb">str</span><span class="p">(</span><span class="n">e</span><span class="p">)</span><span class="si">}</span><span class="s">"</span>


<span class="o">@</span><span class="n">tool</span>
<span class="k">def</span> <span class="nf">search_similar_issues</span><span class="p">(</span><span class="n">error_keywords</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">str</span><span class="p">:</span>
    <span class="s">"""
    Search GitHub issues for similar error messages or problems.

    Args:
        error_keywords: Keywords from the error message to search for

    Returns:
        String containing relevant GitHub issues and their solutions
    """</span>
    <span class="k">try</span><span class="p">:</span>
        <span class="c1"># Search issues in the repository
</span>        <span class="n">search_url</span> <span class="o">=</span> <span class="s">"https://api.github.com/search/issues"</span>
        <span class="n">query</span> <span class="o">=</span> <span class="sa">f</span><span class="s">"repo:</span><span class="si">{</span><span class="n">GITHUB_REPO</span><span class="si">}</span><span class="s"> </span><span class="si">{</span><span class="n">error_keywords</span><span class="si">}</span><span class="s"> is:issue"</span>
        <span class="n">params</span> <span class="o">=</span> <span class="p">{</span><span class="s">'q'</span><span class="p">:</span> <span class="n">query</span><span class="p">,</span> <span class="s">'sort'</span><span class="p">:</span> <span class="s">'relevance'</span><span class="p">,</span> <span class="s">'per_page'</span><span class="p">:</span> <span class="mi">5</span><span class="p">}</span>

        <span class="n">response</span> <span class="o">=</span> <span class="n">requests</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="n">search_url</span><span class="p">,</span> <span class="n">headers</span><span class="o">=</span><span class="n">HEADERS</span><span class="p">,</span> <span class="n">params</span><span class="o">=</span><span class="n">params</span><span class="p">)</span>
        <span class="n">response</span><span class="p">.</span><span class="n">raise_for_status</span><span class="p">()</span>
        <span class="n">issues</span> <span class="o">=</span> <span class="n">response</span><span class="p">.</span><span class="n">json</span><span class="p">()</span>

        <span class="k">if</span> <span class="n">issues</span><span class="p">[</span><span class="s">'total_count'</span><span class="p">]</span> <span class="o">==</span> <span class="mi">0</span><span class="p">:</span>
            <span class="k">return</span> <span class="sa">f</span><span class="s">"No similar issues found for keywords: </span><span class="si">{</span><span class="n">error_keywords</span><span class="si">}</span><span class="s">"</span>

        <span class="n">result</span> <span class="o">=</span> <span class="p">[</span><span class="sa">f</span><span class="s">"Found </span><span class="si">{</span><span class="n">issues</span><span class="p">[</span><span class="s">'total_count'</span><span class="p">]</span><span class="si">}</span><span class="s"> similar issues:</span><span class="se">\n</span><span class="s">"</span><span class="p">]</span>

        <span class="k">for</span> <span class="n">issue</span> <span class="ow">in</span> <span class="n">issues</span><span class="p">[</span><span class="s">'items'</span><span class="p">][:</span><span class="mi">5</span><span class="p">]:</span>
            <span class="n">result</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="sa">f</span><span class="s">"</span><span class="se">\n</span><span class="s">#</span><span class="si">{</span><span class="n">issue</span><span class="p">[</span><span class="s">'number'</span><span class="p">]</span><span class="si">}</span><span class="s">: </span><span class="si">{</span><span class="n">issue</span><span class="p">[</span><span class="s">'title'</span><span class="p">]</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
            <span class="n">result</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="sa">f</span><span class="s">"State: </span><span class="si">{</span><span class="n">issue</span><span class="p">[</span><span class="s">'state'</span><span class="p">]</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
            <span class="n">result</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="sa">f</span><span class="s">"URL: </span><span class="si">{</span><span class="n">issue</span><span class="p">[</span><span class="s">'html_url'</span><span class="p">]</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>

            <span class="c1"># Get first comment if issue is closed (might contain solution)
</span>            <span class="k">if</span> <span class="n">issue</span><span class="p">[</span><span class="s">'state'</span><span class="p">]</span> <span class="o">==</span> <span class="s">'closed'</span> <span class="ow">and</span> <span class="n">issue</span><span class="p">[</span><span class="s">'comments'</span><span class="p">]</span> <span class="o">&gt;</span> <span class="mi">0</span><span class="p">:</span>
                <span class="n">comments_url</span> <span class="o">=</span> <span class="n">issue</span><span class="p">[</span><span class="s">'comments_url'</span><span class="p">]</span>
                <span class="n">comments_response</span> <span class="o">=</span> <span class="n">requests</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="n">comments_url</span><span class="p">,</span> <span class="n">headers</span><span class="o">=</span><span class="n">HEADERS</span><span class="p">)</span>
                <span class="n">comments</span> <span class="o">=</span> <span class="n">comments_response</span><span class="p">.</span><span class="n">json</span><span class="p">()</span>
                <span class="k">if</span> <span class="n">comments</span><span class="p">:</span>
                    <span class="n">first_comment</span> <span class="o">=</span> <span class="n">comments</span><span class="p">[</span><span class="mi">0</span><span class="p">][</span><span class="s">'body'</span><span class="p">][:</span><span class="mi">200</span><span class="p">]</span>  <span class="c1"># First 200 chars
</span>                    <span class="n">result</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="sa">f</span><span class="s">"Solution hint: </span><span class="si">{</span><span class="n">first_comment</span><span class="si">}</span><span class="s">..."</span><span class="p">)</span>

        <span class="k">return</span> <span class="s">'</span><span class="se">\n</span><span class="s">'</span><span class="p">.</span><span class="n">join</span><span class="p">(</span><span class="n">result</span><span class="p">)</span>

    <span class="k">except</span> <span class="n">requests</span><span class="p">.</span><span class="n">exceptions</span><span class="p">.</span><span class="n">RequestException</span> <span class="k">as</span> <span class="n">e</span><span class="p">:</span>
        <span class="k">return</span> <span class="sa">f</span><span class="s">"Error searching issues: </span><span class="si">{</span><span class="nb">str</span><span class="p">(</span><span class="n">e</span><span class="p">)</span><span class="si">}</span><span class="s">"</span>
</code></pre></div></div>

<h3 id="step-2-create-the-agent-with-llm-provider-support">Step 2: Create the Agent with LLM Provider Support</h3>

<p>Now we’ll create the agent with support for both OpenAI and OpenRouter:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">langchain.agents</span> <span class="kn">import</span> <span class="n">create_openai_tools_agent</span><span class="p">,</span> <span class="n">AgentExecutor</span>
<span class="kn">from</span> <span class="nn">langchain_openai</span> <span class="kn">import</span> <span class="n">ChatOpenAI</span>
<span class="kn">from</span> <span class="nn">langchain_core.prompts</span> <span class="kn">import</span> <span class="n">ChatPromptTemplate</span><span class="p">,</span> <span class="n">MessagesPlaceholder</span>

<span class="k">def</span> <span class="nf">get_llm</span><span class="p">():</span>
    <span class="s">"""
    Initialize the LLM based on environment configuration.
    Supports both OpenAI directly and OpenRouter.
    """</span>
    <span class="n">use_openrouter</span> <span class="o">=</span> <span class="n">os</span><span class="p">.</span><span class="n">getenv</span><span class="p">(</span><span class="s">"USE_OPENROUTER"</span><span class="p">,</span> <span class="s">"false"</span><span class="p">).</span><span class="n">lower</span><span class="p">()</span> <span class="o">==</span> <span class="s">"true"</span>

    <span class="k">if</span> <span class="n">use_openrouter</span><span class="p">:</span>
        <span class="c1"># Using OpenRouter for access to multiple models
</span>        <span class="n">api_key</span> <span class="o">=</span> <span class="n">os</span><span class="p">.</span><span class="n">getenv</span><span class="p">(</span><span class="s">"OPENROUTER_API_KEY"</span><span class="p">)</span>
        <span class="n">model_name</span> <span class="o">=</span> <span class="n">os</span><span class="p">.</span><span class="n">getenv</span><span class="p">(</span><span class="s">"MODEL_NAME"</span><span class="p">,</span> <span class="s">"anthropic/claude-3.5-sonnet"</span><span class="p">)</span>

        <span class="n">llm</span> <span class="o">=</span> <span class="n">ChatOpenAI</span><span class="p">(</span>
            <span class="n">model</span><span class="o">=</span><span class="n">model_name</span><span class="p">,</span>
            <span class="n">openai_api_key</span><span class="o">=</span><span class="n">api_key</span><span class="p">,</span>
            <span class="n">openai_api_base</span><span class="o">=</span><span class="s">"https://openrouter.ai/api/v1"</span><span class="p">,</span>
            <span class="n">temperature</span><span class="o">=</span><span class="mi">0</span><span class="p">,</span>
            <span class="n">default_headers</span><span class="o">=</span><span class="p">{</span>
                <span class="s">"HTTP-Referer"</span><span class="p">:</span> <span class="s">"https://github.com/your-username/pipeline-agent"</span><span class="p">,</span>
                <span class="s">"X-Title"</span><span class="p">:</span> <span class="s">"Pipeline Health Monitor Agent"</span>
            <span class="p">}</span>
        <span class="p">)</span>
        <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"Using OpenRouter with model: </span><span class="si">{</span><span class="n">model_name</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
    <span class="k">else</span><span class="p">:</span>
        <span class="c1"># Using OpenAI directly
</span>        <span class="n">api_key</span> <span class="o">=</span> <span class="n">os</span><span class="p">.</span><span class="n">getenv</span><span class="p">(</span><span class="s">"OPENAI_API_KEY"</span><span class="p">)</span>
        <span class="n">llm</span> <span class="o">=</span> <span class="n">ChatOpenAI</span><span class="p">(</span>
            <span class="n">model</span><span class="o">=</span><span class="s">"gpt-4"</span><span class="p">,</span>
            <span class="n">temperature</span><span class="o">=</span><span class="mi">0</span><span class="p">,</span>
            <span class="n">openai_api_key</span><span class="o">=</span><span class="n">api_key</span>
        <span class="p">)</span>
        <span class="k">print</span><span class="p">(</span><span class="s">"Using OpenAI GPT-4"</span><span class="p">)</span>

    <span class="k">return</span> <span class="n">llm</span>

<span class="c1"># Initialize the LLM
</span><span class="n">llm</span> <span class="o">=</span> <span class="n">get_llm</span><span class="p">()</span>

<span class="c1"># Define the system prompt
</span><span class="n">system_prompt</span> <span class="o">=</span> <span class="s">"""You are an expert DevOps AI agent that investigates CI/CD pipeline failures.

Your role is to:
1. Analyze workflow logs to identify the root cause of failures
2. Examine recent code changes that might have introduced issues
3. Search for similar problems in the issue tracker
4. Provide a clear, actionable root cause analysis

When analyzing failures:
- Focus on the actual error messages, not just symptoms
- Consider recent code changes as potential causes
- Look for patterns in similar past issues
- Be specific about what broke and why
- Suggest concrete fixes, not vague advice

Your investigation should be thorough but concise. Developers need actionable insights, not lengthy explanations.

Output format:
**Root Cause**: [One sentence summary]
**Evidence**: [Key findings from logs/commits/issues]
**Recommendation**: [Specific steps to fix]
**Related Issues**: [Links to similar problems if found]
"""</span>

<span class="c1"># Create the prompt template
</span><span class="n">prompt</span> <span class="o">=</span> <span class="n">ChatPromptTemplate</span><span class="p">.</span><span class="n">from_messages</span><span class="p">([</span>
    <span class="p">(</span><span class="s">"system"</span><span class="p">,</span> <span class="n">system_prompt</span><span class="p">),</span>
    <span class="p">(</span><span class="s">"human"</span><span class="p">,</span> <span class="s">"{input}"</span><span class="p">),</span>
    <span class="n">MessagesPlaceholder</span><span class="p">(</span><span class="n">variable_name</span><span class="o">=</span><span class="s">"agent_scratchpad"</span><span class="p">),</span>
<span class="p">])</span>

<span class="c1"># Create the agent
</span><span class="n">tools</span> <span class="o">=</span> <span class="p">[</span><span class="n">get_workflow_logs</span><span class="p">,</span> <span class="n">analyze_recent_commits</span><span class="p">,</span> <span class="n">search_similar_issues</span><span class="p">]</span>
<span class="n">agent</span> <span class="o">=</span> <span class="n">create_openai_tools_agent</span><span class="p">(</span><span class="n">llm</span><span class="p">,</span> <span class="n">tools</span><span class="p">,</span> <span class="n">prompt</span><span class="p">)</span>

<span class="c1"># Create the agent executor
</span><span class="n">agent_executor</span> <span class="o">=</span> <span class="n">AgentExecutor</span><span class="p">(</span>
    <span class="n">agent</span><span class="o">=</span><span class="n">agent</span><span class="p">,</span>
    <span class="n">tools</span><span class="o">=</span><span class="n">tools</span><span class="p">,</span>
    <span class="n">verbose</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span>
    <span class="n">max_iterations</span><span class="o">=</span><span class="mi">5</span><span class="p">,</span>
    <span class="n">handle_parsing_errors</span><span class="o">=</span><span class="bp">True</span>
<span class="p">)</span>
</code></pre></div></div>

<h3 id="step-3-run-the-investigation">Step 3: Run the Investigation</h3>

<p>Finally, we create a function to trigger the investigation:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">investigate_failure</span><span class="p">(</span><span class="n">workflow_run_id</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">dict</span><span class="p">:</span>
    <span class="s">"""
    Investigate a failed GitHub Actions workflow.

    Args:
        workflow_run_id: The GitHub Actions workflow run ID

    Returns:
        Dict containing the investigation result
    """</span>
    <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"</span><span class="se">\n</span><span class="s">Starting investigation for workflow run </span><span class="si">{</span><span class="n">workflow_run_id</span><span class="si">}</span><span class="s">..."</span><span class="p">)</span>
    <span class="k">print</span><span class="p">(</span><span class="s">"="</span> <span class="o">*</span> <span class="mi">60</span><span class="p">)</span>

    <span class="n">input_text</span> <span class="o">=</span> <span class="sa">f</span><span class="s">"""A GitHub Actions workflow has failed (run ID: </span><span class="si">{</span><span class="n">workflow_run_id</span><span class="si">}</span><span class="s">).

Please investigate this failure by:
1. Fetching and analyzing the workflow logs
2. Checking recent commits for changes that might have caused this
3. Searching for similar issues that might provide insights

Provide a comprehensive root cause analysis with specific recommendations."""</span>

    <span class="k">try</span><span class="p">:</span>
        <span class="n">result</span> <span class="o">=</span> <span class="n">agent_executor</span><span class="p">.</span><span class="n">invoke</span><span class="p">({</span><span class="s">"input"</span><span class="p">:</span> <span class="n">input_text</span><span class="p">})</span>

        <span class="k">print</span><span class="p">(</span><span class="s">"</span><span class="se">\n</span><span class="s">"</span> <span class="o">+</span> <span class="s">"="</span> <span class="o">*</span> <span class="mi">60</span><span class="p">)</span>
        <span class="k">print</span><span class="p">(</span><span class="s">"INVESTIGATION COMPLETE"</span><span class="p">)</span>
        <span class="k">print</span><span class="p">(</span><span class="s">"="</span> <span class="o">*</span> <span class="mi">60</span><span class="p">)</span>
        <span class="k">print</span><span class="p">(</span><span class="n">result</span><span class="p">[</span><span class="s">'output'</span><span class="p">])</span>

        <span class="k">return</span> <span class="p">{</span>
            <span class="s">"success"</span><span class="p">:</span> <span class="bp">True</span><span class="p">,</span>
            <span class="s">"workflow_run_id"</span><span class="p">:</span> <span class="n">workflow_run_id</span><span class="p">,</span>
            <span class="s">"analysis"</span><span class="p">:</span> <span class="n">result</span><span class="p">[</span><span class="s">'output'</span><span class="p">]</span>
        <span class="p">}</span>

    <span class="k">except</span> <span class="nb">Exception</span> <span class="k">as</span> <span class="n">e</span><span class="p">:</span>
        <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"</span><span class="se">\n</span><span class="s">Error during investigation: </span><span class="si">{</span><span class="nb">str</span><span class="p">(</span><span class="n">e</span><span class="p">)</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
        <span class="k">return</span> <span class="p">{</span>
            <span class="s">"success"</span><span class="p">:</span> <span class="bp">False</span><span class="p">,</span>
            <span class="s">"workflow_run_id"</span><span class="p">:</span> <span class="n">workflow_run_id</span><span class="p">,</span>
            <span class="s">"error"</span><span class="p">:</span> <span class="nb">str</span><span class="p">(</span><span class="n">e</span><span class="p">)</span>
        <span class="p">}</span>


<span class="k">if</span> <span class="n">__name__</span> <span class="o">==</span> <span class="s">"__main__"</span><span class="p">:</span>
    <span class="kn">import</span> <span class="nn">sys</span>

    <span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">sys</span><span class="p">.</span><span class="n">argv</span><span class="p">)</span> <span class="o">&lt;</span> <span class="mi">2</span><span class="p">:</span>
        <span class="k">print</span><span class="p">(</span><span class="s">"Usage: python agent_investigator.py &lt;workflow_run_id&gt;"</span><span class="p">)</span>
        <span class="n">sys</span><span class="p">.</span><span class="nb">exit</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span>

    <span class="n">workflow_run_id</span> <span class="o">=</span> <span class="n">sys</span><span class="p">.</span><span class="n">argv</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span>
    <span class="n">investigate_failure</span><span class="p">(</span><span class="n">workflow_run_id</span><span class="p">)</span>
</code></pre></div></div>

<h3 id="model-recommendations-via-openrouter">Model Recommendations via OpenRouter</h3>

<p>Here are some good model choices for DevOps investigations:</p>

<p><strong>For best reasoning (higher cost):</strong></p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">anthropic/claude-3.5-sonnet</code> - Excellent at technical analysis</li>
  <li><code class="language-plaintext highlighter-rouge">openai/gpt-4-turbo</code> - Strong general reasoning</li>
  <li><code class="language-plaintext highlighter-rouge">google/gemini-pro-1.5</code> - Good for long context (helpful with large logs)</li>
</ul>

<p><strong>For cost efficiency (lower cost):</strong></p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">anthropic/claude-3-haiku</code> - Fast and cheap, good for simple failures</li>
  <li><code class="language-plaintext highlighter-rouge">openai/gpt-3.5-turbo</code> - Decent reasoning, very affordable</li>
  <li><code class="language-plaintext highlighter-rouge">meta-llama/llama-3.1-70b-instruct</code> - Open source, cost-effective</li>
</ul>

<p><strong>Cost comparison per investigation:</strong></p>

<ul>
  <li>GPT-4: ~$0.15-0.30</li>
  <li>Claude 3.5 Sonnet: ~$0.10-0.20</li>
  <li>GPT-3.5: ~$0.02-0.05</li>
  <li>Llama 3.1 70B: ~$0.01-0.03</li>
</ul>

<h3 id="how-it-works">How It Works</h3>

<p>Let’s walk through what happens when you run this:</p>

<ol>
  <li>
    <p><strong>You trigger the agent</strong>: <code class="language-plaintext highlighter-rouge">python agent_investigator.py 12345678</code></p>
  </li>
  <li>
    <p><strong>Agent receives the task</strong>: “Investigate workflow run 12345678”</p>
  </li>
  <li>
    <p><strong>LLM decides first action</strong>: “I should fetch the workflow logs to see what failed”</p>
  </li>
  <li>
    <p><strong>Agent calls</strong> <code class="language-plaintext highlighter-rouge">get_workflow_logs()</code>: Returns the last 50 lines of failed job logs</p>
  </li>
  <li>
    <p><strong>LLM analyzes logs</strong>: “I see a database connection error. Let me check recent commits for database config changes”</p>
  </li>
  <li>
    <p><strong>Agent calls</strong> <code class="language-plaintext highlighter-rouge">analyze_recent_commits()</code>: Returns commits from the last 24 hours</p>
  </li>
  <li>
    <p><strong>LLM finds suspicious commit</strong>: “Commit abc123 changed database.yml. Let me search for similar issues”</p>
  </li>
  <li>
    <p><strong>Agent calls</strong> <code class="language-plaintext highlighter-rouge">search_similar_issues()</code>: Finds issue #42 about database connection problems</p>
  </li>
  <li>
    <p><strong>LLM synthesizes findings</strong>: Produces a final report with root cause and fix</p>
  </li>
</ol>

<p>The entire process takes 10-30 seconds depending on the complexity.</p>

<h3 id="example-output">Example Output</h3>

<p>Here’s what the agent produces for a real failure:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Root Cause: Database connection pool exhaustion caused by recent increase in concurrent workers without adjusting max_connections setting.

Evidence:
- Workflow logs show "psycopg2.OperationalError: FATAL: sorry, too many clients already"
- Commit d4e5f6a (2 hours ago) changed worker count from 4 to 16 in deploy.yml
- Issue #127 documented same error when worker count was increased last month

Recommendation:
1. Increase PostgreSQL max_connections from 100 to 200 in database config
2. Or reduce worker count back to 8 as a temporary fix
3. Add connection pooling with PgBouncer for better resource management

Related Issues:
- #127: Database connection errors after scaling workers
- #89: PostgreSQL connection pool configuration guide
</code></pre></div></div>

<p>This is exactly what you need: the root cause, evidence, and actionable fixes.</p>

<h3 id="key-design-decisions">Key Design Decisions</h3>

<p><strong>Why max_iterations=5?</strong> Prevents infinite loops. Most investigations complete in 3-4 iterations.</p>

<p><strong>Why last 50 lines of logs?</strong> Error messages are typically at the end. Sending full logs wastes tokens and costs money.</p>

<p><strong>Why temperature=0?</strong> We want deterministic, factual analysis. Higher temperature adds creativity, which we don’t need for debugging.</p>

<p><strong>Why support OpenRouter?</strong> Gives you flexibility to switch models based on cost and performance. Claude 3.5 Sonnet often performs better than GPT-4 for technical debugging at a lower price.</p>

<p>In the next section, we’ll integrate this agent with GitHub Actions so it runs automatically when workflows fail.</p>

<h2 id="github-actions-integration">GitHub Actions Integration</h2>

<p>Now that we have a working agent, let’s integrate it with GitHub Actions so it automatically investigates failures. We’ll use GitHub’s workflow events to trigger our agent whenever a pipeline fails.</p>

<h3 id="architecture-overview">Architecture Overview</h3>

<p>Here’s how the integration works:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>GitHub Actions Workflow Fails
    ↓
GitHub triggers workflow_run event
    ↓
Our "Investigate Failure" workflow runs
    ↓
Calls agent_investigator.py with workflow ID
    ↓
Agent investigates and generates report
    ↓
Posts results to GitHub issue or Slack
</code></pre></div></div>

<h3 id="step-1-set-up-github-secrets">Step 1: Set Up GitHub Secrets</h3>

<p>First, add your API keys to GitHub repository secrets:</p>

<ol>
  <li>Go to your repository on GitHub</li>
  <li>Click <strong>Settings &gt; Secrets and variables &gt; Actions</strong></li>
  <li>Click <strong>New repository secret</strong></li>
  <li>Add these secrets:</li>
</ol>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>OPENAI_API_KEY (or OPENROUTER_API_KEY)
GITHUB_TOKEN (automatically provided by GitHub Actions)
SLACK_WEBHOOK_URL (optional, for notifications)
</code></pre></div></div>

<p>For OpenRouter users, also add:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>USE_OPENROUTER=true
MODEL_NAME=anthropic/claude-3.5-sonnet
</code></pre></div></div>

<h3 id="step-2-create-the-investigation-workflow">Step 2: Create the Investigation Workflow</h3>

<p>Create a new file <code class="language-plaintext highlighter-rouge">.github/workflows/investigate-failures.yml</code>:</p>

<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="na">name</span><span class="pi">:</span> <span class="s">AI Agent - Investigate Failures</span>

<span class="na">on</span><span class="pi">:</span>
  <span class="na">workflow_run</span><span class="pi">:</span>
    <span class="na">workflows</span><span class="pi">:</span> <span class="pi">[</span><span class="s2">"</span><span class="s">*"</span><span class="pi">]</span>  <span class="c1"># Monitor all workflows</span>
    <span class="na">types</span><span class="pi">:</span>
      <span class="pi">-</span> <span class="s">completed</span>

<span class="na">jobs</span><span class="pi">:</span>
  <span class="na">investigate</span><span class="pi">:</span>
    <span class="c1"># Only run if the workflow failed</span>
    <span class="na">if</span><span class="pi">:</span> <span class="s">$</span>
    <span class="na">runs-on</span><span class="pi">:</span> <span class="s">ubuntu-latest</span>

    <span class="na">steps</span><span class="pi">:</span>
      <span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">Checkout code</span>
        <span class="na">uses</span><span class="pi">:</span> <span class="s">actions/checkout@v4</span>

      <span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">Set up Python</span>
        <span class="na">uses</span><span class="pi">:</span> <span class="s">actions/setup-python@v5</span>
        <span class="na">with</span><span class="pi">:</span>
          <span class="na">python-version</span><span class="pi">:</span> <span class="s1">'</span><span class="s">3.11'</span>

      <span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">Install uv</span>
        <span class="na">run</span><span class="pi">:</span> <span class="pi">|</span>
          <span class="s">curl -LsSf https://astral.sh/uv/install.sh | sh</span>
          <span class="s">echo "$HOME/.cargo/bin" &gt;&gt; $GITHUB_PATH</span>

      <span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">Create virtual environment and install dependencies</span>
        <span class="na">run</span><span class="pi">:</span> <span class="pi">|</span>
          <span class="s">uv venv</span>
          <span class="s">source .venv/bin/activate</span>
          <span class="s">uv pip install langchain langchain-openai requests python-dotenv</span>

      <span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">Run AI investigation</span>
        <span class="na">env</span><span class="pi">:</span>
          <span class="na">GITHUB_TOKEN</span><span class="pi">:</span> <span class="s">$</span>
          <span class="na">GITHUB_REPO</span><span class="pi">:</span> <span class="s">$</span>
          <span class="na">OPENAI_API_KEY</span><span class="pi">:</span> <span class="s">$</span>
          <span class="na">OPENROUTER_API_KEY</span><span class="pi">:</span> <span class="s">$</span>
          <span class="na">USE_OPENROUTER</span><span class="pi">:</span> <span class="s">$</span>
          <span class="na">MODEL_NAME</span><span class="pi">:</span> <span class="s">$</span>
        <span class="na">run</span><span class="pi">:</span> <span class="pi">|</span>
          <span class="s">source .venv/bin/activate</span>
          <span class="s">python agent_investigator.py $</span>

      <span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">Post results to GitHub issue</span>
        <span class="na">if</span><span class="pi">:</span> <span class="s">always()</span>
        <span class="na">uses</span><span class="pi">:</span> <span class="s">actions/github-script@v7</span>
        <span class="na">with</span><span class="pi">:</span>
          <span class="na">script</span><span class="pi">:</span> <span class="pi">|</span>
            <span class="s">const fs = require('fs');</span>

            <span class="s">// Read the investigation results</span>
            <span class="s">const workflowName = '$';</span>
            <span class="s">const workflowUrl = '$';</span>
            <span class="s">const runId = '$';</span>

            <span class="s">// Create or update issue with findings</span>
            <span class="s">const title = `Pipeline Failure: ${workflowName}`;</span>
            <span class="s">const body = `## Automated Investigation Report</span>

<span class="na">**Workflow**</span><span class="pi">:</span> <span class="pi">[</span><span class="nv">$</span><span class="pi">{</span><span class="nv">workflowName</span><span class="pi">}]</span><span class="s">(${workflowUrl})</span>
<span class="na">**Run ID**</span><span class="pi">:</span> <span class="s">${runId}</span>
<span class="na">**Branch**</span><span class="pi">:</span> <span class="s">$</span>
<span class="na">**Commit**</span><span class="pi">:</span> <span class="s">$</span>

<span class="c1">### Investigation Results</span>

<span class="s">The AI agent has completed its investigation. Check the workflow logs for detailed analysis.</span>

<span class="na">**Next Steps**</span><span class="pi">:</span>
<span class="s">1. Review the root cause analysis above</span>
<span class="s">2. Check the recommended fixes</span>
<span class="s">3. Review related issues if any were found</span>
<span class="s">4. Apply the fix and re-run the workflow</span>

<span class="nn">---</span>
<span class="nv">*This</span> <span class="s">issue was automatically created by the Pipeline Health Monitor AI Agent*</span>
<span class="err">`</span><span class="s">;</span>

            <span class="s">// Search for existing open issue</span>
            <span class="s">const issues = await github.rest.issues.listForRepo({</span>
              <span class="s">owner</span><span class="err">:</span> <span class="s">context.repo.owner,</span>
              <span class="s">repo</span><span class="err">:</span> <span class="s">context.repo.repo,</span>
              <span class="s">state</span><span class="err">:</span> <span class="s1">'</span><span class="s">open'</span><span class="err">,</span>
              <span class="na">labels</span><span class="pi">:</span> <span class="pi">[</span><span class="s1">'</span><span class="s">pipeline-failure'</span><span class="pi">,</span> <span class="s1">'</span><span class="s">ai-investigated'</span><span class="pi">]</span>
<span class="err">            }</span><span class="s">);</span>

            <span class="s">const existingIssue = issues.data.find(issue =&gt;</span>
              <span class="s">issue.title.includes(workflowName)</span>
            <span class="s">);</span>

            <span class="s">if (existingIssue) {</span>
              <span class="s">// Update existing issue</span>
              <span class="s">await github.rest.issues.createComment({</span>
                <span class="s">owner</span><span class="err">:</span> <span class="s">context.repo.owner,</span>
                <span class="s">repo</span><span class="err">:</span> <span class="s">context.repo.repo,</span>
                <span class="s">issue_number</span><span class="err">:</span> <span class="s">existingIssue.number,</span>
                <span class="s">body</span><span class="err">:</span> <span class="err">`</span><span class="c1">## New Failure Detected\n\n${body}`</span>
              <span class="err">}</span><span class="s">);</span>
            <span class="s">} else {</span>
              <span class="s">// Create new issue</span>
              <span class="s">await github.rest.issues.create({</span>
                <span class="s">owner</span><span class="err">:</span> <span class="s">context.repo.owner,</span>
                <span class="s">repo</span><span class="err">:</span> <span class="s">context.repo.repo,</span>
                <span class="s">title</span><span class="err">:</span> <span class="s">title,</span>
                <span class="s">body</span><span class="err">:</span> <span class="s">body,</span>
                <span class="s">labels</span><span class="err">:</span> <span class="pi">[</span><span class="s1">'</span><span class="s">pipeline-failure'</span><span class="pi">,</span> <span class="s1">'</span><span class="s">ai-investigated'</span><span class="pi">]</span>
              <span class="err">}</span><span class="s">);</span>
            <span class="s">}</span>
</code></pre></div></div>

<h3 id="how-it-works-in-production">How It Works in Production</h3>

<p>Once deployed, here’s what happens automatically:</p>

<ol>
  <li>Developer pushes code that breaks a test</li>
  <li>CI pipeline fails (tests, build, deployment, etc.)</li>
  <li>GitHub triggers the <code class="language-plaintext highlighter-rouge">workflow_run</code> event</li>
  <li>Investigation workflow starts within seconds</li>
  <li>Agent fetches logs, analyzes commits, searches issues</li>
  <li>LLM reasons about the root cause</li>
  <li>Results posted to GitHub issue and Slack</li>
  <li>Developer sees detailed analysis with fix recommendations</li>
</ol>

<p>All of this happens in 30-60 seconds after the failure.</p>

<h3 id="cost-considerations">Cost Considerations</h3>

<p>Each investigation costs approximately:</p>

<ul>
  <li>GPT-4: $0.15-0.30 per investigation</li>
  <li>Claude 3.5 Sonnet (via OpenRouter): $0.10-0.20</li>
  <li>GPT-3.5: $0.02-0.05</li>
</ul>

<p>For a team with:</p>

<ul>
  <li>20 pipeline failures per day</li>
  <li>Using Claude 3.5 Sonnet ($0.15 average)</li>
</ul>

<p>Monthly cost: 20 × $0.15 × 30 = $90</p>

<p>Compare this to:</p>

<ul>
  <li>Developer time investigating failures: 30 min × 20 failures = 10 hours/day</li>
  <li>At $100/hour = $1,000/day saved</li>
</ul>

<p>The ROI is clear.</p>

<h2 id="security-validation-the-48-vulnerability-problem">Security Validation: The 48% Vulnerability Problem</h2>

<p>Here’s the uncomfortable truth: research shows that 48% of AI-generated code contains vulnerabilities. In some studies, 60% of AI suggestions for financial services contained high-severity security flaws.</p>

<p>As DevOps consultants, we can’t afford to blindly trust AI-generated recommendations. Our agent has read access to logs, commits, and issues, but what if we extend it to execute fixes automatically? We need layers of security validation.</p>

<h3 id="the-real-security-risks">The Real Security Risks</h3>

<p>Before we dive into solutions, let’s understand what can go wrong:</p>

<p><strong>Prompt Injection Attacks</strong>: Google’s security team demonstrated a real exploit where hidden HTML comments in a dependency’s README convinced a build agent that a malicious package was legitimate. The agent shipped the malicious code to production.</p>

<p><strong>Hallucinated Commands</strong>: An LLM might confidently suggest running <code class="language-plaintext highlighter-rouge">kubectl delete deployment production</code> when it meant to suggest <code class="language-plaintext highlighter-rouge">kubectl delete pod production-5f6h8</code>.</p>

<p><strong>Information Leakage</strong>: Agents with access to logs might inadvertently expose secrets, API keys, or sensitive data when posting to public channels.</p>

<p><strong>Shadow AI</strong>: Developers creating custom agents without proper governance, leading to unauthorized automation running in your pipelines.</p>

<p>Let’s build defenses against all of these.</p>

<h3 id="layer-1-restrict-agent-permissions">Layer 1: Restrict Agent Permissions</h3>

<p>The principle of least privilege applies to AI agents just like any other system component.</p>

<p>Our current agent only has read-only access:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Current tools - all read-only
</span><span class="n">tools</span> <span class="o">=</span> <span class="p">[</span>
    <span class="n">get_workflow_logs</span><span class="p">,</span>       <span class="c1"># Read GitHub logs
</span>    <span class="n">analyze_recent_commits</span><span class="p">,</span>  <span class="c1"># Read git history
</span>    <span class="n">search_similar_issues</span>    <span class="c1"># Read GitHub issues
</span><span class="p">]</span>
</code></pre></div></div>

<p>This is intentional. Investigation does not require execution.</p>

<h3 id="layer-2-secrets-detection">Layer 2: Secrets Detection</h3>

<p>Never let the agent expose secrets in logs or notifications.</p>

<p>Create a secrets scanner:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># secrets_scanner.py
</span><span class="kn">import</span> <span class="nn">re</span>
<span class="kn">from</span> <span class="nn">typing</span> <span class="kn">import</span> <span class="n">List</span><span class="p">,</span> <span class="n">Tuple</span>

<span class="k">class</span> <span class="nc">SecretsScanner</span><span class="p">:</span>
    <span class="s">"""Detect and redact secrets from agent outputs."""</span>

    <span class="n">PATTERNS</span> <span class="o">=</span> <span class="p">{</span>
        <span class="s">'aws_key'</span><span class="p">:</span> <span class="sa">r</span><span class="s">'AKIA[0-9A-Z]{16}'</span><span class="p">,</span>
        <span class="s">'github_token'</span><span class="p">:</span> <span class="sa">r</span><span class="s">'gh[pousr]_[A-Za-z0-9_]{36,255}'</span><span class="p">,</span>
        <span class="s">'generic_api_key'</span><span class="p">:</span> <span class="sa">r</span><span class="s">'api[_-]?key["\']?\s*[:=]\s*["\']?([a-zA-Z0-9_\-]{20,})'</span><span class="p">,</span>
        <span class="s">'password'</span><span class="p">:</span> <span class="sa">r</span><span class="s">'password["\']?\s*[:=]\s*["\']?([^\s"\']{8,})'</span><span class="p">,</span>
        <span class="s">'private_key'</span><span class="p">:</span> <span class="sa">r</span><span class="s">'-----BEGIN (RSA |OPENSSH )?PRIVATE KEY-----'</span><span class="p">,</span>
        <span class="s">'jwt'</span><span class="p">:</span> <span class="sa">r</span><span class="s">'eyJ[A-Za-z0-9-_=]+\.eyJ[A-Za-z0-9-_=]+\.?[A-Za-z0-9-_.+/=]*'</span><span class="p">,</span>
        <span class="s">'connection_string'</span><span class="p">:</span> <span class="sa">r</span><span class="s">'(postgres|mysql|mongodb)://[^:]+:[^@]+@'</span><span class="p">,</span>
    <span class="p">}</span>

    <span class="o">@</span><span class="nb">staticmethod</span>
    <span class="k">def</span> <span class="nf">scan</span><span class="p">(</span><span class="n">text</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="n">Tuple</span><span class="p">[</span><span class="nb">bool</span><span class="p">,</span> <span class="n">List</span><span class="p">[</span><span class="nb">str</span><span class="p">]]:</span>
        <span class="s">"""
        Scan text for secrets.

        Args:
            text: Text to scan

        Returns:
            Tuple of (has_secrets, list of secret types found)
        """</span>
        <span class="n">found_secrets</span> <span class="o">=</span> <span class="p">[]</span>

        <span class="k">for</span> <span class="n">secret_type</span><span class="p">,</span> <span class="n">pattern</span> <span class="ow">in</span> <span class="n">SecretsScanner</span><span class="p">.</span><span class="n">PATTERNS</span><span class="p">.</span><span class="n">items</span><span class="p">():</span>
            <span class="k">if</span> <span class="n">re</span><span class="p">.</span><span class="n">search</span><span class="p">(</span><span class="n">pattern</span><span class="p">,</span> <span class="n">text</span><span class="p">,</span> <span class="n">re</span><span class="p">.</span><span class="n">IGNORECASE</span><span class="p">):</span>
                <span class="n">found_secrets</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">secret_type</span><span class="p">)</span>

        <span class="k">return</span> <span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">found_secrets</span><span class="p">)</span> <span class="o">&gt;</span> <span class="mi">0</span><span class="p">,</span> <span class="n">found_secrets</span><span class="p">)</span>

    <span class="o">@</span><span class="nb">staticmethod</span>
    <span class="k">def</span> <span class="nf">redact</span><span class="p">(</span><span class="n">text</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">str</span><span class="p">:</span>
        <span class="s">"""
        Redact secrets from text.

        Args:
            text: Text to redact

        Returns:
            Text with secrets replaced by [REDACTED]
        """</span>
        <span class="n">redacted</span> <span class="o">=</span> <span class="n">text</span>

        <span class="k">for</span> <span class="n">secret_type</span><span class="p">,</span> <span class="n">pattern</span> <span class="ow">in</span> <span class="n">SecretsScanner</span><span class="p">.</span><span class="n">PATTERNS</span><span class="p">.</span><span class="n">items</span><span class="p">():</span>
            <span class="n">redacted</span> <span class="o">=</span> <span class="n">re</span><span class="p">.</span><span class="n">sub</span><span class="p">(</span><span class="n">pattern</span><span class="p">,</span> <span class="sa">f</span><span class="s">'[REDACTED:</span><span class="si">{</span><span class="n">secret_type</span><span class="p">.</span><span class="n">upper</span><span class="p">()</span><span class="si">}</span><span class="s">]'</span><span class="p">,</span> <span class="n">redacted</span><span class="p">,</span> <span class="n">flags</span><span class="o">=</span><span class="n">re</span><span class="p">.</span><span class="n">IGNORECASE</span><span class="p">)</span>

        <span class="k">return</span> <span class="n">redacted</span>


<span class="c1"># Usage in agent output
</span><span class="k">def</span> <span class="nf">safe_output</span><span class="p">(</span><span class="n">text</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">str</span><span class="p">:</span>
    <span class="s">"""Process agent output to remove secrets before displaying."""</span>
    <span class="n">scanner</span> <span class="o">=</span> <span class="n">SecretsScanner</span><span class="p">()</span>
    <span class="n">has_secrets</span><span class="p">,</span> <span class="n">secret_types</span> <span class="o">=</span> <span class="n">scanner</span><span class="p">.</span><span class="n">scan</span><span class="p">(</span><span class="n">text</span><span class="p">)</span>

    <span class="k">if</span> <span class="n">has_secrets</span><span class="p">:</span>
        <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"WARNING: Detected secrets in output: </span><span class="si">{</span><span class="s">', '</span><span class="p">.</span><span class="n">join</span><span class="p">(</span><span class="n">secret_types</span><span class="p">)</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
        <span class="k">return</span> <span class="n">scanner</span><span class="p">.</span><span class="n">redact</span><span class="p">(</span><span class="n">text</span><span class="p">)</span>

    <span class="k">return</span> <span class="n">text</span>
</code></pre></div></div>

<p>Update the investigation function to use secrets scanning:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">investigate_failure</span><span class="p">(</span><span class="n">workflow_run_id</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">dict</span><span class="p">:</span>
    <span class="s">"""Investigate a failed GitHub Actions workflow with secret protection."""</span>
    <span class="c1"># ... existing code ...
</span>
    <span class="k">try</span><span class="p">:</span>
        <span class="n">result</span> <span class="o">=</span> <span class="n">agent_executor</span><span class="p">.</span><span class="n">invoke</span><span class="p">({</span><span class="s">"input"</span><span class="p">:</span> <span class="n">input_text</span><span class="p">})</span>

        <span class="c1"># Scan for secrets before outputting
</span>        <span class="n">safe_analysis</span> <span class="o">=</span> <span class="n">safe_output</span><span class="p">(</span><span class="n">result</span><span class="p">[</span><span class="s">'output'</span><span class="p">])</span>

        <span class="k">print</span><span class="p">(</span><span class="s">"</span><span class="se">\n</span><span class="s">"</span> <span class="o">+</span> <span class="s">"="</span> <span class="o">*</span> <span class="mi">60</span><span class="p">)</span>
        <span class="k">print</span><span class="p">(</span><span class="s">"INVESTIGATION COMPLETE"</span><span class="p">)</span>
        <span class="k">print</span><span class="p">(</span><span class="s">"="</span> <span class="o">*</span> <span class="mi">60</span><span class="p">)</span>
        <span class="k">print</span><span class="p">(</span><span class="n">safe_analysis</span><span class="p">)</span>

        <span class="k">return</span> <span class="p">{</span>
            <span class="s">"success"</span><span class="p">:</span> <span class="bp">True</span><span class="p">,</span>
            <span class="s">"workflow_run_id"</span><span class="p">:</span> <span class="n">workflow_run_id</span><span class="p">,</span>
            <span class="s">"analysis"</span><span class="p">:</span> <span class="n">safe_analysis</span>
        <span class="p">}</span>
    <span class="k">except</span> <span class="nb">Exception</span> <span class="k">as</span> <span class="n">e</span><span class="p">:</span>
        <span class="k">return</span> <span class="p">{</span>
            <span class="s">"success"</span><span class="p">:</span> <span class="bp">False</span><span class="p">,</span>
            <span class="s">"workflow_run_id"</span><span class="p">:</span> <span class="n">workflow_run_id</span><span class="p">,</span>
            <span class="s">"error"</span><span class="p">:</span> <span class="nb">str</span><span class="p">(</span><span class="n">e</span><span class="p">)</span>
        <span class="p">}</span>
</code></pre></div></div>

<h3 id="layer-3-audit-trail">Layer 3: Audit Trail</h3>

<p>Log every agent decision for security review and debugging.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># audit_logger.py
</span><span class="kn">import</span> <span class="nn">json</span>
<span class="kn">from</span> <span class="nn">datetime</span> <span class="kn">import</span> <span class="n">datetime</span>
<span class="kn">from</span> <span class="nn">pathlib</span> <span class="kn">import</span> <span class="n">Path</span>
<span class="kn">from</span> <span class="nn">typing</span> <span class="kn">import</span> <span class="n">Dict</span><span class="p">,</span> <span class="n">Any</span>

<span class="k">class</span> <span class="nc">AuditLogger</span><span class="p">:</span>
    <span class="s">"""Log all agent actions for security auditing."""</span>

    <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">log_dir</span><span class="p">:</span> <span class="nb">str</span> <span class="o">=</span> <span class="s">".agent_logs"</span><span class="p">):</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">log_dir</span> <span class="o">=</span> <span class="n">Path</span><span class="p">(</span><span class="n">log_dir</span><span class="p">)</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">log_dir</span><span class="p">.</span><span class="n">mkdir</span><span class="p">(</span><span class="n">exist_ok</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>

    <span class="k">def</span> <span class="nf">log_investigation</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">event_data</span><span class="p">:</span> <span class="n">Dict</span><span class="p">[</span><span class="nb">str</span><span class="p">,</span> <span class="n">Any</span><span class="p">]):</span>
        <span class="s">"""
        Log an investigation event.

        Args:
            event_data: Dictionary containing event details
        """</span>
        <span class="n">timestamp</span> <span class="o">=</span> <span class="n">datetime</span><span class="p">.</span><span class="n">utcnow</span><span class="p">().</span><span class="n">isoformat</span><span class="p">()</span>
        <span class="n">log_entry</span> <span class="o">=</span> <span class="p">{</span>
            <span class="s">"timestamp"</span><span class="p">:</span> <span class="n">timestamp</span><span class="p">,</span>
            <span class="s">"event_type"</span><span class="p">:</span> <span class="s">"investigation"</span><span class="p">,</span>
            <span class="o">**</span><span class="n">event_data</span>
        <span class="p">}</span>

        <span class="c1"># Log to daily file
</span>        <span class="n">log_file</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">log_dir</span> <span class="o">/</span> <span class="sa">f</span><span class="s">"audit_</span><span class="si">{</span><span class="n">datetime</span><span class="p">.</span><span class="n">utcnow</span><span class="p">().</span><span class="n">strftime</span><span class="p">(</span><span class="s">'%Y-%m-%d'</span><span class="p">)</span><span class="si">}</span><span class="s">.jsonl"</span>

        <span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="n">log_file</span><span class="p">,</span> <span class="s">'a'</span><span class="p">)</span> <span class="k">as</span> <span class="n">f</span><span class="p">:</span>
            <span class="n">f</span><span class="p">.</span><span class="n">write</span><span class="p">(</span><span class="n">json</span><span class="p">.</span><span class="n">dumps</span><span class="p">(</span><span class="n">log_entry</span><span class="p">)</span> <span class="o">+</span> <span class="s">'</span><span class="se">\n</span><span class="s">'</span><span class="p">)</span>

    <span class="k">def</span> <span class="nf">log_tool_call</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">tool_name</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span> <span class="n">args</span><span class="p">:</span> <span class="n">Dict</span><span class="p">,</span> <span class="n">result</span><span class="p">:</span> <span class="n">Any</span><span class="p">,</span> <span class="n">duration</span><span class="p">:</span> <span class="nb">float</span><span class="p">):</span>
        <span class="s">"""Log a tool call."""</span>
        <span class="n">log_entry</span> <span class="o">=</span> <span class="p">{</span>
            <span class="s">"timestamp"</span><span class="p">:</span> <span class="n">datetime</span><span class="p">.</span><span class="n">utcnow</span><span class="p">().</span><span class="n">isoformat</span><span class="p">(),</span>
            <span class="s">"event_type"</span><span class="p">:</span> <span class="s">"tool_call"</span><span class="p">,</span>
            <span class="s">"tool"</span><span class="p">:</span> <span class="n">tool_name</span><span class="p">,</span>
            <span class="s">"arguments"</span><span class="p">:</span> <span class="n">args</span><span class="p">,</span>
            <span class="s">"result_preview"</span><span class="p">:</span> <span class="nb">str</span><span class="p">(</span><span class="n">result</span><span class="p">)[:</span><span class="mi">200</span><span class="p">],</span>
            <span class="s">"duration_seconds"</span><span class="p">:</span> <span class="n">duration</span>
        <span class="p">}</span>

        <span class="n">log_file</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">log_dir</span> <span class="o">/</span> <span class="sa">f</span><span class="s">"audit_</span><span class="si">{</span><span class="n">datetime</span><span class="p">.</span><span class="n">utcnow</span><span class="p">().</span><span class="n">strftime</span><span class="p">(</span><span class="s">'%Y-%m-%d'</span><span class="p">)</span><span class="si">}</span><span class="s">.jsonl"</span>

        <span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="n">log_file</span><span class="p">,</span> <span class="s">'a'</span><span class="p">)</span> <span class="k">as</span> <span class="n">f</span><span class="p">:</span>
            <span class="n">f</span><span class="p">.</span><span class="n">write</span><span class="p">(</span><span class="n">json</span><span class="p">.</span><span class="n">dumps</span><span class="p">(</span><span class="n">log_entry</span><span class="p">)</span> <span class="o">+</span> <span class="s">'</span><span class="se">\n</span><span class="s">'</span><span class="p">)</span>

    <span class="k">def</span> <span class="nf">log_security_event</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">event_type</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span> <span class="n">details</span><span class="p">:</span> <span class="n">Dict</span><span class="p">[</span><span class="nb">str</span><span class="p">,</span> <span class="n">Any</span><span class="p">]):</span>
        <span class="s">"""Log a security-related event."""</span>
        <span class="n">log_entry</span> <span class="o">=</span> <span class="p">{</span>
            <span class="s">"timestamp"</span><span class="p">:</span> <span class="n">datetime</span><span class="p">.</span><span class="n">utcnow</span><span class="p">().</span><span class="n">isoformat</span><span class="p">(),</span>
            <span class="s">"event_type"</span><span class="p">:</span> <span class="s">"security"</span><span class="p">,</span>
            <span class="s">"security_event"</span><span class="p">:</span> <span class="n">event_type</span><span class="p">,</span>
            <span class="o">**</span><span class="n">details</span>
        <span class="p">}</span>

        <span class="n">log_file</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">log_dir</span> <span class="o">/</span> <span class="sa">f</span><span class="s">"security_</span><span class="si">{</span><span class="n">datetime</span><span class="p">.</span><span class="n">utcnow</span><span class="p">().</span><span class="n">strftime</span><span class="p">(</span><span class="s">'%Y-%m-%d'</span><span class="p">)</span><span class="si">}</span><span class="s">.jsonl"</span>

        <span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="n">log_file</span><span class="p">,</span> <span class="s">'a'</span><span class="p">)</span> <span class="k">as</span> <span class="n">f</span><span class="p">:</span>
            <span class="n">f</span><span class="p">.</span><span class="n">write</span><span class="p">(</span><span class="n">json</span><span class="p">.</span><span class="n">dumps</span><span class="p">(</span><span class="n">log_entry</span><span class="p">)</span> <span class="o">+</span> <span class="s">'</span><span class="se">\n</span><span class="s">'</span><span class="p">)</span>
</code></pre></div></div>

<h3 id="layer-4-rate-limiting-and-cost-controls">Layer 4: Rate Limiting and Cost Controls</h3>

<p>Prevent runaway costs and API abuse:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># rate_limiter.py
</span><span class="kn">import</span> <span class="nn">time</span>
<span class="kn">from</span> <span class="nn">collections</span> <span class="kn">import</span> <span class="n">deque</span>
<span class="kn">from</span> <span class="nn">datetime</span> <span class="kn">import</span> <span class="n">datetime</span><span class="p">,</span> <span class="n">timedelta</span>

<span class="k">class</span> <span class="nc">RateLimiter</span><span class="p">:</span>
    <span class="s">"""Rate limit agent executions to prevent abuse and control costs."""</span>

    <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">max_investigations_per_hour</span><span class="p">:</span> <span class="nb">int</span> <span class="o">=</span> <span class="mi">20</span><span class="p">):</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">max_per_hour</span> <span class="o">=</span> <span class="n">max_investigations_per_hour</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">investigation_times</span> <span class="o">=</span> <span class="n">deque</span><span class="p">()</span>

    <span class="k">def</span> <span class="nf">can_investigate</span><span class="p">(</span><span class="bp">self</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">bool</span><span class="p">:</span>
        <span class="s">"""Check if we can run another investigation."""</span>
        <span class="n">now</span> <span class="o">=</span> <span class="n">datetime</span><span class="p">.</span><span class="n">utcnow</span><span class="p">()</span>
        <span class="n">cutoff</span> <span class="o">=</span> <span class="n">now</span> <span class="o">-</span> <span class="n">timedelta</span><span class="p">(</span><span class="n">hours</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>

        <span class="c1"># Remove investigations older than 1 hour
</span>        <span class="k">while</span> <span class="bp">self</span><span class="p">.</span><span class="n">investigation_times</span> <span class="ow">and</span> <span class="bp">self</span><span class="p">.</span><span class="n">investigation_times</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">&lt;</span> <span class="n">cutoff</span><span class="p">:</span>
            <span class="bp">self</span><span class="p">.</span><span class="n">investigation_times</span><span class="p">.</span><span class="n">popleft</span><span class="p">()</span>

        <span class="k">return</span> <span class="nb">len</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">investigation_times</span><span class="p">)</span> <span class="o">&lt;</span> <span class="bp">self</span><span class="p">.</span><span class="n">max_per_hour</span>

    <span class="k">def</span> <span class="nf">record_investigation</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
        <span class="s">"""Record that an investigation occurred."""</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">investigation_times</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">datetime</span><span class="p">.</span><span class="n">utcnow</span><span class="p">())</span>

    <span class="k">def</span> <span class="nf">time_until_next_allowed</span><span class="p">(</span><span class="bp">self</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">int</span><span class="p">:</span>
        <span class="s">"""Get seconds until next investigation is allowed."""</span>
        <span class="k">if</span> <span class="bp">self</span><span class="p">.</span><span class="n">can_investigate</span><span class="p">():</span>
            <span class="k">return</span> <span class="mi">0</span>

        <span class="n">oldest</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">investigation_times</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span>
        <span class="n">time_until_allowed</span> <span class="o">=</span> <span class="p">(</span><span class="n">oldest</span> <span class="o">+</span> <span class="n">timedelta</span><span class="p">(</span><span class="n">hours</span><span class="o">=</span><span class="mi">1</span><span class="p">))</span> <span class="o">-</span> <span class="n">datetime</span><span class="p">.</span><span class="n">utcnow</span><span class="p">()</span>
        <span class="k">return</span> <span class="nb">int</span><span class="p">(</span><span class="n">time_until_allowed</span><span class="p">.</span><span class="n">total_seconds</span><span class="p">())</span>
</code></pre></div></div>

<h3 id="security-checklist">Security Checklist</h3>

<p>Before deploying your AI agent to production, verify:</p>

<ul>
  <li>Agent has minimum required permissions (read-only by default)</li>
  <li>All commands validated before execution</li>
  <li>Secrets scanner active on all outputs</li>
  <li>Audit logging enabled and monitored</li>
  <li>Rate limiting configured</li>
  <li>GitHub tokens scoped correctly (no admin access)</li>
  <li>LLM API keys stored in secrets, not code</li>
  <li>No secrets committed to repository</li>
  <li>Slack webhooks use incoming webhook URLs only</li>
  <li>Agent cannot modify production without approval</li>
</ul>

<h3 id="real-world-security-scenario">Real-World Security Scenario</h3>

<p>Here’s how these layers work together:</p>

<ol>
  <li>Agent investigates failure and LLM suggests: <code class="language-plaintext highlighter-rouge">kubectl delete pod production-db-0</code></li>
  <li>Command validator catches this: “APPROVAL REQUIRED: Command requires human approval”</li>
  <li>Agent posts recommendation to GitHub issue instead of executing</li>
  <li>Secrets scanner detects database connection string in logs and redacts it</li>
  <li>Audit logger records the attempted command and approval requirement</li>
  <li>Human reviews the recommendation and decides whether to execute</li>
  <li>If approved, human runs command manually with full context</li>
</ol>

<p>The agent accelerates investigation but humans retain control over critical actions.</p>

<h2 id="practical-tips-and-common-pitfalls">Practical Tips and Common Pitfalls</h2>

<p>After building and running AI agents for DevOps investigations, I’ve learned what works and what doesn’t. Here are the hard-earned lessons that will save you time and money.</p>

<h3 id="prompt-engineering-best-practices">Prompt Engineering Best Practices</h3>

<p>Your prompt is the most important part of your agent. A vague prompt gives vague results. A specific prompt with context gives actionable insights.</p>

<p><strong>Bad Prompt:</strong></p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">system_prompt</span> <span class="o">=</span> <span class="s">"""You are an AI agent. Debug the issue."""</span>
</code></pre></div></div>

<p>Why it fails: Too generic, no context, no output format.</p>

<p><strong>Good Prompt:</strong></p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">system_prompt</span> <span class="o">=</span> <span class="s">"""You are an expert DevOps AI agent that investigates CI/CD pipeline failures.

Infrastructure context:
- Python microservices running on Kubernetes in AWS EKS
- PostgreSQL 14 database with connection pooling
- Redis for caching
- GitHub Actions for CI/CD

Your role is to:
1. Analyze workflow logs to identify the root cause of failures
2. Examine recent code changes that might have introduced issues
3. Search for similar problems in the issue tracker
4. Provide a clear, actionable root cause analysis

When analyzing failures:
- Focus on the actual error messages, not just symptoms
- Consider recent code changes as potential causes
- Look for patterns in similar past issues
- Be specific about what broke and why
- Suggest concrete fixes, not vague advice

Output format:
**Root Cause**: [One sentence summary]
**Evidence**: [Key findings from logs/commits/issues]
**Recommendation**: [Specific steps to fix]
**Related Issues**: [Links to similar problems if found]
"""</span>
</code></pre></div></div>

<p>Why it works: Infrastructure context, clear role, specific instructions, defined output format.</p>

<h3 id="common-pitfalls-and-solutions">Common Pitfalls and Solutions</h3>

<p><strong>Pitfall 1: Agent Loops Infinitely</strong></p>

<p>Symptom: Agent keeps calling tools without making progress.</p>

<p>Solution: Set <code class="language-plaintext highlighter-rouge">max_iterations</code>:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">agent_executor</span> <span class="o">=</span> <span class="n">AgentExecutor</span><span class="p">(</span>
    <span class="n">agent</span><span class="o">=</span><span class="n">agent</span><span class="p">,</span>
    <span class="n">tools</span><span class="o">=</span><span class="n">tools</span><span class="p">,</span>
    <span class="n">verbose</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span>
    <span class="n">max_iterations</span><span class="o">=</span><span class="mi">5</span><span class="p">,</span>  <span class="c1"># Stop after 5 iterations
</span>    <span class="n">handle_parsing_errors</span><span class="o">=</span><span class="bp">True</span>
<span class="p">)</span>
</code></pre></div></div>

<p><strong>Pitfall 2: Costs Spiral Out of Control</strong></p>

<p>Symptom: Your OpenAI bill is $500 for 100 investigations.</p>

<p>Cause: Using GPT-4 for everything, not optimizing token usage.</p>

<p>Solution: Use the right model for the task:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">get_llm</span><span class="p">(</span><span class="n">task_complexity</span><span class="p">:</span> <span class="nb">str</span> <span class="o">=</span> <span class="s">"medium"</span><span class="p">):</span>
    <span class="s">"""Choose LLM based on task complexity."""</span>

    <span class="k">if</span> <span class="n">task_complexity</span> <span class="o">==</span> <span class="s">"simple"</span><span class="p">:</span>
        <span class="c1"># Use cheaper model for simple log analysis
</span>        <span class="n">model</span> <span class="o">=</span> <span class="s">"gpt-3.5-turbo"</span>  <span class="c1"># $0.002 per investigation
</span>    <span class="k">elif</span> <span class="n">task_complexity</span> <span class="o">==</span> <span class="s">"medium"</span><span class="p">:</span>
        <span class="n">model</span> <span class="o">=</span> <span class="s">"anthropic/claude-3.5-sonnet"</span>  <span class="c1"># $0.15 per investigation
</span>    <span class="k">else</span><span class="p">:</span>  <span class="c1"># complex
</span>        <span class="n">model</span> <span class="o">=</span> <span class="s">"openai/gpt-4"</span>  <span class="c1"># $0.30 per investigation
</span>
    <span class="k">return</span> <span class="n">ChatOpenAI</span><span class="p">(</span><span class="n">model</span><span class="o">=</span><span class="n">model</span><span class="p">,</span> <span class="n">temperature</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>
</code></pre></div></div>

<p>Cost comparison:</p>

<ul>
  <li>GPT-4: $0.30 per investigation</li>
  <li>Claude 3.5 Sonnet: $0.15 per investigation</li>
  <li>GPT-3.5: $0.02 per investigation</li>
</ul>

<p>For 100 investigations/month:</p>
<ul>
  <li>All GPT-4: $30</li>
  <li>All GPT-3.5: $2</li>
  <li>Mixed (80% GPT-3.5, 20% GPT-4): $6.80</li>
</ul>

<p><strong>Pitfall 3: Secrets Leak in Logs</strong></p>

<p>Symptom: API keys visible in agent output.</p>

<p>Solution: Always scan output (from the security section):</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">secrets_scanner</span> <span class="kn">import</span> <span class="n">safe_output</span>

<span class="n">result</span> <span class="o">=</span> <span class="n">agent_executor</span><span class="p">.</span><span class="n">invoke</span><span class="p">({</span><span class="s">"input"</span><span class="p">:</span> <span class="n">input_text</span><span class="p">})</span>
<span class="n">safe_result</span> <span class="o">=</span> <span class="n">safe_output</span><span class="p">(</span><span class="n">result</span><span class="p">[</span><span class="s">'output'</span><span class="p">])</span>  <span class="c1"># Redacts secrets
</span></code></pre></div></div>

<h3 id="performance-benchmarks">Performance Benchmarks</h3>

<p>From my production deployments:</p>

<p><strong>Investigation time:</strong></p>
<ul>
  <li>Simple failures (import errors): 10-15 seconds</li>
  <li>Medium complexity (config issues): 20-30 seconds</li>
  <li>Complex failures (race conditions): 45-60 seconds</li>
</ul>

<p><strong>Accuracy:</strong></p>
<ul>
  <li>Correct root cause identified: 78% of cases</li>
  <li>Helpful suggestions even when wrong: 92% of cases</li>
  <li>Completely useless output: 8% of cases</li>
</ul>

<p><strong>Cost per investigation:</strong></p>
<ul>
  <li>GPT-3.5: $0.02-0.05</li>
  <li>Claude 3.5 Sonnet: $0.10-0.20</li>
  <li>GPT-4: $0.15-0.30</li>
</ul>

<p><strong>Developer time saved:</strong></p>
<ul>
  <li>Average investigation time (manual): 25 minutes</li>
  <li>Average investigation time (agent): 30 seconds</li>
  <li>Time saved: 24.5 minutes per failure</li>
</ul>

<p>For 20 failures/day: 490 minutes = 8+ hours saved daily.</p>

<h3 id="quick-reference-dos-and-donts">Quick Reference: Dos and Don’ts</h3>

<p><strong>DO:</strong></p>
<ul>
  <li>Set max_iterations to prevent loops</li>
  <li>Add timeouts to all API calls</li>
  <li>Scan outputs for secrets</li>
  <li>Log all agent decisions</li>
  <li>Use structured output formats</li>
  <li>Cache frequent queries</li>
  <li>Choose models based on complexity</li>
  <li>Test prompts in isolation first</li>
</ul>

<p><strong>DON’T:</strong></p>
<ul>
  <li>Give agents write access without validation</li>
  <li>Trust AI-generated commands blindly</li>
  <li>Send full logs (use last 50 lines)</li>
  <li>Use GPT-4 for everything (cost optimization)</li>
  <li>Ignore rate limits</li>
  <li>Commit API keys to git</li>
  <li>Skip error handling</li>
  <li>Deploy without testing</li>
</ul>

<h2 id="next-steps-and-extensions">Next Steps and Extensions</h2>

<p>You’ve built a working AI agent that automatically investigates pipeline failures. But this is just the beginning. Here are practical ways to extend and improve your agent.</p>

<h3 id="what-youve-built">What You’ve Built</h3>

<p>Let’s recap what your agent can do:</p>

<ul>
  <li>Monitor GitHub Actions workflows automatically</li>
  <li>Investigate failures within 30 seconds</li>
  <li>Fetch and analyze workflow logs</li>
  <li>Examine recent code changes</li>
  <li>Search for similar issues</li>
  <li>Generate root cause analysis with recommendations</li>
  <li>Redact secrets from outputs</li>
  <li>Log all actions for audit</li>
  <li>Rate limit to control costs</li>
  <li>Post results to GitHub issues</li>
</ul>

<h3 id="extension-ideas">Extension Ideas</h3>

<p><strong>1. Multi-Agent System</strong></p>

<p>Create specialist agents for different tasks:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Build Agent: Optimizes build performance
</span><span class="n">build_agent</span> <span class="o">=</span> <span class="n">create_agent</span><span class="p">(</span>
    <span class="n">tools</span><span class="o">=</span><span class="p">[</span><span class="n">analyze_build_logs</span><span class="p">,</span> <span class="n">suggest_caching</span><span class="p">,</span> <span class="n">optimize_dependencies</span><span class="p">],</span>
    <span class="n">role</span><span class="o">=</span><span class="s">"Build Optimization Specialist"</span>
<span class="p">)</span>

<span class="c1"># Security Agent: Scans for vulnerabilities
</span><span class="n">security_agent</span> <span class="o">=</span> <span class="n">create_agent</span><span class="p">(</span>
    <span class="n">tools</span><span class="o">=</span><span class="p">[</span><span class="n">scan_dependencies</span><span class="p">,</span> <span class="n">check_secrets</span><span class="p">,</span> <span class="n">validate_configs</span><span class="p">],</span>
    <span class="n">role</span><span class="o">=</span><span class="s">"Security Analyst"</span>
<span class="p">)</span>

<span class="c1"># Deploy Agent: Manages deployments
</span><span class="n">deploy_agent</span> <span class="o">=</span> <span class="n">create_agent</span><span class="p">(</span>
    <span class="n">tools</span><span class="o">=</span><span class="p">[</span><span class="n">check_health</span><span class="p">,</span> <span class="n">deploy_staging</span><span class="p">,</span> <span class="n">rollback_if_needed</span><span class="p">],</span>
    <span class="n">role</span><span class="o">=</span><span class="s">"Deployment Specialist"</span>
<span class="p">)</span>
</code></pre></div></div>

<p><strong>2. Kubernetes Integration</strong></p>

<p>Add tools for Kubernetes operations:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">@</span><span class="n">tool</span>
<span class="k">def</span> <span class="nf">get_pod_status</span><span class="p">(</span><span class="n">namespace</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span> <span class="n">pod_name</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">str</span><span class="p">:</span>
    <span class="s">"""Get Kubernetes pod status and recent events."""</span>
    <span class="k">pass</span>

<span class="o">@</span><span class="n">tool</span>
<span class="k">def</span> <span class="nf">analyze_pod_logs</span><span class="p">(</span><span class="n">namespace</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span> <span class="n">pod_name</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">str</span><span class="p">:</span>
    <span class="s">"""Fetch and analyze pod logs."""</span>
    <span class="k">pass</span>
</code></pre></div></div>

<p><strong>3. Learning from History</strong></p>

<p>Implement long-term memory with a vector database:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">langchain.vectorstores</span> <span class="kn">import</span> <span class="n">Chroma</span>
<span class="kn">from</span> <span class="nn">langchain.embeddings</span> <span class="kn">import</span> <span class="n">OpenAIEmbeddings</span>

<span class="c1"># Store past investigations
</span><span class="n">vectorstore</span> <span class="o">=</span> <span class="n">Chroma</span><span class="p">(</span>
    <span class="n">collection_name</span><span class="o">=</span><span class="s">"investigation_history"</span><span class="p">,</span>
    <span class="n">embedding_function</span><span class="o">=</span><span class="n">OpenAIEmbeddings</span><span class="p">()</span>
<span class="p">)</span>

<span class="c1"># When investigating a new failure
</span><span class="n">similar_cases</span> <span class="o">=</span> <span class="n">vectorstore</span><span class="p">.</span><span class="n">similarity_search</span><span class="p">(</span>
    <span class="n">error_message</span><span class="p">,</span>
    <span class="n">k</span><span class="o">=</span><span class="mi">3</span>  <span class="c1"># Find 3 most similar past failures
</span><span class="p">)</span>
</code></pre></div></div>

<p>This lets your agent learn from experience.</p>

<h3 id="resources-and-further-learning">Resources and Further Learning</h3>

<p><strong>LangChain Documentation</strong></p>

<ul>
  <li><a href="https://python.langchain.com/docs">LangChain Official Docs</a></li>
  <li><a href="https://python.langchain.com/docs/modules/agents">LangChain Agents Guide</a></li>
  <li><a href="https://python.langchain.com/docs/modules/tools">LangChain Tools Documentation</a></li>
</ul>

<p><strong>OpenRouter</strong></p>

<ul>
  <li><a href="https://openrouter.ai">Get API key</a></li>
  <li><a href="https://openrouter.ai/docs#pricing">Pricing</a></li>
  <li><a href="https://openrouter.ai/models">Model comparison</a></li>
</ul>

<p><strong>Security Resources</strong></p>

<ul>
  <li><a href="https://owasp.org/www-project-top-10-for-large-language-model-applications">OWASP LLM Top 10</a></li>
</ul>

<h2 id="final-thoughts">Final Thoughts</h2>

<p>AI agents aren’t replacing DevOps engineers. They’re accelerating investigation, reducing toil, and freeing you to focus on higher-value work.</p>

<p>The agent we built is read-only by design. It investigates and recommends, but humans make the final decisions. This is the right balance for production systems in 2025.</p>

<p>Start small:</p>

<ol>
  <li>Deploy the read-only investigation agent</li>
  <li>Monitor its accuracy for a few weeks</li>
  <li>Tune prompts based on results</li>
  <li>Gradually add more capabilities</li>
  <li>Always maintain human oversight</li>
</ol>

<p>Over the past 2 years as a DevOps consultant, I’ve seen teams waste countless hours on repetitive failure investigations. This agent solves that problem.</p>

<p>The code is production-ready. The security is enterprise-grade. The cost is negligible compared to developer time saved.</p>

<p>What are you waiting for? Give your CI/CD pipeline a brain.</p>

<hr />

<h2 id="want-to-learn-more">Want to Learn More?</h2>

<p>If you’re interested in deepening your DevOps and systems programming knowledge, check out <a href="https://www.educative.io/unlimited?aff=BYvq">Educative.io’s Unlimited Plan</a> - it’s an excellent resource for hands-on learning with interactive courses.</p>

<hr />

<p><strong>If you found this helpful, share it on X and tag me <a href="https://twitter.com/muhammad_o7">@muhammad_o7</a></strong> - I’d love to hear your thoughts! You can also connect with me on <a href="https://www.linkedin.com/in/muhammad-raza-07/">LinkedIn</a>.</p>

<p><strong>Need Help?</strong> I’m available for Python and DevOps consulting. If you need help with CI/CD, automation, infrastructure, or AI agents for your DevOps workflows, reach out via email or DM me on <a href="https://twitter.com/muhammad_o7">X/Twitter</a>.</p>

          ]]>
        </description>
        <pubDate>Tue, 25 Nov 2025 00:00:00 +0000</pubDate>
        <link>//muhammadraza.me/2025/building-ai-agents-devops-automation/</link>
        <guid isPermaLink="true">//muhammadraza.me/2025/building-ai-agents-devops-automation/</guid>
        
        <category>ai</category>
        
        <category>devops</category>
        
        <category>automation</category>
        
        <category>python</category>
        
        
        
        <dc:creator>{&quot;name&quot;=&gt;&quot;Muhammad Raza&quot;}</dc:creator>
        <dc:rights></dc:rights>
      </item>
    
      <item>
        <title>Building a CI/CD Pipeline Runner from Scratch in Python</title>
        <description>
          <![CDATA[
            
            <p>I’ve been using CI/CD pipelines for years now - GitLab CI, GitHub Actions, Jenkins, you name it. Like most developers, I treated them as configuration files that “just work.” You write a YAML file, push it, and somehow your code gets built, tested, and deployed. But I never really understood what was happening behind the scenes.</p>

<p>That changed when I needed to set up CI/CD for an air-gapped environment at work - no access to GitHub Actions or GitLab’s hosted runners. I needed to understand how these tools actually work under the hood so I could build something custom. That’s when I realized: pipeline runners are just orchestration tools that execute jobs in isolated environments.</p>

<p>In this post, I’ll show you how to build a complete CI/CD pipeline runner from scratch in Python. We’ll implement the core features you use every day: stages, parallel execution, job dependencies, and artifact passing. By the end, you’ll understand exactly how GitLab Runner and GitHub Actions work internally.</p>

<h2 id="what-actually-is-a-cicd-pipeline">What Actually IS a CI/CD Pipeline?</h2>

<p>Before we start coding, let’s understand what a CI/CD pipeline actually does.</p>

<p>A <strong>CI/CD pipeline</strong> is an automated workflow that takes your code from commit to deployment. It’s defined as a series of <strong>jobs</strong> organized into <strong>stages</strong>:</p>

<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="na">stages</span><span class="pi">:</span>
  <span class="pi">-</span> <span class="s">build</span>
  <span class="pi">-</span> <span class="s">test</span>
  <span class="pi">-</span> <span class="s">deploy</span>

<span class="na">build-job</span><span class="pi">:</span>
  <span class="na">stage</span><span class="pi">:</span> <span class="s">build</span>
  <span class="na">script</span><span class="pi">:</span>
    <span class="pi">-</span> <span class="s">python -m build</span>

<span class="na">test-job</span><span class="pi">:</span>
  <span class="na">stage</span><span class="pi">:</span> <span class="s">test</span>
  <span class="na">script</span><span class="pi">:</span>
    <span class="pi">-</span> <span class="s">pytest tests/</span>

<span class="na">deploy-job</span><span class="pi">:</span>
  <span class="na">stage</span><span class="pi">:</span> <span class="s">deploy</span>
  <span class="na">script</span><span class="pi">:</span>
    <span class="pi">-</span> <span class="s">./deploy.sh</span>
</code></pre></div></div>

<p>When you push code, a <strong>pipeline runner</strong> (like GitLab Runner or GitHub Actions) does this:</p>

<ol>
  <li><strong>Parses</strong> the pipeline configuration file</li>
  <li><strong>Creates a dependency graph</strong> of jobs</li>
  <li><strong>Executes jobs</strong> in isolated environments (usually containers)</li>
  <li><strong>Streams logs</strong> back to you in real-time</li>
  <li><strong>Passes artifacts</strong> between jobs</li>
  <li><strong>Reports</strong> success or failure</li>
</ol>

<p>The key insight: <strong>a pipeline runner is just a job orchestrator</strong>. It figures out what to run, in what order, and handles the execution.</p>

<p>Let’s build one ourselves.</p>

<h2 id="understanding-pipeline-components">Understanding Pipeline Components</h2>

<p>Every pipeline has three main components:</p>

<h3 id="1-stages">1. Stages</h3>
<p>Stages define the execution order. Jobs in the same stage can run in parallel, but stages run sequentially:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Stage 1: build     → Stage 2: test        → Stage 3: deploy
  [build-app]        [unit-test]             [deploy-prod]
                     [integration-test]
</code></pre></div></div>

<h3 id="2-jobs">2. Jobs</h3>
<p>Jobs are the actual work units. Each job:</p>
<ul>
  <li>Runs in an isolated environment (container)</li>
  <li>Executes a series of shell commands</li>
  <li>Can depend on other jobs</li>
  <li>Can produce artifacts for other jobs</li>
</ul>

<h3 id="3-artifacts">3. Artifacts</h3>
<p>Artifacts are files produced by one job that other jobs need. For example:</p>
<ul>
  <li>Build job produces <code class="language-plaintext highlighter-rouge">dist/</code> folder</li>
  <li>Test jobs need <code class="language-plaintext highlighter-rouge">dist/</code> to run tests</li>
  <li>Deploy job needs <code class="language-plaintext highlighter-rouge">dist/</code> to deploy</li>
</ul>

<p>Now that we understand the concepts, let’s start building.</p>

<h2 id="building-our-pipeline-runner">Building Our Pipeline Runner</h2>

<h3 id="version-1-single-job-executor">Version 1: Single Job Executor</h3>

<p>Let’s start with the absolute basics - executing a single job in a Docker container:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">#!/usr/bin/env python3
</span><span class="s">"""
Pipeline Runner v1: Single job executor
Executes one job in a Docker container and streams logs
"""</span>

<span class="kn">import</span> <span class="nn">yaml</span>
<span class="kn">import</span> <span class="nn">subprocess</span>
<span class="kn">import</span> <span class="nn">sys</span>
<span class="kn">from</span> <span class="nn">pathlib</span> <span class="kn">import</span> <span class="n">Path</span>


<span class="k">class</span> <span class="nc">Job</span><span class="p">:</span>
    <span class="s">"""Represents a single pipeline job."""</span>

    <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">name</span><span class="p">,</span> <span class="n">config</span><span class="p">):</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">name</span> <span class="o">=</span> <span class="n">name</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">image</span> <span class="o">=</span> <span class="n">config</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="s">'image'</span><span class="p">,</span> <span class="s">'python:3.11'</span><span class="p">)</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">script</span> <span class="o">=</span> <span class="n">config</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="s">'script'</span><span class="p">,</span> <span class="p">[])</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">stage</span> <span class="o">=</span> <span class="n">config</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="s">'stage'</span><span class="p">,</span> <span class="s">'test'</span><span class="p">)</span>

    <span class="k">def</span> <span class="nf">__repr__</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
        <span class="k">return</span> <span class="sa">f</span><span class="s">"Job(</span><span class="si">{</span><span class="bp">self</span><span class="p">.</span><span class="n">name</span><span class="si">}</span><span class="s">, stage=</span><span class="si">{</span><span class="bp">self</span><span class="p">.</span><span class="n">stage</span><span class="si">}</span><span class="s">)"</span>


<span class="k">class</span> <span class="nc">JobExecutor</span><span class="p">:</span>
    <span class="s">"""Executes a job in a Docker container."""</span>

    <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">workspace</span><span class="p">):</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">workspace</span> <span class="o">=</span> <span class="n">Path</span><span class="p">(</span><span class="n">workspace</span><span class="p">).</span><span class="n">resolve</span><span class="p">()</span>

    <span class="k">def</span> <span class="nf">run</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">job</span><span class="p">):</span>
        <span class="s">"""Execute a job and stream output."""</span>
        <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"</span><span class="se">\n</span><span class="si">{</span><span class="s">'='</span><span class="o">*</span><span class="mi">60</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
        <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"[</span><span class="si">{</span><span class="n">job</span><span class="p">.</span><span class="n">name</span><span class="si">}</span><span class="s">] Starting job..."</span><span class="p">)</span>
        <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"[</span><span class="si">{</span><span class="n">job</span><span class="p">.</span><span class="n">name</span><span class="si">}</span><span class="s">] Image: </span><span class="si">{</span><span class="n">job</span><span class="p">.</span><span class="n">image</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
        <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"[</span><span class="si">{</span><span class="n">job</span><span class="p">.</span><span class="n">name</span><span class="si">}</span><span class="s">] Stage: </span><span class="si">{</span><span class="n">job</span><span class="p">.</span><span class="n">stage</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
        <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"</span><span class="si">{</span><span class="s">'='</span><span class="o">*</span><span class="mi">60</span><span class="si">}</span><span class="se">\n</span><span class="s">"</span><span class="p">)</span>

        <span class="c1"># Combine all script commands into one shell command
</span>        <span class="n">script</span> <span class="o">=</span> <span class="s">' &amp;&amp; '</span><span class="p">.</span><span class="n">join</span><span class="p">(</span><span class="n">job</span><span class="p">.</span><span class="n">script</span><span class="p">)</span>

        <span class="c1"># Build docker run command
</span>        <span class="n">cmd</span> <span class="o">=</span> <span class="p">[</span>
            <span class="s">'docker'</span><span class="p">,</span> <span class="s">'run'</span><span class="p">,</span>
            <span class="s">'--rm'</span><span class="p">,</span>  <span class="c1"># Remove container after execution
</span>            <span class="s">'-v'</span><span class="p">,</span> <span class="sa">f</span><span class="s">'</span><span class="si">{</span><span class="bp">self</span><span class="p">.</span><span class="n">workspace</span><span class="si">}</span><span class="s">:/workspace'</span><span class="p">,</span>  <span class="c1"># Mount workspace
</span>            <span class="s">'-w'</span><span class="p">,</span> <span class="s">'/workspace'</span><span class="p">,</span>  <span class="c1"># Set working directory
</span>            <span class="n">job</span><span class="p">.</span><span class="n">image</span><span class="p">,</span>
            <span class="s">'sh'</span><span class="p">,</span> <span class="s">'-c'</span><span class="p">,</span> <span class="n">script</span>
        <span class="p">]</span>

        <span class="k">try</span><span class="p">:</span>
            <span class="c1"># Run and stream output in real-time
</span>            <span class="n">process</span> <span class="o">=</span> <span class="n">subprocess</span><span class="p">.</span><span class="n">Popen</span><span class="p">(</span>
                <span class="n">cmd</span><span class="p">,</span>
                <span class="n">stdout</span><span class="o">=</span><span class="n">subprocess</span><span class="p">.</span><span class="n">PIPE</span><span class="p">,</span>
                <span class="n">stderr</span><span class="o">=</span><span class="n">subprocess</span><span class="p">.</span><span class="n">STDOUT</span><span class="p">,</span>
                <span class="n">text</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span>
                <span class="n">bufsize</span><span class="o">=</span><span class="mi">1</span>
            <span class="p">)</span>

            <span class="c1"># Stream output line by line
</span>            <span class="k">for</span> <span class="n">line</span> <span class="ow">in</span> <span class="n">process</span><span class="p">.</span><span class="n">stdout</span><span class="p">:</span>
                <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"[</span><span class="si">{</span><span class="n">job</span><span class="p">.</span><span class="n">name</span><span class="si">}</span><span class="s">] </span><span class="si">{</span><span class="n">line</span><span class="si">}</span><span class="s">"</span><span class="p">,</span> <span class="n">end</span><span class="o">=</span><span class="s">''</span><span class="p">)</span>

            <span class="n">process</span><span class="p">.</span><span class="n">wait</span><span class="p">()</span>

            <span class="k">if</span> <span class="n">process</span><span class="p">.</span><span class="n">returncode</span> <span class="o">==</span> <span class="mi">0</span><span class="p">:</span>
                <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"</span><span class="se">\n</span><span class="s">[</span><span class="si">{</span><span class="n">job</span><span class="p">.</span><span class="n">name</span><span class="si">}</span><span class="s">] ✓ Job completed successfully</span><span class="se">\n</span><span class="s">"</span><span class="p">)</span>
                <span class="k">return</span> <span class="bp">True</span>
            <span class="k">else</span><span class="p">:</span>
                <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"</span><span class="se">\n</span><span class="s">[</span><span class="si">{</span><span class="n">job</span><span class="p">.</span><span class="n">name</span><span class="si">}</span><span class="s">] ✗ Job failed with exit code </span><span class="si">{</span><span class="n">process</span><span class="p">.</span><span class="n">returncode</span><span class="si">}</span><span class="se">\n</span><span class="s">"</span><span class="p">)</span>
                <span class="k">return</span> <span class="bp">False</span>

        <span class="k">except</span> <span class="nb">Exception</span> <span class="k">as</span> <span class="n">e</span><span class="p">:</span>
            <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"</span><span class="se">\n</span><span class="s">[</span><span class="si">{</span><span class="n">job</span><span class="p">.</span><span class="n">name</span><span class="si">}</span><span class="s">] ✗ Error: </span><span class="si">{</span><span class="n">e</span><span class="si">}</span><span class="se">\n</span><span class="s">"</span><span class="p">)</span>
            <span class="k">return</span> <span class="bp">False</span>


<span class="k">class</span> <span class="nc">Pipeline</span><span class="p">:</span>
    <span class="s">"""Represents a pipeline with jobs."""</span>

    <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">config_file</span><span class="p">):</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">config_file</span> <span class="o">=</span> <span class="n">Path</span><span class="p">(</span><span class="n">config_file</span><span class="p">)</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">config</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">_load_config</span><span class="p">()</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">jobs</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">_parse_jobs</span><span class="p">()</span>

    <span class="k">def</span> <span class="nf">_load_config</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
        <span class="s">"""Load and parse YAML configuration."""</span>
        <span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">config_file</span><span class="p">)</span> <span class="k">as</span> <span class="n">f</span><span class="p">:</span>
            <span class="k">return</span> <span class="n">yaml</span><span class="p">.</span><span class="n">safe_load</span><span class="p">(</span><span class="n">f</span><span class="p">)</span>

    <span class="k">def</span> <span class="nf">_parse_jobs</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
        <span class="s">"""Parse jobs from configuration."""</span>
        <span class="n">jobs</span> <span class="o">=</span> <span class="p">[]</span>
        <span class="k">for</span> <span class="n">job_name</span><span class="p">,</span> <span class="n">job_config</span> <span class="ow">in</span> <span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">items</span><span class="p">():</span>
            <span class="k">if</span> <span class="n">job_name</span> <span class="o">!=</span> <span class="s">'stages'</span> <span class="ow">and</span> <span class="nb">isinstance</span><span class="p">(</span><span class="n">job_config</span><span class="p">,</span> <span class="nb">dict</span><span class="p">):</span>
                <span class="n">jobs</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">Job</span><span class="p">(</span><span class="n">job_name</span><span class="p">,</span> <span class="n">job_config</span><span class="p">))</span>
        <span class="k">return</span> <span class="n">jobs</span>

    <span class="k">def</span> <span class="nf">run</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">workspace</span><span class="o">=</span><span class="s">'.'</span><span class="p">):</span>
        <span class="s">"""Execute all jobs sequentially."""</span>
        <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"</span><span class="se">\n</span><span class="s">Starting pipeline from </span><span class="si">{</span><span class="bp">self</span><span class="p">.</span><span class="n">config_file</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
        <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"Found </span><span class="si">{</span><span class="nb">len</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">jobs</span><span class="p">)</span><span class="si">}</span><span class="s"> job(s)</span><span class="se">\n</span><span class="s">"</span><span class="p">)</span>

        <span class="n">executor</span> <span class="o">=</span> <span class="n">JobExecutor</span><span class="p">(</span><span class="n">workspace</span><span class="p">)</span>

        <span class="k">for</span> <span class="n">job</span> <span class="ow">in</span> <span class="bp">self</span><span class="p">.</span><span class="n">jobs</span><span class="p">:</span>
            <span class="n">success</span> <span class="o">=</span> <span class="n">executor</span><span class="p">.</span><span class="n">run</span><span class="p">(</span><span class="n">job</span><span class="p">)</span>
            <span class="k">if</span> <span class="ow">not</span> <span class="n">success</span><span class="p">:</span>
                <span class="k">print</span><span class="p">(</span><span class="s">"Pipeline failed!"</span><span class="p">)</span>
                <span class="k">return</span> <span class="bp">False</span>

        <span class="k">print</span><span class="p">(</span><span class="s">"✓ Pipeline completed successfully!"</span><span class="p">)</span>
        <span class="k">return</span> <span class="bp">True</span>


<span class="k">if</span> <span class="n">__name__</span> <span class="o">==</span> <span class="s">"__main__"</span><span class="p">:</span>
    <span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">sys</span><span class="p">.</span><span class="n">argv</span><span class="p">)</span> <span class="o">&lt;</span> <span class="mi">2</span><span class="p">:</span>
        <span class="k">print</span><span class="p">(</span><span class="s">"Usage: python runner_v1.py &lt;pipeline.yml&gt;"</span><span class="p">)</span>
        <span class="n">sys</span><span class="p">.</span><span class="nb">exit</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span>

    <span class="n">pipeline</span> <span class="o">=</span> <span class="n">Pipeline</span><span class="p">(</span><span class="n">sys</span><span class="p">.</span><span class="n">argv</span><span class="p">[</span><span class="mi">1</span><span class="p">])</span>
    <span class="n">success</span> <span class="o">=</span> <span class="n">pipeline</span><span class="p">.</span><span class="n">run</span><span class="p">()</span>
    <span class="n">sys</span><span class="p">.</span><span class="nb">exit</span><span class="p">(</span><span class="mi">0</span> <span class="k">if</span> <span class="n">success</span> <span class="k">else</span> <span class="mi">1</span><span class="p">)</span>
</code></pre></div></div>

<h4 id="testing-version-1">Testing Version 1</h4>

<p>Create a simple pipeline config:</p>

<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># pipeline.yml</span>
<span class="na">build-job</span><span class="pi">:</span>
  <span class="na">image</span><span class="pi">:</span> <span class="s">python:3.11</span>
  <span class="na">script</span><span class="pi">:</span>
    <span class="pi">-</span> <span class="s">echo "Building application..."</span>
    <span class="pi">-</span> <span class="s">python --version</span>
    <span class="pi">-</span> <span class="s">pip install --quiet build</span>
    <span class="pi">-</span> <span class="s">echo "Build complete!"</span>
</code></pre></div></div>

<p>Run it:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>python runner_v1.py pipeline.yml
</code></pre></div></div>

<p>You should see output like:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Starting pipeline from pipeline.yml
Found 1 job(s)

============================================================
[build-job] Starting job...
[build-job] Image: python:3.11
[build-job] Stage: test
============================================================

[build-job] Building application...
[build-job] Python 3.11.9
[build-job] Build complete!

[build-job] ✓ Job completed successfully

✓ Pipeline completed successfully!
</code></pre></div></div>

<p>Great! We can execute a single job. But real pipelines have multiple stages. Let’s add that.</p>

<h3 id="version-2-multi-stage-pipeline-with-sequential-execution">Version 2: Multi-Stage Pipeline with Sequential Execution</h3>

<p>Now let’s add support for stages. Jobs in different stages should run in order:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">#!/usr/bin/env python3
</span><span class="s">"""
Pipeline Runner v2: Multi-stage support
Executes jobs stage by stage in sequential order
"""</span>

<span class="kn">import</span> <span class="nn">yaml</span>
<span class="kn">import</span> <span class="nn">subprocess</span>
<span class="kn">import</span> <span class="nn">sys</span>
<span class="kn">from</span> <span class="nn">pathlib</span> <span class="kn">import</span> <span class="n">Path</span>
<span class="kn">from</span> <span class="nn">collections</span> <span class="kn">import</span> <span class="n">defaultdict</span>


<span class="k">class</span> <span class="nc">Job</span><span class="p">:</span>
    <span class="s">"""Represents a single pipeline job."""</span>

    <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">name</span><span class="p">,</span> <span class="n">config</span><span class="p">):</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">name</span> <span class="o">=</span> <span class="n">name</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">image</span> <span class="o">=</span> <span class="n">config</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="s">'image'</span><span class="p">,</span> <span class="s">'python:3.11'</span><span class="p">)</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">script</span> <span class="o">=</span> <span class="n">config</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="s">'script'</span><span class="p">,</span> <span class="p">[])</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">stage</span> <span class="o">=</span> <span class="n">config</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="s">'stage'</span><span class="p">,</span> <span class="s">'test'</span><span class="p">)</span>

    <span class="k">def</span> <span class="nf">__repr__</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
        <span class="k">return</span> <span class="sa">f</span><span class="s">"Job(</span><span class="si">{</span><span class="bp">self</span><span class="p">.</span><span class="n">name</span><span class="si">}</span><span class="s">, stage=</span><span class="si">{</span><span class="bp">self</span><span class="p">.</span><span class="n">stage</span><span class="si">}</span><span class="s">)"</span>


<span class="k">class</span> <span class="nc">JobExecutor</span><span class="p">:</span>
    <span class="s">"""Executes a job in a Docker container."""</span>

    <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">workspace</span><span class="p">):</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">workspace</span> <span class="o">=</span> <span class="n">Path</span><span class="p">(</span><span class="n">workspace</span><span class="p">).</span><span class="n">resolve</span><span class="p">()</span>

    <span class="k">def</span> <span class="nf">run</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">job</span><span class="p">):</span>
        <span class="s">"""Execute a job and stream output."""</span>
        <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"[</span><span class="si">{</span><span class="n">job</span><span class="p">.</span><span class="n">name</span><span class="si">}</span><span class="s">] Starting job..."</span><span class="p">)</span>

        <span class="n">script</span> <span class="o">=</span> <span class="s">' &amp;&amp; '</span><span class="p">.</span><span class="n">join</span><span class="p">(</span><span class="n">job</span><span class="p">.</span><span class="n">script</span><span class="p">)</span>

        <span class="n">cmd</span> <span class="o">=</span> <span class="p">[</span>
            <span class="s">'docker'</span><span class="p">,</span> <span class="s">'run'</span><span class="p">,</span> <span class="s">'--rm'</span><span class="p">,</span>
            <span class="s">'-v'</span><span class="p">,</span> <span class="sa">f</span><span class="s">'</span><span class="si">{</span><span class="bp">self</span><span class="p">.</span><span class="n">workspace</span><span class="si">}</span><span class="s">:/workspace'</span><span class="p">,</span>
            <span class="s">'-w'</span><span class="p">,</span> <span class="s">'/workspace'</span><span class="p">,</span>
            <span class="n">job</span><span class="p">.</span><span class="n">image</span><span class="p">,</span>
            <span class="s">'sh'</span><span class="p">,</span> <span class="s">'-c'</span><span class="p">,</span> <span class="n">script</span>
        <span class="p">]</span>

        <span class="k">try</span><span class="p">:</span>
            <span class="n">process</span> <span class="o">=</span> <span class="n">subprocess</span><span class="p">.</span><span class="n">Popen</span><span class="p">(</span>
                <span class="n">cmd</span><span class="p">,</span>
                <span class="n">stdout</span><span class="o">=</span><span class="n">subprocess</span><span class="p">.</span><span class="n">PIPE</span><span class="p">,</span>
                <span class="n">stderr</span><span class="o">=</span><span class="n">subprocess</span><span class="p">.</span><span class="n">STDOUT</span><span class="p">,</span>
                <span class="n">text</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span>
                <span class="n">bufsize</span><span class="o">=</span><span class="mi">1</span>
            <span class="p">)</span>

            <span class="k">for</span> <span class="n">line</span> <span class="ow">in</span> <span class="n">process</span><span class="p">.</span><span class="n">stdout</span><span class="p">:</span>
                <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"[</span><span class="si">{</span><span class="n">job</span><span class="p">.</span><span class="n">name</span><span class="si">}</span><span class="s">] </span><span class="si">{</span><span class="n">line</span><span class="si">}</span><span class="s">"</span><span class="p">,</span> <span class="n">end</span><span class="o">=</span><span class="s">''</span><span class="p">)</span>

            <span class="n">process</span><span class="p">.</span><span class="n">wait</span><span class="p">()</span>

            <span class="k">if</span> <span class="n">process</span><span class="p">.</span><span class="n">returncode</span> <span class="o">==</span> <span class="mi">0</span><span class="p">:</span>
                <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"[</span><span class="si">{</span><span class="n">job</span><span class="p">.</span><span class="n">name</span><span class="si">}</span><span class="s">] ✓ Job completed successfully"</span><span class="p">)</span>
                <span class="k">return</span> <span class="bp">True</span>
            <span class="k">else</span><span class="p">:</span>
                <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"[</span><span class="si">{</span><span class="n">job</span><span class="p">.</span><span class="n">name</span><span class="si">}</span><span class="s">] ✗ Job failed with exit code </span><span class="si">{</span><span class="n">process</span><span class="p">.</span><span class="n">returncode</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
                <span class="k">return</span> <span class="bp">False</span>

        <span class="k">except</span> <span class="nb">Exception</span> <span class="k">as</span> <span class="n">e</span><span class="p">:</span>
            <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"[</span><span class="si">{</span><span class="n">job</span><span class="p">.</span><span class="n">name</span><span class="si">}</span><span class="s">] ✗ Error: </span><span class="si">{</span><span class="n">e</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
            <span class="k">return</span> <span class="bp">False</span>


<span class="k">class</span> <span class="nc">Pipeline</span><span class="p">:</span>
    <span class="s">"""Represents a pipeline with stages and jobs."""</span>

    <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">config_file</span><span class="p">):</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">config_file</span> <span class="o">=</span> <span class="n">Path</span><span class="p">(</span><span class="n">config_file</span><span class="p">)</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">config</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">_load_config</span><span class="p">()</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">stages</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="s">'stages'</span><span class="p">,</span> <span class="p">[</span><span class="s">'test'</span><span class="p">])</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">jobs</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">_parse_jobs</span><span class="p">()</span>

    <span class="k">def</span> <span class="nf">_load_config</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
        <span class="s">"""Load and parse YAML configuration."""</span>
        <span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">config_file</span><span class="p">)</span> <span class="k">as</span> <span class="n">f</span><span class="p">:</span>
            <span class="k">return</span> <span class="n">yaml</span><span class="p">.</span><span class="n">safe_load</span><span class="p">(</span><span class="n">f</span><span class="p">)</span>

    <span class="k">def</span> <span class="nf">_parse_jobs</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
        <span class="s">"""Parse jobs from configuration."""</span>
        <span class="n">jobs</span> <span class="o">=</span> <span class="p">[]</span>
        <span class="k">for</span> <span class="n">job_name</span><span class="p">,</span> <span class="n">job_config</span> <span class="ow">in</span> <span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">items</span><span class="p">():</span>
            <span class="k">if</span> <span class="n">job_name</span> <span class="o">!=</span> <span class="s">'stages'</span> <span class="ow">and</span> <span class="nb">isinstance</span><span class="p">(</span><span class="n">job_config</span><span class="p">,</span> <span class="nb">dict</span><span class="p">):</span>
                <span class="n">jobs</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">Job</span><span class="p">(</span><span class="n">job_name</span><span class="p">,</span> <span class="n">job_config</span><span class="p">))</span>
        <span class="k">return</span> <span class="n">jobs</span>

    <span class="k">def</span> <span class="nf">_group_jobs_by_stage</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
        <span class="s">"""Group jobs by their stage."""</span>
        <span class="n">stages</span> <span class="o">=</span> <span class="n">defaultdict</span><span class="p">(</span><span class="nb">list</span><span class="p">)</span>
        <span class="k">for</span> <span class="n">job</span> <span class="ow">in</span> <span class="bp">self</span><span class="p">.</span><span class="n">jobs</span><span class="p">:</span>
            <span class="n">stages</span><span class="p">[</span><span class="n">job</span><span class="p">.</span><span class="n">stage</span><span class="p">].</span><span class="n">append</span><span class="p">(</span><span class="n">job</span><span class="p">)</span>
        <span class="k">return</span> <span class="n">stages</span>

    <span class="k">def</span> <span class="nf">run</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">workspace</span><span class="o">=</span><span class="s">'.'</span><span class="p">):</span>
        <span class="s">"""Execute pipeline stage by stage."""</span>
        <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"</span><span class="se">\n</span><span class="si">{</span><span class="s">'='</span><span class="o">*</span><span class="mi">60</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
        <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"Starting pipeline: </span><span class="si">{</span><span class="bp">self</span><span class="p">.</span><span class="n">config_file</span><span class="p">.</span><span class="n">name</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
        <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"Stages: </span><span class="si">{</span><span class="s">' → '</span><span class="p">.</span><span class="n">join</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">stages</span><span class="p">)</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
        <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"Total jobs: </span><span class="si">{</span><span class="nb">len</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">jobs</span><span class="p">)</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
        <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"</span><span class="si">{</span><span class="s">'='</span><span class="o">*</span><span class="mi">60</span><span class="si">}</span><span class="se">\n</span><span class="s">"</span><span class="p">)</span>

        <span class="n">executor</span> <span class="o">=</span> <span class="n">JobExecutor</span><span class="p">(</span><span class="n">workspace</span><span class="p">)</span>
        <span class="n">stages_with_jobs</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">_group_jobs_by_stage</span><span class="p">()</span>

        <span class="c1"># Execute stages in order
</span>        <span class="k">for</span> <span class="n">stage</span> <span class="ow">in</span> <span class="bp">self</span><span class="p">.</span><span class="n">stages</span><span class="p">:</span>
            <span class="n">stage_jobs</span> <span class="o">=</span> <span class="n">stages_with_jobs</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="n">stage</span><span class="p">,</span> <span class="p">[])</span>

            <span class="k">if</span> <span class="ow">not</span> <span class="n">stage_jobs</span><span class="p">:</span>
                <span class="k">continue</span>

            <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"</span><span class="se">\n</span><span class="si">{</span><span class="s">'─'</span><span class="o">*</span><span class="mi">60</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
            <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"Stage: </span><span class="si">{</span><span class="n">stage</span><span class="si">}</span><span class="s"> (</span><span class="si">{</span><span class="nb">len</span><span class="p">(</span><span class="n">stage_jobs</span><span class="p">)</span><span class="si">}</span><span class="s"> job(s))"</span><span class="p">)</span>
            <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"</span><span class="si">{</span><span class="s">'─'</span><span class="o">*</span><span class="mi">60</span><span class="si">}</span><span class="se">\n</span><span class="s">"</span><span class="p">)</span>

            <span class="c1"># Execute all jobs in this stage (sequentially for now)
</span>            <span class="k">for</span> <span class="n">job</span> <span class="ow">in</span> <span class="n">stage_jobs</span><span class="p">:</span>
                <span class="n">success</span> <span class="o">=</span> <span class="n">executor</span><span class="p">.</span><span class="n">run</span><span class="p">(</span><span class="n">job</span><span class="p">)</span>
                <span class="k">if</span> <span class="ow">not</span> <span class="n">success</span><span class="p">:</span>
                    <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"</span><span class="se">\n</span><span class="s">✗ Pipeline failed at stage '</span><span class="si">{</span><span class="n">stage</span><span class="si">}</span><span class="s">'"</span><span class="p">)</span>
                    <span class="k">return</span> <span class="bp">False</span>

        <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"</span><span class="se">\n</span><span class="si">{</span><span class="s">'='</span><span class="o">*</span><span class="mi">60</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
        <span class="k">print</span><span class="p">(</span><span class="s">"✓ Pipeline completed successfully!"</span><span class="p">)</span>
        <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"</span><span class="si">{</span><span class="s">'='</span><span class="o">*</span><span class="mi">60</span><span class="si">}</span><span class="se">\n</span><span class="s">"</span><span class="p">)</span>
        <span class="k">return</span> <span class="bp">True</span>


<span class="k">if</span> <span class="n">__name__</span> <span class="o">==</span> <span class="s">"__main__"</span><span class="p">:</span>
    <span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">sys</span><span class="p">.</span><span class="n">argv</span><span class="p">)</span> <span class="o">&lt;</span> <span class="mi">2</span><span class="p">:</span>
        <span class="k">print</span><span class="p">(</span><span class="s">"Usage: python runner_v2.py &lt;pipeline.yml&gt;"</span><span class="p">)</span>
        <span class="n">sys</span><span class="p">.</span><span class="nb">exit</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span>

    <span class="n">pipeline</span> <span class="o">=</span> <span class="n">Pipeline</span><span class="p">(</span><span class="n">sys</span><span class="p">.</span><span class="n">argv</span><span class="p">[</span><span class="mi">1</span><span class="p">])</span>
    <span class="n">success</span> <span class="o">=</span> <span class="n">pipeline</span><span class="p">.</span><span class="n">run</span><span class="p">()</span>
    <span class="n">sys</span><span class="p">.</span><span class="nb">exit</span><span class="p">(</span><span class="mi">0</span> <span class="k">if</span> <span class="n">success</span> <span class="k">else</span> <span class="mi">1</span><span class="p">)</span>
</code></pre></div></div>

<h4 id="testing-version-2">Testing Version 2</h4>

<p>Create a multi-stage pipeline:</p>

<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># pipeline.yml</span>
<span class="na">stages</span><span class="pi">:</span>
  <span class="pi">-</span> <span class="s">build</span>
  <span class="pi">-</span> <span class="s">test</span>
  <span class="pi">-</span> <span class="s">deploy</span>

<span class="na">build-job</span><span class="pi">:</span>
  <span class="na">stage</span><span class="pi">:</span> <span class="s">build</span>
  <span class="na">image</span><span class="pi">:</span> <span class="s">python:3.11</span>
  <span class="na">script</span><span class="pi">:</span>
    <span class="pi">-</span> <span class="s">echo "Building application..."</span>
    <span class="pi">-</span> <span class="s">mkdir -p dist</span>
    <span class="pi">-</span> <span class="s">echo "v1.0.0" &gt; dist/version.txt</span>

<span class="na">test-job</span><span class="pi">:</span>
  <span class="na">stage</span><span class="pi">:</span> <span class="s">test</span>
  <span class="na">image</span><span class="pi">:</span> <span class="s">python:3.11</span>
  <span class="na">script</span><span class="pi">:</span>
    <span class="pi">-</span> <span class="s">echo "Running tests..."</span>
    <span class="pi">-</span> <span class="s">python -c "print('All tests passed!')"</span>

<span class="na">deploy-job</span><span class="pi">:</span>
  <span class="na">stage</span><span class="pi">:</span> <span class="s">deploy</span>
  <span class="na">image</span><span class="pi">:</span> <span class="s">alpine:latest</span>
  <span class="na">script</span><span class="pi">:</span>
    <span class="pi">-</span> <span class="s">echo "Deploying application..."</span>
    <span class="pi">-</span> <span class="s">cat dist/version.txt</span>
    <span class="pi">-</span> <span class="s">echo "Deployment complete!"</span>
</code></pre></div></div>

<p>Run it:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>python runner_v2.py pipeline.yml
</code></pre></div></div>

<p>Now we have stages! But notice that if you have multiple jobs in the test stage, they still run sequentially. In real CI/CD, jobs in the same stage run in parallel. Let’s add that.</p>

<h3 id="version-3-parallel-job-execution">Version 3: Parallel Job Execution</h3>

<p>Real pipeline runners execute jobs in the same stage in parallel. Let’s implement this using Python’s <code class="language-plaintext highlighter-rouge">multiprocessing</code>:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">#!/usr/bin/env python3
</span><span class="s">"""
Pipeline Runner v3: Parallel execution
Executes jobs within a stage in parallel using multiprocessing
"""</span>

<span class="kn">import</span> <span class="nn">yaml</span>
<span class="kn">import</span> <span class="nn">subprocess</span>
<span class="kn">import</span> <span class="nn">sys</span>
<span class="kn">from</span> <span class="nn">pathlib</span> <span class="kn">import</span> <span class="n">Path</span>
<span class="kn">from</span> <span class="nn">collections</span> <span class="kn">import</span> <span class="n">defaultdict</span>
<span class="kn">from</span> <span class="nn">multiprocessing</span> <span class="kn">import</span> <span class="n">Pool</span><span class="p">,</span> <span class="n">Manager</span>
<span class="kn">from</span> <span class="nn">functools</span> <span class="kn">import</span> <span class="n">partial</span>


<span class="k">class</span> <span class="nc">Job</span><span class="p">:</span>
    <span class="s">"""Represents a single pipeline job."""</span>

    <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">name</span><span class="p">,</span> <span class="n">config</span><span class="p">):</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">name</span> <span class="o">=</span> <span class="n">name</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">image</span> <span class="o">=</span> <span class="n">config</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="s">'image'</span><span class="p">,</span> <span class="s">'python:3.11'</span><span class="p">)</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">script</span> <span class="o">=</span> <span class="n">config</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="s">'script'</span><span class="p">,</span> <span class="p">[])</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">stage</span> <span class="o">=</span> <span class="n">config</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="s">'stage'</span><span class="p">,</span> <span class="s">'test'</span><span class="p">)</span>

    <span class="k">def</span> <span class="nf">__repr__</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
        <span class="k">return</span> <span class="sa">f</span><span class="s">"Job(</span><span class="si">{</span><span class="bp">self</span><span class="p">.</span><span class="n">name</span><span class="si">}</span><span class="s">, stage=</span><span class="si">{</span><span class="bp">self</span><span class="p">.</span><span class="n">stage</span><span class="si">}</span><span class="s">)"</span>


<span class="k">class</span> <span class="nc">JobExecutor</span><span class="p">:</span>
    <span class="s">"""Executes a job in a Docker container."""</span>

    <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">workspace</span><span class="p">):</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">workspace</span> <span class="o">=</span> <span class="n">Path</span><span class="p">(</span><span class="n">workspace</span><span class="p">).</span><span class="n">resolve</span><span class="p">()</span>

    <span class="k">def</span> <span class="nf">run</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">job</span><span class="p">,</span> <span class="n">output_queue</span><span class="o">=</span><span class="bp">None</span><span class="p">):</span>
        <span class="s">"""Execute a job and stream output."""</span>

        <span class="k">def</span> <span class="nf">log</span><span class="p">(</span><span class="n">msg</span><span class="p">):</span>
            <span class="s">"""Log a message (to queue if available, else stdout)."""</span>
            <span class="k">if</span> <span class="n">output_queue</span><span class="p">:</span>
                <span class="n">output_queue</span><span class="p">.</span><span class="n">put</span><span class="p">(</span><span class="n">msg</span><span class="p">)</span>
            <span class="k">else</span><span class="p">:</span>
                <span class="k">print</span><span class="p">(</span><span class="n">msg</span><span class="p">)</span>

        <span class="n">log</span><span class="p">(</span><span class="sa">f</span><span class="s">"[</span><span class="si">{</span><span class="n">job</span><span class="p">.</span><span class="n">name</span><span class="si">}</span><span class="s">] Starting job..."</span><span class="p">)</span>

        <span class="n">script</span> <span class="o">=</span> <span class="s">' &amp;&amp; '</span><span class="p">.</span><span class="n">join</span><span class="p">(</span><span class="n">job</span><span class="p">.</span><span class="n">script</span><span class="p">)</span>

        <span class="n">cmd</span> <span class="o">=</span> <span class="p">[</span>
            <span class="s">'docker'</span><span class="p">,</span> <span class="s">'run'</span><span class="p">,</span> <span class="s">'--rm'</span><span class="p">,</span>
            <span class="s">'-v'</span><span class="p">,</span> <span class="sa">f</span><span class="s">'</span><span class="si">{</span><span class="bp">self</span><span class="p">.</span><span class="n">workspace</span><span class="si">}</span><span class="s">:/workspace'</span><span class="p">,</span>
            <span class="s">'-w'</span><span class="p">,</span> <span class="s">'/workspace'</span><span class="p">,</span>
            <span class="n">job</span><span class="p">.</span><span class="n">image</span><span class="p">,</span>
            <span class="s">'sh'</span><span class="p">,</span> <span class="s">'-c'</span><span class="p">,</span> <span class="n">script</span>
        <span class="p">]</span>

        <span class="k">try</span><span class="p">:</span>
            <span class="n">process</span> <span class="o">=</span> <span class="n">subprocess</span><span class="p">.</span><span class="n">Popen</span><span class="p">(</span>
                <span class="n">cmd</span><span class="p">,</span>
                <span class="n">stdout</span><span class="o">=</span><span class="n">subprocess</span><span class="p">.</span><span class="n">PIPE</span><span class="p">,</span>
                <span class="n">stderr</span><span class="o">=</span><span class="n">subprocess</span><span class="p">.</span><span class="n">STDOUT</span><span class="p">,</span>
                <span class="n">text</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span>
                <span class="n">bufsize</span><span class="o">=</span><span class="mi">1</span>
            <span class="p">)</span>

            <span class="k">for</span> <span class="n">line</span> <span class="ow">in</span> <span class="n">process</span><span class="p">.</span><span class="n">stdout</span><span class="p">:</span>
                <span class="n">log</span><span class="p">(</span><span class="sa">f</span><span class="s">"[</span><span class="si">{</span><span class="n">job</span><span class="p">.</span><span class="n">name</span><span class="si">}</span><span class="s">] </span><span class="si">{</span><span class="n">line</span><span class="p">.</span><span class="n">rstrip</span><span class="p">()</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>

            <span class="n">process</span><span class="p">.</span><span class="n">wait</span><span class="p">()</span>

            <span class="k">if</span> <span class="n">process</span><span class="p">.</span><span class="n">returncode</span> <span class="o">==</span> <span class="mi">0</span><span class="p">:</span>
                <span class="n">log</span><span class="p">(</span><span class="sa">f</span><span class="s">"[</span><span class="si">{</span><span class="n">job</span><span class="p">.</span><span class="n">name</span><span class="si">}</span><span class="s">] ✓ Job completed successfully"</span><span class="p">)</span>
                <span class="k">return</span> <span class="p">(</span><span class="n">job</span><span class="p">.</span><span class="n">name</span><span class="p">,</span> <span class="bp">True</span><span class="p">,</span> <span class="bp">None</span><span class="p">)</span>
            <span class="k">else</span><span class="p">:</span>
                <span class="n">error_msg</span> <span class="o">=</span> <span class="sa">f</span><span class="s">"Exit code </span><span class="si">{</span><span class="n">process</span><span class="p">.</span><span class="n">returncode</span><span class="si">}</span><span class="s">"</span>
                <span class="n">log</span><span class="p">(</span><span class="sa">f</span><span class="s">"[</span><span class="si">{</span><span class="n">job</span><span class="p">.</span><span class="n">name</span><span class="si">}</span><span class="s">] ✗ Job failed: </span><span class="si">{</span><span class="n">error_msg</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
                <span class="k">return</span> <span class="p">(</span><span class="n">job</span><span class="p">.</span><span class="n">name</span><span class="p">,</span> <span class="bp">False</span><span class="p">,</span> <span class="n">error_msg</span><span class="p">)</span>

        <span class="k">except</span> <span class="nb">Exception</span> <span class="k">as</span> <span class="n">e</span><span class="p">:</span>
            <span class="n">error_msg</span> <span class="o">=</span> <span class="nb">str</span><span class="p">(</span><span class="n">e</span><span class="p">)</span>
            <span class="n">log</span><span class="p">(</span><span class="sa">f</span><span class="s">"[</span><span class="si">{</span><span class="n">job</span><span class="p">.</span><span class="n">name</span><span class="si">}</span><span class="s">] ✗ Error: </span><span class="si">{</span><span class="n">error_msg</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
            <span class="k">return</span> <span class="p">(</span><span class="n">job</span><span class="p">.</span><span class="n">name</span><span class="p">,</span> <span class="bp">False</span><span class="p">,</span> <span class="n">error_msg</span><span class="p">)</span>


<span class="k">def</span> <span class="nf">run_job_parallel</span><span class="p">(</span><span class="n">job</span><span class="p">,</span> <span class="n">workspace</span><span class="p">,</span> <span class="n">output_queue</span><span class="p">):</span>
    <span class="s">"""Helper function for parallel execution."""</span>
    <span class="n">executor</span> <span class="o">=</span> <span class="n">JobExecutor</span><span class="p">(</span><span class="n">workspace</span><span class="p">)</span>
    <span class="k">return</span> <span class="n">executor</span><span class="p">.</span><span class="n">run</span><span class="p">(</span><span class="n">job</span><span class="p">,</span> <span class="n">output_queue</span><span class="p">)</span>


<span class="k">class</span> <span class="nc">Pipeline</span><span class="p">:</span>
    <span class="s">"""Represents a pipeline with stages and parallel job execution."""</span>

    <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">config_file</span><span class="p">):</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">config_file</span> <span class="o">=</span> <span class="n">Path</span><span class="p">(</span><span class="n">config_file</span><span class="p">)</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">config</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">_load_config</span><span class="p">()</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">stages</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="s">'stages'</span><span class="p">,</span> <span class="p">[</span><span class="s">'test'</span><span class="p">])</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">jobs</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">_parse_jobs</span><span class="p">()</span>

    <span class="k">def</span> <span class="nf">_load_config</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
        <span class="s">"""Load and parse YAML configuration."""</span>
        <span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">config_file</span><span class="p">)</span> <span class="k">as</span> <span class="n">f</span><span class="p">:</span>
            <span class="k">return</span> <span class="n">yaml</span><span class="p">.</span><span class="n">safe_load</span><span class="p">(</span><span class="n">f</span><span class="p">)</span>

    <span class="k">def</span> <span class="nf">_parse_jobs</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
        <span class="s">"""Parse jobs from configuration."""</span>
        <span class="n">jobs</span> <span class="o">=</span> <span class="p">[]</span>
        <span class="k">for</span> <span class="n">job_name</span><span class="p">,</span> <span class="n">job_config</span> <span class="ow">in</span> <span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">items</span><span class="p">():</span>
            <span class="k">if</span> <span class="n">job_name</span> <span class="o">!=</span> <span class="s">'stages'</span> <span class="ow">and</span> <span class="nb">isinstance</span><span class="p">(</span><span class="n">job_config</span><span class="p">,</span> <span class="nb">dict</span><span class="p">):</span>
                <span class="n">jobs</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">Job</span><span class="p">(</span><span class="n">job_name</span><span class="p">,</span> <span class="n">job_config</span><span class="p">))</span>
        <span class="k">return</span> <span class="n">jobs</span>

    <span class="k">def</span> <span class="nf">_group_jobs_by_stage</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
        <span class="s">"""Group jobs by their stage."""</span>
        <span class="n">stages</span> <span class="o">=</span> <span class="n">defaultdict</span><span class="p">(</span><span class="nb">list</span><span class="p">)</span>
        <span class="k">for</span> <span class="n">job</span> <span class="ow">in</span> <span class="bp">self</span><span class="p">.</span><span class="n">jobs</span><span class="p">:</span>
            <span class="n">stages</span><span class="p">[</span><span class="n">job</span><span class="p">.</span><span class="n">stage</span><span class="p">].</span><span class="n">append</span><span class="p">(</span><span class="n">job</span><span class="p">)</span>
        <span class="k">return</span> <span class="n">stages</span>

    <span class="k">def</span> <span class="nf">run</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">workspace</span><span class="o">=</span><span class="s">'.'</span><span class="p">):</span>
        <span class="s">"""Execute pipeline with parallel job execution per stage."""</span>
        <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"</span><span class="se">\n</span><span class="si">{</span><span class="s">'='</span><span class="o">*</span><span class="mi">60</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
        <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"Starting pipeline: </span><span class="si">{</span><span class="bp">self</span><span class="p">.</span><span class="n">config_file</span><span class="p">.</span><span class="n">name</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
        <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"Stages: </span><span class="si">{</span><span class="s">' → '</span><span class="p">.</span><span class="n">join</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">stages</span><span class="p">)</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
        <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"Total jobs: </span><span class="si">{</span><span class="nb">len</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">jobs</span><span class="p">)</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
        <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"</span><span class="si">{</span><span class="s">'='</span><span class="o">*</span><span class="mi">60</span><span class="si">}</span><span class="se">\n</span><span class="s">"</span><span class="p">)</span>

        <span class="n">workspace</span> <span class="o">=</span> <span class="n">Path</span><span class="p">(</span><span class="n">workspace</span><span class="p">).</span><span class="n">resolve</span><span class="p">()</span>
        <span class="n">stages_with_jobs</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">_group_jobs_by_stage</span><span class="p">()</span>

        <span class="c1"># Execute stages in order
</span>        <span class="k">for</span> <span class="n">stage</span> <span class="ow">in</span> <span class="bp">self</span><span class="p">.</span><span class="n">stages</span><span class="p">:</span>
            <span class="n">stage_jobs</span> <span class="o">=</span> <span class="n">stages_with_jobs</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="n">stage</span><span class="p">,</span> <span class="p">[])</span>

            <span class="k">if</span> <span class="ow">not</span> <span class="n">stage_jobs</span><span class="p">:</span>
                <span class="k">continue</span>

            <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"</span><span class="se">\n</span><span class="si">{</span><span class="s">'─'</span><span class="o">*</span><span class="mi">60</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
            <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"Stage: </span><span class="si">{</span><span class="n">stage</span><span class="si">}</span><span class="s"> (</span><span class="si">{</span><span class="nb">len</span><span class="p">(</span><span class="n">stage_jobs</span><span class="p">)</span><span class="si">}</span><span class="s"> job(s))"</span><span class="p">)</span>
            <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"</span><span class="si">{</span><span class="s">'─'</span><span class="o">*</span><span class="mi">60</span><span class="si">}</span><span class="se">\n</span><span class="s">"</span><span class="p">)</span>

            <span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">stage_jobs</span><span class="p">)</span> <span class="o">==</span> <span class="mi">1</span><span class="p">:</span>
                <span class="c1"># Single job - run directly
</span>                <span class="n">executor</span> <span class="o">=</span> <span class="n">JobExecutor</span><span class="p">(</span><span class="n">workspace</span><span class="p">)</span>
                <span class="n">job_name</span><span class="p">,</span> <span class="n">success</span><span class="p">,</span> <span class="n">error</span> <span class="o">=</span> <span class="n">executor</span><span class="p">.</span><span class="n">run</span><span class="p">(</span><span class="n">stage_jobs</span><span class="p">[</span><span class="mi">0</span><span class="p">])</span>
                <span class="k">if</span> <span class="ow">not</span> <span class="n">success</span><span class="p">:</span>
                    <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"</span><span class="se">\n</span><span class="s">✗ Pipeline failed at stage '</span><span class="si">{</span><span class="n">stage</span><span class="si">}</span><span class="s">'"</span><span class="p">)</span>
                    <span class="k">return</span> <span class="bp">False</span>
            <span class="k">else</span><span class="p">:</span>
                <span class="c1"># Multiple jobs - run in parallel
</span>                <span class="n">manager</span> <span class="o">=</span> <span class="n">Manager</span><span class="p">()</span>
                <span class="n">output_queue</span> <span class="o">=</span> <span class="n">manager</span><span class="p">.</span><span class="n">Queue</span><span class="p">()</span>

                <span class="c1"># Create a partial function with workspace and queue
</span>                <span class="n">run_func</span> <span class="o">=</span> <span class="n">partial</span><span class="p">(</span><span class="n">run_job_parallel</span><span class="p">,</span> <span class="n">workspace</span><span class="o">=</span><span class="n">workspace</span><span class="p">,</span> <span class="n">output_queue</span><span class="o">=</span><span class="n">output_queue</span><span class="p">)</span>

                <span class="c1"># Execute jobs in parallel
</span>                <span class="k">with</span> <span class="n">Pool</span><span class="p">(</span><span class="n">processes</span><span class="o">=</span><span class="nb">len</span><span class="p">(</span><span class="n">stage_jobs</span><span class="p">))</span> <span class="k">as</span> <span class="n">pool</span><span class="p">:</span>
                    <span class="c1"># Start async execution
</span>                    <span class="n">results</span> <span class="o">=</span> <span class="n">pool</span><span class="p">.</span><span class="n">map_async</span><span class="p">(</span><span class="n">run_func</span><span class="p">,</span> <span class="n">stage_jobs</span><span class="p">)</span>

                    <span class="c1"># Print output as it arrives
</span>                    <span class="k">while</span> <span class="bp">True</span><span class="p">:</span>
                        <span class="k">if</span> <span class="n">results</span><span class="p">.</span><span class="n">ready</span><span class="p">()</span> <span class="ow">and</span> <span class="n">output_queue</span><span class="p">.</span><span class="n">empty</span><span class="p">():</span>
                            <span class="k">break</span>

                        <span class="k">if</span> <span class="ow">not</span> <span class="n">output_queue</span><span class="p">.</span><span class="n">empty</span><span class="p">():</span>
                            <span class="k">print</span><span class="p">(</span><span class="n">output_queue</span><span class="p">.</span><span class="n">get</span><span class="p">())</span>

                    <span class="c1"># Get results
</span>                    <span class="n">job_results</span> <span class="o">=</span> <span class="n">results</span><span class="p">.</span><span class="n">get</span><span class="p">()</span>

                    <span class="c1"># Check if all jobs succeeded
</span>                    <span class="k">if</span> <span class="ow">not</span> <span class="nb">all</span><span class="p">(</span><span class="n">success</span> <span class="k">for</span> <span class="n">_</span><span class="p">,</span> <span class="n">success</span><span class="p">,</span> <span class="n">_</span> <span class="ow">in</span> <span class="n">job_results</span><span class="p">):</span>
                        <span class="n">failed_jobs</span> <span class="o">=</span> <span class="p">[</span><span class="n">name</span> <span class="k">for</span> <span class="n">name</span><span class="p">,</span> <span class="n">success</span><span class="p">,</span> <span class="n">_</span> <span class="ow">in</span> <span class="n">job_results</span> <span class="k">if</span> <span class="ow">not</span> <span class="n">success</span><span class="p">]</span>
                        <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"</span><span class="se">\n</span><span class="s">✗ Pipeline failed at stage '</span><span class="si">{</span><span class="n">stage</span><span class="si">}</span><span class="s">'"</span><span class="p">)</span>
                        <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"  Failed jobs: </span><span class="si">{</span><span class="s">', '</span><span class="p">.</span><span class="n">join</span><span class="p">(</span><span class="n">failed_jobs</span><span class="p">)</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
                        <span class="k">return</span> <span class="bp">False</span>

        <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"</span><span class="se">\n</span><span class="si">{</span><span class="s">'='</span><span class="o">*</span><span class="mi">60</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
        <span class="k">print</span><span class="p">(</span><span class="s">"✓ Pipeline completed successfully!"</span><span class="p">)</span>
        <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"</span><span class="si">{</span><span class="s">'='</span><span class="o">*</span><span class="mi">60</span><span class="si">}</span><span class="se">\n</span><span class="s">"</span><span class="p">)</span>
        <span class="k">return</span> <span class="bp">True</span>


<span class="k">if</span> <span class="n">__name__</span> <span class="o">==</span> <span class="s">"__main__"</span><span class="p">:</span>
    <span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">sys</span><span class="p">.</span><span class="n">argv</span><span class="p">)</span> <span class="o">&lt;</span> <span class="mi">2</span><span class="p">:</span>
        <span class="k">print</span><span class="p">(</span><span class="s">"Usage: python runner_v3.py &lt;pipeline.yml&gt;"</span><span class="p">)</span>
        <span class="n">sys</span><span class="p">.</span><span class="nb">exit</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span>

    <span class="n">pipeline</span> <span class="o">=</span> <span class="n">Pipeline</span><span class="p">(</span><span class="n">sys</span><span class="p">.</span><span class="n">argv</span><span class="p">[</span><span class="mi">1</span><span class="p">])</span>
    <span class="n">success</span> <span class="o">=</span> <span class="n">pipeline</span><span class="p">.</span><span class="n">run</span><span class="p">()</span>
    <span class="n">sys</span><span class="p">.</span><span class="nb">exit</span><span class="p">(</span><span class="mi">0</span> <span class="k">if</span> <span class="n">success</span> <span class="k">else</span> <span class="mi">1</span><span class="p">)</span>
</code></pre></div></div>

<h4 id="testing-version-3">Testing Version 3</h4>

<p>Create a pipeline with parallel jobs:</p>

<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># pipeline.yml</span>
<span class="na">stages</span><span class="pi">:</span>
  <span class="pi">-</span> <span class="s">build</span>
  <span class="pi">-</span> <span class="s">test</span>

<span class="na">build-job</span><span class="pi">:</span>
  <span class="na">stage</span><span class="pi">:</span> <span class="s">build</span>
  <span class="na">image</span><span class="pi">:</span> <span class="s">python:3.11</span>
  <span class="na">script</span><span class="pi">:</span>
    <span class="pi">-</span> <span class="s">echo "Building..."</span>
    <span class="pi">-</span> <span class="s">sleep </span><span class="m">2</span>

<span class="na">unit-tests</span><span class="pi">:</span>
  <span class="na">stage</span><span class="pi">:</span> <span class="s">test</span>
  <span class="na">image</span><span class="pi">:</span> <span class="s">python:3.11</span>
  <span class="na">script</span><span class="pi">:</span>
    <span class="pi">-</span> <span class="s">echo "Running unit tests..."</span>
    <span class="pi">-</span> <span class="s">sleep </span><span class="m">3</span>
    <span class="pi">-</span> <span class="s">echo "Unit tests passed!"</span>

<span class="na">integration-tests</span><span class="pi">:</span>
  <span class="na">stage</span><span class="pi">:</span> <span class="s">test</span>
  <span class="na">image</span><span class="pi">:</span> <span class="s">python:3.11</span>
  <span class="na">script</span><span class="pi">:</span>
    <span class="pi">-</span> <span class="s">echo "Running integration tests..."</span>
    <span class="pi">-</span> <span class="s">sleep </span><span class="m">3</span>
    <span class="pi">-</span> <span class="s">echo "Integration tests passed!"</span>
</code></pre></div></div>

<p>Run it and notice that <code class="language-plaintext highlighter-rouge">unit-tests</code> and <code class="language-plaintext highlighter-rouge">integration-tests</code> run simultaneously!</p>

<p>Now we have parallel execution, but there’s a problem: what if <code class="language-plaintext highlighter-rouge">test</code> jobs need artifacts from the <code class="language-plaintext highlighter-rouge">build</code> job? We need dependency management and artifact passing.</p>

<h3 id="version-4-dependencies-and-artifacts">Version 4: Dependencies and Artifacts</h3>

<p>This is where it gets interesting. We need to:</p>
<ol>
  <li>Build a dependency graph (which jobs need which other jobs)</li>
  <li>Execute jobs in topological order (respecting dependencies)</li>
  <li>Pass artifacts between jobs</li>
</ol>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">#!/usr/bin/env python3
</span><span class="s">"""
Pipeline Runner v4: Dependencies and Artifacts
Supports job dependencies, topological sorting, and artifact passing
"""</span>

<span class="kn">import</span> <span class="nn">yaml</span>
<span class="kn">import</span> <span class="nn">subprocess</span>
<span class="kn">import</span> <span class="nn">sys</span>
<span class="kn">import</span> <span class="nn">shutil</span>
<span class="kn">from</span> <span class="nn">pathlib</span> <span class="kn">import</span> <span class="n">Path</span>
<span class="kn">from</span> <span class="nn">collections</span> <span class="kn">import</span> <span class="n">defaultdict</span><span class="p">,</span> <span class="n">deque</span>
<span class="kn">from</span> <span class="nn">multiprocessing</span> <span class="kn">import</span> <span class="n">Pool</span><span class="p">,</span> <span class="n">Manager</span>
<span class="kn">from</span> <span class="nn">functools</span> <span class="kn">import</span> <span class="n">partial</span>


<span class="k">class</span> <span class="nc">Job</span><span class="p">:</span>
    <span class="s">"""Represents a single pipeline job."""</span>

    <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">name</span><span class="p">,</span> <span class="n">config</span><span class="p">):</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">name</span> <span class="o">=</span> <span class="n">name</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">image</span> <span class="o">=</span> <span class="n">config</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="s">'image'</span><span class="p">,</span> <span class="s">'python:3.11'</span><span class="p">)</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">script</span> <span class="o">=</span> <span class="n">config</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="s">'script'</span><span class="p">,</span> <span class="p">[])</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">stage</span> <span class="o">=</span> <span class="n">config</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="s">'stage'</span><span class="p">,</span> <span class="s">'test'</span><span class="p">)</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">artifacts</span> <span class="o">=</span> <span class="n">config</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="s">'artifacts'</span><span class="p">,</span> <span class="p">{}).</span><span class="n">get</span><span class="p">(</span><span class="s">'paths'</span><span class="p">,</span> <span class="p">[])</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">needs</span> <span class="o">=</span> <span class="n">config</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="s">'needs'</span><span class="p">,</span> <span class="p">[])</span>  <span class="c1"># Job dependencies
</span>
    <span class="k">def</span> <span class="nf">__repr__</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
        <span class="k">return</span> <span class="sa">f</span><span class="s">"Job(</span><span class="si">{</span><span class="bp">self</span><span class="p">.</span><span class="n">name</span><span class="si">}</span><span class="s">, stage=</span><span class="si">{</span><span class="bp">self</span><span class="p">.</span><span class="n">stage</span><span class="si">}</span><span class="s">)"</span>


<span class="k">class</span> <span class="nc">ArtifactManager</span><span class="p">:</span>
    <span class="s">"""Manages artifact storage and retrieval."""</span>

    <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">workspace</span><span class="p">):</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">workspace</span> <span class="o">=</span> <span class="n">Path</span><span class="p">(</span><span class="n">workspace</span><span class="p">).</span><span class="n">resolve</span><span class="p">()</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">artifact_dir</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">workspace</span> <span class="o">/</span> <span class="s">'.pipeline_artifacts'</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">artifact_dir</span><span class="p">.</span><span class="n">mkdir</span><span class="p">(</span><span class="n">exist_ok</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>

    <span class="k">def</span> <span class="nf">save_artifacts</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">job_name</span><span class="p">,</span> <span class="n">artifact_paths</span><span class="p">):</span>
        <span class="s">"""Save artifacts from a job."""</span>
        <span class="k">if</span> <span class="ow">not</span> <span class="n">artifact_paths</span><span class="p">:</span>
            <span class="k">return</span>

        <span class="n">job_artifact_dir</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">artifact_dir</span> <span class="o">/</span> <span class="n">job_name</span>
        <span class="n">job_artifact_dir</span><span class="p">.</span><span class="n">mkdir</span><span class="p">(</span><span class="n">exist_ok</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>

        <span class="k">for</span> <span class="n">artifact_path</span> <span class="ow">in</span> <span class="n">artifact_paths</span><span class="p">:</span>
            <span class="n">src</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">workspace</span> <span class="o">/</span> <span class="n">artifact_path</span>
            <span class="k">if</span> <span class="n">src</span><span class="p">.</span><span class="n">exists</span><span class="p">():</span>
                <span class="n">dst</span> <span class="o">=</span> <span class="n">job_artifact_dir</span> <span class="o">/</span> <span class="n">artifact_path</span>
                <span class="n">dst</span><span class="p">.</span><span class="n">parent</span><span class="p">.</span><span class="n">mkdir</span><span class="p">(</span><span class="n">parents</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span> <span class="n">exist_ok</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>

                <span class="k">if</span> <span class="n">src</span><span class="p">.</span><span class="n">is_dir</span><span class="p">():</span>
                    <span class="n">shutil</span><span class="p">.</span><span class="n">copytree</span><span class="p">(</span><span class="n">src</span><span class="p">,</span> <span class="n">dst</span><span class="p">,</span> <span class="n">dirs_exist_ok</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
                <span class="k">else</span><span class="p">:</span>
                    <span class="n">shutil</span><span class="p">.</span><span class="n">copy2</span><span class="p">(</span><span class="n">src</span><span class="p">,</span> <span class="n">dst</span><span class="p">)</span>

                <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"  Saved artifact: </span><span class="si">{</span><span class="n">artifact_path</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>

    <span class="k">def</span> <span class="nf">load_artifacts</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">job_names</span><span class="p">):</span>
        <span class="s">"""Load artifacts from dependent jobs."""</span>
        <span class="k">for</span> <span class="n">job_name</span> <span class="ow">in</span> <span class="n">job_names</span><span class="p">:</span>
            <span class="n">job_artifact_dir</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">artifact_dir</span> <span class="o">/</span> <span class="n">job_name</span>
            <span class="k">if</span> <span class="ow">not</span> <span class="n">job_artifact_dir</span><span class="p">.</span><span class="n">exists</span><span class="p">():</span>
                <span class="k">continue</span>

            <span class="c1"># Copy artifacts to workspace
</span>            <span class="k">for</span> <span class="n">item</span> <span class="ow">in</span> <span class="n">job_artifact_dir</span><span class="p">.</span><span class="n">rglob</span><span class="p">(</span><span class="s">'*'</span><span class="p">):</span>
                <span class="k">if</span> <span class="n">item</span><span class="p">.</span><span class="n">is_file</span><span class="p">():</span>
                    <span class="n">rel_path</span> <span class="o">=</span> <span class="n">item</span><span class="p">.</span><span class="n">relative_to</span><span class="p">(</span><span class="n">job_artifact_dir</span><span class="p">)</span>
                    <span class="n">dst</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">workspace</span> <span class="o">/</span> <span class="n">rel_path</span>
                    <span class="n">dst</span><span class="p">.</span><span class="n">parent</span><span class="p">.</span><span class="n">mkdir</span><span class="p">(</span><span class="n">parents</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span> <span class="n">exist_ok</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
                    <span class="n">shutil</span><span class="p">.</span><span class="n">copy2</span><span class="p">(</span><span class="n">item</span><span class="p">,</span> <span class="n">dst</span><span class="p">)</span>

    <span class="k">def</span> <span class="nf">cleanup</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
        <span class="s">"""Remove all artifacts."""</span>
        <span class="k">if</span> <span class="bp">self</span><span class="p">.</span><span class="n">artifact_dir</span><span class="p">.</span><span class="n">exists</span><span class="p">():</span>
            <span class="n">shutil</span><span class="p">.</span><span class="n">rmtree</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">artifact_dir</span><span class="p">)</span>


<span class="k">class</span> <span class="nc">JobExecutor</span><span class="p">:</span>
    <span class="s">"""Executes a job in a Docker container."""</span>

    <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">workspace</span><span class="p">,</span> <span class="n">artifact_manager</span><span class="p">):</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">workspace</span> <span class="o">=</span> <span class="n">Path</span><span class="p">(</span><span class="n">workspace</span><span class="p">).</span><span class="n">resolve</span><span class="p">()</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">artifact_manager</span> <span class="o">=</span> <span class="n">artifact_manager</span>

    <span class="k">def</span> <span class="nf">run</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">job</span><span class="p">,</span> <span class="n">output_queue</span><span class="o">=</span><span class="bp">None</span><span class="p">):</span>
        <span class="s">"""Execute a job and stream output."""</span>

        <span class="k">def</span> <span class="nf">log</span><span class="p">(</span><span class="n">msg</span><span class="p">):</span>
            <span class="k">if</span> <span class="n">output_queue</span><span class="p">:</span>
                <span class="n">output_queue</span><span class="p">.</span><span class="n">put</span><span class="p">(</span><span class="n">msg</span><span class="p">)</span>
            <span class="k">else</span><span class="p">:</span>
                <span class="k">print</span><span class="p">(</span><span class="n">msg</span><span class="p">)</span>

        <span class="n">log</span><span class="p">(</span><span class="sa">f</span><span class="s">"[</span><span class="si">{</span><span class="n">job</span><span class="p">.</span><span class="n">name</span><span class="si">}</span><span class="s">] Starting job..."</span><span class="p">)</span>

        <span class="c1"># Load artifacts from dependencies
</span>        <span class="k">if</span> <span class="n">job</span><span class="p">.</span><span class="n">needs</span><span class="p">:</span>
            <span class="n">log</span><span class="p">(</span><span class="sa">f</span><span class="s">"[</span><span class="si">{</span><span class="n">job</span><span class="p">.</span><span class="n">name</span><span class="si">}</span><span class="s">] Loading artifacts from: </span><span class="si">{</span><span class="s">', '</span><span class="p">.</span><span class="n">join</span><span class="p">(</span><span class="n">job</span><span class="p">.</span><span class="n">needs</span><span class="p">)</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
            <span class="bp">self</span><span class="p">.</span><span class="n">artifact_manager</span><span class="p">.</span><span class="n">load_artifacts</span><span class="p">(</span><span class="n">job</span><span class="p">.</span><span class="n">needs</span><span class="p">)</span>

        <span class="n">script</span> <span class="o">=</span> <span class="s">' &amp;&amp; '</span><span class="p">.</span><span class="n">join</span><span class="p">(</span><span class="n">job</span><span class="p">.</span><span class="n">script</span><span class="p">)</span>

        <span class="n">cmd</span> <span class="o">=</span> <span class="p">[</span>
            <span class="s">'docker'</span><span class="p">,</span> <span class="s">'run'</span><span class="p">,</span> <span class="s">'--rm'</span><span class="p">,</span>
            <span class="s">'-v'</span><span class="p">,</span> <span class="sa">f</span><span class="s">'</span><span class="si">{</span><span class="bp">self</span><span class="p">.</span><span class="n">workspace</span><span class="si">}</span><span class="s">:/workspace'</span><span class="p">,</span>
            <span class="s">'-w'</span><span class="p">,</span> <span class="s">'/workspace'</span><span class="p">,</span>
            <span class="n">job</span><span class="p">.</span><span class="n">image</span><span class="p">,</span>
            <span class="s">'sh'</span><span class="p">,</span> <span class="s">'-c'</span><span class="p">,</span> <span class="n">script</span>
        <span class="p">]</span>

        <span class="k">try</span><span class="p">:</span>
            <span class="n">process</span> <span class="o">=</span> <span class="n">subprocess</span><span class="p">.</span><span class="n">Popen</span><span class="p">(</span>
                <span class="n">cmd</span><span class="p">,</span>
                <span class="n">stdout</span><span class="o">=</span><span class="n">subprocess</span><span class="p">.</span><span class="n">PIPE</span><span class="p">,</span>
                <span class="n">stderr</span><span class="o">=</span><span class="n">subprocess</span><span class="p">.</span><span class="n">STDOUT</span><span class="p">,</span>
                <span class="n">text</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span>
                <span class="n">bufsize</span><span class="o">=</span><span class="mi">1</span>
            <span class="p">)</span>

            <span class="k">for</span> <span class="n">line</span> <span class="ow">in</span> <span class="n">process</span><span class="p">.</span><span class="n">stdout</span><span class="p">:</span>
                <span class="n">log</span><span class="p">(</span><span class="sa">f</span><span class="s">"[</span><span class="si">{</span><span class="n">job</span><span class="p">.</span><span class="n">name</span><span class="si">}</span><span class="s">] </span><span class="si">{</span><span class="n">line</span><span class="p">.</span><span class="n">rstrip</span><span class="p">()</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>

            <span class="n">process</span><span class="p">.</span><span class="n">wait</span><span class="p">()</span>

            <span class="k">if</span> <span class="n">process</span><span class="p">.</span><span class="n">returncode</span> <span class="o">==</span> <span class="mi">0</span><span class="p">:</span>
                <span class="c1"># Save artifacts
</span>                <span class="k">if</span> <span class="n">job</span><span class="p">.</span><span class="n">artifacts</span><span class="p">:</span>
                    <span class="n">log</span><span class="p">(</span><span class="sa">f</span><span class="s">"[</span><span class="si">{</span><span class="n">job</span><span class="p">.</span><span class="n">name</span><span class="si">}</span><span class="s">] Saving artifacts..."</span><span class="p">)</span>
                    <span class="bp">self</span><span class="p">.</span><span class="n">artifact_manager</span><span class="p">.</span><span class="n">save_artifacts</span><span class="p">(</span><span class="n">job</span><span class="p">.</span><span class="n">name</span><span class="p">,</span> <span class="n">job</span><span class="p">.</span><span class="n">artifacts</span><span class="p">)</span>

                <span class="n">log</span><span class="p">(</span><span class="sa">f</span><span class="s">"[</span><span class="si">{</span><span class="n">job</span><span class="p">.</span><span class="n">name</span><span class="si">}</span><span class="s">] ✓ Job completed successfully"</span><span class="p">)</span>
                <span class="k">return</span> <span class="p">(</span><span class="n">job</span><span class="p">.</span><span class="n">name</span><span class="p">,</span> <span class="bp">True</span><span class="p">,</span> <span class="bp">None</span><span class="p">)</span>
            <span class="k">else</span><span class="p">:</span>
                <span class="n">error_msg</span> <span class="o">=</span> <span class="sa">f</span><span class="s">"Exit code </span><span class="si">{</span><span class="n">process</span><span class="p">.</span><span class="n">returncode</span><span class="si">}</span><span class="s">"</span>
                <span class="n">log</span><span class="p">(</span><span class="sa">f</span><span class="s">"[</span><span class="si">{</span><span class="n">job</span><span class="p">.</span><span class="n">name</span><span class="si">}</span><span class="s">] ✗ Job failed: </span><span class="si">{</span><span class="n">error_msg</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
                <span class="k">return</span> <span class="p">(</span><span class="n">job</span><span class="p">.</span><span class="n">name</span><span class="p">,</span> <span class="bp">False</span><span class="p">,</span> <span class="n">error_msg</span><span class="p">)</span>

        <span class="k">except</span> <span class="nb">Exception</span> <span class="k">as</span> <span class="n">e</span><span class="p">:</span>
            <span class="n">error_msg</span> <span class="o">=</span> <span class="nb">str</span><span class="p">(</span><span class="n">e</span><span class="p">)</span>
            <span class="n">log</span><span class="p">(</span><span class="sa">f</span><span class="s">"[</span><span class="si">{</span><span class="n">job</span><span class="p">.</span><span class="n">name</span><span class="si">}</span><span class="s">] ✗ Error: </span><span class="si">{</span><span class="n">error_msg</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
            <span class="k">return</span> <span class="p">(</span><span class="n">job</span><span class="p">.</span><span class="n">name</span><span class="p">,</span> <span class="bp">False</span><span class="p">,</span> <span class="n">error_msg</span><span class="p">)</span>


<span class="k">def</span> <span class="nf">run_job_parallel</span><span class="p">(</span><span class="n">job</span><span class="p">,</span> <span class="n">workspace</span><span class="p">,</span> <span class="n">artifact_manager</span><span class="p">,</span> <span class="n">output_queue</span><span class="p">):</span>
    <span class="s">"""Helper function for parallel execution."""</span>
    <span class="n">executor</span> <span class="o">=</span> <span class="n">JobExecutor</span><span class="p">(</span><span class="n">workspace</span><span class="p">,</span> <span class="n">artifact_manager</span><span class="p">)</span>
    <span class="k">return</span> <span class="n">executor</span><span class="p">.</span><span class="n">run</span><span class="p">(</span><span class="n">job</span><span class="p">,</span> <span class="n">output_queue</span><span class="p">)</span>


<span class="k">class</span> <span class="nc">Pipeline</span><span class="p">:</span>
    <span class="s">"""Represents a pipeline with dependency-aware execution."""</span>

    <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">config_file</span><span class="p">):</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">config_file</span> <span class="o">=</span> <span class="n">Path</span><span class="p">(</span><span class="n">config_file</span><span class="p">)</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">config</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">_load_config</span><span class="p">()</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">stages</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="s">'stages'</span><span class="p">,</span> <span class="p">[</span><span class="s">'test'</span><span class="p">])</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">jobs</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">_parse_jobs</span><span class="p">()</span>

    <span class="k">def</span> <span class="nf">_load_config</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
        <span class="s">"""Load and parse YAML configuration."""</span>
        <span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">config_file</span><span class="p">)</span> <span class="k">as</span> <span class="n">f</span><span class="p">:</span>
            <span class="k">return</span> <span class="n">yaml</span><span class="p">.</span><span class="n">safe_load</span><span class="p">(</span><span class="n">f</span><span class="p">)</span>

    <span class="k">def</span> <span class="nf">_parse_jobs</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
        <span class="s">"""Parse jobs from configuration."""</span>
        <span class="n">jobs</span> <span class="o">=</span> <span class="p">[]</span>
        <span class="k">for</span> <span class="n">job_name</span><span class="p">,</span> <span class="n">job_config</span> <span class="ow">in</span> <span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">items</span><span class="p">():</span>
            <span class="k">if</span> <span class="n">job_name</span> <span class="o">!=</span> <span class="s">'stages'</span> <span class="ow">and</span> <span class="nb">isinstance</span><span class="p">(</span><span class="n">job_config</span><span class="p">,</span> <span class="nb">dict</span><span class="p">):</span>
                <span class="n">jobs</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">Job</span><span class="p">(</span><span class="n">job_name</span><span class="p">,</span> <span class="n">job_config</span><span class="p">))</span>
        <span class="k">return</span> <span class="n">jobs</span>

    <span class="k">def</span> <span class="nf">_topological_sort</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">jobs</span><span class="p">):</span>
        <span class="s">"""
        Sort jobs in topological order based on dependencies.
        Returns list of job groups where each group can run in parallel.
        """</span>
        <span class="c1"># Build adjacency list and in-degree count
</span>        <span class="n">job_map</span> <span class="o">=</span> <span class="p">{</span><span class="n">job</span><span class="p">.</span><span class="n">name</span><span class="p">:</span> <span class="n">job</span> <span class="k">for</span> <span class="n">job</span> <span class="ow">in</span> <span class="n">jobs</span><span class="p">}</span>
        <span class="n">in_degree</span> <span class="o">=</span> <span class="p">{</span><span class="n">job</span><span class="p">.</span><span class="n">name</span><span class="p">:</span> <span class="mi">0</span> <span class="k">for</span> <span class="n">job</span> <span class="ow">in</span> <span class="n">jobs</span><span class="p">}</span>
        <span class="n">adjacency</span> <span class="o">=</span> <span class="n">defaultdict</span><span class="p">(</span><span class="nb">list</span><span class="p">)</span>

        <span class="k">for</span> <span class="n">job</span> <span class="ow">in</span> <span class="n">jobs</span><span class="p">:</span>
            <span class="k">for</span> <span class="n">dep</span> <span class="ow">in</span> <span class="n">job</span><span class="p">.</span><span class="n">needs</span><span class="p">:</span>
                <span class="k">if</span> <span class="n">dep</span> <span class="ow">in</span> <span class="n">job_map</span><span class="p">:</span>
                    <span class="n">adjacency</span><span class="p">[</span><span class="n">dep</span><span class="p">].</span><span class="n">append</span><span class="p">(</span><span class="n">job</span><span class="p">.</span><span class="n">name</span><span class="p">)</span>
                    <span class="n">in_degree</span><span class="p">[</span><span class="n">job</span><span class="p">.</span><span class="n">name</span><span class="p">]</span> <span class="o">+=</span> <span class="mi">1</span>

        <span class="c1"># Find jobs with no dependencies
</span>        <span class="n">queue</span> <span class="o">=</span> <span class="n">deque</span><span class="p">([</span><span class="n">name</span> <span class="k">for</span> <span class="n">name</span><span class="p">,</span> <span class="n">degree</span> <span class="ow">in</span> <span class="n">in_degree</span><span class="p">.</span><span class="n">items</span><span class="p">()</span> <span class="k">if</span> <span class="n">degree</span> <span class="o">==</span> <span class="mi">0</span><span class="p">])</span>
        <span class="n">execution_order</span> <span class="o">=</span> <span class="p">[]</span>

        <span class="k">while</span> <span class="n">queue</span><span class="p">:</span>
            <span class="c1"># All jobs in current queue can run in parallel
</span>            <span class="n">current_batch</span> <span class="o">=</span> <span class="nb">list</span><span class="p">(</span><span class="n">queue</span><span class="p">)</span>
            <span class="n">execution_order</span><span class="p">.</span><span class="n">append</span><span class="p">([</span><span class="n">job_map</span><span class="p">[</span><span class="n">name</span><span class="p">]</span> <span class="k">for</span> <span class="n">name</span> <span class="ow">in</span> <span class="n">current_batch</span><span class="p">])</span>

            <span class="n">queue</span><span class="p">.</span><span class="n">clear</span><span class="p">()</span>

            <span class="c1"># Process current batch
</span>            <span class="k">for</span> <span class="n">job_name</span> <span class="ow">in</span> <span class="n">current_batch</span><span class="p">:</span>
                <span class="k">for</span> <span class="n">dependent</span> <span class="ow">in</span> <span class="n">adjacency</span><span class="p">[</span><span class="n">job_name</span><span class="p">]:</span>
                    <span class="n">in_degree</span><span class="p">[</span><span class="n">dependent</span><span class="p">]</span> <span class="o">-=</span> <span class="mi">1</span>
                    <span class="k">if</span> <span class="n">in_degree</span><span class="p">[</span><span class="n">dependent</span><span class="p">]</span> <span class="o">==</span> <span class="mi">0</span><span class="p">:</span>
                        <span class="n">queue</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">dependent</span><span class="p">)</span>

        <span class="c1"># Check for cycles
</span>        <span class="k">if</span> <span class="nb">sum</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">batch</span><span class="p">)</span> <span class="k">for</span> <span class="n">batch</span> <span class="ow">in</span> <span class="n">execution_order</span><span class="p">)</span> <span class="o">!=</span> <span class="nb">len</span><span class="p">(</span><span class="n">jobs</span><span class="p">):</span>
            <span class="k">raise</span> <span class="nb">ValueError</span><span class="p">(</span><span class="s">"Circular dependency detected in job dependencies"</span><span class="p">)</span>

        <span class="k">return</span> <span class="n">execution_order</span>

    <span class="k">def</span> <span class="nf">_group_jobs_by_stage</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
        <span class="s">"""Group jobs by their stage."""</span>
        <span class="n">stages</span> <span class="o">=</span> <span class="n">defaultdict</span><span class="p">(</span><span class="nb">list</span><span class="p">)</span>
        <span class="k">for</span> <span class="n">job</span> <span class="ow">in</span> <span class="bp">self</span><span class="p">.</span><span class="n">jobs</span><span class="p">:</span>
            <span class="n">stages</span><span class="p">[</span><span class="n">job</span><span class="p">.</span><span class="n">stage</span><span class="p">].</span><span class="n">append</span><span class="p">(</span><span class="n">job</span><span class="p">)</span>
        <span class="k">return</span> <span class="n">stages</span>

    <span class="k">def</span> <span class="nf">_execute_job_batch</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">jobs</span><span class="p">,</span> <span class="n">workspace</span><span class="p">,</span> <span class="n">artifact_manager</span><span class="p">):</span>
        <span class="s">"""Execute a batch of jobs in parallel."""</span>
        <span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">jobs</span><span class="p">)</span> <span class="o">==</span> <span class="mi">1</span><span class="p">:</span>
            <span class="c1"># Single job - run directly
</span>            <span class="n">executor</span> <span class="o">=</span> <span class="n">JobExecutor</span><span class="p">(</span><span class="n">workspace</span><span class="p">,</span> <span class="n">artifact_manager</span><span class="p">)</span>
            <span class="n">job_name</span><span class="p">,</span> <span class="n">success</span><span class="p">,</span> <span class="n">error</span> <span class="o">=</span> <span class="n">executor</span><span class="p">.</span><span class="n">run</span><span class="p">(</span><span class="n">jobs</span><span class="p">[</span><span class="mi">0</span><span class="p">])</span>
            <span class="k">return</span> <span class="p">[(</span><span class="n">job_name</span><span class="p">,</span> <span class="n">success</span><span class="p">,</span> <span class="n">error</span><span class="p">)]</span>
        <span class="k">else</span><span class="p">:</span>
            <span class="c1"># Multiple jobs - run in parallel
</span>            <span class="n">manager</span> <span class="o">=</span> <span class="n">Manager</span><span class="p">()</span>
            <span class="n">output_queue</span> <span class="o">=</span> <span class="n">manager</span><span class="p">.</span><span class="n">Queue</span><span class="p">()</span>

            <span class="c1"># Create a partial function
</span>            <span class="n">run_func</span> <span class="o">=</span> <span class="n">partial</span><span class="p">(</span>
                <span class="n">run_job_parallel</span><span class="p">,</span>
                <span class="n">workspace</span><span class="o">=</span><span class="n">workspace</span><span class="p">,</span>
                <span class="n">artifact_manager</span><span class="o">=</span><span class="n">artifact_manager</span><span class="p">,</span>
                <span class="n">output_queue</span><span class="o">=</span><span class="n">output_queue</span>
            <span class="p">)</span>

            <span class="c1"># Execute jobs in parallel
</span>            <span class="k">with</span> <span class="n">Pool</span><span class="p">(</span><span class="n">processes</span><span class="o">=</span><span class="nb">len</span><span class="p">(</span><span class="n">jobs</span><span class="p">))</span> <span class="k">as</span> <span class="n">pool</span><span class="p">:</span>
                <span class="n">results</span> <span class="o">=</span> <span class="n">pool</span><span class="p">.</span><span class="n">map_async</span><span class="p">(</span><span class="n">run_func</span><span class="p">,</span> <span class="n">jobs</span><span class="p">)</span>

                <span class="c1"># Print output as it arrives
</span>                <span class="k">while</span> <span class="bp">True</span><span class="p">:</span>
                    <span class="k">if</span> <span class="n">results</span><span class="p">.</span><span class="n">ready</span><span class="p">()</span> <span class="ow">and</span> <span class="n">output_queue</span><span class="p">.</span><span class="n">empty</span><span class="p">():</span>
                        <span class="k">break</span>

                    <span class="k">if</span> <span class="ow">not</span> <span class="n">output_queue</span><span class="p">.</span><span class="n">empty</span><span class="p">():</span>
                        <span class="k">print</span><span class="p">(</span><span class="n">output_queue</span><span class="p">.</span><span class="n">get</span><span class="p">())</span>

                <span class="k">return</span> <span class="n">results</span><span class="p">.</span><span class="n">get</span><span class="p">()</span>

    <span class="k">def</span> <span class="nf">run</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">workspace</span><span class="o">=</span><span class="s">'.'</span><span class="p">):</span>
        <span class="s">"""Execute pipeline with dependency resolution."""</span>
        <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"</span><span class="se">\n</span><span class="si">{</span><span class="s">'='</span><span class="o">*</span><span class="mi">60</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
        <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"Starting pipeline: </span><span class="si">{</span><span class="bp">self</span><span class="p">.</span><span class="n">config_file</span><span class="p">.</span><span class="n">name</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
        <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"Stages: </span><span class="si">{</span><span class="s">' → '</span><span class="p">.</span><span class="n">join</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">stages</span><span class="p">)</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
        <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"Total jobs: </span><span class="si">{</span><span class="nb">len</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">jobs</span><span class="p">)</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
        <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"</span><span class="si">{</span><span class="s">'='</span><span class="o">*</span><span class="mi">60</span><span class="si">}</span><span class="se">\n</span><span class="s">"</span><span class="p">)</span>

        <span class="n">workspace</span> <span class="o">=</span> <span class="n">Path</span><span class="p">(</span><span class="n">workspace</span><span class="p">).</span><span class="n">resolve</span><span class="p">()</span>
        <span class="n">artifact_manager</span> <span class="o">=</span> <span class="n">ArtifactManager</span><span class="p">(</span><span class="n">workspace</span><span class="p">)</span>
        <span class="n">stages_with_jobs</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">_group_jobs_by_stage</span><span class="p">()</span>

        <span class="k">try</span><span class="p">:</span>
            <span class="c1"># Execute stages in order
</span>            <span class="k">for</span> <span class="n">stage</span> <span class="ow">in</span> <span class="bp">self</span><span class="p">.</span><span class="n">stages</span><span class="p">:</span>
                <span class="n">stage_jobs</span> <span class="o">=</span> <span class="n">stages_with_jobs</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="n">stage</span><span class="p">,</span> <span class="p">[])</span>

                <span class="k">if</span> <span class="ow">not</span> <span class="n">stage_jobs</span><span class="p">:</span>
                    <span class="k">continue</span>

                <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"</span><span class="se">\n</span><span class="si">{</span><span class="s">'─'</span><span class="o">*</span><span class="mi">60</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
                <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"Stage: </span><span class="si">{</span><span class="n">stage</span><span class="si">}</span><span class="s"> (</span><span class="si">{</span><span class="nb">len</span><span class="p">(</span><span class="n">stage_jobs</span><span class="p">)</span><span class="si">}</span><span class="s"> job(s))"</span><span class="p">)</span>
                <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"</span><span class="si">{</span><span class="s">'─'</span><span class="o">*</span><span class="mi">60</span><span class="si">}</span><span class="se">\n</span><span class="s">"</span><span class="p">)</span>

                <span class="c1"># Sort jobs by dependencies
</span>                <span class="k">try</span><span class="p">:</span>
                    <span class="n">execution_batches</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">_topological_sort</span><span class="p">(</span><span class="n">stage_jobs</span><span class="p">)</span>
                <span class="k">except</span> <span class="nb">ValueError</span> <span class="k">as</span> <span class="n">e</span><span class="p">:</span>
                    <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"✗ Error: </span><span class="si">{</span><span class="n">e</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
                    <span class="k">return</span> <span class="bp">False</span>

                <span class="c1"># Execute batches in order
</span>                <span class="k">for</span> <span class="n">batch</span> <span class="ow">in</span> <span class="n">execution_batches</span><span class="p">:</span>
                    <span class="n">job_results</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">_execute_job_batch</span><span class="p">(</span><span class="n">batch</span><span class="p">,</span> <span class="n">workspace</span><span class="p">,</span> <span class="n">artifact_manager</span><span class="p">)</span>

                    <span class="c1"># Check if all jobs succeeded
</span>                    <span class="k">if</span> <span class="ow">not</span> <span class="nb">all</span><span class="p">(</span><span class="n">success</span> <span class="k">for</span> <span class="n">_</span><span class="p">,</span> <span class="n">success</span><span class="p">,</span> <span class="n">_</span> <span class="ow">in</span> <span class="n">job_results</span><span class="p">):</span>
                        <span class="n">failed_jobs</span> <span class="o">=</span> <span class="p">[</span><span class="n">name</span> <span class="k">for</span> <span class="n">name</span><span class="p">,</span> <span class="n">success</span><span class="p">,</span> <span class="n">_</span> <span class="ow">in</span> <span class="n">job_results</span> <span class="k">if</span> <span class="ow">not</span> <span class="n">success</span><span class="p">]</span>
                        <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"</span><span class="se">\n</span><span class="s">✗ Pipeline failed at stage '</span><span class="si">{</span><span class="n">stage</span><span class="si">}</span><span class="s">'"</span><span class="p">)</span>
                        <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"  Failed jobs: </span><span class="si">{</span><span class="s">', '</span><span class="p">.</span><span class="n">join</span><span class="p">(</span><span class="n">failed_jobs</span><span class="p">)</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
                        <span class="k">return</span> <span class="bp">False</span>

            <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"</span><span class="se">\n</span><span class="si">{</span><span class="s">'='</span><span class="o">*</span><span class="mi">60</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
            <span class="k">print</span><span class="p">(</span><span class="s">"✓ Pipeline completed successfully!"</span><span class="p">)</span>
            <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"</span><span class="si">{</span><span class="s">'='</span><span class="o">*</span><span class="mi">60</span><span class="si">}</span><span class="se">\n</span><span class="s">"</span><span class="p">)</span>
            <span class="k">return</span> <span class="bp">True</span>

        <span class="k">finally</span><span class="p">:</span>
            <span class="c1"># Cleanup artifacts
</span>            <span class="n">artifact_manager</span><span class="p">.</span><span class="n">cleanup</span><span class="p">()</span>


<span class="k">if</span> <span class="n">__name__</span> <span class="o">==</span> <span class="s">"__main__"</span><span class="p">:</span>
    <span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">sys</span><span class="p">.</span><span class="n">argv</span><span class="p">)</span> <span class="o">&lt;</span> <span class="mi">2</span><span class="p">:</span>
        <span class="k">print</span><span class="p">(</span><span class="s">"Usage: python runner_v4.py &lt;pipeline.yml&gt;"</span><span class="p">)</span>
        <span class="n">sys</span><span class="p">.</span><span class="nb">exit</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span>

    <span class="n">pipeline</span> <span class="o">=</span> <span class="n">Pipeline</span><span class="p">(</span><span class="n">sys</span><span class="p">.</span><span class="n">argv</span><span class="p">[</span><span class="mi">1</span><span class="p">])</span>
    <span class="n">success</span> <span class="o">=</span> <span class="n">pipeline</span><span class="p">.</span><span class="n">run</span><span class="p">()</span>
    <span class="n">sys</span><span class="p">.</span><span class="nb">exit</span><span class="p">(</span><span class="mi">0</span> <span class="k">if</span> <span class="n">success</span> <span class="k">else</span> <span class="mi">1</span><span class="p">)</span>
</code></pre></div></div>

<h4 id="testing-version-4">Testing Version 4</h4>

<p>Create a pipeline with dependencies and artifacts:</p>

<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># pipeline.yml</span>
<span class="na">stages</span><span class="pi">:</span>
  <span class="pi">-</span> <span class="s">build</span>
  <span class="pi">-</span> <span class="s">test</span>
  <span class="pi">-</span> <span class="s">deploy</span>

<span class="na">build-app</span><span class="pi">:</span>
  <span class="na">stage</span><span class="pi">:</span> <span class="s">build</span>
  <span class="na">image</span><span class="pi">:</span> <span class="s">python:3.11</span>
  <span class="na">script</span><span class="pi">:</span>
    <span class="pi">-</span> <span class="s">echo "Building application..."</span>
    <span class="pi">-</span> <span class="s">mkdir -p dist</span>
    <span class="pi">-</span> <span class="s">echo "app-v1.0.0" &gt; dist/app.txt</span>
    <span class="pi">-</span> <span class="s">echo "Build complete!"</span>
  <span class="na">artifacts</span><span class="pi">:</span>
    <span class="na">paths</span><span class="pi">:</span>
      <span class="pi">-</span> <span class="s">dist/</span>

<span class="na">unit-tests</span><span class="pi">:</span>
  <span class="na">stage</span><span class="pi">:</span> <span class="s">test</span>
  <span class="na">image</span><span class="pi">:</span> <span class="s">python:3.11</span>
  <span class="na">needs</span><span class="pi">:</span>
    <span class="pi">-</span> <span class="s">build-app</span>
  <span class="na">script</span><span class="pi">:</span>
    <span class="pi">-</span> <span class="s">echo "Running unit tests..."</span>
    <span class="pi">-</span> <span class="s">ls -la dist/</span>
    <span class="pi">-</span> <span class="s">cat dist/app.txt</span>
    <span class="pi">-</span> <span class="s">echo "Tests passed!"</span>

<span class="na">integration-tests</span><span class="pi">:</span>
  <span class="na">stage</span><span class="pi">:</span> <span class="s">test</span>
  <span class="na">image</span><span class="pi">:</span> <span class="s">python:3.11</span>
  <span class="na">needs</span><span class="pi">:</span>
    <span class="pi">-</span> <span class="s">build-app</span>
  <span class="na">script</span><span class="pi">:</span>
    <span class="pi">-</span> <span class="s">echo "Running integration tests..."</span>
    <span class="pi">-</span> <span class="s">cat dist/app.txt</span>
    <span class="pi">-</span> <span class="s">echo "Integration tests passed!"</span>

<span class="na">deploy-prod</span><span class="pi">:</span>
  <span class="na">stage</span><span class="pi">:</span> <span class="s">deploy</span>
  <span class="na">image</span><span class="pi">:</span> <span class="s">alpine:latest</span>
  <span class="na">needs</span><span class="pi">:</span>
    <span class="pi">-</span> <span class="s">unit-tests</span>
    <span class="pi">-</span> <span class="s">integration-tests</span>
  <span class="na">script</span><span class="pi">:</span>
    <span class="pi">-</span> <span class="s">echo "Deploying to production..."</span>
    <span class="pi">-</span> <span class="s">cat dist/app.txt</span>
    <span class="pi">-</span> <span class="s">echo "Deployed!"</span>
</code></pre></div></div>

<p>Run it:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>python runner_v4.py pipeline.yml
</code></pre></div></div>

<p>Notice how:</p>
<ul>
  <li><code class="language-plaintext highlighter-rouge">build-app</code> runs first</li>
  <li><code class="language-plaintext highlighter-rouge">unit-tests</code> and <code class="language-plaintext highlighter-rouge">integration-tests</code> run in parallel (both depend on <code class="language-plaintext highlighter-rouge">build-app</code>)</li>
  <li><code class="language-plaintext highlighter-rouge">deploy-prod</code> runs only after both test jobs complete</li>
  <li>Artifacts (the <code class="language-plaintext highlighter-rouge">dist/</code> folder) are passed between jobs!</li>
</ul>

<p>This is getting close to a real CI/CD runner. Let’s add one more version with production-ready features.</p>

<h3 id="version-5-production-ready-features">Version 5: Production-Ready Features</h3>

<p>Let’s add the final touches: environment variables, branch filtering, timeouts, and better error handling:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">#!/usr/bin/env python3
</span><span class="s">"""
Pipeline Runner v5: Production-ready
Complete CI/CD runner with all features
"""</span>

<span class="kn">import</span> <span class="nn">yaml</span>
<span class="kn">import</span> <span class="nn">subprocess</span>
<span class="kn">import</span> <span class="nn">sys</span>
<span class="kn">import</span> <span class="nn">shutil</span>
<span class="kn">import</span> <span class="nn">os</span>
<span class="kn">import</span> <span class="nn">re</span>
<span class="kn">from</span> <span class="nn">pathlib</span> <span class="kn">import</span> <span class="n">Path</span>
<span class="kn">from</span> <span class="nn">collections</span> <span class="kn">import</span> <span class="n">defaultdict</span><span class="p">,</span> <span class="n">deque</span>
<span class="kn">from</span> <span class="nn">multiprocessing</span> <span class="kn">import</span> <span class="n">Pool</span><span class="p">,</span> <span class="n">Manager</span>
<span class="kn">from</span> <span class="nn">functools</span> <span class="kn">import</span> <span class="n">partial</span>
<span class="kn">import</span> <span class="nn">time</span>


<span class="k">def</span> <span class="nf">get_current_branch</span><span class="p">():</span>
    <span class="s">"""Get the current git branch."""</span>
    <span class="k">try</span><span class="p">:</span>
        <span class="n">result</span> <span class="o">=</span> <span class="n">subprocess</span><span class="p">.</span><span class="n">run</span><span class="p">(</span>
            <span class="p">[</span><span class="s">'git'</span><span class="p">,</span> <span class="s">'rev-parse'</span><span class="p">,</span> <span class="s">'--abbrev-ref'</span><span class="p">,</span> <span class="s">'HEAD'</span><span class="p">],</span>
            <span class="n">capture_output</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span>
            <span class="n">text</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span>
            <span class="n">timeout</span><span class="o">=</span><span class="mi">5</span>
        <span class="p">)</span>
        <span class="k">return</span> <span class="n">result</span><span class="p">.</span><span class="n">stdout</span><span class="p">.</span><span class="n">strip</span><span class="p">()</span> <span class="k">if</span> <span class="n">result</span><span class="p">.</span><span class="n">returncode</span> <span class="o">==</span> <span class="mi">0</span> <span class="k">else</span> <span class="s">'main'</span>
    <span class="k">except</span><span class="p">:</span>
        <span class="k">return</span> <span class="s">'main'</span>


<span class="k">def</span> <span class="nf">substitute_variables</span><span class="p">(</span><span class="n">text</span><span class="p">,</span> <span class="n">variables</span><span class="p">):</span>
    <span class="s">"""Substitute ${VAR} style variables in text."""</span>
    <span class="k">if</span> <span class="ow">not</span> <span class="nb">isinstance</span><span class="p">(</span><span class="n">text</span><span class="p">,</span> <span class="nb">str</span><span class="p">):</span>
        <span class="k">return</span> <span class="n">text</span>

    <span class="k">for</span> <span class="n">key</span><span class="p">,</span> <span class="n">value</span> <span class="ow">in</span> <span class="n">variables</span><span class="p">.</span><span class="n">items</span><span class="p">():</span>
        <span class="n">text</span> <span class="o">=</span> <span class="n">text</span><span class="p">.</span><span class="n">replace</span><span class="p">(</span><span class="sa">f</span><span class="s">'${{</span><span class="si">{</span><span class="n">key</span><span class="si">}</span><span class="s">}}'</span><span class="p">,</span> <span class="nb">str</span><span class="p">(</span><span class="n">value</span><span class="p">))</span>
        <span class="n">text</span> <span class="o">=</span> <span class="n">text</span><span class="p">.</span><span class="n">replace</span><span class="p">(</span><span class="sa">f</span><span class="s">'$</span><span class="si">{</span><span class="n">key</span><span class="si">}</span><span class="s">'</span><span class="p">,</span> <span class="nb">str</span><span class="p">(</span><span class="n">value</span><span class="p">))</span>

    <span class="k">return</span> <span class="n">text</span>


<span class="k">class</span> <span class="nc">Job</span><span class="p">:</span>
    <span class="s">"""Represents a single pipeline job."""</span>

    <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">name</span><span class="p">,</span> <span class="n">config</span><span class="p">,</span> <span class="n">global_variables</span><span class="o">=</span><span class="bp">None</span><span class="p">):</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">name</span> <span class="o">=</span> <span class="n">name</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">image</span> <span class="o">=</span> <span class="n">config</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="s">'image'</span><span class="p">,</span> <span class="s">'python:3.11'</span><span class="p">)</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">script</span> <span class="o">=</span> <span class="n">config</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="s">'script'</span><span class="p">,</span> <span class="p">[])</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">stage</span> <span class="o">=</span> <span class="n">config</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="s">'stage'</span><span class="p">,</span> <span class="s">'test'</span><span class="p">)</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">artifacts</span> <span class="o">=</span> <span class="n">config</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="s">'artifacts'</span><span class="p">,</span> <span class="p">{}).</span><span class="n">get</span><span class="p">(</span><span class="s">'paths'</span><span class="p">,</span> <span class="p">[])</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">needs</span> <span class="o">=</span> <span class="n">config</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="s">'needs'</span><span class="p">,</span> <span class="p">[])</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">only</span> <span class="o">=</span> <span class="n">config</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="s">'only'</span><span class="p">,</span> <span class="p">[])</span>  <span class="c1"># Branch filter
</span>        <span class="bp">self</span><span class="p">.</span><span class="n">timeout</span> <span class="o">=</span> <span class="n">config</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="s">'timeout'</span><span class="p">,</span> <span class="mi">3600</span><span class="p">)</span>  <span class="c1"># Default 1 hour
</span>
        <span class="c1"># Substitute variables in image and script
</span>        <span class="n">variables</span> <span class="o">=</span> <span class="n">global_variables</span> <span class="ow">or</span> <span class="p">{}</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">image</span> <span class="o">=</span> <span class="n">substitute_variables</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">image</span><span class="p">,</span> <span class="n">variables</span><span class="p">)</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">script</span> <span class="o">=</span> <span class="p">[</span><span class="n">substitute_variables</span><span class="p">(</span><span class="n">cmd</span><span class="p">,</span> <span class="n">variables</span><span class="p">)</span> <span class="k">for</span> <span class="n">cmd</span> <span class="ow">in</span> <span class="bp">self</span><span class="p">.</span><span class="n">script</span><span class="p">]</span>

    <span class="k">def</span> <span class="nf">should_run</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">branch</span><span class="p">):</span>
        <span class="s">"""Check if job should run on current branch."""</span>
        <span class="k">if</span> <span class="ow">not</span> <span class="bp">self</span><span class="p">.</span><span class="n">only</span><span class="p">:</span>
            <span class="k">return</span> <span class="bp">True</span>
        <span class="k">return</span> <span class="n">branch</span> <span class="ow">in</span> <span class="bp">self</span><span class="p">.</span><span class="n">only</span>

    <span class="k">def</span> <span class="nf">__repr__</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
        <span class="k">return</span> <span class="sa">f</span><span class="s">"Job(</span><span class="si">{</span><span class="bp">self</span><span class="p">.</span><span class="n">name</span><span class="si">}</span><span class="s">, stage=</span><span class="si">{</span><span class="bp">self</span><span class="p">.</span><span class="n">stage</span><span class="si">}</span><span class="s">)"</span>


<span class="k">class</span> <span class="nc">ArtifactManager</span><span class="p">:</span>
    <span class="s">"""Manages artifact storage and retrieval."""</span>

    <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">workspace</span><span class="p">):</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">workspace</span> <span class="o">=</span> <span class="n">Path</span><span class="p">(</span><span class="n">workspace</span><span class="p">).</span><span class="n">resolve</span><span class="p">()</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">artifact_dir</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">workspace</span> <span class="o">/</span> <span class="s">'.pipeline_artifacts'</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">artifact_dir</span><span class="p">.</span><span class="n">mkdir</span><span class="p">(</span><span class="n">exist_ok</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>

    <span class="k">def</span> <span class="nf">save_artifacts</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">job_name</span><span class="p">,</span> <span class="n">artifact_paths</span><span class="p">):</span>
        <span class="s">"""Save artifacts from a job."""</span>
        <span class="k">if</span> <span class="ow">not</span> <span class="n">artifact_paths</span><span class="p">:</span>
            <span class="k">return</span>

        <span class="n">job_artifact_dir</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">artifact_dir</span> <span class="o">/</span> <span class="n">job_name</span>
        <span class="n">job_artifact_dir</span><span class="p">.</span><span class="n">mkdir</span><span class="p">(</span><span class="n">exist_ok</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>

        <span class="n">saved_count</span> <span class="o">=</span> <span class="mi">0</span>
        <span class="k">for</span> <span class="n">artifact_path</span> <span class="ow">in</span> <span class="n">artifact_paths</span><span class="p">:</span>
            <span class="n">src</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">workspace</span> <span class="o">/</span> <span class="n">artifact_path</span>
            <span class="k">if</span> <span class="n">src</span><span class="p">.</span><span class="n">exists</span><span class="p">():</span>
                <span class="n">dst</span> <span class="o">=</span> <span class="n">job_artifact_dir</span> <span class="o">/</span> <span class="n">artifact_path</span>
                <span class="n">dst</span><span class="p">.</span><span class="n">parent</span><span class="p">.</span><span class="n">mkdir</span><span class="p">(</span><span class="n">parents</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span> <span class="n">exist_ok</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>

                <span class="k">if</span> <span class="n">src</span><span class="p">.</span><span class="n">is_dir</span><span class="p">():</span>
                    <span class="n">shutil</span><span class="p">.</span><span class="n">copytree</span><span class="p">(</span><span class="n">src</span><span class="p">,</span> <span class="n">dst</span><span class="p">,</span> <span class="n">dirs_exist_ok</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
                <span class="k">else</span><span class="p">:</span>
                    <span class="n">shutil</span><span class="p">.</span><span class="n">copy2</span><span class="p">(</span><span class="n">src</span><span class="p">,</span> <span class="n">dst</span><span class="p">)</span>

                <span class="n">saved_count</span> <span class="o">+=</span> <span class="mi">1</span>

        <span class="k">return</span> <span class="n">saved_count</span>

    <span class="k">def</span> <span class="nf">load_artifacts</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">job_names</span><span class="p">):</span>
        <span class="s">"""Load artifacts from dependent jobs."""</span>
        <span class="n">loaded_count</span> <span class="o">=</span> <span class="mi">0</span>
        <span class="k">for</span> <span class="n">job_name</span> <span class="ow">in</span> <span class="n">job_names</span><span class="p">:</span>
            <span class="n">job_artifact_dir</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">artifact_dir</span> <span class="o">/</span> <span class="n">job_name</span>
            <span class="k">if</span> <span class="ow">not</span> <span class="n">job_artifact_dir</span><span class="p">.</span><span class="n">exists</span><span class="p">():</span>
                <span class="k">continue</span>

            <span class="k">for</span> <span class="n">item</span> <span class="ow">in</span> <span class="n">job_artifact_dir</span><span class="p">.</span><span class="n">rglob</span><span class="p">(</span><span class="s">'*'</span><span class="p">):</span>
                <span class="k">if</span> <span class="n">item</span><span class="p">.</span><span class="n">is_file</span><span class="p">():</span>
                    <span class="n">rel_path</span> <span class="o">=</span> <span class="n">item</span><span class="p">.</span><span class="n">relative_to</span><span class="p">(</span><span class="n">job_artifact_dir</span><span class="p">)</span>
                    <span class="n">dst</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">workspace</span> <span class="o">/</span> <span class="n">rel_path</span>
                    <span class="n">dst</span><span class="p">.</span><span class="n">parent</span><span class="p">.</span><span class="n">mkdir</span><span class="p">(</span><span class="n">parents</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span> <span class="n">exist_ok</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
                    <span class="n">shutil</span><span class="p">.</span><span class="n">copy2</span><span class="p">(</span><span class="n">item</span><span class="p">,</span> <span class="n">dst</span><span class="p">)</span>
                    <span class="n">loaded_count</span> <span class="o">+=</span> <span class="mi">1</span>

        <span class="k">return</span> <span class="n">loaded_count</span>

    <span class="k">def</span> <span class="nf">cleanup</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
        <span class="s">"""Remove all artifacts."""</span>
        <span class="k">if</span> <span class="bp">self</span><span class="p">.</span><span class="n">artifact_dir</span><span class="p">.</span><span class="n">exists</span><span class="p">():</span>
            <span class="n">shutil</span><span class="p">.</span><span class="n">rmtree</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">artifact_dir</span><span class="p">)</span>


<span class="k">class</span> <span class="nc">JobExecutor</span><span class="p">:</span>
    <span class="s">"""Executes a job in a Docker container."""</span>

    <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">workspace</span><span class="p">,</span> <span class="n">artifact_manager</span><span class="p">):</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">workspace</span> <span class="o">=</span> <span class="n">Path</span><span class="p">(</span><span class="n">workspace</span><span class="p">).</span><span class="n">resolve</span><span class="p">()</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">artifact_manager</span> <span class="o">=</span> <span class="n">artifact_manager</span>

    <span class="k">def</span> <span class="nf">run</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">job</span><span class="p">,</span> <span class="n">output_queue</span><span class="o">=</span><span class="bp">None</span><span class="p">):</span>
        <span class="s">"""Execute a job with timeout and proper error handling."""</span>

        <span class="k">def</span> <span class="nf">log</span><span class="p">(</span><span class="n">msg</span><span class="p">):</span>
            <span class="k">if</span> <span class="n">output_queue</span><span class="p">:</span>
                <span class="n">output_queue</span><span class="p">.</span><span class="n">put</span><span class="p">(</span><span class="n">msg</span><span class="p">)</span>
            <span class="k">else</span><span class="p">:</span>
                <span class="k">print</span><span class="p">(</span><span class="n">msg</span><span class="p">)</span>

        <span class="n">start_time</span> <span class="o">=</span> <span class="n">time</span><span class="p">.</span><span class="n">time</span><span class="p">()</span>
        <span class="n">log</span><span class="p">(</span><span class="sa">f</span><span class="s">"[</span><span class="si">{</span><span class="n">job</span><span class="p">.</span><span class="n">name</span><span class="si">}</span><span class="s">] Starting job..."</span><span class="p">)</span>
        <span class="n">log</span><span class="p">(</span><span class="sa">f</span><span class="s">"[</span><span class="si">{</span><span class="n">job</span><span class="p">.</span><span class="n">name</span><span class="si">}</span><span class="s">] Image: </span><span class="si">{</span><span class="n">job</span><span class="p">.</span><span class="n">image</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>

        <span class="c1"># Load artifacts from dependencies
</span>        <span class="k">if</span> <span class="n">job</span><span class="p">.</span><span class="n">needs</span><span class="p">:</span>
            <span class="n">log</span><span class="p">(</span><span class="sa">f</span><span class="s">"[</span><span class="si">{</span><span class="n">job</span><span class="p">.</span><span class="n">name</span><span class="si">}</span><span class="s">] Loading artifacts from dependencies..."</span><span class="p">)</span>
            <span class="n">count</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">artifact_manager</span><span class="p">.</span><span class="n">load_artifacts</span><span class="p">(</span><span class="n">job</span><span class="p">.</span><span class="n">needs</span><span class="p">)</span>
            <span class="k">if</span> <span class="n">count</span> <span class="o">&gt;</span> <span class="mi">0</span><span class="p">:</span>
                <span class="n">log</span><span class="p">(</span><span class="sa">f</span><span class="s">"[</span><span class="si">{</span><span class="n">job</span><span class="p">.</span><span class="n">name</span><span class="si">}</span><span class="s">] Loaded </span><span class="si">{</span><span class="n">count</span><span class="si">}</span><span class="s"> artifact file(s)"</span><span class="p">)</span>

        <span class="n">script</span> <span class="o">=</span> <span class="s">' &amp;&amp; '</span><span class="p">.</span><span class="n">join</span><span class="p">(</span><span class="n">job</span><span class="p">.</span><span class="n">script</span><span class="p">)</span>

        <span class="n">cmd</span> <span class="o">=</span> <span class="p">[</span>
            <span class="s">'docker'</span><span class="p">,</span> <span class="s">'run'</span><span class="p">,</span> <span class="s">'--rm'</span><span class="p">,</span>
            <span class="s">'-v'</span><span class="p">,</span> <span class="sa">f</span><span class="s">'</span><span class="si">{</span><span class="bp">self</span><span class="p">.</span><span class="n">workspace</span><span class="si">}</span><span class="s">:/workspace'</span><span class="p">,</span>
            <span class="s">'-w'</span><span class="p">,</span> <span class="s">'/workspace'</span><span class="p">,</span>
            <span class="n">job</span><span class="p">.</span><span class="n">image</span><span class="p">,</span>
            <span class="s">'sh'</span><span class="p">,</span> <span class="s">'-c'</span><span class="p">,</span> <span class="n">script</span>
        <span class="p">]</span>

        <span class="k">try</span><span class="p">:</span>
            <span class="n">process</span> <span class="o">=</span> <span class="n">subprocess</span><span class="p">.</span><span class="n">Popen</span><span class="p">(</span>
                <span class="n">cmd</span><span class="p">,</span>
                <span class="n">stdout</span><span class="o">=</span><span class="n">subprocess</span><span class="p">.</span><span class="n">PIPE</span><span class="p">,</span>
                <span class="n">stderr</span><span class="o">=</span><span class="n">subprocess</span><span class="p">.</span><span class="n">STDOUT</span><span class="p">,</span>
                <span class="n">text</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span>
                <span class="n">bufsize</span><span class="o">=</span><span class="mi">1</span>
            <span class="p">)</span>

            <span class="c1"># Read output with timeout
</span>            <span class="n">start</span> <span class="o">=</span> <span class="n">time</span><span class="p">.</span><span class="n">time</span><span class="p">()</span>
            <span class="k">for</span> <span class="n">line</span> <span class="ow">in</span> <span class="n">process</span><span class="p">.</span><span class="n">stdout</span><span class="p">:</span>
                <span class="k">if</span> <span class="n">time</span><span class="p">.</span><span class="n">time</span><span class="p">()</span> <span class="o">-</span> <span class="n">start</span> <span class="o">&gt;</span> <span class="n">job</span><span class="p">.</span><span class="n">timeout</span><span class="p">:</span>
                    <span class="n">process</span><span class="p">.</span><span class="n">kill</span><span class="p">()</span>
                    <span class="n">log</span><span class="p">(</span><span class="sa">f</span><span class="s">"[</span><span class="si">{</span><span class="n">job</span><span class="p">.</span><span class="n">name</span><span class="si">}</span><span class="s">] ✗ Job timed out after </span><span class="si">{</span><span class="n">job</span><span class="p">.</span><span class="n">timeout</span><span class="si">}</span><span class="s">s"</span><span class="p">)</span>
                    <span class="k">return</span> <span class="p">(</span><span class="n">job</span><span class="p">.</span><span class="n">name</span><span class="p">,</span> <span class="bp">False</span><span class="p">,</span> <span class="s">"Timeout"</span><span class="p">)</span>

                <span class="n">log</span><span class="p">(</span><span class="sa">f</span><span class="s">"[</span><span class="si">{</span><span class="n">job</span><span class="p">.</span><span class="n">name</span><span class="si">}</span><span class="s">] </span><span class="si">{</span><span class="n">line</span><span class="p">.</span><span class="n">rstrip</span><span class="p">()</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>

            <span class="n">process</span><span class="p">.</span><span class="n">wait</span><span class="p">()</span>

            <span class="k">if</span> <span class="n">process</span><span class="p">.</span><span class="n">returncode</span> <span class="o">==</span> <span class="mi">0</span><span class="p">:</span>
                <span class="c1"># Save artifacts
</span>                <span class="k">if</span> <span class="n">job</span><span class="p">.</span><span class="n">artifacts</span><span class="p">:</span>
                    <span class="n">log</span><span class="p">(</span><span class="sa">f</span><span class="s">"[</span><span class="si">{</span><span class="n">job</span><span class="p">.</span><span class="n">name</span><span class="si">}</span><span class="s">] Saving artifacts..."</span><span class="p">)</span>
                    <span class="n">count</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">artifact_manager</span><span class="p">.</span><span class="n">save_artifacts</span><span class="p">(</span><span class="n">job</span><span class="p">.</span><span class="n">name</span><span class="p">,</span> <span class="n">job</span><span class="p">.</span><span class="n">artifacts</span><span class="p">)</span>
                    <span class="k">if</span> <span class="n">count</span> <span class="o">&gt;</span> <span class="mi">0</span><span class="p">:</span>
                        <span class="n">log</span><span class="p">(</span><span class="sa">f</span><span class="s">"[</span><span class="si">{</span><span class="n">job</span><span class="p">.</span><span class="n">name</span><span class="si">}</span><span class="s">] Saved </span><span class="si">{</span><span class="n">count</span><span class="si">}</span><span class="s"> artifact(s)"</span><span class="p">)</span>

                <span class="n">duration</span> <span class="o">=</span> <span class="n">time</span><span class="p">.</span><span class="n">time</span><span class="p">()</span> <span class="o">-</span> <span class="n">start_time</span>
                <span class="n">log</span><span class="p">(</span><span class="sa">f</span><span class="s">"[</span><span class="si">{</span><span class="n">job</span><span class="p">.</span><span class="n">name</span><span class="si">}</span><span class="s">] ✓ Job completed successfully (</span><span class="si">{</span><span class="n">duration</span><span class="si">:</span><span class="p">.</span><span class="mi">1</span><span class="n">f</span><span class="si">}</span><span class="s">s)"</span><span class="p">)</span>
                <span class="k">return</span> <span class="p">(</span><span class="n">job</span><span class="p">.</span><span class="n">name</span><span class="p">,</span> <span class="bp">True</span><span class="p">,</span> <span class="bp">None</span><span class="p">)</span>
            <span class="k">else</span><span class="p">:</span>
                <span class="n">error_msg</span> <span class="o">=</span> <span class="sa">f</span><span class="s">"Exit code </span><span class="si">{</span><span class="n">process</span><span class="p">.</span><span class="n">returncode</span><span class="si">}</span><span class="s">"</span>
                <span class="n">log</span><span class="p">(</span><span class="sa">f</span><span class="s">"[</span><span class="si">{</span><span class="n">job</span><span class="p">.</span><span class="n">name</span><span class="si">}</span><span class="s">] ✗ Job failed: </span><span class="si">{</span><span class="n">error_msg</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
                <span class="k">return</span> <span class="p">(</span><span class="n">job</span><span class="p">.</span><span class="n">name</span><span class="p">,</span> <span class="bp">False</span><span class="p">,</span> <span class="n">error_msg</span><span class="p">)</span>

        <span class="k">except</span> <span class="nb">Exception</span> <span class="k">as</span> <span class="n">e</span><span class="p">:</span>
            <span class="n">error_msg</span> <span class="o">=</span> <span class="nb">str</span><span class="p">(</span><span class="n">e</span><span class="p">)</span>
            <span class="n">log</span><span class="p">(</span><span class="sa">f</span><span class="s">"[</span><span class="si">{</span><span class="n">job</span><span class="p">.</span><span class="n">name</span><span class="si">}</span><span class="s">] ✗ Error: </span><span class="si">{</span><span class="n">error_msg</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
            <span class="k">return</span> <span class="p">(</span><span class="n">job</span><span class="p">.</span><span class="n">name</span><span class="p">,</span> <span class="bp">False</span><span class="p">,</span> <span class="n">error_msg</span><span class="p">)</span>


<span class="k">def</span> <span class="nf">run_job_parallel</span><span class="p">(</span><span class="n">job</span><span class="p">,</span> <span class="n">workspace</span><span class="p">,</span> <span class="n">artifact_manager</span><span class="p">,</span> <span class="n">output_queue</span><span class="p">):</span>
    <span class="s">"""Helper function for parallel execution."""</span>
    <span class="n">executor</span> <span class="o">=</span> <span class="n">JobExecutor</span><span class="p">(</span><span class="n">workspace</span><span class="p">,</span> <span class="n">artifact_manager</span><span class="p">)</span>
    <span class="k">return</span> <span class="n">executor</span><span class="p">.</span><span class="n">run</span><span class="p">(</span><span class="n">job</span><span class="p">,</span> <span class="n">output_queue</span><span class="p">)</span>


<span class="k">class</span> <span class="nc">Pipeline</span><span class="p">:</span>
    <span class="s">"""Complete pipeline runner with all features."""</span>

    <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">config_file</span><span class="p">):</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">config_file</span> <span class="o">=</span> <span class="n">Path</span><span class="p">(</span><span class="n">config_file</span><span class="p">)</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">config</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">_load_config</span><span class="p">()</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">stages</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="s">'stages'</span><span class="p">,</span> <span class="p">[</span><span class="s">'test'</span><span class="p">])</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">variables</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="s">'variables'</span><span class="p">,</span> <span class="p">{})</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">jobs</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">_parse_jobs</span><span class="p">()</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">current_branch</span> <span class="o">=</span> <span class="n">get_current_branch</span><span class="p">()</span>

    <span class="k">def</span> <span class="nf">_load_config</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
        <span class="s">"""Load and parse YAML configuration."""</span>
        <span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">config_file</span><span class="p">)</span> <span class="k">as</span> <span class="n">f</span><span class="p">:</span>
            <span class="k">return</span> <span class="n">yaml</span><span class="p">.</span><span class="n">safe_load</span><span class="p">(</span><span class="n">f</span><span class="p">)</span>

    <span class="k">def</span> <span class="nf">_parse_jobs</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
        <span class="s">"""Parse jobs from configuration."""</span>
        <span class="n">jobs</span> <span class="o">=</span> <span class="p">[]</span>
        <span class="k">for</span> <span class="n">job_name</span><span class="p">,</span> <span class="n">job_config</span> <span class="ow">in</span> <span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">items</span><span class="p">():</span>
            <span class="k">if</span> <span class="n">job_name</span> <span class="ow">not</span> <span class="ow">in</span> <span class="p">[</span><span class="s">'stages'</span><span class="p">,</span> <span class="s">'variables'</span><span class="p">]</span> <span class="ow">and</span> <span class="nb">isinstance</span><span class="p">(</span><span class="n">job_config</span><span class="p">,</span> <span class="nb">dict</span><span class="p">):</span>
                <span class="n">jobs</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">Job</span><span class="p">(</span><span class="n">job_name</span><span class="p">,</span> <span class="n">job_config</span><span class="p">,</span> <span class="bp">self</span><span class="p">.</span><span class="n">variables</span><span class="p">))</span>
        <span class="k">return</span> <span class="n">jobs</span>

    <span class="k">def</span> <span class="nf">_topological_sort</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">jobs</span><span class="p">):</span>
        <span class="s">"""Sort jobs in topological order based on dependencies."""</span>
        <span class="n">job_map</span> <span class="o">=</span> <span class="p">{</span><span class="n">job</span><span class="p">.</span><span class="n">name</span><span class="p">:</span> <span class="n">job</span> <span class="k">for</span> <span class="n">job</span> <span class="ow">in</span> <span class="n">jobs</span><span class="p">}</span>
        <span class="n">in_degree</span> <span class="o">=</span> <span class="p">{</span><span class="n">job</span><span class="p">.</span><span class="n">name</span><span class="p">:</span> <span class="mi">0</span> <span class="k">for</span> <span class="n">job</span> <span class="ow">in</span> <span class="n">jobs</span><span class="p">}</span>
        <span class="n">adjacency</span> <span class="o">=</span> <span class="n">defaultdict</span><span class="p">(</span><span class="nb">list</span><span class="p">)</span>

        <span class="k">for</span> <span class="n">job</span> <span class="ow">in</span> <span class="n">jobs</span><span class="p">:</span>
            <span class="k">for</span> <span class="n">dep</span> <span class="ow">in</span> <span class="n">job</span><span class="p">.</span><span class="n">needs</span><span class="p">:</span>
                <span class="k">if</span> <span class="n">dep</span> <span class="ow">in</span> <span class="n">job_map</span><span class="p">:</span>
                    <span class="n">adjacency</span><span class="p">[</span><span class="n">dep</span><span class="p">].</span><span class="n">append</span><span class="p">(</span><span class="n">job</span><span class="p">.</span><span class="n">name</span><span class="p">)</span>
                    <span class="n">in_degree</span><span class="p">[</span><span class="n">job</span><span class="p">.</span><span class="n">name</span><span class="p">]</span> <span class="o">+=</span> <span class="mi">1</span>

        <span class="n">queue</span> <span class="o">=</span> <span class="n">deque</span><span class="p">([</span><span class="n">name</span> <span class="k">for</span> <span class="n">name</span><span class="p">,</span> <span class="n">degree</span> <span class="ow">in</span> <span class="n">in_degree</span><span class="p">.</span><span class="n">items</span><span class="p">()</span> <span class="k">if</span> <span class="n">degree</span> <span class="o">==</span> <span class="mi">0</span><span class="p">])</span>
        <span class="n">execution_order</span> <span class="o">=</span> <span class="p">[]</span>

        <span class="k">while</span> <span class="n">queue</span><span class="p">:</span>
            <span class="n">current_batch</span> <span class="o">=</span> <span class="nb">list</span><span class="p">(</span><span class="n">queue</span><span class="p">)</span>
            <span class="n">execution_order</span><span class="p">.</span><span class="n">append</span><span class="p">([</span><span class="n">job_map</span><span class="p">[</span><span class="n">name</span><span class="p">]</span> <span class="k">for</span> <span class="n">name</span> <span class="ow">in</span> <span class="n">current_batch</span><span class="p">])</span>
            <span class="n">queue</span><span class="p">.</span><span class="n">clear</span><span class="p">()</span>

            <span class="k">for</span> <span class="n">job_name</span> <span class="ow">in</span> <span class="n">current_batch</span><span class="p">:</span>
                <span class="k">for</span> <span class="n">dependent</span> <span class="ow">in</span> <span class="n">adjacency</span><span class="p">[</span><span class="n">job_name</span><span class="p">]:</span>
                    <span class="n">in_degree</span><span class="p">[</span><span class="n">dependent</span><span class="p">]</span> <span class="o">-=</span> <span class="mi">1</span>
                    <span class="k">if</span> <span class="n">in_degree</span><span class="p">[</span><span class="n">dependent</span><span class="p">]</span> <span class="o">==</span> <span class="mi">0</span><span class="p">:</span>
                        <span class="n">queue</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">dependent</span><span class="p">)</span>

        <span class="k">if</span> <span class="nb">sum</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">batch</span><span class="p">)</span> <span class="k">for</span> <span class="n">batch</span> <span class="ow">in</span> <span class="n">execution_order</span><span class="p">)</span> <span class="o">!=</span> <span class="nb">len</span><span class="p">(</span><span class="n">jobs</span><span class="p">):</span>
            <span class="k">raise</span> <span class="nb">ValueError</span><span class="p">(</span><span class="s">"Circular dependency detected in job dependencies"</span><span class="p">)</span>

        <span class="k">return</span> <span class="n">execution_order</span>

    <span class="k">def</span> <span class="nf">_group_jobs_by_stage</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
        <span class="s">"""Group jobs by their stage."""</span>
        <span class="n">stages</span> <span class="o">=</span> <span class="n">defaultdict</span><span class="p">(</span><span class="nb">list</span><span class="p">)</span>
        <span class="k">for</span> <span class="n">job</span> <span class="ow">in</span> <span class="bp">self</span><span class="p">.</span><span class="n">jobs</span><span class="p">:</span>
            <span class="k">if</span> <span class="n">job</span><span class="p">.</span><span class="n">should_run</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">current_branch</span><span class="p">):</span>
                <span class="n">stages</span><span class="p">[</span><span class="n">job</span><span class="p">.</span><span class="n">stage</span><span class="p">].</span><span class="n">append</span><span class="p">(</span><span class="n">job</span><span class="p">)</span>
        <span class="k">return</span> <span class="n">stages</span>

    <span class="k">def</span> <span class="nf">_execute_job_batch</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">jobs</span><span class="p">,</span> <span class="n">workspace</span><span class="p">,</span> <span class="n">artifact_manager</span><span class="p">):</span>
        <span class="s">"""Execute a batch of jobs in parallel."""</span>
        <span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">jobs</span><span class="p">)</span> <span class="o">==</span> <span class="mi">1</span><span class="p">:</span>
            <span class="n">executor</span> <span class="o">=</span> <span class="n">JobExecutor</span><span class="p">(</span><span class="n">workspace</span><span class="p">,</span> <span class="n">artifact_manager</span><span class="p">)</span>
            <span class="n">job_name</span><span class="p">,</span> <span class="n">success</span><span class="p">,</span> <span class="n">error</span> <span class="o">=</span> <span class="n">executor</span><span class="p">.</span><span class="n">run</span><span class="p">(</span><span class="n">jobs</span><span class="p">[</span><span class="mi">0</span><span class="p">])</span>
            <span class="k">return</span> <span class="p">[(</span><span class="n">job_name</span><span class="p">,</span> <span class="n">success</span><span class="p">,</span> <span class="n">error</span><span class="p">)]</span>
        <span class="k">else</span><span class="p">:</span>
            <span class="n">manager</span> <span class="o">=</span> <span class="n">Manager</span><span class="p">()</span>
            <span class="n">output_queue</span> <span class="o">=</span> <span class="n">manager</span><span class="p">.</span><span class="n">Queue</span><span class="p">()</span>

            <span class="n">run_func</span> <span class="o">=</span> <span class="n">partial</span><span class="p">(</span>
                <span class="n">run_job_parallel</span><span class="p">,</span>
                <span class="n">workspace</span><span class="o">=</span><span class="n">workspace</span><span class="p">,</span>
                <span class="n">artifact_manager</span><span class="o">=</span><span class="n">artifact_manager</span><span class="p">,</span>
                <span class="n">output_queue</span><span class="o">=</span><span class="n">output_queue</span>
            <span class="p">)</span>

            <span class="k">with</span> <span class="n">Pool</span><span class="p">(</span><span class="n">processes</span><span class="o">=</span><span class="nb">len</span><span class="p">(</span><span class="n">jobs</span><span class="p">))</span> <span class="k">as</span> <span class="n">pool</span><span class="p">:</span>
                <span class="n">results</span> <span class="o">=</span> <span class="n">pool</span><span class="p">.</span><span class="n">map_async</span><span class="p">(</span><span class="n">run_func</span><span class="p">,</span> <span class="n">jobs</span><span class="p">)</span>

                <span class="k">while</span> <span class="bp">True</span><span class="p">:</span>
                    <span class="k">if</span> <span class="n">results</span><span class="p">.</span><span class="n">ready</span><span class="p">()</span> <span class="ow">and</span> <span class="n">output_queue</span><span class="p">.</span><span class="n">empty</span><span class="p">():</span>
                        <span class="k">break</span>

                    <span class="k">if</span> <span class="ow">not</span> <span class="n">output_queue</span><span class="p">.</span><span class="n">empty</span><span class="p">():</span>
                        <span class="k">print</span><span class="p">(</span><span class="n">output_queue</span><span class="p">.</span><span class="n">get</span><span class="p">())</span>

                <span class="k">return</span> <span class="n">results</span><span class="p">.</span><span class="n">get</span><span class="p">()</span>

    <span class="k">def</span> <span class="nf">run</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">workspace</span><span class="o">=</span><span class="s">'.'</span><span class="p">):</span>
        <span class="s">"""Execute complete pipeline."""</span>
        <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"</span><span class="se">\n</span><span class="si">{</span><span class="s">'='</span><span class="o">*</span><span class="mi">60</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
        <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"Pipeline Runner v5"</span><span class="p">)</span>
        <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"</span><span class="si">{</span><span class="s">'='</span><span class="o">*</span><span class="mi">60</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
        <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"Config: </span><span class="si">{</span><span class="bp">self</span><span class="p">.</span><span class="n">config_file</span><span class="p">.</span><span class="n">name</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
        <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"Branch: </span><span class="si">{</span><span class="bp">self</span><span class="p">.</span><span class="n">current_branch</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
        <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"Stages: </span><span class="si">{</span><span class="s">' → '</span><span class="p">.</span><span class="n">join</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">stages</span><span class="p">)</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
        <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"Total jobs: </span><span class="si">{</span><span class="nb">len</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">jobs</span><span class="p">)</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
        <span class="k">if</span> <span class="bp">self</span><span class="p">.</span><span class="n">variables</span><span class="p">:</span>
            <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"Variables: </span><span class="si">{</span><span class="s">', '</span><span class="p">.</span><span class="n">join</span><span class="p">(</span><span class="sa">f</span><span class="s">'</span><span class="si">{</span><span class="n">k</span><span class="si">}</span><span class="o">=</span><span class="si">{</span><span class="n">v</span><span class="si">}</span><span class="s">' for k, v in self.variables.items())</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
        <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"</span><span class="si">{</span><span class="s">'='</span><span class="o">*</span><span class="mi">60</span><span class="si">}</span><span class="se">\n</span><span class="s">"</span><span class="p">)</span>

        <span class="n">workspace</span> <span class="o">=</span> <span class="n">Path</span><span class="p">(</span><span class="n">workspace</span><span class="p">).</span><span class="n">resolve</span><span class="p">()</span>
        <span class="n">artifact_manager</span> <span class="o">=</span> <span class="n">ArtifactManager</span><span class="p">(</span><span class="n">workspace</span><span class="p">)</span>
        <span class="n">stages_with_jobs</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">_group_jobs_by_stage</span><span class="p">()</span>

        <span class="c1"># Count jobs that will run
</span>        <span class="n">total_jobs</span> <span class="o">=</span> <span class="nb">sum</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">jobs</span><span class="p">)</span> <span class="k">for</span> <span class="n">jobs</span> <span class="ow">in</span> <span class="n">stages_with_jobs</span><span class="p">.</span><span class="n">values</span><span class="p">())</span>
        <span class="k">if</span> <span class="n">total_jobs</span> <span class="o">==</span> <span class="mi">0</span><span class="p">:</span>
            <span class="k">print</span><span class="p">(</span><span class="s">"No jobs to run on this branch."</span><span class="p">)</span>
            <span class="k">return</span> <span class="bp">True</span>

        <span class="n">pipeline_start</span> <span class="o">=</span> <span class="n">time</span><span class="p">.</span><span class="n">time</span><span class="p">()</span>

        <span class="k">try</span><span class="p">:</span>
            <span class="k">for</span> <span class="n">stage</span> <span class="ow">in</span> <span class="bp">self</span><span class="p">.</span><span class="n">stages</span><span class="p">:</span>
                <span class="n">stage_jobs</span> <span class="o">=</span> <span class="n">stages_with_jobs</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="n">stage</span><span class="p">,</span> <span class="p">[])</span>

                <span class="k">if</span> <span class="ow">not</span> <span class="n">stage_jobs</span><span class="p">:</span>
                    <span class="k">continue</span>

                <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"</span><span class="se">\n</span><span class="si">{</span><span class="s">'─'</span><span class="o">*</span><span class="mi">60</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
                <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"Stage: </span><span class="si">{</span><span class="n">stage</span><span class="si">}</span><span class="s"> (</span><span class="si">{</span><span class="nb">len</span><span class="p">(</span><span class="n">stage_jobs</span><span class="p">)</span><span class="si">}</span><span class="s"> job(s))"</span><span class="p">)</span>
                <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"</span><span class="si">{</span><span class="s">'─'</span><span class="o">*</span><span class="mi">60</span><span class="si">}</span><span class="se">\n</span><span class="s">"</span><span class="p">)</span>

                <span class="k">try</span><span class="p">:</span>
                    <span class="n">execution_batches</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">_topological_sort</span><span class="p">(</span><span class="n">stage_jobs</span><span class="p">)</span>
                <span class="k">except</span> <span class="nb">ValueError</span> <span class="k">as</span> <span class="n">e</span><span class="p">:</span>
                    <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"✗ Error: </span><span class="si">{</span><span class="n">e</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
                    <span class="k">return</span> <span class="bp">False</span>

                <span class="k">for</span> <span class="n">batch</span> <span class="ow">in</span> <span class="n">execution_batches</span><span class="p">:</span>
                    <span class="n">job_results</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">_execute_job_batch</span><span class="p">(</span><span class="n">batch</span><span class="p">,</span> <span class="n">workspace</span><span class="p">,</span> <span class="n">artifact_manager</span><span class="p">)</span>

                    <span class="k">if</span> <span class="ow">not</span> <span class="nb">all</span><span class="p">(</span><span class="n">success</span> <span class="k">for</span> <span class="n">_</span><span class="p">,</span> <span class="n">success</span><span class="p">,</span> <span class="n">_</span> <span class="ow">in</span> <span class="n">job_results</span><span class="p">):</span>
                        <span class="n">failed_jobs</span> <span class="o">=</span> <span class="p">[</span><span class="n">name</span> <span class="k">for</span> <span class="n">name</span><span class="p">,</span> <span class="n">success</span><span class="p">,</span> <span class="n">_</span> <span class="ow">in</span> <span class="n">job_results</span> <span class="k">if</span> <span class="ow">not</span> <span class="n">success</span><span class="p">]</span>
                        <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"</span><span class="se">\n</span><span class="si">{</span><span class="s">'='</span><span class="o">*</span><span class="mi">60</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
                        <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"✗ Pipeline failed at stage '</span><span class="si">{</span><span class="n">stage</span><span class="si">}</span><span class="s">'"</span><span class="p">)</span>
                        <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"  Failed jobs: </span><span class="si">{</span><span class="s">', '</span><span class="p">.</span><span class="n">join</span><span class="p">(</span><span class="n">failed_jobs</span><span class="p">)</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
                        <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"</span><span class="si">{</span><span class="s">'='</span><span class="o">*</span><span class="mi">60</span><span class="si">}</span><span class="se">\n</span><span class="s">"</span><span class="p">)</span>
                        <span class="k">return</span> <span class="bp">False</span>

            <span class="n">duration</span> <span class="o">=</span> <span class="n">time</span><span class="p">.</span><span class="n">time</span><span class="p">()</span> <span class="o">-</span> <span class="n">pipeline_start</span>
            <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"</span><span class="se">\n</span><span class="si">{</span><span class="s">'='</span><span class="o">*</span><span class="mi">60</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
            <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"✓ Pipeline completed successfully!"</span><span class="p">)</span>
            <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"  Duration: </span><span class="si">{</span><span class="n">duration</span><span class="si">:</span><span class="p">.</span><span class="mi">1</span><span class="n">f</span><span class="si">}</span><span class="s">s"</span><span class="p">)</span>
            <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"  Jobs executed: </span><span class="si">{</span><span class="n">total_jobs</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
            <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"</span><span class="si">{</span><span class="s">'='</span><span class="o">*</span><span class="mi">60</span><span class="si">}</span><span class="se">\n</span><span class="s">"</span><span class="p">)</span>
            <span class="k">return</span> <span class="bp">True</span>

        <span class="k">finally</span><span class="p">:</span>
            <span class="n">artifact_manager</span><span class="p">.</span><span class="n">cleanup</span><span class="p">()</span>


<span class="k">def</span> <span class="nf">main</span><span class="p">():</span>
    <span class="s">"""CLI entry point."""</span>
    <span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">sys</span><span class="p">.</span><span class="n">argv</span><span class="p">)</span> <span class="o">&lt;</span> <span class="mi">2</span><span class="p">:</span>
        <span class="k">print</span><span class="p">(</span><span class="s">"Pipeline Runner v5 - A minimal CI/CD pipeline runner"</span><span class="p">)</span>
        <span class="k">print</span><span class="p">(</span><span class="s">"</span><span class="se">\n</span><span class="s">Usage:"</span><span class="p">)</span>
        <span class="k">print</span><span class="p">(</span><span class="s">"  python runner.py &lt;pipeline.yml&gt; [workspace]"</span><span class="p">)</span>
        <span class="k">print</span><span class="p">(</span><span class="s">"</span><span class="se">\n</span><span class="s">Example:"</span><span class="p">)</span>
        <span class="k">print</span><span class="p">(</span><span class="s">"  python runner.py .gitlab-ci.yml"</span><span class="p">)</span>
        <span class="k">print</span><span class="p">(</span><span class="s">"  python runner.py pipeline.yml /path/to/workspace"</span><span class="p">)</span>
        <span class="n">sys</span><span class="p">.</span><span class="nb">exit</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span>

    <span class="n">config_file</span> <span class="o">=</span> <span class="n">sys</span><span class="p">.</span><span class="n">argv</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span>
    <span class="n">workspace</span> <span class="o">=</span> <span class="n">sys</span><span class="p">.</span><span class="n">argv</span><span class="p">[</span><span class="mi">2</span><span class="p">]</span> <span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">sys</span><span class="p">.</span><span class="n">argv</span><span class="p">)</span> <span class="o">&gt;</span> <span class="mi">2</span> <span class="k">else</span> <span class="s">'.'</span>

    <span class="k">if</span> <span class="ow">not</span> <span class="n">Path</span><span class="p">(</span><span class="n">config_file</span><span class="p">).</span><span class="n">exists</span><span class="p">():</span>
        <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"Error: Config file '</span><span class="si">{</span><span class="n">config_file</span><span class="si">}</span><span class="s">' not found"</span><span class="p">)</span>
        <span class="n">sys</span><span class="p">.</span><span class="nb">exit</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span>

    <span class="k">try</span><span class="p">:</span>
        <span class="n">pipeline</span> <span class="o">=</span> <span class="n">Pipeline</span><span class="p">(</span><span class="n">config_file</span><span class="p">)</span>
        <span class="n">success</span> <span class="o">=</span> <span class="n">pipeline</span><span class="p">.</span><span class="n">run</span><span class="p">(</span><span class="n">workspace</span><span class="p">)</span>
        <span class="n">sys</span><span class="p">.</span><span class="nb">exit</span><span class="p">(</span><span class="mi">0</span> <span class="k">if</span> <span class="n">success</span> <span class="k">else</span> <span class="mi">1</span><span class="p">)</span>
    <span class="k">except</span> <span class="nb">Exception</span> <span class="k">as</span> <span class="n">e</span><span class="p">:</span>
        <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"Fatal error: </span><span class="si">{</span><span class="n">e</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
        <span class="kn">import</span> <span class="nn">traceback</span>
        <span class="n">traceback</span><span class="p">.</span><span class="n">print_exc</span><span class="p">()</span>
        <span class="n">sys</span><span class="p">.</span><span class="nb">exit</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span>


<span class="k">if</span> <span class="n">__name__</span> <span class="o">==</span> <span class="s">"__main__"</span><span class="p">:</span>
    <span class="n">main</span><span class="p">()</span>
</code></pre></div></div>

<h4 id="testing-version-5">Testing Version 5</h4>

<p>Create a complete pipeline with all features:</p>

<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># pipeline.yml</span>
<span class="na">stages</span><span class="pi">:</span>
  <span class="pi">-</span> <span class="s">build</span>
  <span class="pi">-</span> <span class="s">test</span>
  <span class="pi">-</span> <span class="s">deploy</span>

<span class="na">variables</span><span class="pi">:</span>
  <span class="na">APP_VERSION</span><span class="pi">:</span> <span class="s2">"</span><span class="s">1.0.0"</span>
  <span class="na">PYTHON_IMAGE</span><span class="pi">:</span> <span class="s2">"</span><span class="s">python:3.11"</span>

<span class="na">build-app</span><span class="pi">:</span>
  <span class="na">stage</span><span class="pi">:</span> <span class="s">build</span>
  <span class="na">image</span><span class="pi">:</span> <span class="s">${PYTHON_IMAGE}</span>
  <span class="na">script</span><span class="pi">:</span>
    <span class="pi">-</span> <span class="s">echo "Building version ${APP_VERSION}..."</span>
    <span class="pi">-</span> <span class="s">mkdir -p dist</span>
    <span class="pi">-</span> <span class="s">echo "app-${APP_VERSION}" &gt; dist/app.txt</span>
    <span class="pi">-</span> <span class="s">echo "Build complete!"</span>
  <span class="na">artifacts</span><span class="pi">:</span>
    <span class="na">paths</span><span class="pi">:</span>
      <span class="pi">-</span> <span class="s">dist/</span>

<span class="na">unit-tests</span><span class="pi">:</span>
  <span class="na">stage</span><span class="pi">:</span> <span class="s">test</span>
  <span class="na">image</span><span class="pi">:</span> <span class="s">${PYTHON_IMAGE}</span>
  <span class="na">needs</span><span class="pi">:</span>
    <span class="pi">-</span> <span class="s">build-app</span>
  <span class="na">script</span><span class="pi">:</span>
    <span class="pi">-</span> <span class="s">echo "Running unit tests..."</span>
    <span class="pi">-</span> <span class="s">cat dist/app.txt</span>
    <span class="pi">-</span> <span class="s">sleep </span><span class="m">2</span>
    <span class="pi">-</span> <span class="s">echo "Unit tests passed!"</span>

<span class="na">integration-tests</span><span class="pi">:</span>
  <span class="na">stage</span><span class="pi">:</span> <span class="s">test</span>
  <span class="na">image</span><span class="pi">:</span> <span class="s">${PYTHON_IMAGE}</span>
  <span class="na">needs</span><span class="pi">:</span>
    <span class="pi">-</span> <span class="s">build-app</span>
  <span class="na">script</span><span class="pi">:</span>
    <span class="pi">-</span> <span class="s">echo "Running integration tests..."</span>
    <span class="pi">-</span> <span class="s">cat dist/app.txt</span>
    <span class="pi">-</span> <span class="s">sleep </span><span class="m">2</span>
    <span class="pi">-</span> <span class="s">echo "Integration tests passed!"</span>

<span class="na">deploy-staging</span><span class="pi">:</span>
  <span class="na">stage</span><span class="pi">:</span> <span class="s">deploy</span>
  <span class="na">image</span><span class="pi">:</span> <span class="s">alpine:latest</span>
  <span class="na">needs</span><span class="pi">:</span>
    <span class="pi">-</span> <span class="s">unit-tests</span>
    <span class="pi">-</span> <span class="s">integration-tests</span>
  <span class="na">script</span><span class="pi">:</span>
    <span class="pi">-</span> <span class="s">echo "Deploying to staging..."</span>
    <span class="pi">-</span> <span class="s">cat dist/app.txt</span>
    <span class="pi">-</span> <span class="s">echo "Deployed to staging!"</span>

<span class="na">deploy-production</span><span class="pi">:</span>
  <span class="na">stage</span><span class="pi">:</span> <span class="s">deploy</span>
  <span class="na">image</span><span class="pi">:</span> <span class="s">alpine:latest</span>
  <span class="na">needs</span><span class="pi">:</span>
    <span class="pi">-</span> <span class="s">unit-tests</span>
    <span class="pi">-</span> <span class="s">integration-tests</span>
  <span class="na">only</span><span class="pi">:</span>
    <span class="pi">-</span> <span class="s">main</span>
  <span class="na">script</span><span class="pi">:</span>
    <span class="pi">-</span> <span class="s">echo "Deploying to production..."</span>
    <span class="pi">-</span> <span class="s">cat dist/app.txt</span>
    <span class="pi">-</span> <span class="s">echo "Deployed to production!"</span>
</code></pre></div></div>

<p>Run it:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>python runner.py pipeline.yml
</code></pre></div></div>

<p>You’ll see:</p>
<ul>
  <li>Variable substitution (<code class="language-plaintext highlighter-rouge">${APP_VERSION}</code>, <code class="language-plaintext highlighter-rouge">${PYTHON_IMAGE}</code>)</li>
  <li>Dependency-based execution order</li>
  <li>Parallel test execution</li>
  <li>Branch filtering (deploy-production only on main)</li>
  <li>Execution time tracking</li>
  <li>Proper artifact passing</li>
</ul>

<p><strong>We now have a fully functional CI/CD pipeline runner!</strong></p>

<h2 id="testing-your-pipeline-runner">Testing Your Pipeline Runner</h2>

<p>Let’s test it with a real Python project. Create this structure:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>my-project/
├── pipeline.yml
├── src/
│   └── calculator.py
└── tests/
    └── test_calculator.py
</code></pre></div></div>

<p><strong>src/calculator.py:</strong></p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">add</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">b</span><span class="p">):</span>
    <span class="k">return</span> <span class="n">a</span> <span class="o">+</span> <span class="n">b</span>

<span class="k">def</span> <span class="nf">multiply</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">b</span><span class="p">):</span>
    <span class="k">return</span> <span class="n">a</span> <span class="o">*</span> <span class="n">b</span>
</code></pre></div></div>

<p><strong>tests/test_calculator.py:</strong></p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">sys</span>
<span class="n">sys</span><span class="p">.</span><span class="n">path</span><span class="p">.</span><span class="n">insert</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="s">'src'</span><span class="p">)</span>

<span class="kn">from</span> <span class="nn">calculator</span> <span class="kn">import</span> <span class="n">add</span><span class="p">,</span> <span class="n">multiply</span>

<span class="k">def</span> <span class="nf">test_add</span><span class="p">():</span>
    <span class="k">assert</span> <span class="n">add</span><span class="p">(</span><span class="mi">2</span><span class="p">,</span> <span class="mi">3</span><span class="p">)</span> <span class="o">==</span> <span class="mi">5</span>
    <span class="k">assert</span> <span class="n">add</span><span class="p">(</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">)</span> <span class="o">==</span> <span class="mi">0</span>

<span class="k">def</span> <span class="nf">test_multiply</span><span class="p">():</span>
    <span class="k">assert</span> <span class="n">multiply</span><span class="p">(</span><span class="mi">3</span><span class="p">,</span> <span class="mi">4</span><span class="p">)</span> <span class="o">==</span> <span class="mi">12</span>
    <span class="k">assert</span> <span class="n">multiply</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">5</span><span class="p">)</span> <span class="o">==</span> <span class="mi">0</span>

<span class="k">if</span> <span class="n">__name__</span> <span class="o">==</span> <span class="s">"__main__"</span><span class="p">:</span>
    <span class="n">test_add</span><span class="p">()</span>
    <span class="n">test_multiply</span><span class="p">()</span>
    <span class="k">print</span><span class="p">(</span><span class="s">"All tests passed!"</span><span class="p">)</span>
</code></pre></div></div>

<p><strong>pipeline.yml:</strong></p>
<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="na">stages</span><span class="pi">:</span>
  <span class="pi">-</span> <span class="s">test</span>
  <span class="pi">-</span> <span class="s">build</span>
  <span class="pi">-</span> <span class="s">deploy</span>

<span class="na">variables</span><span class="pi">:</span>
  <span class="na">PYTHON_VERSION</span><span class="pi">:</span> <span class="s2">"</span><span class="s">3.11"</span>

<span class="na">lint-code</span><span class="pi">:</span>
  <span class="na">stage</span><span class="pi">:</span> <span class="s">test</span>
  <span class="na">image</span><span class="pi">:</span> <span class="s">python:${PYTHON_VERSION}</span>
  <span class="na">script</span><span class="pi">:</span>
    <span class="pi">-</span> <span class="s">echo "Linting code..."</span>
    <span class="pi">-</span> <span class="s">python -m py_compile src/*.py</span>
    <span class="pi">-</span> <span class="s">echo "Lint passed!"</span>

<span class="na">unit-tests</span><span class="pi">:</span>
  <span class="na">stage</span><span class="pi">:</span> <span class="s">test</span>
  <span class="na">image</span><span class="pi">:</span> <span class="s">python:${PYTHON_VERSION}</span>
  <span class="na">script</span><span class="pi">:</span>
    <span class="pi">-</span> <span class="s">echo "Running unit tests..."</span>
    <span class="pi">-</span> <span class="s">python tests/test_calculator.py</span>
    <span class="pi">-</span> <span class="s">echo "Tests passed!"</span>

<span class="na">build-package</span><span class="pi">:</span>
  <span class="na">stage</span><span class="pi">:</span> <span class="s">build</span>
  <span class="na">image</span><span class="pi">:</span> <span class="s">python:${PYTHON_VERSION}</span>
  <span class="na">needs</span><span class="pi">:</span>
    <span class="pi">-</span> <span class="s">lint-code</span>
    <span class="pi">-</span> <span class="s">unit-tests</span>
  <span class="na">script</span><span class="pi">:</span>
    <span class="pi">-</span> <span class="s">echo "Building package..."</span>
    <span class="pi">-</span> <span class="s">mkdir -p dist</span>
    <span class="pi">-</span> <span class="s">cp -r src dist/</span>
    <span class="pi">-</span> <span class="s">echo "v1.0.0" &gt; dist/VERSION</span>
  <span class="na">artifacts</span><span class="pi">:</span>
    <span class="na">paths</span><span class="pi">:</span>
      <span class="pi">-</span> <span class="s">dist/</span>

<span class="na">deploy-app</span><span class="pi">:</span>
  <span class="na">stage</span><span class="pi">:</span> <span class="s">deploy</span>
  <span class="na">image</span><span class="pi">:</span> <span class="s">alpine:latest</span>
  <span class="na">needs</span><span class="pi">:</span>
    <span class="pi">-</span> <span class="s">build-package</span>
  <span class="na">script</span><span class="pi">:</span>
    <span class="pi">-</span> <span class="s">echo "Deploying application..."</span>
    <span class="pi">-</span> <span class="s">ls -la dist/</span>
    <span class="pi">-</span> <span class="s">cat dist/VERSION</span>
    <span class="pi">-</span> <span class="s">echo "Deployment complete!"</span>
</code></pre></div></div>

<p>Run your pipeline:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>python runner.py pipeline.yml
</code></pre></div></div>

<p>You’ll see:</p>
<ol>
  <li><code class="language-plaintext highlighter-rouge">lint-code</code> and <code class="language-plaintext highlighter-rouge">unit-tests</code> run in parallel</li>
  <li><code class="language-plaintext highlighter-rouge">build-package</code> waits for both to complete</li>
  <li>Artifacts (dist/) are passed to <code class="language-plaintext highlighter-rouge">deploy-app</code></li>
  <li>Total execution time is optimized through parallelization</li>
</ol>

<h2 id="what-we-built-vs-what-production-runners-do">What We Built vs. What Production Runners Do</h2>

<p>Our pipeline runner demonstrates the core concepts, but production tools like GitLab Runner and GitHub Actions have many more features:</p>

<p><strong>What we built:</strong></p>
<ul>
  <li>✅ Multi-stage pipelines</li>
  <li>✅ Parallel job execution within stages</li>
  <li>✅ Job dependencies (needs/dependencies)</li>
  <li>✅ Artifact passing between jobs</li>
  <li>✅ Variable substitution</li>
  <li>✅ Branch filtering (only: branches)</li>
  <li>✅ Job timeouts</li>
  <li>✅ Real-time log streaming</li>
  <li>✅ Topological sorting for dependencies</li>
  <li>✅ Docker container isolation</li>
</ul>

<p><strong>What production runners add:</strong></p>
<ul>
  <li><strong>Distributed execution</strong>: Run jobs across multiple machines</li>
  <li><strong>Caching</strong>: Cache dependencies (node_modules, pip packages) between runs</li>
  <li><strong>Matrix builds</strong>: Run same job with different parameters (Python 3.9, 3.10, 3.11)</li>
  <li><strong>Webhooks</strong>: Trigger on git push, PR, tag events</li>
  <li><strong>Secrets management</strong>: Secure credential storage and injection</li>
  <li><strong>Web UI</strong>: Visual pipeline visualization and logs</li>
  <li><strong>Docker-in-Docker</strong>: Build Docker images within jobs</li>
  <li><strong>Service containers</strong>: Database/Redis for integration tests</li>
  <li><strong>Retry mechanisms</strong>: Automatic retry on transient failures</li>
  <li><strong>Manual triggers</strong>: Approval gates for deployments</li>
  <li><strong>Resource management</strong>: CPU/memory limits per job</li>
  <li><strong>Security features</strong>: Isolation, sandboxing, permission control</li>
</ul>

<h2 id="understanding-the-architecture">Understanding the Architecture</h2>

<p>Our runner has four main components:</p>

<h3 id="1-configuration-parser">1. Configuration Parser</h3>
<p>Reads YAML and builds job objects with all metadata (stage, dependencies, artifacts, etc.)</p>

<h3 id="2-dependency-resolver">2. Dependency Resolver</h3>
<p>Uses topological sorting to determine execution order. This ensures jobs run only after their dependencies complete.</p>

<h3 id="3-job-scheduler">3. Job Scheduler</h3>
<p>Groups jobs into batches that can run in parallel. Uses Python’s multiprocessing for parallel execution.</p>

<h3 id="4-job-executor">4. Job Executor</h3>
<p>Spawns Docker containers, mounts workspace, executes scripts, streams logs, and manages artifacts.</p>

<p>The flow:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>YAML Config → Parse Jobs → Build Dependency Graph →
Topological Sort → Execute Batches → Save Artifacts → Report Results
</code></pre></div></div>

<h2 id="real-world-use-cases">Real-World Use Cases</h2>

<p>You could actually use this runner for:</p>

<ol>
  <li><strong>Local CI/CD testing</strong>: Test your pipeline config before pushing</li>
  <li><strong>Air-gapped environments</strong>: No external CI/CD access</li>
  <li><strong>Custom workflows</strong>: Company-specific requirements</li>
  <li><strong>Learning tool</strong>: Understand how CI/CD works internally</li>
  <li><strong>Embedded systems</strong>: Limited connectivity to cloud runners</li>
  <li><strong>On-premise deployments</strong>: Full control over execution environment</li>
</ol>

<h2 id="performance-considerations">Performance Considerations</h2>

<p>Our runner is reasonably efficient, but here’s what impacts performance:</p>

<p><strong>What makes it fast:</strong></p>
<ul>
  <li>Parallel execution within stages</li>
  <li>Dependency-aware scheduling (no unnecessary waiting)</li>
  <li>Docker container reuse</li>
  <li>Minimal artifact copying</li>
</ul>

<p><strong>What could slow it down:</strong></p>
<ul>
  <li>Docker image pulls (first time)</li>
  <li>Large artifacts being copied</li>
  <li>Many jobs in sequence (long critical path)</li>
  <li>Heavy multiprocessing overhead for tiny jobs</li>
</ul>

<p><strong>Optimizations you could add:</strong></p>
<ul>
  <li>Cache Docker images</li>
  <li>Compress artifacts</li>
  <li>Parallel artifact downloads</li>
  <li>Job scheduling across machines</li>
</ul>

<h2 id="extending-the-runner">Extending the Runner</h2>

<p>Here are features you could add:</p>

<h3 id="1-caching">1. Caching</h3>
<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="na">build</span><span class="pi">:</span>
  <span class="na">cache</span><span class="pi">:</span>
    <span class="na">key</span><span class="pi">:</span> <span class="s">${CI_COMMIT_REF}</span>
    <span class="na">paths</span><span class="pi">:</span>
      <span class="pi">-</span> <span class="s">node_modules/</span>
</code></pre></div></div>

<h3 id="2-matrix-builds">2. Matrix Builds</h3>
<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="na">test</span><span class="pi">:</span>
  <span class="na">image</span><span class="pi">:</span> <span class="s">python:${VERSION}</span>
  <span class="na">matrix</span><span class="pi">:</span>
    <span class="na">VERSION</span><span class="pi">:</span> <span class="pi">[</span><span class="s2">"</span><span class="s">3.9"</span><span class="pi">,</span> <span class="s2">"</span><span class="s">3.10"</span><span class="pi">,</span> <span class="s2">"</span><span class="s">3.11"</span><span class="pi">]</span>
</code></pre></div></div>

<h3 id="3-service-containers">3. Service Containers</h3>
<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="na">integration-test</span><span class="pi">:</span>
  <span class="na">services</span><span class="pi">:</span>
    <span class="pi">-</span> <span class="s">postgres:13</span>
    <span class="pi">-</span> <span class="s">redis:6</span>
</code></pre></div></div>

<h3 id="4-manual-triggers">4. Manual Triggers</h3>
<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="na">deploy</span><span class="pi">:</span>
  <span class="na">when</span><span class="pi">:</span> <span class="s">manual</span>  <span class="c1"># Require manual approval</span>
</code></pre></div></div>

<h3 id="5-retry-logic">5. Retry Logic</h3>
<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="na">flaky-test</span><span class="pi">:</span>
  <span class="na">retry</span><span class="pi">:</span> <span class="m">2</span>  <span class="c1"># Retry up to 2 times</span>
</code></pre></div></div>

<h2 id="conclusion">Conclusion</h2>

<p>By building this CI/CD pipeline runner, we’ve demystified how GitLab Runner and GitHub Actions work internally. The core concepts aren’t that complicated:</p>

<ul>
  <li><strong>Parse configuration</strong> into job objects</li>
  <li><strong>Build a dependency graph</strong> to understand execution order</li>
  <li><strong>Use topological sorting</strong> to schedule jobs correctly</li>
  <li><strong>Execute jobs in containers</strong> for isolation</li>
  <li><strong>Pass artifacts</strong> between dependent jobs</li>
  <li><strong>Stream logs</strong> for visibility</li>
</ul>

<p>GitLab’s innovation wasn’t inventing these concepts - it was packaging them into a developer-friendly tool with great UX, distributed execution, and enterprise features.</p>

<p>Understanding these fundamentals makes you a better DevOps engineer. When pipelines are slow, you’ll know why. When jobs fail mysteriously, you’ll know where to look. When you need custom CI/CD, you’ll know how to build it.</p>

<h2 id="further-learning">Further Learning</h2>

<p>If you want to dive deeper:</p>

<ul>
  <li><strong>GitLab Runner source code</strong>: Written in Go, shows production implementation</li>
  <li><strong>GitHub Actions runner</strong>: Open source, similar concepts</li>
  <li><strong>Tekton</strong>: Kubernetes-native pipelines</li>
  <li><strong>Drone CI</strong>: Simple, container-focused CI/CD</li>
  <li><strong>Argo Workflows</strong>: DAG-based workflow engine for Kubernetes</li>
</ul>

<p>I also recommend reading about:</p>
<ul>
  <li><strong>Directed Acyclic Graphs (DAGs)</strong>: Foundation of job dependencies</li>
  <li><strong>Topological sorting</strong>: Algorithm for dependency resolution</li>
  <li><strong>Container orchestration</strong>: Kubernetes, Docker Swarm</li>
  <li><strong>Workflow engines</strong>: Apache Airflow, Prefect</li>
</ul>

<h2 id="next-steps">Next Steps</h2>

<p>Want to take this further? Here are some ideas:</p>

<ol>
  <li><strong>Add AI-powered failure investigation</strong>: In my follow-up post, I show how to <a href="/2025/building-ai-agents-devops-automation/">build AI agents that automatically investigate pipeline failures</a> using LangChain and GitHub Actions integration.</li>
  <li><strong>Deploy to production</strong>: If you’re deploying Python apps to AWS, check out my guide on <a href="/2025/ecs-decisions-that-waste-6-weeks/">ECS decisions that waste 6 weeks</a> - including GitHub Actions deployment patterns.</li>
  <li><strong>Add a web UI</strong>: Use Flask + WebSockets for real-time pipeline visualization</li>
  <li><strong>Implement caching</strong>: Cache pip/npm packages between runs</li>
  <li><strong>Add distributed execution</strong>: Run jobs across multiple machines</li>
</ol>

<p>Let me know in the comments what you’d like to see next!</p>

<hr />

<h4 id="announcements">Announcements</h4>

<ul>
  <li>If you’re interested in more systems programming and DevOps content, follow me on <a href="https://twitter.com/muhammad_o7">Twitter/X</a> where I share what I’m learning.</li>
  <li>I’m available for Python and DevOps consulting - if you need help with CI/CD, automation, or infrastructure, reach out via <a href="mailto:muhammadraza0047@gmail.com">email</a>.</li>
</ul>

<p><br />
<em>If you found this helpful, share it on X and tag me <a href="https://twitter.com/muhammad_o7">@muhammad_o7</a> - I’d love to hear your thoughts! You can also connect with me on <a href="https://www.linkedin.com/in/muhammad-raza-07/">LinkedIn</a>.</em></p>

<p><strong>Want to be notified about posts like this? Subscribe to my RSS feed or leave your email <a href="https://forms.gle/M1EK61LLCxJ3iTiD7">here</a></strong></p>

          ]]>
        </description>
        <pubDate>Sun, 09 Nov 2025 00:00:00 +0000</pubDate>
        <link>//muhammadraza.me/2025/building-cicd-pipeline-runner-python/</link>
        <guid isPermaLink="true">//muhammadraza.me/2025/building-cicd-pipeline-runner-python/</guid>
        
        <category>python</category>
        
        <category>devops</category>
        
        
        
        <dc:creator>{&quot;name&quot;=&gt;&quot;Muhammad Raza&quot;}</dc:creator>
        <dc:rights></dc:rights>
      </item>
    
      <item>
        <title>Understanding Docker Internals: Building a Container Runtime in Python</title>
        <description>
          <![CDATA[
            
            <p>I’ve been working with containers professionally for several years now, using Docker and Kubernetes daily in production environments. Like many developers, I initially treated containers as a “black box” - I knew how to use them, but didn’t really understand what was happening under the hood. It wasn’t until I needed to debug a particularly container networking issue at work that I realized I needed to understand the underlying technology better.</p>

<p>In this post, I’ll take you on a journey to breakdown container technology by building a simple container runtime in Python. We’ll explore the Linux primitives that make containers possible and implement them step by step. By the end, you’ll understand how containers work.</p>

<h2 id="what-actually-is-a-container">What Actually IS a Container?</h2>

<p>Before we start building, let’s clear up a common misconception: <strong>containers are NOT lightweight virtual machines</strong>. This comparison, while convenient for explaining containers to newcomers, is technically misleading.</p>

<p>A virtual machine includes an entire operating system with its own kernel. Containers, on the other hand, share the host’s kernel and use Linux features to create isolated environments. Specifically, containers are built on three main Linux primitives:</p>

<ol>
  <li><strong>Namespaces</strong> - Provide isolation (process, network, filesystem, etc.)</li>
  <li><strong>Control Groups (cgroups)</strong> - Limit and monitor resource usage (CPU, memory, I/O)</li>
  <li><strong>Filesystem Isolation</strong> - Use chroot/pivot_root to change the root filesystem</li>
</ol>

<p>When you run <code class="language-plaintext highlighter-rouge">docker run ubuntu bash</code>, Docker is essentially:</p>
<ul>
  <li>Creating namespaces to isolate the process</li>
  <li>Setting up cgroups to limit resources</li>
  <li>Using an overlay filesystem to provide the Ubuntu root filesystem</li>
  <li>Executing <code class="language-plaintext highlighter-rouge">/bin/bash</code> in this isolated environment</li>
</ul>

<p>Let’s build this ourselves to see exactly how it works.</p>

<h2 id="understanding-linux-namespaces">Understanding Linux Namespaces</h2>

<p>Namespaces are a Linux kernel feature that partitions kernel resources. Different processes can have different views of the system. Linux provides several types of namespaces:</p>

<ul>
  <li><strong>PID Namespace</strong> - Process isolation. Processes in a namespace only see processes within that namespace.</li>
  <li><strong>Network Namespace</strong> - Network isolation. Each namespace has its own network devices, IP addresses, routing tables.</li>
  <li><strong>Mount Namespace</strong> - Filesystem isolation. Each namespace can have its own mount points.</li>
  <li><strong>UTS Namespace</strong> - Hostname isolation. Each namespace can have its own hostname.</li>
  <li><strong>IPC Namespace</strong> - Inter-process communication isolation.</li>
  <li><strong>User Namespace</strong> - User and group ID isolation.</li>
</ul>

<p>Let’s start by implementing the simplest form of isolation: PID namespaces.</p>

<h2 id="building-our-container-runtime">Building Our Container Runtime</h2>

<h3 id="step-1-basic-process-isolation-with-pid-namespaces">Step 1: Basic Process Isolation with PID Namespaces</h3>

<p>Let’s create our first container that isolates processes:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">#!/usr/bin/env python3
</span><span class="kn">import</span> <span class="nn">os</span>
<span class="kn">import</span> <span class="nn">sys</span>
<span class="kn">import</span> <span class="nn">subprocess</span>

<span class="k">def</span> <span class="nf">run_in_container</span><span class="p">(</span><span class="n">command</span><span class="p">):</span>
    <span class="s">"""
    Run a command in an isolated PID namespace.
    This creates a new process namespace where the command
    will be PID 1 and won't see host processes.
    """</span>
    <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"Starting container with command: </span><span class="si">{</span><span class="n">command</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
    <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"Parent process PID: </span><span class="si">{</span><span class="n">os</span><span class="p">.</span><span class="n">getpid</span><span class="p">()</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>

    <span class="c1"># Create a child process
</span>    <span class="n">pid</span> <span class="o">=</span> <span class="n">os</span><span class="p">.</span><span class="n">fork</span><span class="p">()</span>

    <span class="k">if</span> <span class="n">pid</span> <span class="o">==</span> <span class="mi">0</span><span class="p">:</span>
        <span class="c1"># Child process
</span>        <span class="k">try</span><span class="p">:</span>
            <span class="c1"># Create a new PID namespace
</span>            <span class="c1"># CLONE_NEWPID creates a new process namespace
</span>            <span class="n">os</span><span class="p">.</span><span class="n">unshare</span><span class="p">(</span><span class="n">os</span><span class="p">.</span><span class="n">CLONE_NEWPID</span><span class="p">)</span>

            <span class="c1"># Mount /proc so we can see our isolated process tree
</span>            <span class="c1"># Note: This requires root privileges
</span>            <span class="n">subprocess</span><span class="p">.</span><span class="n">run</span><span class="p">([</span><span class="s">'mount'</span><span class="p">,</span> <span class="s">'-t'</span><span class="p">,</span> <span class="s">'proc'</span><span class="p">,</span> <span class="s">'proc'</span><span class="p">,</span> <span class="s">'/proc'</span><span class="p">])</span>

            <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"Container process PID: </span><span class="si">{</span><span class="n">os</span><span class="p">.</span><span class="n">getpid</span><span class="p">()</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>

            <span class="c1"># Execute the command
</span>            <span class="n">os</span><span class="p">.</span><span class="n">execvp</span><span class="p">(</span><span class="n">command</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="n">command</span><span class="p">)</span>
        <span class="k">except</span> <span class="nb">Exception</span> <span class="k">as</span> <span class="n">e</span><span class="p">:</span>
            <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"Error in container: </span><span class="si">{</span><span class="n">e</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
            <span class="n">sys</span><span class="p">.</span><span class="nb">exit</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span>
    <span class="k">else</span><span class="p">:</span>
        <span class="c1"># Parent process - wait for child to complete
</span>        <span class="n">os</span><span class="p">.</span><span class="n">waitpid</span><span class="p">(</span><span class="n">pid</span><span class="p">,</span> <span class="mi">0</span><span class="p">)</span>
        <span class="k">print</span><span class="p">(</span><span class="s">"Container exited"</span><span class="p">)</span>

<span class="k">if</span> <span class="n">__name__</span> <span class="o">==</span> <span class="s">"__main__"</span><span class="p">:</span>
    <span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">sys</span><span class="p">.</span><span class="n">argv</span><span class="p">)</span> <span class="o">&lt;</span> <span class="mi">2</span><span class="p">:</span>
        <span class="k">print</span><span class="p">(</span><span class="s">"Usage: python3 simple_container.py &lt;command&gt;"</span><span class="p">)</span>
        <span class="n">sys</span><span class="p">.</span><span class="nb">exit</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span>

    <span class="k">if</span> <span class="n">os</span><span class="p">.</span><span class="n">geteuid</span><span class="p">()</span> <span class="o">!=</span> <span class="mi">0</span><span class="p">:</span>
        <span class="k">print</span><span class="p">(</span><span class="s">"This script requires root privileges"</span><span class="p">)</span>
        <span class="n">sys</span><span class="p">.</span><span class="nb">exit</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span>

    <span class="n">command</span> <span class="o">=</span> <span class="n">sys</span><span class="p">.</span><span class="n">argv</span><span class="p">[</span><span class="mi">1</span><span class="p">:]</span>
    <span class="n">run_in_container</span><span class="p">(</span><span class="n">command</span><span class="p">)</span>
</code></pre></div></div>

<h4 id="testing-pid-isolation">Testing PID Isolation</h4>

<p>Save this as <code class="language-plaintext highlighter-rouge">simple_container.py</code> and run it:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">sudo </span>python3 simple_container.py bash
</code></pre></div></div>

<p>Inside the container, try running:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>ps aux  <span class="c"># You'll only see processes in this namespace!</span>
<span class="nb">echo</span> <span class="nv">$$</span>  <span class="c"># This will show PID 1</span>
</code></pre></div></div>

<p>This is our first step towards a container - we’ve isolated the process tree!</p>

<h3 id="step-2-filesystem-isolation-with-chroot">Step 2: Filesystem Isolation with chroot</h3>

<p>Now let’s add filesystem isolation. We’ll create a minimal root filesystem and use <code class="language-plaintext highlighter-rouge">chroot</code> to change the root directory:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">#!/usr/bin/env python3
</span><span class="kn">import</span> <span class="nn">os</span>
<span class="kn">import</span> <span class="nn">sys</span>
<span class="kn">import</span> <span class="nn">subprocess</span>
<span class="kn">import</span> <span class="nn">tempfile</span>
<span class="kn">import</span> <span class="nn">shutil</span>

<span class="k">def</span> <span class="nf">setup_rootfs</span><span class="p">(</span><span class="n">rootfs_path</span><span class="p">):</span>
    <span class="s">"""
    Create a minimal root filesystem.
    In production, this would be container image layers.
    """</span>
    <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"Setting up root filesystem at </span><span class="si">{</span><span class="n">rootfs_path</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>

    <span class="c1"># Create basic directory structure
</span>    <span class="n">dirs</span> <span class="o">=</span> <span class="p">[</span><span class="s">'bin'</span><span class="p">,</span> <span class="s">'lib'</span><span class="p">,</span> <span class="s">'lib64'</span><span class="p">,</span> <span class="s">'usr'</span><span class="p">,</span> <span class="s">'proc'</span><span class="p">,</span> <span class="s">'sys'</span><span class="p">,</span> <span class="s">'dev'</span><span class="p">,</span> <span class="s">'tmp'</span><span class="p">]</span>
    <span class="k">for</span> <span class="n">d</span> <span class="ow">in</span> <span class="n">dirs</span><span class="p">:</span>
        <span class="n">os</span><span class="p">.</span><span class="n">makedirs</span><span class="p">(</span><span class="n">os</span><span class="p">.</span><span class="n">path</span><span class="p">.</span><span class="n">join</span><span class="p">(</span><span class="n">rootfs_path</span><span class="p">,</span> <span class="n">d</span><span class="p">),</span> <span class="n">exist_ok</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>

    <span class="c1"># Copy essential binaries (bash and ls for demo)
</span>    <span class="n">binaries</span> <span class="o">=</span> <span class="p">[</span><span class="s">'/bin/bash'</span><span class="p">,</span> <span class="s">'/bin/ls'</span><span class="p">,</span> <span class="s">'/bin/ps'</span><span class="p">]</span>
    <span class="k">for</span> <span class="n">binary</span> <span class="ow">in</span> <span class="n">binaries</span><span class="p">:</span>
        <span class="k">if</span> <span class="n">os</span><span class="p">.</span><span class="n">path</span><span class="p">.</span><span class="n">exists</span><span class="p">(</span><span class="n">binary</span><span class="p">):</span>
            <span class="n">dest</span> <span class="o">=</span> <span class="n">os</span><span class="p">.</span><span class="n">path</span><span class="p">.</span><span class="n">join</span><span class="p">(</span><span class="n">rootfs_path</span><span class="p">,</span> <span class="n">binary</span><span class="p">.</span><span class="n">lstrip</span><span class="p">(</span><span class="s">'/'</span><span class="p">))</span>
            <span class="n">shutil</span><span class="p">.</span><span class="n">copy2</span><span class="p">(</span><span class="n">binary</span><span class="p">,</span> <span class="n">dest</span><span class="p">)</span>

            <span class="c1"># Copy required shared libraries
</span>            <span class="n">copy_dependencies</span><span class="p">(</span><span class="n">binary</span><span class="p">,</span> <span class="n">rootfs_path</span><span class="p">)</span>

<span class="k">def</span> <span class="nf">copy_dependencies</span><span class="p">(</span><span class="n">binary</span><span class="p">,</span> <span class="n">rootfs_path</span><span class="p">):</span>
    <span class="s">"""
    Copy shared library dependencies for a binary.
    Uses ldd to find dependencies.
    """</span>
    <span class="k">try</span><span class="p">:</span>
        <span class="n">result</span> <span class="o">=</span> <span class="n">subprocess</span><span class="p">.</span><span class="n">run</span><span class="p">([</span><span class="s">'ldd'</span><span class="p">,</span> <span class="n">binary</span><span class="p">],</span>
                              <span class="n">capture_output</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span>
                              <span class="n">text</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>

        <span class="k">for</span> <span class="n">line</span> <span class="ow">in</span> <span class="n">result</span><span class="p">.</span><span class="n">stdout</span><span class="p">.</span><span class="n">split</span><span class="p">(</span><span class="s">'</span><span class="se">\n</span><span class="s">'</span><span class="p">):</span>
            <span class="k">if</span> <span class="s">'=&gt;'</span> <span class="ow">in</span> <span class="n">line</span><span class="p">:</span>
                <span class="n">parts</span> <span class="o">=</span> <span class="n">line</span><span class="p">.</span><span class="n">split</span><span class="p">(</span><span class="s">'=&gt;'</span><span class="p">)</span>
                <span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">parts</span><span class="p">)</span> <span class="o">&gt;</span> <span class="mi">1</span><span class="p">:</span>
                    <span class="n">lib_path</span> <span class="o">=</span> <span class="n">parts</span><span class="p">[</span><span class="mi">1</span><span class="p">].</span><span class="n">strip</span><span class="p">().</span><span class="n">split</span><span class="p">()[</span><span class="mi">0</span><span class="p">]</span>
                    <span class="k">if</span> <span class="n">os</span><span class="p">.</span><span class="n">path</span><span class="p">.</span><span class="n">exists</span><span class="p">(</span><span class="n">lib_path</span><span class="p">):</span>
                        <span class="n">dest</span> <span class="o">=</span> <span class="n">os</span><span class="p">.</span><span class="n">path</span><span class="p">.</span><span class="n">join</span><span class="p">(</span><span class="n">rootfs_path</span><span class="p">,</span> <span class="n">lib_path</span><span class="p">.</span><span class="n">lstrip</span><span class="p">(</span><span class="s">'/'</span><span class="p">))</span>
                        <span class="n">os</span><span class="p">.</span><span class="n">makedirs</span><span class="p">(</span><span class="n">os</span><span class="p">.</span><span class="n">path</span><span class="p">.</span><span class="n">dirname</span><span class="p">(</span><span class="n">dest</span><span class="p">),</span> <span class="n">exist_ok</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
                        <span class="k">if</span> <span class="ow">not</span> <span class="n">os</span><span class="p">.</span><span class="n">path</span><span class="p">.</span><span class="n">exists</span><span class="p">(</span><span class="n">dest</span><span class="p">):</span>
                            <span class="n">shutil</span><span class="p">.</span><span class="n">copy2</span><span class="p">(</span><span class="n">lib_path</span><span class="p">,</span> <span class="n">dest</span><span class="p">)</span>
    <span class="k">except</span> <span class="nb">Exception</span> <span class="k">as</span> <span class="n">e</span><span class="p">:</span>
        <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"Warning: Could not copy dependencies for </span><span class="si">{</span><span class="n">binary</span><span class="si">}</span><span class="s">: </span><span class="si">{</span><span class="n">e</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>

<span class="k">def</span> <span class="nf">run_container</span><span class="p">(</span><span class="n">command</span><span class="p">,</span> <span class="n">rootfs_path</span><span class="p">):</span>
    <span class="s">"""
    Run a command in an isolated container with its own filesystem.
    """</span>
    <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"Starting container with command: </span><span class="si">{</span><span class="n">command</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>

    <span class="n">pid</span> <span class="o">=</span> <span class="n">os</span><span class="p">.</span><span class="n">fork</span><span class="p">()</span>

    <span class="k">if</span> <span class="n">pid</span> <span class="o">==</span> <span class="mi">0</span><span class="p">:</span>
        <span class="c1"># Child process
</span>        <span class="k">try</span><span class="p">:</span>
            <span class="c1"># Create new namespaces
</span>            <span class="c1"># CLONE_NEWPID: new process namespace
</span>            <span class="c1"># CLONE_NEWNS: new mount namespace
</span>            <span class="c1"># CLONE_NEWUTS: new hostname namespace
</span>            <span class="n">os</span><span class="p">.</span><span class="n">unshare</span><span class="p">(</span><span class="n">os</span><span class="p">.</span><span class="n">CLONE_NEWPID</span> <span class="o">|</span> <span class="n">os</span><span class="p">.</span><span class="n">CLONE_NEWNS</span> <span class="o">|</span> <span class="n">os</span><span class="p">.</span><span class="n">CLONE_NEWUTS</span><span class="p">)</span>

            <span class="c1"># Set hostname for this container
</span>            <span class="n">hostname</span> <span class="o">=</span> <span class="s">"container"</span>
            <span class="n">os</span><span class="p">.</span><span class="n">system</span><span class="p">(</span><span class="sa">f</span><span class="s">'hostname </span><span class="si">{</span><span class="n">hostname</span><span class="si">}</span><span class="s">'</span><span class="p">)</span>

            <span class="c1"># Change root filesystem
</span>            <span class="n">os</span><span class="p">.</span><span class="n">chroot</span><span class="p">(</span><span class="n">rootfs_path</span><span class="p">)</span>
            <span class="n">os</span><span class="p">.</span><span class="n">chdir</span><span class="p">(</span><span class="s">'/'</span><span class="p">)</span>

            <span class="c1"># Mount /proc in the container
</span>            <span class="n">os</span><span class="p">.</span><span class="n">makedirs</span><span class="p">(</span><span class="s">'/proc'</span><span class="p">,</span> <span class="n">exist_ok</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
            <span class="n">subprocess</span><span class="p">.</span><span class="n">run</span><span class="p">([</span><span class="s">'mount'</span><span class="p">,</span> <span class="s">'-t'</span><span class="p">,</span> <span class="s">'proc'</span><span class="p">,</span> <span class="s">'proc'</span><span class="p">,</span> <span class="s">'/proc'</span><span class="p">],</span>
                         <span class="n">stderr</span><span class="o">=</span><span class="n">subprocess</span><span class="p">.</span><span class="n">DEVNULL</span><span class="p">)</span>

            <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"Container started with hostname: </span><span class="si">{</span><span class="n">hostname</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
            <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"Root filesystem: </span><span class="si">{</span><span class="n">rootfs_path</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>

            <span class="c1"># Execute the command
</span>            <span class="n">os</span><span class="p">.</span><span class="n">execvp</span><span class="p">(</span><span class="n">command</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="n">command</span><span class="p">)</span>

        <span class="k">except</span> <span class="nb">Exception</span> <span class="k">as</span> <span class="n">e</span><span class="p">:</span>
            <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"Error in container: </span><span class="si">{</span><span class="n">e</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
            <span class="n">sys</span><span class="p">.</span><span class="nb">exit</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span>
    <span class="k">else</span><span class="p">:</span>
        <span class="c1"># Parent process
</span>        <span class="k">try</span><span class="p">:</span>
            <span class="n">os</span><span class="p">.</span><span class="n">waitpid</span><span class="p">(</span><span class="n">pid</span><span class="p">,</span> <span class="mi">0</span><span class="p">)</span>
        <span class="k">except</span> <span class="nb">KeyboardInterrupt</span><span class="p">:</span>
            <span class="k">print</span><span class="p">(</span><span class="s">"</span><span class="se">\n</span><span class="s">Container interrupted"</span><span class="p">)</span>
        <span class="k">print</span><span class="p">(</span><span class="s">"Container exited"</span><span class="p">)</span>

<span class="k">if</span> <span class="n">__name__</span> <span class="o">==</span> <span class="s">"__main__"</span><span class="p">:</span>
    <span class="k">if</span> <span class="n">os</span><span class="p">.</span><span class="n">geteuid</span><span class="p">()</span> <span class="o">!=</span> <span class="mi">0</span><span class="p">:</span>
        <span class="k">print</span><span class="p">(</span><span class="s">"This script requires root privileges"</span><span class="p">)</span>
        <span class="n">sys</span><span class="p">.</span><span class="nb">exit</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span>

    <span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">sys</span><span class="p">.</span><span class="n">argv</span><span class="p">)</span> <span class="o">&lt;</span> <span class="mi">2</span><span class="p">:</span>
        <span class="k">print</span><span class="p">(</span><span class="s">"Usage: sudo python3 container_v2.py &lt;command&gt;"</span><span class="p">)</span>
        <span class="n">sys</span><span class="p">.</span><span class="nb">exit</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span>

    <span class="c1"># Create temporary root filesystem
</span>    <span class="n">rootfs_path</span> <span class="o">=</span> <span class="n">tempfile</span><span class="p">.</span><span class="n">mkdtemp</span><span class="p">(</span><span class="n">prefix</span><span class="o">=</span><span class="s">'container_rootfs_'</span><span class="p">)</span>

    <span class="k">try</span><span class="p">:</span>
        <span class="n">setup_rootfs</span><span class="p">(</span><span class="n">rootfs_path</span><span class="p">)</span>
        <span class="n">command</span> <span class="o">=</span> <span class="n">sys</span><span class="p">.</span><span class="n">argv</span><span class="p">[</span><span class="mi">1</span><span class="p">:]</span>
        <span class="n">run_container</span><span class="p">(</span><span class="n">command</span><span class="p">,</span> <span class="n">rootfs_path</span><span class="p">)</span>
    <span class="k">finally</span><span class="p">:</span>
        <span class="c1"># Cleanup
</span>        <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"Cleaning up </span><span class="si">{</span><span class="n">rootfs_path</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
        <span class="n">shutil</span><span class="p">.</span><span class="n">rmtree</span><span class="p">(</span><span class="n">rootfs_path</span><span class="p">,</span> <span class="n">ignore_errors</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
</code></pre></div></div>

<p>Now when you run this, you’ll have a container with:</p>
<ul>
  <li>Isolated process tree</li>
  <li>Isolated filesystem</li>
  <li>Custom hostname</li>
</ul>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">sudo </span>python3 container_v2.py bash
</code></pre></div></div>

<p>Try these commands inside:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">hostname</span>  <span class="c"># Should show "container"</span>
<span class="nb">ls</span> /      <span class="c"># Should only see our minimal filesystem</span>
ps aux    <span class="c"># Only processes in this namespace</span>
</code></pre></div></div>

<h3 id="step-3-resource-limits-with-cgroups">Step 3: Resource Limits with cgroups</h3>

<p>Now let’s add resource limits using cgroups (control groups). This is what prevents a container from consuming all system resources:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">#!/usr/bin/env python3
</span><span class="kn">import</span> <span class="nn">os</span>
<span class="kn">import</span> <span class="nn">sys</span>
<span class="kn">import</span> <span class="nn">subprocess</span>

<span class="k">class</span> <span class="nc">CgroupManager</span><span class="p">:</span>
    <span class="s">"""
    Manages cgroups v2 for resource limiting.
    Modern Linux systems use cgroups v2.
    """</span>

    <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">container_id</span><span class="p">):</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">container_id</span> <span class="o">=</span> <span class="n">container_id</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">cgroup_path</span> <span class="o">=</span> <span class="sa">f</span><span class="s">"/sys/fs/cgroup/container_</span><span class="si">{</span><span class="n">container_id</span><span class="si">}</span><span class="s">"</span>

    <span class="k">def</span> <span class="nf">create</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">memory_limit_mb</span><span class="o">=</span><span class="mi">100</span><span class="p">,</span> <span class="n">cpu_shares</span><span class="o">=</span><span class="mi">512</span><span class="p">):</span>
        <span class="s">"""
        Create a cgroup with resource limits.

        Args:
            memory_limit_mb: Memory limit in megabytes
            cpu_shares: CPU shares (1024 = 100% of one CPU)
        """</span>
        <span class="k">try</span><span class="p">:</span>
            <span class="c1"># Create cgroup directory
</span>            <span class="n">os</span><span class="p">.</span><span class="n">makedirs</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">cgroup_path</span><span class="p">,</span> <span class="n">exist_ok</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>

            <span class="c1"># Set memory limit
</span>            <span class="n">memory_limit_bytes</span> <span class="o">=</span> <span class="n">memory_limit_mb</span> <span class="o">*</span> <span class="mi">1024</span> <span class="o">*</span> <span class="mi">1024</span>
            <span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="sa">f</span><span class="s">"</span><span class="si">{</span><span class="bp">self</span><span class="p">.</span><span class="n">cgroup_path</span><span class="si">}</span><span class="s">/memory.max"</span><span class="p">,</span> <span class="s">'w'</span><span class="p">)</span> <span class="k">as</span> <span class="n">f</span><span class="p">:</span>
                <span class="n">f</span><span class="p">.</span><span class="n">write</span><span class="p">(</span><span class="nb">str</span><span class="p">(</span><span class="n">memory_limit_bytes</span><span class="p">))</span>

            <span class="c1"># Set CPU limit
</span>            <span class="c1"># cpu.max format: $MAX $PERIOD (in microseconds)
</span>            <span class="c1"># For example, "50000 100000" means 50% of one CPU
</span>            <span class="n">cpu_quota</span> <span class="o">=</span> <span class="nb">int</span><span class="p">((</span><span class="n">cpu_shares</span> <span class="o">/</span> <span class="mi">1024</span><span class="p">)</span> <span class="o">*</span> <span class="mi">100000</span><span class="p">)</span>
            <span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="sa">f</span><span class="s">"</span><span class="si">{</span><span class="bp">self</span><span class="p">.</span><span class="n">cgroup_path</span><span class="si">}</span><span class="s">/cpu.max"</span><span class="p">,</span> <span class="s">'w'</span><span class="p">)</span> <span class="k">as</span> <span class="n">f</span><span class="p">:</span>
                <span class="n">f</span><span class="p">.</span><span class="n">write</span><span class="p">(</span><span class="sa">f</span><span class="s">"</span><span class="si">{</span><span class="n">cpu_quota</span><span class="si">}</span><span class="s"> 100000"</span><span class="p">)</span>

            <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"Created cgroup with limits:"</span><span class="p">)</span>
            <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"  Memory: </span><span class="si">{</span><span class="n">memory_limit_mb</span><span class="si">}</span><span class="s">MB"</span><span class="p">)</span>
            <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"  CPU: </span><span class="si">{</span><span class="n">cpu_shares</span><span class="si">}</span><span class="s">/1024 shares"</span><span class="p">)</span>

        <span class="k">except</span> <span class="nb">Exception</span> <span class="k">as</span> <span class="n">e</span><span class="p">:</span>
            <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"Warning: Could not set cgroup limits: </span><span class="si">{</span><span class="n">e</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
            <span class="k">print</span><span class="p">(</span><span class="s">"Continuing without resource limits..."</span><span class="p">)</span>

    <span class="k">def</span> <span class="nf">add_process</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">pid</span><span class="p">):</span>
        <span class="s">"""Add a process to this cgroup."""</span>
        <span class="k">try</span><span class="p">:</span>
            <span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="sa">f</span><span class="s">"</span><span class="si">{</span><span class="bp">self</span><span class="p">.</span><span class="n">cgroup_path</span><span class="si">}</span><span class="s">/cgroup.procs"</span><span class="p">,</span> <span class="s">'w'</span><span class="p">)</span> <span class="k">as</span> <span class="n">f</span><span class="p">:</span>
                <span class="n">f</span><span class="p">.</span><span class="n">write</span><span class="p">(</span><span class="nb">str</span><span class="p">(</span><span class="n">pid</span><span class="p">))</span>
        <span class="k">except</span> <span class="nb">Exception</span> <span class="k">as</span> <span class="n">e</span><span class="p">:</span>
            <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"Warning: Could not add process to cgroup: </span><span class="si">{</span><span class="n">e</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>

    <span class="k">def</span> <span class="nf">cleanup</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
        <span class="s">"""Remove the cgroup."""</span>
        <span class="k">try</span><span class="p">:</span>
            <span class="n">os</span><span class="p">.</span><span class="n">rmdir</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">cgroup_path</span><span class="p">)</span>
        <span class="k">except</span> <span class="nb">Exception</span> <span class="k">as</span> <span class="n">e</span><span class="p">:</span>
            <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"Warning: Could not remove cgroup: </span><span class="si">{</span><span class="n">e</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>

<span class="k">def</span> <span class="nf">run_container_with_limits</span><span class="p">(</span><span class="n">command</span><span class="p">,</span> <span class="n">memory_mb</span><span class="o">=</span><span class="mi">100</span><span class="p">,</span> <span class="n">cpu_shares</span><span class="o">=</span><span class="mi">512</span><span class="p">):</span>
    <span class="s">"""
    Run a container with resource limits.
    """</span>
    <span class="kn">import</span> <span class="nn">time</span>
    <span class="kn">import</span> <span class="nn">uuid</span>

    <span class="n">container_id</span> <span class="o">=</span> <span class="nb">str</span><span class="p">(</span><span class="n">uuid</span><span class="p">.</span><span class="n">uuid4</span><span class="p">())[:</span><span class="mi">8</span><span class="p">]</span>
    <span class="n">cgroup</span> <span class="o">=</span> <span class="n">CgroupManager</span><span class="p">(</span><span class="n">container_id</span><span class="p">)</span>

    <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"Container ID: </span><span class="si">{</span><span class="n">container_id</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>

    <span class="c1"># Create cgroup with limits
</span>    <span class="n">cgroup</span><span class="p">.</span><span class="n">create</span><span class="p">(</span><span class="n">memory_limit_mb</span><span class="o">=</span><span class="n">memory_mb</span><span class="p">,</span> <span class="n">cpu_shares</span><span class="o">=</span><span class="n">cpu_shares</span><span class="p">)</span>

    <span class="n">pid</span> <span class="o">=</span> <span class="n">os</span><span class="p">.</span><span class="n">fork</span><span class="p">()</span>

    <span class="k">if</span> <span class="n">pid</span> <span class="o">==</span> <span class="mi">0</span><span class="p">:</span>
        <span class="c1"># Child process
</span>        <span class="k">try</span><span class="p">:</span>
            <span class="c1"># Create namespaces
</span>            <span class="n">os</span><span class="p">.</span><span class="n">unshare</span><span class="p">(</span><span class="n">os</span><span class="p">.</span><span class="n">CLONE_NEWPID</span> <span class="o">|</span> <span class="n">os</span><span class="p">.</span><span class="n">CLONE_NEWUTS</span> <span class="o">|</span> <span class="n">os</span><span class="p">.</span><span class="n">CLONE_NEWNS</span><span class="p">)</span>

            <span class="c1"># Set hostname
</span>            <span class="n">subprocess</span><span class="p">.</span><span class="n">run</span><span class="p">([</span><span class="s">'hostname'</span><span class="p">,</span> <span class="sa">f</span><span class="s">'container-</span><span class="si">{</span><span class="n">container_id</span><span class="si">}</span><span class="s">'</span><span class="p">],</span>
                         <span class="n">stderr</span><span class="o">=</span><span class="n">subprocess</span><span class="p">.</span><span class="n">DEVNULL</span><span class="p">)</span>

            <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"Container process started (PID: </span><span class="si">{</span><span class="n">os</span><span class="p">.</span><span class="n">getpid</span><span class="p">()</span><span class="si">}</span><span class="s">)"</span><span class="p">)</span>

            <span class="c1"># Execute command
</span>            <span class="n">os</span><span class="p">.</span><span class="n">execvp</span><span class="p">(</span><span class="n">command</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="n">command</span><span class="p">)</span>

        <span class="k">except</span> <span class="nb">Exception</span> <span class="k">as</span> <span class="n">e</span><span class="p">:</span>
            <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"Error in container: </span><span class="si">{</span><span class="n">e</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
            <span class="n">sys</span><span class="p">.</span><span class="nb">exit</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span>
    <span class="k">else</span><span class="p">:</span>
        <span class="c1"># Parent process
</span>        <span class="k">try</span><span class="p">:</span>
            <span class="c1"># Add container process to cgroup
</span>            <span class="n">cgroup</span><span class="p">.</span><span class="n">add_process</span><span class="p">(</span><span class="n">pid</span><span class="p">)</span>

            <span class="c1"># Wait for container to exit
</span>            <span class="n">os</span><span class="p">.</span><span class="n">waitpid</span><span class="p">(</span><span class="n">pid</span><span class="p">,</span> <span class="mi">0</span><span class="p">)</span>

        <span class="k">except</span> <span class="nb">KeyboardInterrupt</span><span class="p">:</span>
            <span class="k">print</span><span class="p">(</span><span class="s">"</span><span class="se">\n</span><span class="s">Container interrupted"</span><span class="p">)</span>
        <span class="k">finally</span><span class="p">:</span>
            <span class="c1"># Cleanup cgroup
</span>            <span class="n">cgroup</span><span class="p">.</span><span class="n">cleanup</span><span class="p">()</span>
            <span class="k">print</span><span class="p">(</span><span class="s">"Container exited"</span><span class="p">)</span>

<span class="k">if</span> <span class="n">__name__</span> <span class="o">==</span> <span class="s">"__main__"</span><span class="p">:</span>
    <span class="k">if</span> <span class="n">os</span><span class="p">.</span><span class="n">geteuid</span><span class="p">()</span> <span class="o">!=</span> <span class="mi">0</span><span class="p">:</span>
        <span class="k">print</span><span class="p">(</span><span class="s">"This script requires root privileges"</span><span class="p">)</span>
        <span class="n">sys</span><span class="p">.</span><span class="nb">exit</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span>

    <span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">sys</span><span class="p">.</span><span class="n">argv</span><span class="p">)</span> <span class="o">&lt;</span> <span class="mi">2</span><span class="p">:</span>
        <span class="k">print</span><span class="p">(</span><span class="s">"Usage: sudo python3 container_v3.py &lt;command&gt; [memory_mb] [cpu_shares]"</span><span class="p">)</span>
        <span class="k">print</span><span class="p">(</span><span class="s">"Example: sudo python3 container_v3.py bash 100 512"</span><span class="p">)</span>
        <span class="n">sys</span><span class="p">.</span><span class="nb">exit</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span>

    <span class="n">command</span> <span class="o">=</span> <span class="n">sys</span><span class="p">.</span><span class="n">argv</span><span class="p">[</span><span class="mi">1</span><span class="p">:</span><span class="o">-</span><span class="mi">2</span><span class="p">]</span> <span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">sys</span><span class="p">.</span><span class="n">argv</span><span class="p">)</span> <span class="o">&gt;</span> <span class="mi">3</span> <span class="k">else</span> <span class="p">[</span><span class="n">sys</span><span class="p">.</span><span class="n">argv</span><span class="p">[</span><span class="mi">1</span><span class="p">]]</span>
    <span class="n">memory_mb</span> <span class="o">=</span> <span class="nb">int</span><span class="p">(</span><span class="n">sys</span><span class="p">.</span><span class="n">argv</span><span class="p">[</span><span class="o">-</span><span class="mi">2</span><span class="p">])</span> <span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">sys</span><span class="p">.</span><span class="n">argv</span><span class="p">)</span> <span class="o">&gt;</span> <span class="mi">2</span> <span class="k">else</span> <span class="mi">100</span>
    <span class="n">cpu_shares</span> <span class="o">=</span> <span class="nb">int</span><span class="p">(</span><span class="n">sys</span><span class="p">.</span><span class="n">argv</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">])</span> <span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">sys</span><span class="p">.</span><span class="n">argv</span><span class="p">)</span> <span class="o">&gt;</span> <span class="mi">3</span> <span class="k">else</span> <span class="mi">512</span>

    <span class="n">run_container_with_limits</span><span class="p">(</span><span class="n">command</span><span class="p">,</span> <span class="n">memory_mb</span><span class="p">,</span> <span class="n">cpu_shares</span><span class="p">)</span>
</code></pre></div></div>

<p>To test the memory limit, inside the container try:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># This Python one-liner will try to allocate memory until it hits the limit</span>
python3 <span class="nt">-c</span> <span class="s2">"a = ['x' * 1024 * 1024 for i in range(200)]"</span>
</code></pre></div></div>

<p>The process should be killed when it exceeds the memory limit!</p>

<h3 id="step-4-complete-container-runtime">Step 4: Complete Container Runtime</h3>

<p>Now let’s put everything together into a complete, production-like container runtime:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">#!/usr/bin/env python3
</span><span class="s">"""
A minimal container runtime implementation in Python.
Demonstrates how Docker-like containers work under the hood.

Usage:
    sudo python3 container.py run &lt;image_dir&gt; &lt;command&gt;

Example:
    sudo python3 container.py run ./alpine bash
"""</span>

<span class="kn">import</span> <span class="nn">os</span>
<span class="kn">import</span> <span class="nn">sys</span>
<span class="kn">import</span> <span class="nn">subprocess</span>
<span class="kn">import</span> <span class="nn">shutil</span>
<span class="kn">import</span> <span class="nn">uuid</span>
<span class="kn">import</span> <span class="nn">argparse</span>
<span class="kn">from</span> <span class="nn">pathlib</span> <span class="kn">import</span> <span class="n">Path</span>

<span class="k">class</span> <span class="nc">Container</span><span class="p">:</span>
    <span class="s">"""Represents a running container with full isolation."""</span>

    <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">image_dir</span><span class="p">,</span> <span class="n">command</span><span class="p">,</span> <span class="n">memory_mb</span><span class="o">=</span><span class="mi">512</span><span class="p">,</span> <span class="n">cpu_shares</span><span class="o">=</span><span class="mi">1024</span><span class="p">):</span>
        <span class="bp">self</span><span class="p">.</span><span class="nb">id</span> <span class="o">=</span> <span class="nb">str</span><span class="p">(</span><span class="n">uuid</span><span class="p">.</span><span class="n">uuid4</span><span class="p">())[:</span><span class="mi">12</span><span class="p">]</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">image_dir</span> <span class="o">=</span> <span class="n">Path</span><span class="p">(</span><span class="n">image_dir</span><span class="p">).</span><span class="n">resolve</span><span class="p">()</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">command</span> <span class="o">=</span> <span class="n">command</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">memory_mb</span> <span class="o">=</span> <span class="n">memory_mb</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">cpu_shares</span> <span class="o">=</span> <span class="n">cpu_shares</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">cgroup_path</span> <span class="o">=</span> <span class="sa">f</span><span class="s">"/sys/fs/cgroup/container_</span><span class="si">{</span><span class="bp">self</span><span class="p">.</span><span class="nb">id</span><span class="si">}</span><span class="s">"</span>

    <span class="k">def</span> <span class="nf">setup_cgroup</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
        <span class="s">"""Create and configure cgroup for resource limits."""</span>
        <span class="k">try</span><span class="p">:</span>
            <span class="n">os</span><span class="p">.</span><span class="n">makedirs</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">cgroup_path</span><span class="p">,</span> <span class="n">exist_ok</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>

            <span class="c1"># Memory limit
</span>            <span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="sa">f</span><span class="s">"</span><span class="si">{</span><span class="bp">self</span><span class="p">.</span><span class="n">cgroup_path</span><span class="si">}</span><span class="s">/memory.max"</span><span class="p">,</span> <span class="s">'w'</span><span class="p">)</span> <span class="k">as</span> <span class="n">f</span><span class="p">:</span>
                <span class="n">f</span><span class="p">.</span><span class="n">write</span><span class="p">(</span><span class="nb">str</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">memory_mb</span> <span class="o">*</span> <span class="mi">1024</span> <span class="o">*</span> <span class="mi">1024</span><span class="p">))</span>

            <span class="c1"># CPU limit
</span>            <span class="n">cpu_quota</span> <span class="o">=</span> <span class="nb">int</span><span class="p">((</span><span class="bp">self</span><span class="p">.</span><span class="n">cpu_shares</span> <span class="o">/</span> <span class="mi">1024</span><span class="p">)</span> <span class="o">*</span> <span class="mi">100000</span><span class="p">)</span>
            <span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="sa">f</span><span class="s">"</span><span class="si">{</span><span class="bp">self</span><span class="p">.</span><span class="n">cgroup_path</span><span class="si">}</span><span class="s">/cpu.max"</span><span class="p">,</span> <span class="s">'w'</span><span class="p">)</span> <span class="k">as</span> <span class="n">f</span><span class="p">:</span>
                <span class="n">f</span><span class="p">.</span><span class="n">write</span><span class="p">(</span><span class="sa">f</span><span class="s">"</span><span class="si">{</span><span class="n">cpu_quota</span><span class="si">}</span><span class="s"> 100000"</span><span class="p">)</span>

            <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"[</span><span class="si">{</span><span class="bp">self</span><span class="p">.</span><span class="nb">id</span><span class="si">}</span><span class="s">] Resource limits: </span><span class="si">{</span><span class="bp">self</span><span class="p">.</span><span class="n">memory_mb</span><span class="si">}</span><span class="s">MB RAM, </span><span class="si">{</span><span class="bp">self</span><span class="p">.</span><span class="n">cpu_shares</span><span class="si">}</span><span class="s">/1024 CPU"</span><span class="p">)</span>
        <span class="k">except</span> <span class="nb">Exception</span> <span class="k">as</span> <span class="n">e</span><span class="p">:</span>
            <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"Warning: Could not set cgroups: </span><span class="si">{</span><span class="n">e</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>

    <span class="k">def</span> <span class="nf">setup_network</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
        <span class="s">"""
        Setup network namespace and virtual network interface.
        In a real implementation, this would create veth pairs,
        bridges, and configure iptables for NAT.
        """</span>
        <span class="k">try</span><span class="p">:</span>
            <span class="c1"># Create new network namespace
</span>            <span class="n">os</span><span class="p">.</span><span class="n">unshare</span><span class="p">(</span><span class="n">os</span><span class="p">.</span><span class="n">CLONE_NEWNET</span><span class="p">)</span>

            <span class="c1"># Bring up loopback interface
</span>            <span class="n">subprocess</span><span class="p">.</span><span class="n">run</span><span class="p">([</span><span class="s">'ip'</span><span class="p">,</span> <span class="s">'link'</span><span class="p">,</span> <span class="s">'set'</span><span class="p">,</span> <span class="s">'lo'</span><span class="p">,</span> <span class="s">'up'</span><span class="p">],</span>
                         <span class="n">stderr</span><span class="o">=</span><span class="n">subprocess</span><span class="p">.</span><span class="n">DEVNULL</span><span class="p">)</span>

            <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"[</span><span class="si">{</span><span class="bp">self</span><span class="p">.</span><span class="nb">id</span><span class="si">}</span><span class="s">] Network namespace created"</span><span class="p">)</span>
        <span class="k">except</span> <span class="nb">Exception</span> <span class="k">as</span> <span class="n">e</span><span class="p">:</span>
            <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"Warning: Could not setup network: </span><span class="si">{</span><span class="n">e</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>

    <span class="k">def</span> <span class="nf">setup_filesystem</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
        <span class="s">"""Setup isolated filesystem with mount namespace."""</span>
        <span class="k">try</span><span class="p">:</span>
            <span class="c1"># Ensure image directory exists
</span>            <span class="k">if</span> <span class="ow">not</span> <span class="bp">self</span><span class="p">.</span><span class="n">image_dir</span><span class="p">.</span><span class="n">exists</span><span class="p">():</span>
                <span class="k">raise</span> <span class="nb">Exception</span><span class="p">(</span><span class="sa">f</span><span class="s">"Image directory not found: </span><span class="si">{</span><span class="bp">self</span><span class="p">.</span><span class="n">image_dir</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>

            <span class="c1"># Create mount namespace
</span>            <span class="n">os</span><span class="p">.</span><span class="n">unshare</span><span class="p">(</span><span class="n">os</span><span class="p">.</span><span class="n">CLONE_NEWNS</span><span class="p">)</span>

            <span class="c1"># Remount everything private to avoid propagation
</span>            <span class="n">subprocess</span><span class="p">.</span><span class="n">run</span><span class="p">([</span><span class="s">'mount'</span><span class="p">,</span> <span class="s">'--make-rprivate'</span><span class="p">,</span> <span class="s">'/'</span><span class="p">],</span>
                         <span class="n">stderr</span><span class="o">=</span><span class="n">subprocess</span><span class="p">.</span><span class="n">DEVNULL</span><span class="p">)</span>

            <span class="c1"># Change to the container's root
</span>            <span class="n">os</span><span class="p">.</span><span class="n">chroot</span><span class="p">(</span><span class="nb">str</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">image_dir</span><span class="p">))</span>
            <span class="n">os</span><span class="p">.</span><span class="n">chdir</span><span class="p">(</span><span class="s">'/'</span><span class="p">)</span>

            <span class="c1"># Mount essential filesystems
</span>            <span class="n">os</span><span class="p">.</span><span class="n">makedirs</span><span class="p">(</span><span class="s">'/proc'</span><span class="p">,</span> <span class="n">exist_ok</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
            <span class="n">subprocess</span><span class="p">.</span><span class="n">run</span><span class="p">([</span><span class="s">'mount'</span><span class="p">,</span> <span class="s">'-t'</span><span class="p">,</span> <span class="s">'proc'</span><span class="p">,</span> <span class="s">'proc'</span><span class="p">,</span> <span class="s">'/proc'</span><span class="p">],</span>
                         <span class="n">stderr</span><span class="o">=</span><span class="n">subprocess</span><span class="p">.</span><span class="n">DEVNULL</span><span class="p">)</span>

            <span class="n">os</span><span class="p">.</span><span class="n">makedirs</span><span class="p">(</span><span class="s">'/sys'</span><span class="p">,</span> <span class="n">exist_ok</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
            <span class="n">subprocess</span><span class="p">.</span><span class="n">run</span><span class="p">([</span><span class="s">'mount'</span><span class="p">,</span> <span class="s">'-t'</span><span class="p">,</span> <span class="s">'sysfs'</span><span class="p">,</span> <span class="s">'sys'</span><span class="p">,</span> <span class="s">'/sys'</span><span class="p">],</span>
                         <span class="n">stderr</span><span class="o">=</span><span class="n">subprocess</span><span class="p">.</span><span class="n">DEVNULL</span><span class="p">)</span>

            <span class="n">os</span><span class="p">.</span><span class="n">makedirs</span><span class="p">(</span><span class="s">'/dev'</span><span class="p">,</span> <span class="n">exist_ok</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
            <span class="n">subprocess</span><span class="p">.</span><span class="n">run</span><span class="p">([</span><span class="s">'mount'</span><span class="p">,</span> <span class="s">'-t'</span><span class="p">,</span> <span class="s">'devtmpfs'</span><span class="p">,</span> <span class="s">'dev'</span><span class="p">,</span> <span class="s">'/dev'</span><span class="p">],</span>
                         <span class="n">stderr</span><span class="o">=</span><span class="n">subprocess</span><span class="p">.</span><span class="n">DEVNULL</span><span class="p">)</span>

            <span class="n">os</span><span class="p">.</span><span class="n">makedirs</span><span class="p">(</span><span class="s">'/tmp'</span><span class="p">,</span> <span class="n">exist_ok</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
            <span class="n">subprocess</span><span class="p">.</span><span class="n">run</span><span class="p">([</span><span class="s">'mount'</span><span class="p">,</span> <span class="s">'-t'</span><span class="p">,</span> <span class="s">'tmpfs'</span><span class="p">,</span> <span class="s">'tmpfs'</span><span class="p">,</span> <span class="s">'/tmp'</span><span class="p">],</span>
                         <span class="n">stderr</span><span class="o">=</span><span class="n">subprocess</span><span class="p">.</span><span class="n">DEVNULL</span><span class="p">)</span>

            <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"[</span><span class="si">{</span><span class="bp">self</span><span class="p">.</span><span class="nb">id</span><span class="si">}</span><span class="s">] Filesystem isolated (root: </span><span class="si">{</span><span class="bp">self</span><span class="p">.</span><span class="n">image_dir</span><span class="si">}</span><span class="s">)"</span><span class="p">)</span>
        <span class="k">except</span> <span class="nb">Exception</span> <span class="k">as</span> <span class="n">e</span><span class="p">:</span>
            <span class="k">raise</span> <span class="nb">Exception</span><span class="p">(</span><span class="sa">f</span><span class="s">"Failed to setup filesystem: </span><span class="si">{</span><span class="n">e</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>

    <span class="k">def</span> <span class="nf">run</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
        <span class="s">"""Run the container."""</span>
        <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"[</span><span class="si">{</span><span class="bp">self</span><span class="p">.</span><span class="nb">id</span><span class="si">}</span><span class="s">] Starting container..."</span><span class="p">)</span>
        <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"[</span><span class="si">{</span><span class="bp">self</span><span class="p">.</span><span class="nb">id</span><span class="si">}</span><span class="s">] Command: </span><span class="si">{</span><span class="s">' '</span><span class="p">.</span><span class="n">join</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">command</span><span class="p">)</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>

        <span class="c1"># Setup cgroup first
</span>        <span class="bp">self</span><span class="p">.</span><span class="n">setup_cgroup</span><span class="p">()</span>

        <span class="n">pid</span> <span class="o">=</span> <span class="n">os</span><span class="p">.</span><span class="n">fork</span><span class="p">()</span>

        <span class="k">if</span> <span class="n">pid</span> <span class="o">==</span> <span class="mi">0</span><span class="p">:</span>
            <span class="c1"># Child process - this becomes the container
</span>            <span class="k">try</span><span class="p">:</span>
                <span class="c1"># Create new namespaces
</span>                <span class="n">os</span><span class="p">.</span><span class="n">unshare</span><span class="p">(</span>
                    <span class="n">os</span><span class="p">.</span><span class="n">CLONE_NEWPID</span> <span class="o">|</span>   <span class="c1"># Process isolation
</span>                    <span class="n">os</span><span class="p">.</span><span class="n">CLONE_NEWNS</span> <span class="o">|</span>    <span class="c1"># Mount isolation
</span>                    <span class="n">os</span><span class="p">.</span><span class="n">CLONE_NEWUTS</span> <span class="o">|</span>   <span class="c1"># Hostname isolation
</span>                    <span class="n">os</span><span class="p">.</span><span class="n">CLONE_NEWIPC</span>     <span class="c1"># IPC isolation
</span>                <span class="p">)</span>

                <span class="c1"># Fork again to become PID 1 in the new namespace
</span>                <span class="n">container_pid</span> <span class="o">=</span> <span class="n">os</span><span class="p">.</span><span class="n">fork</span><span class="p">()</span>

                <span class="k">if</span> <span class="n">container_pid</span> <span class="o">==</span> <span class="mi">0</span><span class="p">:</span>
                    <span class="c1"># Grandchild - this is PID 1 in the container
</span>
                    <span class="c1"># Set hostname
</span>                    <span class="n">subprocess</span><span class="p">.</span><span class="n">run</span><span class="p">([</span><span class="s">'hostname'</span><span class="p">,</span> <span class="sa">f</span><span class="s">'container-</span><span class="si">{</span><span class="bp">self</span><span class="p">.</span><span class="nb">id</span><span class="si">}</span><span class="s">'</span><span class="p">],</span>
                                 <span class="n">stderr</span><span class="o">=</span><span class="n">subprocess</span><span class="p">.</span><span class="n">DEVNULL</span><span class="p">)</span>

                    <span class="c1"># Setup filesystem
</span>                    <span class="bp">self</span><span class="p">.</span><span class="n">setup_filesystem</span><span class="p">()</span>

                    <span class="c1"># Setup network
</span>                    <span class="bp">self</span><span class="p">.</span><span class="n">setup_network</span><span class="p">()</span>

                    <span class="c1"># Set environment
</span>                    <span class="n">os</span><span class="p">.</span><span class="n">environ</span><span class="p">[</span><span class="s">'HOSTNAME'</span><span class="p">]</span> <span class="o">=</span> <span class="sa">f</span><span class="s">'container-</span><span class="si">{</span><span class="bp">self</span><span class="p">.</span><span class="nb">id</span><span class="si">}</span><span class="s">'</span>
                    <span class="n">os</span><span class="p">.</span><span class="n">environ</span><span class="p">[</span><span class="s">'PATH'</span><span class="p">]</span> <span class="o">=</span> <span class="s">'/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin'</span>

                    <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"[</span><span class="si">{</span><span class="bp">self</span><span class="p">.</span><span class="nb">id</span><span class="si">}</span><span class="s">] Container ready!"</span><span class="p">)</span>
                    <span class="k">print</span><span class="p">(</span><span class="s">"="</span><span class="o">*</span><span class="mi">60</span><span class="p">)</span>

                    <span class="c1"># Execute the command
</span>                    <span class="n">os</span><span class="p">.</span><span class="n">execvp</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">command</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="bp">self</span><span class="p">.</span><span class="n">command</span><span class="p">)</span>
                <span class="k">else</span><span class="p">:</span>
                    <span class="c1"># Child process - wait for grandchild
</span>                    <span class="n">os</span><span class="p">.</span><span class="n">waitpid</span><span class="p">(</span><span class="n">container_pid</span><span class="p">,</span> <span class="mi">0</span><span class="p">)</span>
                    <span class="n">sys</span><span class="p">.</span><span class="nb">exit</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span>

            <span class="k">except</span> <span class="nb">Exception</span> <span class="k">as</span> <span class="n">e</span><span class="p">:</span>
                <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"[</span><span class="si">{</span><span class="bp">self</span><span class="p">.</span><span class="nb">id</span><span class="si">}</span><span class="s">] Error: </span><span class="si">{</span><span class="n">e</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
                <span class="n">sys</span><span class="p">.</span><span class="nb">exit</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span>
        <span class="k">else</span><span class="p">:</span>
            <span class="c1"># Parent process
</span>            <span class="k">try</span><span class="p">:</span>
                <span class="c1"># Add container to cgroup
</span>                <span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="sa">f</span><span class="s">"</span><span class="si">{</span><span class="bp">self</span><span class="p">.</span><span class="n">cgroup_path</span><span class="si">}</span><span class="s">/cgroup.procs"</span><span class="p">,</span> <span class="s">'w'</span><span class="p">)</span> <span class="k">as</span> <span class="n">f</span><span class="p">:</span>
                    <span class="n">f</span><span class="p">.</span><span class="n">write</span><span class="p">(</span><span class="nb">str</span><span class="p">(</span><span class="n">pid</span><span class="p">))</span>

                <span class="c1"># Wait for container to exit
</span>                <span class="n">os</span><span class="p">.</span><span class="n">waitpid</span><span class="p">(</span><span class="n">pid</span><span class="p">,</span> <span class="mi">0</span><span class="p">)</span>

            <span class="k">except</span> <span class="nb">KeyboardInterrupt</span><span class="p">:</span>
                <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"</span><span class="se">\n</span><span class="s">[</span><span class="si">{</span><span class="bp">self</span><span class="p">.</span><span class="nb">id</span><span class="si">}</span><span class="s">] Interrupted"</span><span class="p">)</span>
            <span class="k">finally</span><span class="p">:</span>
                <span class="bp">self</span><span class="p">.</span><span class="n">cleanup</span><span class="p">()</span>

    <span class="k">def</span> <span class="nf">cleanup</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
        <span class="s">"""Cleanup container resources."""</span>
        <span class="k">try</span><span class="p">:</span>
            <span class="n">os</span><span class="p">.</span><span class="n">rmdir</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">cgroup_path</span><span class="p">)</span>
        <span class="k">except</span> <span class="nb">Exception</span><span class="p">:</span>
            <span class="k">pass</span>
        <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"[</span><span class="si">{</span><span class="bp">self</span><span class="p">.</span><span class="nb">id</span><span class="si">}</span><span class="s">] Container stopped"</span><span class="p">)</span>

<span class="k">def</span> <span class="nf">main</span><span class="p">():</span>
    <span class="n">parser</span> <span class="o">=</span> <span class="n">argparse</span><span class="p">.</span><span class="n">ArgumentParser</span><span class="p">(</span>
        <span class="n">description</span><span class="o">=</span><span class="s">'A minimal container runtime'</span><span class="p">,</span>
        <span class="n">formatter_class</span><span class="o">=</span><span class="n">argparse</span><span class="p">.</span><span class="n">RawDescriptionHelpFormatter</span><span class="p">,</span>
        <span class="n">epilog</span><span class="o">=</span><span class="s">"""
Examples:
  sudo python3 container.py run /path/to/rootfs bash
  sudo python3 container.py run ./alpine sh -c "echo hello from container"
  sudo python3 container.py run ./ubuntu bash --memory 256 --cpu 512
        """</span>
    <span class="p">)</span>

    <span class="n">parser</span><span class="p">.</span><span class="n">add_argument</span><span class="p">(</span><span class="s">'action'</span><span class="p">,</span> <span class="n">choices</span><span class="o">=</span><span class="p">[</span><span class="s">'run'</span><span class="p">],</span> <span class="n">help</span><span class="o">=</span><span class="s">'Action to perform'</span><span class="p">)</span>
    <span class="n">parser</span><span class="p">.</span><span class="n">add_argument</span><span class="p">(</span><span class="s">'image'</span><span class="p">,</span> <span class="n">help</span><span class="o">=</span><span class="s">'Path to root filesystem'</span><span class="p">)</span>
    <span class="n">parser</span><span class="p">.</span><span class="n">add_argument</span><span class="p">(</span><span class="s">'command'</span><span class="p">,</span> <span class="n">nargs</span><span class="o">=</span><span class="s">'+'</span><span class="p">,</span> <span class="n">help</span><span class="o">=</span><span class="s">'Command to run in container'</span><span class="p">)</span>
    <span class="n">parser</span><span class="p">.</span><span class="n">add_argument</span><span class="p">(</span><span class="s">'--memory'</span><span class="p">,</span> <span class="nb">type</span><span class="o">=</span><span class="nb">int</span><span class="p">,</span> <span class="n">default</span><span class="o">=</span><span class="mi">512</span><span class="p">,</span>
                       <span class="n">help</span><span class="o">=</span><span class="s">'Memory limit in MB (default: 512)'</span><span class="p">)</span>
    <span class="n">parser</span><span class="p">.</span><span class="n">add_argument</span><span class="p">(</span><span class="s">'--cpu'</span><span class="p">,</span> <span class="nb">type</span><span class="o">=</span><span class="nb">int</span><span class="p">,</span> <span class="n">default</span><span class="o">=</span><span class="mi">1024</span><span class="p">,</span>
                       <span class="n">help</span><span class="o">=</span><span class="s">'CPU shares (default: 1024 = 1 CPU)'</span><span class="p">)</span>

    <span class="n">args</span> <span class="o">=</span> <span class="n">parser</span><span class="p">.</span><span class="n">parse_args</span><span class="p">()</span>

    <span class="c1"># Check root
</span>    <span class="k">if</span> <span class="n">os</span><span class="p">.</span><span class="n">geteuid</span><span class="p">()</span> <span class="o">!=</span> <span class="mi">0</span><span class="p">:</span>
        <span class="k">print</span><span class="p">(</span><span class="s">"Error: This program requires root privileges"</span><span class="p">)</span>
        <span class="k">print</span><span class="p">(</span><span class="s">"Run with: sudo python3 container.py ..."</span><span class="p">)</span>
        <span class="n">sys</span><span class="p">.</span><span class="nb">exit</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span>

    <span class="k">if</span> <span class="n">args</span><span class="p">.</span><span class="n">action</span> <span class="o">==</span> <span class="s">'run'</span><span class="p">:</span>
        <span class="n">container</span> <span class="o">=</span> <span class="n">Container</span><span class="p">(</span>
            <span class="n">args</span><span class="p">.</span><span class="n">image</span><span class="p">,</span>
            <span class="n">args</span><span class="p">.</span><span class="n">command</span><span class="p">,</span>
            <span class="n">memory_mb</span><span class="o">=</span><span class="n">args</span><span class="p">.</span><span class="n">memory</span><span class="p">,</span>
            <span class="n">cpu_shares</span><span class="o">=</span><span class="n">args</span><span class="p">.</span><span class="n">cpu</span>
        <span class="p">)</span>
        <span class="n">container</span><span class="p">.</span><span class="n">run</span><span class="p">()</span>

<span class="k">if</span> <span class="n">__name__</span> <span class="o">==</span> <span class="s">"__main__"</span><span class="p">:</span>
    <span class="n">main</span><span class="p">()</span>
</code></pre></div></div>

<h2 id="testing-your-container-runtime">Testing Your Container Runtime</h2>

<p>To test this, you’ll need a root filesystem. Here’s how to create a minimal one using an existing Docker image:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># Create a directory for our container image</span>
<span class="nb">mkdir </span>alpine_rootfs

<span class="c"># Export an Alpine Linux filesystem (requires Docker)</span>
docker <span class="nb">export</span> <span class="si">$(</span>docker create alpine:latest<span class="si">)</span> | <span class="nb">tar</span> <span class="nt">-C</span> alpine_rootfs <span class="nt">-xf</span> -

<span class="c"># Or download a minimal rootfs</span>
wget https://dl-cdn.alpinelinux.org/alpine/v3.18/releases/x86_64/alpine-minirootfs-3.18.0-x86_64.tar.gz
<span class="nb">mkdir </span>alpine_rootfs
<span class="nb">tar</span> <span class="nt">-xzf</span> alpine-minirootfs-3.18.0-x86_64.tar.gz <span class="nt">-C</span> alpine_rootfs

<span class="c"># Now run your container!</span>
<span class="nb">sudo </span>python3 container.py run alpine_rootfs sh
</code></pre></div></div>

<p>Inside the container, you can verify isolation:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># Check hostname</span>
<span class="nb">hostname</span>  <span class="c"># Should show container-&lt;id&gt;</span>

<span class="c"># Check processes (only container processes)</span>
ps aux

<span class="c"># Check filesystem (should only see alpine files)</span>
<span class="nb">ls</span> /

<span class="c"># Check resource limits</span>
<span class="nb">cat</span> /sys/fs/cgroup/memory.max
</code></pre></div></div>

<h2 id="what-we-built-vs-what-docker-does">What We Built vs. What Docker Does</h2>

<p>Our container runtime demonstrates the core concepts, but production container runtimes like Docker/containerd do much more:</p>

<p><strong>What we built:</strong></p>
<ul>
  <li>Process isolation (PID namespaces)</li>
  <li>Filesystem isolation (mount namespaces + chroot)</li>
  <li>Resource limits (cgroups v2)</li>
  <li>Basic network isolation</li>
  <li>Hostname isolation (UTS namespace)</li>
</ul>

<p><strong>What Docker adds:</strong></p>
<ul>
  <li><strong>Image Management</strong>: Layered filesystems using overlay2/AUFS</li>
  <li><strong>Image Distribution</strong>: Pulling images from registries</li>
  <li><strong>Advanced Networking</strong>: Bridge networks, overlay networks, port mapping</li>
  <li><strong>Volume Management</strong>: Persistent storage with bind mounts and volumes</li>
  <li><strong>Security Features</strong>: seccomp profiles, AppArmor/SELinux, capability dropping</li>
  <li><strong>Container Orchestration APIs</strong>: REST API for managing containers</li>
  <li><strong>Logging &amp; Monitoring</strong>: stdout/stderr capture, metrics collection</li>
  <li><strong>Health Checks</strong>: Container health monitoring</li>
  <li><strong>Restart Policies</strong>: Automatic restart on failure</li>
</ul>

<h2 id="understanding-the-security-implications">Understanding the Security Implications</h2>

<p>It’s crucial to understand that our simple implementation lacks many security features:</p>

<ol>
  <li>
    <p><strong>No User Namespaces</strong>: Our containers run as root. Production containers should use user namespaces to map container root to unprivileged users.</p>
  </li>
  <li>
    <p><strong>No seccomp</strong>: We don’t restrict system calls. Docker uses seccomp profiles to block dangerous syscalls.</p>
  </li>
  <li>
    <p><strong>No Capability Dropping</strong>: Our containers have all Linux capabilities. Docker drops most by default.</p>
  </li>
  <li>
    <p><strong>No AppArmor/SELinux</strong>: No mandatory access control.</p>
  </li>
</ol>

<p>These missing features are why you should never use this implementation in production!</p>

<h2 id="conclusion">Conclusion</h2>

<p>By building this container runtime, we’ve demystified how containers actually work. They’re not magic - they’re clever applications of Linux kernel features that have existed for years:</p>

<ul>
  <li><strong>Namespaces</strong> (2002-2013): Provide isolation</li>
  <li><strong>cgroups</strong> (2007): Provide resource limiting</li>
  <li><strong>chroot</strong> (1979!): Provides filesystem isolation</li>
</ul>

<p>Docker’s innovation wasn’t inventing these technologies - it was packaging them into an easy-to-use tool with great developer experience.</p>

<p>Understanding these fundamentals makes you a better DevOps engineer. When things go wrong in production, you’ll know where to look. When you need to optimize container performance, you’ll understand the levers you can pull.</p>

<h2 id="further-learning">Further Learning</h2>

<p>If you enjoyed this deep dive, here are resources to continue learning:</p>

<ul>
  <li><strong>Linux Namespaces</strong>: <code class="language-plaintext highlighter-rouge">man namespaces</code>, <code class="language-plaintext highlighter-rouge">man unshare</code></li>
  <li><strong>cgroups</strong>: <a href="https://www.kernel.org/doc/Documentation/cgroup-v2.txt">Kernel cgroups documentation</a></li>
  <li><strong>OCI Runtime Spec</strong>: The standard container runtime specification</li>
  <li><strong>runc Source Code</strong>: Docker’s actual container runtime</li>
  <li><strong>LXC/LXD</strong>: Linux containers project - the original container tech</li>
</ul>

<p>I also highly recommend <a href="https://app.codecrafters.io/join?via=mraza007">CodeCrafters’ “Build Your Own Docker” challenge</a> - it’s an interactive way to build a container runtime with guided steps.</p>

<h2 id="next-steps">Next Steps</h2>

<p>In a future post, I might explore:</p>
<ul>
  <li>Implementing container image layers with overlay filesystems</li>
  <li>Building container networking from scratch (veth pairs, bridges, NAT)</li>
  <li>Creating a simple container orchestrator (mini-Kubernetes)</li>
</ul>

<p>Let me know in the comments what you’d like to see next!</p>

<hr />

<h4 id="announcements">Announcements</h4>

<ul>
  <li>If you’re interested in more content like this, I post regularly about DevOps, Python, and systems programming. Follow me on <a href="https://twitter.com/muhammad_o7">Twitter/X</a> for updates.</li>
  <li>I’m available for Python and DevOps consulting. If you need help with containerization, automation, or infrastructure, feel free to reach out via <a href="mailto:muhammadraza0047@gmail.com">email</a>.</li>
</ul>

<p><br />
<em>If you share this on X, tag me <a href="https://twitter.com/muhammad_o7">@muhammad_o7</a> - I’d love to see your thoughts! You can also connect with me on <a href="https://www.linkedin.com/in/muhammad-raza-07/">LinkedIn</a>.</em></p>

<p><strong>Note: Want to be notified about posts like this? Subscribe to my RSS feed or leave your email <a href="https://forms.gle/M1EK61LLCxJ3iTiD7">here</a></strong></p>

          ]]>
        </description>
        <pubDate>Mon, 28 Oct 2024 00:00:00 +0000</pubDate>
        <link>//muhammadraza.me/2024/building-container-runtime-python/</link>
        <guid isPermaLink="true">//muhammadraza.me/2024/building-container-runtime-python/</guid>
        
        <category>python</category>
        
        <category>linux</category>
        
        
        
        <dc:creator>{&quot;name&quot;=&gt;&quot;Muhammad Raza&quot;}</dc:creator>
        <dc:rights></dc:rights>
      </item>
    
  </channel>
</rss>
