<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:media="http://search.yahoo.com/mrss/">
  <channel>
    <title></title>
    <description>Personal Blog where I write about things I learn or discover.</description>
    <link>//muhammadraza.me/</link>
    <atom:link href="//muhammadraza.me/feed.xml" rel="self" type="application/rss+xml"/>
    <pubDate>Wed, 10 Jun 2026 03:44:52 +0000</pubDate>
    <lastBuildDate>Wed, 10 Jun 2026 03:44:52 +0000</lastBuildDate>
    <generator>Jekyll v3.10.0</generator>
    
      <item>
        <title>How ECS Actually Works: A Visual Guide for People Who Know Kubernetes</title>
        <description>
          <![CDATA[
            
            <p>Every few months I have the same conversation. A small team, three to eight engineers, is containerizing their app, and someone says “we should use Kubernetes, that’s the industry standard.” Six months later they’re maintaining a small distributed systems platform on the side, and the app they were supposed to ship is still competing for attention with CNI upgrades.</p>

<p>I’ve written before about <a href="/2025/ecs-decisions-that-waste-6-weeks/">the ECS decisions that waste six weeks</a>. This post is the prequel: what ECS actually is, how it maps onto the Kubernetes concepts you already know, and what you stop carrying on your pager when you choose it. There are a few interactive diagrams below. Click around in them; they teach the model faster than prose does.</p>

<p>One thing before we start: this is not a “Kubernetes bad” post. EKS is the right choice for some teams, and I’ll tell you exactly which ones at the end. But I’ve watched too many three-person teams default to EKS because it felt like the serious choice, without anyone explaining what they were signing up to operate.</p>

<h2 id="ecs-is-an-orchestrator-thats-it">ECS is an orchestrator. That’s it.</h2>

<p>Strip away the branding and every container orchestrator does the same job: you declare what should be running, and a control loop makes reality match the declaration. Kubernetes does this. Nomad does this. ECS does this.</p>

<p>ECS just exposes far fewer moving parts to you. Here’s the whole object model. Click each piece:</p>

<style>
/* ---- ecsx shared widget styles (scoped) ---- */
.ecsx{background:#0e131b;border:1px solid #2a3447;border-radius:8px;padding:18px;margin:1.6em 0;
  font-family:"JetBrains Mono",monospace;color:#dbe2ee;font-size:13px;line-height:1.5}
.ecsx *{box-sizing:border-box}
.ecsx-title{font-size:11px;letter-spacing:.18em;color:#7d8aa3;margin-bottom:14px;text-transform:uppercase}
.ecsx button{font-family:inherit;font-size:12px;background:#1a2333;color:#dbe2ee;border:1px solid #36435c;
  border-radius:5px;padding:7px 12px;cursor:pointer;transition:all .15s}
.ecsx button:hover{border-color:#ffb454;color:#ffb454}
.ecsx button:disabled{opacity:.4;cursor:default}
.ecsx-badge{display:inline-block;font-size:10px;padding:2px 8px;border-radius:99px;border:1px solid #4b79c4;
  color:#8db8f8;margin-left:8px;white-space:nowrap}
.ecsx-flex{display:flex;gap:16px;flex-wrap:wrap}
@media(max-width:640px){.ecsx{font-size:12px}}
/* anatomy */
.ecsx-anat-box{border:1.5px solid;border-radius:7px;padding:10px;cursor:pointer;transition:background .15s}
.ecsx-anat-box:hover{background:rgba(255,180,84,.06)}
.ecsx-anat-box.sel{background:rgba(255,180,84,.12)}
.ecsx-anat-label{font-size:11px;letter-spacing:.08em;margin-bottom:8px;font-weight:700}
.ecsx-info{flex:1;min-width:240px;border-left:2px solid #ffb454;padding:4px 0 4px 14px;align-self:center}
.ecsx-info h4{margin:0 0 6px;font-size:14px;color:#ffb454;font-family:inherit}
.ecsx-info p{margin:0;color:#aab4c8;font-size:12.5px}
/* recon */
.ecsx-taskgrid{display:flex;gap:10px;flex-wrap:wrap;min-height:84px;margin:12px 0}
.ecsx-task{width:118px;border:1.5px solid #3e9c5a;border-radius:6px;padding:8px;cursor:pointer;
  transition:opacity .4s, transform .4s}
.ecsx-task .id{font-size:11px;color:#7d8aa3}
.ecsx-task .st{font-size:11px;font-weight:700;margin-top:4px}
.ecsx-task.RUNNING{border-color:#3e9c5a}.ecsx-task.RUNNING .st{color:#79d68a}
.ecsx-task.PROVISIONING{border-color:#b98a3c;animation:ecsxpulse 1s infinite}.ecsx-task.PROVISIONING .st{color:#ffb454}
.ecsx-task.DRAINING{border-color:#5c677c;opacity:.55}.ecsx-task.DRAINING .st{color:#8b96ab}
.ecsx-task.STOPPED{border-color:#c44f5e;opacity:.25;transform:scale(.92)}.ecsx-task.STOPPED .st{color:#ff6b7d}
.ecsx-ver{display:inline-block;font-size:10px;padding:1px 7px;border-radius:99px;margin-top:5px}
.ecsx-ver.v1{background:#16344e;color:#7fc4ff}.ecsx-ver.v2{background:#33234e;color:#c9a6ff}
@keyframes ecsxpulse{50%{background:rgba(255,180,84,.08)}}
.ecsx-log{background:#0a0e15;border:1px solid #232c3d;border-radius:6px;padding:10px 12px;font-size:11.5px;
  height:118px;overflow:hidden;display:flex;flex-direction:column;justify-content:flex-end;color:#94a0b8}
.ecsx-log .t{color:#525e75;margin-right:8px}
.ecsx-log .hl{color:#ffb454}.ecsx-log .ok{color:#79d68a}.ecsx-log .bad{color:#ff6b7d}
.ecsx-svchead{display:flex;gap:18px;flex-wrap:wrap;font-size:12px;color:#aab4c8;margin-bottom:4px}
.ecsx-svchead b{color:#dbe2ee}
/* stack */
.ecsx-cols{display:flex;gap:14px;flex-wrap:wrap;margin-top:12px}
.ecsx-col{flex:1;min-width:230px}
.ecsx-colhead{text-align:center;font-weight:700;font-size:13px;padding:8px;border-bottom:2px solid #36435c;margin-bottom:8px}
.ecsx-cell{border-radius:5px;padding:8px 10px;margin-bottom:6px;font-size:12px;border:1px solid;min-height:54px}
.ecsx-cell .who{font-size:10px;font-weight:700;letter-spacing:.1em;display:block;margin-bottom:2px}
.ecsx-cell.aws{background:rgba(62,156,90,.10);border-color:#2c5e3e}.ecsx-cell.aws .who{color:#79d68a}
.ecsx-cell.you{background:rgba(255,180,84,.10);border-color:#7a5a28}.ecsx-cell.you .who{color:#ffb454}
.ecsx-cell.na{background:rgba(120,130,150,.05);border-color:#2a3447;color:#67738c}.ecsx-cell.na .who{color:#67738c}
.ecsx-score{margin-top:10px;padding:10px 12px;background:#0a0e15;border:1px solid #232c3d;border-radius:6px;
  font-size:12.5px;color:#aab4c8}
.ecsx-score b{color:#ffb454}
.ecsx-toggle{display:inline-flex;border:1px solid #36435c;border-radius:6px;overflow:hidden;margin-left:10px}
.ecsx-toggle button{border:none;border-radius:0;padding:5px 12px;font-size:11px}
.ecsx-toggle button.on{background:#ffb454;color:#1a1206}
</style>

<div class="ecsx" id="ecsx-anatomy">
  <div class="ecsx-title">The entire ECS object model — click anything</div>
  <div class="ecsx-flex">
    <div style="flex:1.4;min-width:280px">
      <div class="ecsx-anat-box" data-k="cluster" style="border-color:#4b79c4">
        <div class="ecsx-anat-label" style="color:#8db8f8">CLUSTER</div>
        <div class="ecsx-anat-box" data-k="service" style="border-color:#3e9c5a">
          <div class="ecsx-anat-label" style="color:#79d68a">SERVICE — web · desired: 3</div>
          <div style="display:flex;gap:8px;flex-wrap:wrap">
            <div class="ecsx-anat-box" data-k="task" style="border-color:#ffb454;flex:1;min-width:110px">
              <div class="ecsx-anat-label" style="color:#ffb454">TASK</div>
              <div class="ecsx-anat-box" data-k="container" style="border-color:#c9a6ff;font-size:11px;color:#c9a6ff">container: app</div>
              <div class="ecsx-anat-box" data-k="container" style="border-color:#c9a6ff;font-size:11px;color:#c9a6ff;margin-top:6px">container: nginx</div>
            </div>
            <div class="ecsx-anat-box" data-k="task" style="border-color:#ffb454;flex:1;min-width:110px">
              <div class="ecsx-anat-label" style="color:#ffb454">TASK</div>
              <div class="ecsx-anat-box" data-k="container" style="border-color:#c9a6ff;font-size:11px;color:#c9a6ff">container: app</div>
              <div class="ecsx-anat-box" data-k="container" style="border-color:#c9a6ff;font-size:11px;color:#c9a6ff;margin-top:6px">container: nginx</div>
            </div>
            <div class="ecsx-anat-box" data-k="task" style="border-color:#ffb454;flex:1;min-width:110px">
              <div class="ecsx-anat-label" style="color:#ffb454">TASK</div>
              <div class="ecsx-anat-box" data-k="container" style="border-color:#c9a6ff;font-size:11px;color:#c9a6ff">container: app</div>
              <div class="ecsx-anat-box" data-k="container" style="border-color:#c9a6ff;font-size:11px;color:#c9a6ff;margin-top:6px">container: nginx</div>
            </div>
          </div>
        </div>
      </div>
      <div class="ecsx-anat-box" data-k="taskdef" style="border-color:#e06c75;margin-top:10px">
        <div class="ecsx-anat-label" style="color:#e06c75">TASK DEFINITION — web:42 <span style="color:#67738c;font-weight:400">(the blueprint the service stamps tasks from)</span></div>
      </div>
    </div>
    <div class="ecsx-info" id="ecsx-anat-info">
      <h4>Click a component</h4>
      <p>Every box on the left has a direct Kubernetes equivalent. Click to see what it is and what it maps to.</p>
    </div>
  </div>
</div>

<script>
(function(){
  var INFO = {
    cluster:  ['Cluster', 'Kubernetes equivalent: cluster',
      'A logical boundary for compute and workloads. Unlike a Kubernetes cluster, there is no control plane living inside it that you can see, version, or break — the scheduler and state store are an AWS regional service. There is nothing to upgrade. Ever.'],
    service:  ['Service', 'Kubernetes equivalent: Deployment + Service',
      'Holds the declaration: "keep N copies of this task definition running, registered behind this load balancer target group." It is the reconciliation loop — it replaces dead tasks, performs rolling deployments, and hooks into autoscaling. One ECS object does what a Deployment, ReplicaSet, and Service do together in Kubernetes.'],
    task:     ['Task', 'Kubernetes equivalent: Pod',
      'One running copy of your workload: one or more containers scheduled together on the same host, sharing a network namespace and an IAM role. With the awsvpc network mode every task gets its own ENI and private IP — same mental model as a pod IP.'],
    container:['Container', 'Kubernetes equivalent: container',
      'Exactly what you think it is. Sidecars work the same way as in a pod — an nginx or log-router container scheduled next to your app container inside the same task.'],
    taskdef:  ['Task Definition', 'Kubernetes equivalent: pod spec (+ a bit of Deployment)',
      'A versioned, immutable JSON document: images, CPU/memory, env vars, ports, volumes, IAM role. Every revision gets a number (web:41, web:42). A deployment is literally "point the service at a new revision." No Helm, no templating layer — which is both the good news and the bad news.'],
  };
  var root = document.getElementById('ecsx-anatomy');
  var info = document.getElementById('ecsx-anat-info');
  root.addEventListener('click', function(e){
    var box = e.target.closest('.ecsx-anat-box');
    if(!box) return;
    e.stopPropagation();
    root.querySelectorAll('.ecsx-anat-box').forEach(function(b){b.classList.remove('sel')});
    box.classList.add('sel');
    var d = INFO[box.dataset.k];
    info.innerHTML = '<h4>'+d[0]+'<span class="ecsx-badge">'+d[1]+'</span></h4><p>'+d[2]+'</p>';
  }, true);
})();
</script>

<p>If you know Kubernetes, the translation table is short enough to memorize over coffee:</p>

<table>
  <thead>
    <tr>
      <th>ECS</th>
      <th>Kubernetes</th>
      <th>What it is</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Cluster</td>
      <td>Cluster</td>
      <td>Logical boundary for compute + workloads</td>
    </tr>
    <tr>
      <td>Service</td>
      <td>Deployment + ReplicaSet + Service</td>
      <td>“Keep N running, behind this LB”</td>
    </tr>
    <tr>
      <td>Task</td>
      <td>Pod</td>
      <td>Co-scheduled containers, shared network + identity</td>
    </tr>
    <tr>
      <td>Task definition</td>
      <td>Pod spec</td>
      <td>Versioned blueprint for a task</td>
    </tr>
    <tr>
      <td>Capacity provider</td>
      <td>Node group / Karpenter</td>
      <td>Where compute comes from</td>
    </tr>
    <tr>
      <td>Fargate</td>
      <td>— (closest: virtual kubelet)</td>
      <td>Serverless compute, no nodes at all</td>
    </tr>
    <tr>
      <td>Task IAM role</td>
      <td>ServiceAccount + IRSA</td>
      <td>Per-workload cloud credentials</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">awsvpc</code> mode</td>
      <td>CNI</td>
      <td>Every task gets its own ENI/IP — not a choice, a default</td>
    </tr>
  </tbody>
</table>

<p>That last column is where the story actually lives. In Kubernetes, “where compute comes from” and “how pods get IPs” and “how workloads get cloud credentials” are all <em>decisions</em> with an ecosystem of competing answers. In ECS they’re defaults. You don’t pick a CNI. You don’t install an IRSA webhook. There’s one way, it’s boring, and it works.</p>

<h2 id="the-reconciliation-loop--same-idea-fewer-layers">The reconciliation loop — same idea, fewer layers</h2>

<p>The core idea both systems share: you declare desired state, a control loop enforces it. This is the part I find people understand instantly once they <em>watch</em> it instead of reading about it.</p>

<p>Below is an ECS service with <code class="language-plaintext highlighter-rouge">desired count: 4</code>. Click a task to kill it, then watch the scheduler notice and replace it. Then hit deploy and watch a rolling deployment do exactly what a Kubernetes Deployment rollout does: bring up new tasks, drain old ones, never drop below healthy.</p>

<div class="ecsx" id="ecsx-recon">
  <div class="ecsx-title">Service reconciliation — click a task to kill it</div>
  <div class="ecsx-svchead">
    <span>service: <b>web</b></span>
    <span>desired: <b>4</b></span>
    <span>running: <b id="ecsx-running">4</b></span>
    <span>revision: <b id="ecsx-rev">web:41</b></span>
  </div>
  <div class="ecsx-taskgrid" id="ecsx-tasks"></div>
  <div style="display:flex;gap:10px;margin-bottom:12px;flex-wrap:wrap">
    <button id="ecsx-kill">⚡ kill a task</button>
    <button id="ecsx-deploy">🚀 deploy web:42</button>
    <button id="ecsx-reset">↺ reset</button>
  </div>
  <div class="ecsx-log" id="ecsx-loglines"></div>
</div>

<script>
(function(){
  var grid = document.getElementById('ecsx-tasks');
  var logEl = document.getElementById('ecsx-loglines');
  var runEl = document.getElementById('ecsx-running');
  var revEl = document.getElementById('ecsx-rev');
  var DESIRED = 4, tasks = [], logs = [], t0 = Date.now(), deploying = false, timers = [];

  function now(){ return ((Date.now()-t0)/1000).toFixed(1)+'s'; }
  function log(msg, cls){
    logs.push('<div><span class="t">'+now()+'</span><span class="'+(cls||'')+'">'+msg+'</span></div>');
    logs = logs.slice(-7); logEl.innerHTML = logs.join('');
  }
  function id(){ return Math.random().toString(16).slice(2,8); }
  function later(fn, ms){ timers.push(setTimeout(fn, ms)); }

  function render(){
    grid.innerHTML = tasks.map(function(t){
      return '<div class="ecsx-task '+t.st+'" data-id="'+t.id+'">'+
        '<div class="id">'+t.id+'</div>'+
        '<div class="st">'+t.st+'</div>'+
        '<span class="ecsx-ver '+t.ver+'">web:'+(t.ver==='v1'?41:42)+'</span></div>';
    }).join('');
    runEl.textContent = tasks.filter(function(t){return t.st==='RUNNING'}).length;
  }

  function spawn(ver, cb){
    var t = { id:id(), ver:ver, st:'PROVISIONING' };
    tasks.push(t); render();
    log('scheduler: starting task <span class="hl">'+t.id+'</span> ('+(ver==='v1'?'web:41':'web:42')+')');
    later(function(){
      t.st = 'RUNNING'; render();
      log('task <span class="hl">'+t.id+'</span> RUNNING — registered with target group','ok');
      if(cb) cb(t);
    }, 1900 + Math.random()*700);
  }

  function reconcile(){
    if (deploying) return;
    var alive = tasks.filter(function(t){return t.st==='RUNNING'||t.st==='PROVISIONING'}).length;
    if (alive < DESIRED){
      log('service web: running ('+alive+') below desired ('+DESIRED+')','hl');
      spawn(tasks.some(function(t){return t.ver==='v2'}) ? 'v2' : 'v1');
    }
  }

  function kill(tid){
    var t = tasks.find(function(x){return x.id===tid && x.st==='RUNNING'});
    if(!t) return;
    t.st='STOPPED'; render();
    log('task <span class="bad">'+t.id+'</span> stopped (essential container exited)','bad');
    later(function(){ tasks = tasks.filter(function(x){return x!==t}); render(); reconcile(); }, 900);
  }

  grid.addEventListener('click', function(e){
    var el = e.target.closest('.ecsx-task'); if(el) kill(el.dataset.id);
  });
  document.getElementById('ecsx-kill').onclick = function(){
    var r = tasks.filter(function(t){return t.st==='RUNNING'});
    if(r.length) kill(r[Math.floor(Math.random()*r.length)].id);
  };

  document.getElementById('ecsx-deploy').onclick = function(){
    if (deploying || tasks.some(function(t){return t.ver==='v2'})) return;
    deploying = true;
    revEl.textContent = 'web:42';
    log('deployment started: web:41 → web:42 (rolling, min healthy 100%)','hl');
    (function step(){
      var olds = tasks.filter(function(t){return t.ver==='v1' && t.st==='RUNNING'});
      if (!olds.length){ deploying=false; log('deployment completed: 4/4 tasks on web:42','ok'); return; }
      spawn('v2', function(){
        var old = tasks.find(function(t){return t.ver==='v1' && t.st==='RUNNING'});
        if (old){
          old.st='DRAINING'; render();
          log('task <span class="hl">'+old.id+'</span> draining connections…');
          later(function(){
            tasks = tasks.filter(function(x){return x!==old}); render();
            log('task '+old.id+' deregistered + stopped');
            step();
          }, 1400);
        } else step();
      });
    })();
  };

  function reset(){
    timers.forEach(clearTimeout); timers=[]; tasks=[]; logs=[]; deploying=false; t0=Date.now();
    revEl.textContent='web:41';
    for (var i=0;i<DESIRED;i++) tasks.push({id:id(), ver:'v1', st:'RUNNING'});
    render(); log('service web: steady state — 4/4 running','ok');
  }
  document.getElementById('ecsx-reset').onclick = reset;
  setInterval(reconcile, 1200);
  reset();
})();
</script>

<p>That’s a Deployment rollout and a ReplicaSet self-heal, except nobody installed anything to get it. There’s no controller manager to version. You get all of this the moment you create a service.</p>

<p>When I help teams ship on ECS, this is where it clicks: you already understand ECS. If you can reason about desired state and reconciliation, the orchestration knowledge transfers completely. What doesn’t transfer is the operational surface area, and that’s the actual argument.</p>

<h2 id="what-you-stop-operating">What you stop operating</h2>

<p>This is the comparison that matters for a small team, and it’s the one nobody draws. The question isn’t which scheduler is smarter. They’re both fine. The question is whose pager each layer lands on.</p>

<p>Toggle ECS between Fargate and EC2 to see the middle ground:</p>

<div class="ecsx" id="ecsx-stack">
  <div class="ecsx-title">Who operates each layer
    <span class="ecsx-toggle"><button id="ecsx-fg" class="on">ECS · Fargate</button><button id="ecsx-ec2">ECS · EC2</button></span>
  </div>
  <div class="ecsx-cols">
    <div class="ecsx-col"><div class="ecsx-colhead" style="color:#8db8f8">EKS</div><div id="ecsx-col-eks"></div></div>
    <div class="ecsx-col"><div class="ecsx-colhead" style="color:#79d68a">ECS <span id="ecsx-mode">· Fargate</span></div><div id="ecsx-col-ecs"></div></div>
  </div>
  <div class="ecsx-score" id="ecsx-score"></div>
</div>

<script>
(function(){
  /* rows: [layer, EKS cell, ECS-Fargate cell, ECS-EC2 cell]; who: aws|you|na */
  var ROWS = [
    ['Control plane (API, scheduler, state store)',
      {who:'aws', txt:'AWS runs it — you pay $0.10/hr per cluster'},
      {who:'aws', txt:'AWS runs it — free'},
      {who:'aws', txt:'AWS runs it — free'}],
    ['Version upgrade treadmill',
      {who:'you', txt:'You initiate + test a cluster upgrade ~every 12–14 months, or pay 6× for extended support'},
      {who:'na',  txt:'Does not exist — there is no version'},
      {who:'na',  txt:'Does not exist — there is no version'}],
    ['Cluster add-ons (CNI, CoreDNS, kube-proxy)',
      {who:'you', txt:'You choose, install, and upgrade them — and they break during cluster upgrades'},
      {who:'na',  txt:'Built in (awsvpc networking). Not configurable, not breakable'},
      {who:'na',  txt:'Built in (awsvpc networking). Not configurable, not breakable'}],
    ['Ingress / load balancing',
      {who:'you', txt:'You install + upgrade the AWS Load Balancer Controller'},
      {who:'aws', txt:'Native ALB target-group integration'},
      {who:'aws', txt:'Native ALB target-group integration'}],
    ['Node OS, AMIs, patching',
      {who:'you', txt:'Yours — managed node groups help, but the reboot schedule is still your problem'},
      {who:'aws', txt:'No nodes. AWS patches the compute under you'},
      {who:'you', txt:'Yours — ASG AMI rotation, drain hooks, the works'}],
    ['Capacity planning + node autoscaling',
      {who:'you', txt:'Karpenter or Cluster Autoscaler — you configure and tune it'},
      {who:'aws', txt:'Per-task. You declare CPU/memory, AWS finds room'},
      {who:'you', txt:'Capacity providers + ASG sizing — bin-packing is back on you'}],
    ['Workload identity (cloud credentials)',
      {who:'you', txt:'RBAC + OIDC provider + IRSA annotations per service account'},
      {who:'aws', txt:'A plain IAM role on the task definition'},
      {who:'aws', txt:'A plain IAM role on the task definition'}],
  ];
  var eksCol = document.getElementById('ecsx-col-eks');
  var ecsCol = document.getElementById('ecsx-col-ecs');
  var score  = document.getElementById('ecsx-score');
  var modeEl = document.getElementById('ecsx-mode');
  var WHO = {aws:'AWS MANAGES', you:'YOU OPERATE', na:'— GONE —'};

  function cell(layer, c){
    return '<div class="ecsx-cell '+c.who+'"><span class="who">'+WHO[c.who]+'</span><b>'+layer+'</b><br>'+c.txt+'</div>';
  }
  function draw(fargate){
    var idx = fargate ? 2 : 3;
    eksCol.innerHTML = ROWS.map(function(r){ return cell(r[0], r[1]); }).join('');
    ecsCol.innerHTML = ROWS.map(function(r){ return cell(r[0], r[idx]); }).join('');
    var ye = ROWS.filter(function(r){return r[1].who==='you'}).length;
    var yc = ROWS.filter(function(r){return r[idx].who==='you'}).length;
    modeEl.textContent = fargate ? '· Fargate' : '· EC2';
    score.innerHTML = 'Layers on <b>your</b> pager — EKS: <b>'+ye+' of '+ROWS.length+'</b> · ECS '+
      (fargate?'on Fargate':'on EC2')+': <b>'+yc+' of '+ROWS.length+'</b>';
    document.getElementById('ecsx-fg').classList.toggle('on', fargate);
    document.getElementById('ecsx-ec2').classList.toggle('on', !fargate);
  }
  document.getElementById('ecsx-fg').onclick = function(){ draw(true); };
  document.getElementById('ecsx-ec2').onclick = function(){ draw(false); };
  draw(true);
})();
</script>

<p>Look at the EKS column. Six of the seven layers are yours. None of them are your product.</p>

<p>The upgrade treadmill deserves special attention because it’s the one that quietly eats small teams. Kubernetes ships about three releases a year, and EKS standard support for each lands around 14 months. That means a recurring, unskippable project roughly once a year, forever: test the control plane upgrade, upgrade the add-ons in the right order, chase whatever deprecated APIs your manifests use, then roll the nodes. Skip it and AWS moves you to extended support at six times the control plane price. For a platform team of 15, that’s Tuesday. For a team of four, it’s a sprint per year spent running to stand still. And there’s a quieter cost on top: you have to stay the kind of team that can do this safely.</p>

<p>ECS doesn’t have a version. I want to make sure that lands. There is no upgrade, no deprecation cycle, no “v1.29 removes the API your ALB controller depends on.” The control plane changed under you a hundred times last year and you never noticed. I have ECS services from 2021 that have never needed a maintenance commit. Infrastructure that doesn’t generate homework is worth more to a small team than anything on the Kubernetes feature list. It’s the same reason I tell teams to <a href="/2025/ecs-decisions-that-waste-6-weeks/">pick boring options everywhere else in the stack</a>: boring means you debug your app, not your platform.</p>

<p>On raw cost, the EKS control plane is about $73 a month per cluster and ECS’s is free, and that’s the least interesting line in the comparison. Run the numbers on engineering time instead. One sprint of one engineer’s time per year on cluster maintenance is $10-20k. The <a href="/2025/aws-cost-optimization-case-study/">biggest AWS savings I’ve ever found</a> came from deleting complexity, not from rightsizing it.</p>

<h2 id="what-you-give-up">What you give up</h2>

<p>If this were one-sided, EKS wouldn’t exist. Here’s what you actually lose.</p>

<p>The big one is the operator ecosystem. Kubernetes has operators for Postgres, Kafka, cert-manager, external-dns, ArgoCD, all debugged by thousands of teams over a decade. ECS has no CRDs and no operator pattern. The AWS answer is “use the managed service”: RDS instead of a Postgres operator, MSK instead of Strimzi. That works right up until you need something AWS doesn’t sell.</p>

<p>Tooling in general follows the same line. Vendors ship a Helm chart, not a task definition. Kustomize, the CNCF landscape, none of it targets ECS. And your deployment layer is AWS-native, so a future move off AWS means rewriting it. Your containers move unchanged, but the wiring around them doesn’t.</p>

<p>There’s also the hiring thing, and I won’t pretend it isn’t real. Engineers want Kubernetes on their CV. ECS knowledge is real orchestration knowledge and the concepts transfer completely, as the diagrams above show, but nobody’s career was ever advanced by the phrase “task definition.”</p>

<p>And ECS has a control ceiling. Custom schedulers, topology spread, network policy, the more exotic probe and init semantics: Kubernetes gives you knobs ECS simply doesn’t have. Most web products never touch them. If yours genuinely does, you’ll feel the ceiling and you’ll resent it.</p>

<h2 id="so-when-is-eks-the-right-call">So when is EKS the right call?</h2>

<p>EKS earns its keep when at least one of these is true:</p>

<ul>
  <li>Someone owns the platform. You have, or are hiring, people whose actual job is cluster operations, so the pager layers above land on a team that exists.</li>
  <li>You’re running stateful infrastructure on-cluster that AWS doesn’t offer as a managed service, and you need the operator ecosystem for it.</li>
  <li>Multi-cloud or on-prem is a real requirement: contractual, regulatory, or your customers deploy your software into their clusters.</li>
  <li>Your team is already fluent. K8s veterans ship faster on EKS than they would learning anything else. The tax is only a tax if you haven’t already paid it.</li>
</ul>

<p>If none of those describe you, and for most sub-ten-engineer teams shipping a web product none do, then Kubernetes isn’t buying you capability. It’s buying you a second job.</p>

<h2 id="the-takeaway">The takeaway</h2>

<p>ECS is not “Kubernetes for beginners.” It’s the same control loop idea with a deliberately smaller operational surface. Same desired state, same reconciliation, same rolling deploys, minus the version treadmill, the add-on stack, and the node fleet. You’ve seen the whole object model in this post. There is no part two where the hidden complexity lives.</p>

<p>Small teams don’t lose because they picked the wrong orchestrator. They lose because their best engineers spent the year operating infrastructure the product didn’t need. Pick the tool that generates the least homework, ship, and revisit when you have the head count to afford opinions.</p>

<p>If you’re starting an ECS build-out, the companion post on <a href="/2025/ecs-decisions-that-waste-6-weeks/">the 5 ECS decisions that waste 6 weeks</a> covers the concrete choices: Fargate vs EC2, service discovery, CI/CD, secrets, and monitoring.</p>

<hr />

<p><em>If this post saved you a meeting, it did its job. I write about AWS, DevOps, and building things from scratch. Subscribe via <a href="/feed.xml">RSS</a>, or find me on <a href="https://twitter.com/muhammad_o7">Twitter</a>.</em></p>

          ]]>
        </description>
        <pubDate>Tue, 09 Jun 2026 00:00:00 +0000</pubDate>
        <link>//muhammadraza.me/2026/ecs-explained-visually/</link>
        <guid isPermaLink="true">//muhammadraza.me/2026/ecs-explained-visually/</guid>
        
        <category>aws</category>
        
        <category>devops</category>
        
        <category>kubernetes</category>
        
        
        
        <dc:creator>{&quot;name&quot;=&gt;&quot;Muhammad Raza&quot;}</dc:creator>
        <dc:rights></dc:rights>
      </item>
    
      <item>
        <title>GGUF vs MLX: A Decision Guide, Not Another Benchmark</title>
        <description>
          <![CDATA[
            
            <p>Every few weeks someone downloads the GGUF build and the MLX build of the same model, runs both, screenshots the tokens-per-second counter, and posts it as proof that one format wins. The replies split down the middle. Half the thread says MLX is obviously faster, the other half says the test was rigged.</p>

<p>They are both right, which is the problem. The number on the screen is real and it is also not the number you actually wait for. And the format you should pick was never really about that number anyway.</p>

<p>I have gone through this decision enough times now, on my own machine and for clients standing up local inference, that I want to write down the part nobody puts in the comparison tables: GGUF versus MLX is a five-question decision, and only one of those questions is about speed.</p>

<h2 id="what-you-are-actually-choosing-between">What you are actually choosing between</h2>

<p>GGUF is the file format from the llama.cpp project. One file holds the quantized weights, the tokenizer, the chat template, and the metadata, and any runtime that can load it will run the model. That includes llama.cpp itself, Ollama, LM Studio, KoboldCPP, and a handful of others. It runs on basically everything: CPU, NVIDIA, AMD, Apple Metal, even a Raspberry Pi if you are patient. Portability is the whole point of the format.</p>

<p>MLX is not a file format. It is Apple’s array framework, the rough equivalent of PyTorch built specifically for Apple Silicon. An MLX model is a directory of safetensors files plus a config that the runtime reads directly. You convert and quantize a model in one command with <code class="language-plaintext highlighter-rouge">mlx_lm.convert</code>. The catch is in the name: MLX runs on Apple Silicon and nowhere else.</p>

<p>One thing worth clearing up before we go further, because it shows up in half the comparisons and it is out of date: people say GGUF does clever mixed-precision quantization while MLX is stuck on flat uniform 4-bit. The first half is true. The second half is not. Apple walked through per-layer mixed precision in their WWDC25 session on running large language models with MLX, including the trick of keeping the embedding and output layers at 6-bit while the rest of the model sits at 4-bit. MLX can do it. It is just that most of the MLX builds floating around Hugging Face do not bother, so in practice you often are comparing GGUF’s mixed precision against a uniform MLX quant. Worth knowing when you read someone else’s quality benchmark.</p>

<h2 id="the-number-on-the-screen-is-lying-to-you">The number on the screen is lying to you</h2>

<p>Quick detour, because it poisons most of the benchmarks you will find.</p>

<p>The tokens-per-second figure your runtime prints while text is streaming measures decode speed, the rate at which the model emits new tokens. It does not include prefill, the time the model spends reading your prompt before it says anything. For a chatty exchange with a short prompt that does not matter much. For an agent that stuffs tool output, a chunk of a file, and a system prompt into every turn, prefill is most of what you wait for, and the streaming counter never sees it.</p>

<p>There is a benchmark writeup that made the rounds on r/LocalLLaMA where the author’s UI proudly reported nearly twice the tokens per second on MLX as on GGUF, and then the actual wall-clock time had GGUF finishing first on most of the real tasks. Same machine, same model. The counter was not wrong. It was just answering a different question than the one that mattered.</p>

<p>Keep that in your head for the whole rest of this post. When I say one format is “faster” below, I mean wall-clock on a real workload, not the number that scrolls past while tokens stream.</p>

<h2 id="five-questions-that-actually-decide-it">Five questions that actually decide it</h2>

<h3 id="1-how-big-is-the-model-relative-to-your-ram">1. How big is the model relative to your RAM?</h3>

<p>This is the question that quietly settles a lot of arguments. Token generation is bounded by memory bandwidth, not compute. To emit one token the GPU has to read the entire model out of memory. On an M4 Pro with roughly 273 GB/s of bandwidth, a 4-bit 27B model weighing about 17 GB caps out near 16 tokens per second no matter what software you run. MLX cannot fetch bytes faster than the hardware allows, and neither can llama.cpp.</p>

<p>So for large models, the ones that fill most of your unified memory, the format barely matters for speed. They both hit the same wall. The interesting differences show up on smaller models, under roughly 8 to 14B, where the model fits comfortably and the bottleneck shifts from bandwidth to framework overhead. That is where MLX’s tighter, Apple-specific kernels pull ahead, often in the 15 to 40 percent range on single-user decode, and wider still on very small models that lean hardest on framework efficiency.</p>

<p>Small model, want it snappy: MLX has something real to offer. Big model that barely fits: pick on the other four questions, because speed is a wash.</p>

<h3 id="2-will-this-ever-need-to-run-somewhere-other-than-a-mac">2. Will this ever need to run somewhere other than a Mac?</h3>

<p>If there is any chance the same artifact has to run on a Linux box, a cloud GPU, or a teammate’s non-Apple machine, you want GGUF. The same file moves between all of them. MLX does not leave Apple Silicon, full stop. If you ship MLX as your only build and then need a CUDA fallback, you are re-quantizing under pressure.</p>

<p>This one overrides almost everything else. Portability is not a performance feature, but it is the feature you miss most when it is gone.</p>

<h3 id="3-what-does-your-workload-actually-look-like">3. What does your workload actually look like?</h3>

<p>Not “what model,” but the shape of the traffic. Specifically the ratio of input to output.</p>

<p>Workloads that feed the model a lot and ask for a little (classification, tool-calling agents with short replies, RAG with a big injected context) lean toward GGUF. llama.cpp has more battle-tested prompt caching and FlashAttention, and MLX’s prefix caching has historically been the less reliable of the two, especially on newer hybrid-attention models. When prefill dominates the wall clock, that maturity wins.</p>

<p>Workloads that take a short prompt and generate a lot (summaries, long-form chat, brainstorming) lean toward MLX. Once the model is past prefill and just streaming tokens, MLX’s decode advantage compounds, and the longer the reply the more it pays off.</p>

<p>There is a crossover point that depends on both context size and reply length. With a small prompt, MLX needs a couple hundred output tokens before its faster decode makes up for slower prefill. With a few thousand tokens of context, it needs several hundred more. If your agent’s replies are 150 tokens and its context keeps growing, you are living on the wrong side of that crossover, and GGUF is the better call.</p>

<h3 id="4-do-you-want-to-train-or-just-run">4. Do you want to train, or just run?</h3>

<p>GGUF is an inference format. You download it, you run it, that is the relationship. If you want to fine-tune, you convert back to safetensors, find a GPU, do the work, and convert forward again.</p>

<p>MLX is a full framework. You can fine-tune with LoRA or QLoRA directly on the Mac, merge adapters, and run speculative decoding with a small draft model, all natively. If part of your reason for going local is to actually adapt models and not just serve them, MLX is the only serious option on Apple Silicon, and this question alone can decide the whole thing.</p>

<h3 id="5-how-much-do-you-care-about-ecosystem-and-exact-fit">5. How much do you care about ecosystem and exact fit?</h3>

<p>Two practical edges for GGUF here. First, coverage: every open model gets GGUF builds within hours of release, including the obscure ones. MLX coverage is good for popular models and lags for everything else. Second, granularity. GGUF gives you a long ladder of quant levels, Q4_K_M, Q5_K_M, Q6_K, the I-quants, and so on, so when you have exactly 16 GB to work with you can usually find a quant that fits. MLX builds are mostly published at 4-bit and 8-bit, so you sometimes get a 4-bit that is a hair too small for the quality you want and an 8-bit that will not fit.</p>

<p>The edge on MLX’s side: it tends to get support for new Apple hardware features first, because Apple ships the metal abstraction in MLX before llama.cpp catches up.</p>

<h2 id="the-flowchart">The flowchart</h2>

<p>Put the five questions in order and most decisions fall out in about ten seconds.</p>

<ul>
  <li><strong>Need to run on anything other than Apple Silicon, now or later?</strong> → <strong>GGUF</strong>. Stop here, portability wins.</li>
  <li><strong>Staying on Apple Silicon. Do you want to fine-tune or train on-device?</strong> → <strong>MLX</strong>.</li>
  <li><strong>Inference only. Is your workload short-output and prefill-heavy</strong> (agents, RAG, classification)? → <strong>GGUF</strong>.</li>
  <li><strong>Long outputs, interactive, single user, latency you can feel?</strong> → <strong>MLX</strong>.</li>
  <li><strong>Need a precise quant to fit tight RAM, or running a just-released or obscure model?</strong> → <strong>GGUF</strong>.</li>
  <li><strong>Still undecided?</strong> → <strong>GGUF</strong>. It is the conservative default. Ship it, and A/B an MLX build later if throughput becomes the constraint.</li>
</ul>

<p>The short version: GGUF is what you pick when you are not sure, because it is the one that is hard to regret. MLX is what you pick when you own the hardware, run single-user, and have a specific reason, throughput on long outputs or on-device training, to want it.</p>

<h2 id="once-you-have-picked-pick-a-quant-level">Once you have picked, pick a quant level</h2>

<p>The format is half the decision. The bit width is the other half, and the defaults are good but not always right.</p>

<p>Start at <strong>Q4_K_M</strong> for GGUF or <strong>4-bit</strong> for MLX. Q4_K_M is the community default for a reason. It keeps most tensors at 4-bit, then bumps the quality-sensitive ones to 6-bit: the attention value weights and the feed-forward down-projection, on a portion of the layers. That holds quality better than a flat 4-bit quant at a small size cost. The reported quality loss against FP16 on MMLU is model-dependent but small: well under a point on a big model, creeping up toward a point or so on something under 8B, and a little more again for a uniform 4-bit MLX build. On a 30B-plus model that gap is noise. On something under 8B, especially on coding tasks where attention precision matters, it is visible, and you have two outs: stay on GGUF Q4_K_M, or move to MLX 6-bit, which closes the gap for roughly a 30 percent larger file.</p>

<p>If RAM is genuinely tight, GGUF’s <strong>I-quants</strong> with an importance matrix are the quality-per-byte champions at low bit widths. The cost is slower decode on CPU, so they make more sense when you are squeezing a model onto limited memory than when you are chasing speed.</p>

<p>One rule regardless of format: do not drop below roughly 3-bit without measuring quality on your own task. The aggregate benchmarks stop predicting what you will actually see down there.</p>

<h2 id="two-traps-that-will-flip-your-results">Two traps that will flip your results</h2>

<p><strong>The bf16 trap on M1 and M2.</strong> A lot of MLX builds ship as bf16, and on the M1 and M2 that data type does not get the accelerated path that fp16 does. During prefill those weights run un-accelerated and the penalty multiplies across every input token, which is part of why some “MLX is slow” reports come from older hardware. The fix is a one-minute reconvert with <code class="language-plaintext highlighter-rouge">--dtype float16</code>. If you are on an M1 or M2 and MLX feels sluggish, check this before you blame the format.</p>

<p><strong>Caching is the real variable.</strong> The biggest swings I have seen between runtimes were not about GGUF versus MLX at all, they were about whether prompt and KV caching actually worked for that model on that runtime. A runtime that reprocesses the full conversation every turn will lose to one that caches the prefix, regardless of format. Test caching with your real context lengths before you commit, and do not trust the streaming counter to tell you about it, because it never measures the part that caching fixes.</p>

<h2 id="so-which-one">So which one</h2>

<p>If you want the one-line version: GGUF is the conservative default, and you should reach for it whenever you are uncertain, need portability, or want a specific quant. Reach for MLX when you are locked to Apple Silicon, run single-user interactive workloads with long outputs, or want to fine-tune on the machine you already own.</p>

<p>And if you are choosing this for a team rather than a laptop, treat it as the architecture decision it is. The format you standardize on shapes your model coverage, your fallback options, and your serving setup for as long as the stack lives, and re-quantizing a fleet after the fact is the kind of avoidable week of work I keep getting hired to clean up. Decide it on the five questions, not on a screenshot.</p>

          ]]>
        </description>
        <pubDate>Wed, 03 Jun 2026 00:00:00 +0000</pubDate>
        <link>//muhammadraza.me/2026/gguf-vs-mlx-decision-guide/</link>
        <guid isPermaLink="true">//muhammadraza.me/2026/gguf-vs-mlx-decision-guide/</guid>
        
        <category>ai</category>
        
        <category>llm</category>
        
        <category>devops</category>
        
        <category>mac</category>
        
        
        
        <dc:creator>{&quot;name&quot;=&gt;&quot;Muhammad Raza&quot;}</dc:creator>
        <dc:rights></dc:rights>
      </item>
    
      <item>
        <title>Building CodeWiki: Compiling Codebases Into Living Wikis With LLMs</title>
        <description>
          <![CDATA[
            
            <p>Every coding agent session starts from zero. The agent doesn’t know how your code is organized, which files matter, how the pieces connect. It has to rediscover the architecture from scratch. Grep around, read some files, build a mental model, start working. That mental model disappears the moment the session ends.</p>

<p>I kept watching this happen. Ten minutes of exploration before any real work, every single time. If you work across multiple repos or come back to a project after a couple weeks, it’s worse. The agent is essentially reading the codebase for the first time, again.</p>

<p>I wanted to fix this.</p>

<h2 id="the-idea">The idea</h2>

<p>A few weeks ago Karpathy <a href="https://x.com/karpathy/status/2039805659525644595">tweeted</a> about using LLMs to build personal knowledge bases. The workflow: collect raw sources, have an LLM compile them into a structured wiki of markdown files, then query and build on that wiki over time. Every query makes the wiki richer. The knowledge adds up.</p>

<p>The part that stuck with me: he’s not using fancy RAG. The LLM maintains its own index files and summaries, and at his scale (~100 articles, ~400K words) it just works. The LLM reads its own compiled knowledge to answer questions.</p>

<p>Codebases are raw data too. Source files are unstructured information that happens to be executable. What if the LLM compiled a codebase into a wiki the same way, with module overviews, architecture docs, concept articles, and then used that wiki as its starting point for every session?</p>

<p>That’s <a href="https://github.com/mraza007/codewiki">CodeWiki</a>.</p>

<h2 id="how-it-works">How it works</h2>

<p>CodeWiki is a thin Rust CLI called <code class="language-plaintext highlighter-rouge">cw</code> paired with a Claude Code skill. The CLI handles git ops, directory scaffolding, and metadata. The agent does all the actual reading and writing. No API keys, no LLM calls from the CLI. Your agent is the intelligence.</p>

<p>When you run <code class="language-plaintext highlighter-rouge">cw init</code> in a repo, it creates a wiki directory at <code class="language-plaintext highlighter-rouge">~/.codewiki/&lt;project&gt;/</code> with this structure:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>~/.codewiki/my-project/
├── _index.md         # master index
├── _architecture.md  # system overview
├── _patterns.md      # recurring patterns
├── _meta.yaml        # last compiled commit
├── modules/          # one article per module
├── concepts/         # cross-cutting concerns
├── decisions/        # why things are the way they are
├── learnings/        # bugs fixed, patterns discovered
└── queries/          # past Q&amp;A, filed back
</code></pre></div></div>

<p>The first time you start a Claude Code session after init, the skill kicks in. The agent walks your codebase, reads the source files, and writes wiki articles. Module articles describe what each part of the code actually does. Not what it’s supposed to do, what it does. Key files, functions, data flow, connections to other modules.</p>

<p>Concept articles cut across modules. “How does error handling work across the system” or “how does data flow from request to response.” These are the questions that normally require reading eight files across four directories. The wiki answers them in one place.</p>

<h2 id="keeping-it-fresh">Keeping it fresh</h2>

<p>The wiki is only useful if it stays current. Every article has YAML frontmatter with a <code class="language-plaintext highlighter-rouge">source_files</code> field:</p>

<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nn">---</span>
<span class="na">title</span><span class="pi">:</span> <span class="s">Authentication Module</span>
<span class="na">type</span><span class="pi">:</span> <span class="s">module</span>
<span class="na">source_files</span><span class="pi">:</span>
  <span class="pi">-</span> <span class="s">src/auth/middleware.py</span>
  <span class="pi">-</span> <span class="s">src/auth/tokens.py</span>
<span class="na">tags</span><span class="pi">:</span> <span class="pi">[</span><span class="nv">auth</span><span class="pi">,</span> <span class="nv">middleware</span><span class="pi">,</span> <span class="nv">jwt</span><span class="pi">]</span>
<span class="nn">---</span>
</code></pre></div></div>

<p>The CLI tracks which commit the wiki was last compiled against. When you start a new session, <code class="language-plaintext highlighter-rouge">cw status</code> diffs against that commit and cross-references changed files against every article’s <code class="language-plaintext highlighter-rouge">source_files</code>:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cw status
Changed since last compile (4964c23):
  M src/auth/middleware.py
  M src/auth/tokens.py

Stale articles:
  ! modules/auth.md
</code></pre></div></div>

<p>The agent sees this and knows exactly what to re-read and update. No guessing, no full recompile.</p>

<p>At session end, the agent writes learnings and decisions back into the wiki. Fixed a bug? That becomes <code class="language-plaintext highlighter-rouge">learnings/auth-token-race-condition.md</code>. Made a design decision? That’s <code class="language-plaintext highlighter-rouge">decisions/switched-to-redis-sessions.md</code>. Then it updates <code class="language-plaintext highlighter-rouge">_meta.yaml</code> with the current commit hash.</p>

<p>Next session picks up where this one left off.</p>

<h2 id="the-cli">The CLI</h2>

<p>About 400 lines of Rust. Here are the commands:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>cw init                <span class="c"># scaffold wiki for current repo</span>
cw status              <span class="c"># what changed since last compile</span>
cw path                <span class="c"># print wiki path</span>
cw projects            <span class="c"># list all wikis</span>
cw index               <span class="c"># rebuild _index.md from article frontmatter</span>
cw meta update         <span class="c"># record current commit as compiled</span>

cw setup claude-code   <span class="c"># install skill into Claude Code</span>
cw setup codex         <span class="c"># install instructions for Codex</span>
cw setup qmd           <span class="c"># register wiki as QMD search collection</span>
</code></pre></div></div>

<p>The CLI doesn’t make any LLM calls. It handles the things agents are bad at: tracking git state, knowing which files changed, maintaining timestamps. The agent handles what it’s good at: reading code and writing about it.</p>

<h2 id="search-with-qmd">Search with QMD</h2>

<p>For larger wikis, <a href="https://github.com/tobi/qmd">QMD</a> by Tobi Lutke adds proper search. It’s a local search engine for markdown with hybrid BM25 plus vector search plus a small reranker model. Running <code class="language-plaintext highlighter-rouge">cw setup qmd</code> registers your wiki as a searchable collection. The agent can then query the wiki through QMD’s MCP server during a session.</p>

<p>At the scale of most repos people actually work in, you probably don’t need it. A well organized wiki with an index file is enough for the LLM to navigate on its own. But when the wiki gets large, QMD keeps retrieval fast.</p>

<h2 id="viewing-with-obsidian">Viewing with Obsidian</h2>

<p>All wiki articles live at <code class="language-plaintext highlighter-rouge">~/.codewiki/</code>. Open that directory as an Obsidian vault and you get a browsable knowledge graph of all your projects. Articles use <code class="language-plaintext highlighter-rouge">[[backlinks]]</code> so modules connect to each other. The auth article links to <code class="language-plaintext highlighter-rouge">[[database]]</code> and <code class="language-plaintext highlighter-rouge">[[api]]</code>. You never have to write or edit these articles yourself. The agent maintains everything.</p>

<h2 id="why-not-rag">Why not RAG</h2>

<p>Traditional RAG chunks your code, embeds it, retrieves fragments when you ask a question. You get decontextualized snippets and hope the LLM can stitch them together.</p>

<p>CodeWiki is different. The LLM reads the code and writes structured articles about it. The auth article already connects the middleware to the token service to the database layer. That connection doesn’t exist in any single source file. It exists in the compiled understanding.</p>

<p>Karpathy found the same thing with his research wiki. You don’t need vector search over raw data when you have a well organized collection of articles. The LLM reads the index, finds the relevant articles, reads those. Simple and it works.</p>

<h2 id="getting-started">Getting started</h2>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>git clone https://github.com/mraza007/codewiki.git
<span class="nb">cd </span>codewiki
cargo <span class="nb">install</span> <span class="nt">--path</span> <span class="nb">.</span>

<span class="nb">cd </span>your-project
cw init
cw setup claude-code
</code></pre></div></div>

<p>Start a Claude Code session and the skill handles the rest. The project is MIT licensed and on <a href="https://github.com/mraza007/codewiki">GitHub</a>.</p>

          ]]>
        </description>
        <pubDate>Fri, 03 Apr 2026 00:00:00 +0000</pubDate>
        <link>//muhammadraza.me/2026/building-codewiki-compiling-codebases-into-living-wikis/</link>
        <guid isPermaLink="true">//muhammadraza.me/2026/building-codewiki-compiling-codebases-into-living-wikis/</guid>
        
        <category>ai</category>
        
        <category>rust</category>
        
        <category>tools</category>
        
        <category>devops</category>
        
        
        
        <dc:creator>{&quot;name&quot;=&gt;&quot;Muhammad Raza&quot;}</dc:creator>
        <dc:rights></dc:rights>
      </item>
    
      <item>
        <title>I Built an Orchestrator That Watches GitHub Issues and Sends Agents to Fix Them</title>
        <description>
          <![CDATA[
            
            <p>I have too many issues and not enough time. Same as everyone. The usual loop is: pick an issue, context switch into it, write the code, open a PR, pick the next one. Do that until the sprint ends or you lose the will.</p>

<p>Coding agents help with this. I can point Claude Code at an issue and let it work while I do something else. But that’s still one agent, one issue, one terminal. If I have 10 issues labeled “agent-ready,” I’m not babysitting 10 terminal tabs.</p>

<p>I wanted something that just watches for new issues and sends agents after them. Then OpenAI released their <a href="https://github.com/openai/symphony/blob/main/SPEC.md">Symphony spec</a>, an orchestrator pattern for their Codex agent. The architecture was solid: poll an issue tracker, dispatch agents into isolated workspaces, reconcile when issues close. But it was built around Codex and Linear, and I use Claude Code and GitHub Issues.</p>

<p>So I took the ideas I liked from Symphony and built my own. That’s <a href="https://github.com/mraza007/baton">Baton</a>.</p>

<h2 id="what-it-does">What it does</h2>

<p>Baton is a Python daemon. You start it in your repo, it polls GitHub Issues matching your configured labels, creates an isolated git worktree per issue, and runs Claude Code CLI as a subprocess. When the agent finishes and opens a PR, Baton releases the claim and grabs the next issue.</p>

<p>One config file. One command. Go do something else.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>WORKFLOW.md -&gt; Orchestrator -&gt; Worker (per issue)
                  |              |
                  |              +-- git worktree create
                  |              +-- hooks (before_run)
                  |              +-- claude -p "&lt;prompt&gt;"
                  |              +-- check issue state
                  |              +-- hooks (after_run)
                  |
                  +-- Poller (gh issue list)
                  +-- Dispatcher (concurrency control)
                  +-- Reconciler (stale run detection)
</code></pre></div></div>

<p>The name comes from relay races. You hand off the baton and the runner goes.</p>

<h2 id="the-config">The config</h2>

<p>Everything lives in <code class="language-plaintext highlighter-rouge">WORKFLOW.md</code>. YAML front matter for configuration, Jinja2 template below for the prompt. Baton reloads this file on every poll cycle, so you can change settings without restarting.</p>

<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nn">---</span>
<span class="na">tracker</span><span class="pi">:</span>
  <span class="na">kind</span><span class="pi">:</span> <span class="s">github</span>
  <span class="na">labels</span><span class="pi">:</span> <span class="pi">[</span><span class="s2">"</span><span class="s">agent"</span><span class="pi">]</span>
  <span class="na">exclude_labels</span><span class="pi">:</span> <span class="pi">[</span><span class="s2">"</span><span class="s">blocked"</span><span class="pi">]</span>

<span class="na">polling</span><span class="pi">:</span>
  <span class="na">interval_ms</span><span class="pi">:</span> <span class="m">30000</span>

<span class="na">agent</span><span class="pi">:</span>
  <span class="na">max_concurrent</span><span class="pi">:</span> <span class="m">3</span>
  <span class="na">max_turns</span><span class="pi">:</span> <span class="m">5</span>
  <span class="na">command</span><span class="pi">:</span> <span class="s">claude</span>
  <span class="na">permission_mode</span><span class="pi">:</span> <span class="s">bypassPermissions</span>

<span class="na">hooks</span><span class="pi">:</span>
  <span class="na">before_run</span><span class="pi">:</span> <span class="pi">|</span>
    <span class="s">git fetch origin main &amp;&amp; git rebase origin/main</span>
  <span class="na">timeout_ms</span><span class="pi">:</span> <span class="m">60000</span>
<span class="nn">---</span>

<span class="s">You are an autonomous software engineer working on issue</span> <span class="c1">#{{ issue.number " }}: {{ issue.title " }}.</span>

<span class="pi">{{</span> <span class="nv">issue.body "</span> <span class="pi">}}</span>

<span class="pi">{</span><span class="err">%</span> <span class="nv">if attempt %</span><span class="pi">}</span>
<span class="s">This is continuation attempt {{ attempt " }}. Review what was done and continue.</span>
<span class="pi">{</span><span class="err">%</span> <span class="nv">endif %</span><span class="pi">}</span>

<span class="c1">## Instructions</span>

<span class="s">1. Understand the issue requirements</span>
<span class="s">2. Write clean, well-tested code</span>
<span class="s">3. Run existing tests to make sure nothing breaks</span>
<span class="s">4. Commit your changes with a descriptive message</span>
<span class="s">5. Push the branch and create a pull request linking to</span> <span class="c1">#{{ issue.number " }}</span>
</code></pre></div></div>

<p>Labels filter which issues get picked up. <code class="language-plaintext highlighter-rouge">max_concurrent</code> controls parallel agents. <code class="language-plaintext highlighter-rouge">max_turns</code> is the retry limit per issue. Hooks run shell commands at different points. I use <code class="language-plaintext highlighter-rouge">before_run</code> to rebase on main so the agent starts from fresh code.</p>

<p>The prompt template gets <code class="language-plaintext highlighter-rouge">issue.number</code>, <code class="language-plaintext highlighter-rouge">issue.title</code>, <code class="language-plaintext highlighter-rouge">issue.body</code>, <code class="language-plaintext highlighter-rouge">issue.labels</code>, and <code class="language-plaintext highlighter-rouge">attempt</code> for retries. Standard Jinja2.</p>

<h2 id="why-worktrees">Why worktrees</h2>

<p>Each issue gets its own worktree under <code class="language-plaintext highlighter-rouge">.symphony/worktrees/</code>, with a branch name slugified from the issue title: <code class="language-plaintext highlighter-rouge">baton/fix-login-redirect-42</code>.</p>

<p>I thought about Docker containers and temp directories but worktrees won out. They share the git object database so creating one is almost instant, unlike a full clone. They’re real checkouts, so linters and test runners and build scripts all work without any path hacking. And they’re isolated. If one agent trashes its branch, the others don’t care.</p>

<h2 id="why-gh-cli-instead-of-the-github-api">Why <code class="language-plaintext highlighter-rouge">gh</code> CLI instead of the GitHub API</h2>

<p>Baton shells out to <code class="language-plaintext highlighter-rouge">gh issue list</code> and <code class="language-plaintext highlighter-rouge">gh pr create</code> instead of using PyGitHub or the REST API. Seems odd, but think about setup.</p>

<p>With the API, you need a personal access token. You need to configure it somewhere. You need to handle rate limits.</p>

<p>With <code class="language-plaintext highlighter-rouge">gh</code>, you authenticate once (<code class="language-plaintext highlighter-rouge">gh auth login</code>) and everything on your machine uses the same credentials. No token management in the orchestrator. The tradeoff is speed, but Baton polls every 30 seconds. The overhead of a subprocess call doesn’t matter at that pace.</p>

<h2 id="the-permission-problem">The permission problem</h2>

<p>This tripped me up. Claude Code has permission modes: <code class="language-plaintext highlighter-rouge">default</code> asks for everything, <code class="language-plaintext highlighter-rouge">acceptEdits</code> auto-approves file edits but prompts for shell commands, and <code class="language-plaintext highlighter-rouge">bypassPermissions</code> auto-approves everything.</p>

<p>I started with <code class="language-plaintext highlighter-rouge">acceptEdits</code> because it felt like the right balance. Let the agent write code freely, but make it ask before running commands. Problem: “ask” means a human clicking yes, and in an autonomous orchestrator there’s no human. The agent just blocks forever waiting for a prompt nobody will answer.</p>

<p>I wasted about 20 minutes watching it hang before I figured this out. For autonomous operation you need <code class="language-plaintext highlighter-rouge">bypassPermissions</code>, which maps to <code class="language-plaintext highlighter-rouge">--dangerously-skip-permissions</code>. The flag name is honest about the risk. I’m comfortable with it because the agents run in isolated worktrees on disposable branches, not in my main checkout.</p>

<h2 id="auto-releasing-on-pr-creation">Auto-releasing on PR creation</h2>

<p>My first version had a dumb problem. The agent would finish its work, create a PR on turn 2 of 5, and Baton would keep scheduling continuation turns for the remaining 3. The slot was occupied but nobody was doing anything useful.</p>

<p>The fix: after each worker finishes, check if a PR exists for that issue’s branch. If yes, release the claim immediately and free up the slot. If not, schedule a short retry.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">pr_exists</span> <span class="o">=</span> <span class="k">await</span> <span class="bp">self</span><span class="p">.</span><span class="n">tracker</span><span class="p">.</span><span class="n">check_pr_exists</span><span class="p">(</span><span class="n">issue</span><span class="p">.</span><span class="n">number</span><span class="p">)</span>
<span class="k">if</span> <span class="n">pr_exists</span><span class="p">:</span>
    <span class="n">log</span><span class="p">.</span><span class="n">info</span><span class="p">(</span><span class="sa">f</span><span class="s">"PR_READY #</span><span class="si">{</span><span class="n">issue</span><span class="p">.</span><span class="n">number</span><span class="si">}</span><span class="s"> -- PR found, releasing claim"</span><span class="p">)</span>
    <span class="k">return</span> <span class="s">"pr_created"</span>
<span class="k">return</span> <span class="s">"no_pr"</span>
</code></pre></div></div>

<p>Small change, but it meant the orchestrator stopped wasting time on finished work.</p>

<h2 id="extensibility-through-skills-and-mcp-servers">Extensibility through skills and MCP servers</h2>

<p>Baton itself is deliberately simple. It polls, dispatches, and manages worktrees. The interesting part is what you put in the prompt and what tools you give the agent.</p>

<p>Claude Code supports MCP servers, which means you can wire up external tools and the agent can use them during its run. Baton passes MCP server config through to each worker:</p>

<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="na">agent</span><span class="pi">:</span>
  <span class="na">mcp_servers</span><span class="pi">:</span>
    <span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">playwright</span>
      <span class="na">command</span><span class="pi">:</span> <span class="s">npx @playwright/mcp@latest</span>
</code></pre></div></div>

<p>That means the agent has access to a headless browser while it works. It can open a page, click around, take screenshots, verify that the UI renders correctly. You don’t have to build that into Baton. You just declare which MCP servers you want and the agent figures out when to use them.</p>

<p>Same idea with CLI tools. If <a href="https://github.com/vercel-labs/agent-browser">agent-browser</a> is installed on the machine, you can tell the agent to use it in the prompt template. “Before creating a PR, open the app with agent-browser and verify the acceptance criteria.” The agent spins up a local server, opens the page, clicks buttons, fills inputs, takes snapshots. All from instructions in WORKFLOW.md, nothing hardcoded in the orchestrator.</p>

<p>Claude Code also has skills, which are reusable prompt fragments that teach the agent specific capabilities. If you have a code review skill or a testing skill installed, the agent can use them during its run. Baton’s config supports a <code class="language-plaintext highlighter-rouge">skills</code> list for this:</p>

<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="na">agent</span><span class="pi">:</span>
  <span class="na">skills</span><span class="pi">:</span>
    <span class="pi">-</span> <span class="s">code-reviewer</span>
    <span class="pi">-</span> <span class="s">accessibility-checker</span>
</code></pre></div></div>

<p>You can also override skills per issue by adding a <code class="language-plaintext highlighter-rouge">## Skills</code> section to the issue body. If one issue needs Playwright but the others don’t, just add it to that issue.</p>

<p>The point is that Baton doesn’t need to know about browsers or test runners or linters. It just needs to dispatch agents with the right config. The prompt and the tools do the rest.</p>

<h2 id="putting-it-together-a-todo-app-from-scratch">Putting it together: a todo app from scratch</h2>

<p>To see all of this working end to end, I had Baton build a todo app. Fresh repo, no code. I created three GitHub issues labeled <code class="language-plaintext highlighter-rouge">baton</code>:</p>

<ol>
  <li>Create basic HTML structure</li>
  <li>Add JavaScript for create/delete</li>
  <li>Add localStorage persistence</li>
</ol>

<p>The WORKFLOW.md prompt told the agent to use agent-browser for verification before opening PRs. I ran <code class="language-plaintext highlighter-rouge">baton start</code> and went to make coffee.</p>

<p>Baton picked up issue #1, created a worktree on <code class="language-plaintext highlighter-rouge">baton/create-basic-todo-app-html-structure-1</code>, and dispatched Claude Code. The agent wrote <code class="language-plaintext highlighter-rouge">index.html</code>, spun up a local server with <code class="language-plaintext highlighter-rouge">npx serve</code>, opened it with agent-browser, confirmed the layout rendered, then committed, pushed, and opened a PR. The PR description included what agent-browser found:</p>

<blockquote>
  <p>Opened <code class="language-plaintext highlighter-rouge">http://localhost:3456</code> and confirmed the page renders correctly.
Ran <code class="language-plaintext highlighter-rouge">agent-browser snapshot -i</code> confirming interactive elements: textbox and button.</p>
</blockquote>

<p>I merged it. The issue auto-closed (the PR had <code class="language-plaintext highlighter-rouge">Closes #1</code>). Baton saw the issue was gone on the next poll, released the slot, and picked up issue #2. Same cycle. Then #3.</p>

<p>Three issues, three PRs, three merges. I didn’t write a line of the todo app. The agent-browser verification wasn’t built into Baton. It was just instructions in the prompt and a CLI tool on my machine.</p>

<h2 id="getting-started">Getting started</h2>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>pip <span class="nb">install</span> <span class="nt">-e</span> <span class="nb">.</span>
<span class="nb">cp </span>WORKFLOW.md.example WORKFLOW.md
<span class="c"># Edit WORKFLOW.md: set your labels, tweak the prompt</span>
baton start
</code></pre></div></div>

<p>You need Python 3.11+, Claude Code CLI (<code class="language-plaintext highlighter-rouge">claude</code>), GitHub CLI (<code class="language-plaintext highlighter-rouge">gh</code>) authenticated, and Git.</p>

<p>The code is at <a href="https://github.com/mraza007/baton">github.com/mraza007/baton</a>. MIT licensed. About 10 Python modules, no external services, no databases. State lives in memory with JSON persistence for the status command.</p>

<h2 id="what-i-want-to-add-next">What I want to add next</h2>

<ul>
  <li>A proper TUI instead of <code class="language-plaintext highlighter-rouge">baton status</code> reading a JSON file</li>
  <li>Issue dependency ordering so issue 3 waits for issue 2 if it needs to</li>
  <li>Cost tracking per issue, so I can see what automating the backlog actually costs in tokens</li>
  <li>More trackers besides GitHub Issues (Linear, Jira, GitLab)</li>
</ul>

<p>If you’ve got a repo with a pile of issues sitting there, try pointing Baton at it. Start with one label and <code class="language-plaintext highlighter-rouge">max_concurrent: 1</code>. See what it does. The setup takes about five minutes and the worst case is you get a bad PR that you close. The code is MIT licensed, the whole thing is ten files, and there’s nothing weird in it. Fork it, break it, rip out the parts you don’t like.</p>

<p>If you try it, I want to hear what breaks.</p>

<hr />

<p>I write a newsletter called <a href="https://devconsole.substack.com/">Dev Console</a> where I cover what’s actually happening in AI, minus the hype. New tools, real use cases, stuff I’m building. If this post was interesting, you’ll probably like it.</p>

          ]]>
        </description>
        <pubDate>Fri, 27 Mar 2026 00:00:00 +0000</pubDate>
        <link>//muhammadraza.me/2026/building-baton-autonomous-agent-orchestrator/</link>
        <guid isPermaLink="true">//muhammadraza.me/2026/building-baton-autonomous-agent-orchestrator/</guid>
        
        <category>ai</category>
        
        <category>python</category>
        
        <category>tools</category>
        
        <category>automation</category>
        
        
        
        <dc:creator>{&quot;name&quot;=&gt;&quot;Muhammad Raza&quot;}</dc:creator>
        <dc:rights></dc:rights>
      </item>
    
      <item>
        <title>Harness Engineering: The DevOps Skill Nobody Told You About</title>
        <description>
          <![CDATA[
            
            <p>I’ve written before about how <a href="https://muhammadrazame.github.io/blog/2026/01/03/ai-agents-devops-perspective">AI agents are just CI pipelines with an LLM plugged in</a>. That post mapped agent concepts to infrastructure patterns you already know. But there’s a discipline forming around the infrastructure side of agents that deserves its own name.</p>

<p>Harness engineering. It’s the practice of building everything around the LLM — the execution environment, tool definitions, safety boundaries, observability, and lifecycle management. The stuff that turns a chatbot into a production system.</p>

<p>If you work in DevOps, you’ve been doing this for years. You just called it something else.</p>

<h2 id="why-harnesses-matter-more-than-models">Why Harnesses Matter More Than Models</h2>

<p>Pick any AI agent demo. Strip out the model. What’s left?</p>

<p>A container or sandbox. A set of callable tools. A loop that reads output and decides what happens next. Logging. Timeouts. Cleanup.</p>

<p>That’s the harness. And it’s where agents succeed or fail. A great model in a bad harness hallucinates, loops forever, leaks secrets, or silently does nothing useful. A decent model in a good harness stays bounded, recovers from errors, and produces auditable results.</p>

<p>DevOps engineers already think this way. You don’t just pick a good application — you build the infrastructure that makes it reliable. Same thing here.</p>

<h2 id="the-five-parts-of-a-harness">The Five Parts of a Harness</h2>

<p>Here’s how I break down harness engineering into components. Each one maps directly to something you’ve built before.</p>

<p><strong>1. Execution environment.</strong> Where does the agent run? A container, a VM, a temporary directory, a git worktree. You need isolation so the agent can’t corrupt shared state. You need reproducibility so runs are consistent. This is the same problem as CI job runners. Docker, Firecracker, nsjail — pick your isolation boundary.</p>

<p><strong>2. Tool definitions.</strong> Tools are the agent’s API surface. Read a file. Run a command. Query a database. Call an endpoint. Each tool needs input validation, output formatting, error handling, and permission scoping. Think of it like designing an API — you wouldn’t expose raw database access through a REST endpoint. Don’t give an agent raw shell access either. The tool layer is your contract.</p>

<p><strong>3. Control loop.</strong> Observe, decide, execute, verify. The loop is what makes an agent an agent instead of a one-shot prompt. Your job as a harness engineer is to decide: how many iterations? What’s the timeout per step? What happens when a tool call fails? When does the loop escalate to a human? This is the same logic you put in health check loops and deployment rollback controllers.</p>

<p><strong>4. Guardrails.</strong> Cost caps. Token limits. Command allowlists. File path restrictions. Rate limiting on external calls. Without guardrails, an agent can burn through your API budget in minutes or write to paths it shouldn’t touch. Every guardrail is a policy decision — same as IAM policies, network rules, and resource quotas you already manage.</p>

<p><strong>5. Observability.</strong> If you can’t see what the agent did, you can’t debug it, audit it, or trust it. Log every tool call, every LLM response, every decision point. Capture diffs, timing, token usage, and cost. This is no different from structured logging in any production system. The difference is that agent traces are longer and less predictable than HTTP request traces, so you need good tooling to navigate them.</p>

<h2 id="where-devops-context-overlaps">Where DevOps Context Overlaps</h2>

<p>Here’s where your existing skills plug in directly.</p>

<p><strong>Infrastructure as code.</strong> Agent harnesses should be declarative and version-controlled. The tool definitions, policies, and environment specs should live in config files, not hardcoded in application logic. When you change a tool’s behavior, that change should be reviewable in a PR.</p>

<p><strong>Pipeline orchestration.</strong> Multi-agent systems look a lot like multi-stage pipelines. One agent does research, passes context to a planning agent, which passes a plan to an implementation agent. You’re managing handoffs, shared artifacts, and failure propagation — the same coordination problem as CI/CD stages.</p>

<p><strong>Incident response.</strong> When an agent goes wrong, you need the same muscle memory. Check the logs. Find the failing step. Understand the input that caused it. Roll back if needed. The debugging workflow is identical.</p>

<p><strong>Security boundaries.</strong> Least privilege applies to agents just like it applies to services. What tools can this agent access? What files can it read? Can it make network calls? Can it spend money? Every agent needs a security boundary, and DevOps engineers already think in terms of boundaries.</p>

<h2 id="getting-started">Getting Started</h2>

<p>If you want to start building harnesses, you don’t need a new framework. Start with what you have.</p>

<p>Take a simple task — say, analyzing a failed CI build. Write a script that collects the logs, sends them to an LLM with a prompt, parses the response, and posts a summary to Slack. That’s a harness. A minimal one, but it has all the components: environment setup, tool use (log collection, Slack posting), a control flow, and output handling.</p>

<p>Then add complexity. Let the LLM decide which logs to fetch. Add a retry loop. Add a cost cap. Add structured logging. Each addition is a harness engineering decision.</p>

<p>You don’t need to learn ML. You don’t need to fine-tune models. You need to build the infrastructure that makes models useful — and that’s the job you already do.</p>

<p>Harness engineering isn’t a new discipline. It’s DevOps applied to a new kind of workload. The sooner you see it that way, the faster you’ll build agents that actually work in production.</p>

          ]]>
        </description>
        <pubDate>Sat, 14 Mar 2026 00:00:00 +0000</pubDate>
        <link>//muhammadraza.me/2026/harness-engineering-devops-perspective/</link>
        <guid isPermaLink="true">//muhammadraza.me/2026/harness-engineering-devops-perspective/</guid>
        
        <category>ai</category>
        
        <category>devops</category>
        
        <category>automation</category>
        
        
        
        <dc:creator>{&quot;name&quot;=&gt;&quot;Muhammad Raza&quot;}</dc:creator>
        <dc:rights></dc:rights>
      </item>
    
      <item>
        <title>I Built Local Memory for Coding Agents Because They Keep Forgetting Everything</title>
        <description>
          <![CDATA[
            
            <p>Here’s something that frustrates me about coding agents. They forget everything. Every single session starts from scratch. The agent that spent 45 minutes yesterday figuring out your authentication flow? Gone. The decision to use JWT over sessions? Gone. The bug it found in your ORM’s lazy loading? Gone.</p>

<p>You start a new session and it re-discovers the same patterns. Repeats the same mistakes. Asks the same questions. It’s like working with a brilliant colleague who gets amnesia every night.</p>

<p>I got tired of this. So I built <a href="https://github.com/mraza007/echovault">EchoVault</a> — a local memory system that gives coding agents persistent memory across sessions. No cloud. No API keys. No cost. Just a SQLite database and some Markdown files on your machine.</p>

<h2 id="the-problem-is-real">The Problem Is Real</h2>

<p>I use coding agents daily across multiple client projects. Claude Code, Cursor, Codex — I switch between them depending on the task. Every time I start a session, I’m repeating context that the agent should already know.</p>

<p>“We chose FastAPI over Flask because of async support.”
“The deploy script needs –no-cache or the CSS breaks.”
“Don’t touch the legacy auth module — it’s being replaced next sprint.”</p>

<p>I was copy-pasting this stuff into every session. That’s not how tools should work.</p>

<p>I tried existing solutions. Supermemory announced their MCP and I was tempted, but it saves everything in the cloud. I work with multiple companies as a consultant — I don’t want codebase decisions stored on someone else’s servers. Claude Mem was the first tool I tried, but it was eating too much memory in my sessions and became a bottleneck when running multiple agents at the same time.</p>

<p>So I built my own.</p>

<h2 id="how-echovault-works">How EchoVault Works</h2>

<p>EchoVault runs as an MCP server. When your agent starts a session, it has three tools available:</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">memory_context</code> — load prior decisions, bugs, and context for the current project</li>
  <li><code class="language-plaintext highlighter-rouge">memory_search</code> — find specific memories by keyword or semantic similarity</li>
  <li><code class="language-plaintext highlighter-rouge">memory_save</code> — persist a decision, bug fix, pattern, or learning</li>
</ul>

<p>The agent calls these tools like it calls any other tool. No hooks. No shell scripts. No prompt injection. The MCP protocol handles everything.</p>

<p>Here’s what happens in practice:</p>

<p><strong>Session start.</strong> The agent sees <code class="language-plaintext highlighter-rouge">memory_context</code> in its available tools. The tool description says “You MUST call this at session start.” The agent calls it and gets back a list of prior memories for the project. Now it knows what happened yesterday.</p>

<p><strong>During work.</strong> You ask about authentication. The agent calls <code class="language-plaintext highlighter-rouge">memory_search</code> with “authentication” and gets back the decision to use JWT, the bug with token refresh, and the migration plan. It has context before writing a single line of code.</p>

<p><strong>Session end.</strong> The agent just fixed a tricky race condition. The tool description says “You MUST call memory_save before ending any session where you made changes.” It saves the root cause, the fix, and what to watch for.</p>

<p>Next session, that knowledge is there. Every session builds on the last one.</p>

<h2 id="the-architecture">The Architecture</h2>

<p>I kept it simple. The whole system is four things:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>~/.memory/
├── vault/                    # Obsidian-compatible Markdown
│   └── my-project/
│       └── 2026-02-01-session.md
├── index.db                  # SQLite: FTS5 + sqlite-vec
└── config.yaml               # Optional embedding config
</code></pre></div></div>

<p><strong>Markdown vault.</strong> Every memory gets written to a session file — one file per day per project. These are valid Markdown with YAML frontmatter. You can point Obsidian at <code class="language-plaintext highlighter-rouge">~/.memory/vault/</code> and browse your agent’s memory visually. You can read them in any editor. They’re not locked in a proprietary format.</p>

<p><strong>SQLite index.</strong> This is where search happens. FTS5 handles keyword search out of the box — no configuration needed. If you want semantic search (where “authentication” matches a memory titled “JWT token setup”), add an embedding provider. I use Ollama with <code class="language-plaintext highlighter-rouge">nomic-embed-text</code> locally. You can also use OpenAI or OpenRouter if you prefer cloud.</p>

<p><strong>MCP server.</strong> The agent talks to EchoVault through the Model Context Protocol. Three tools, stdio transport, nothing fancy. The server starts when the agent needs it and stops when the session ends. Zero idle cost.</p>

<p><strong>Secret redaction.</strong> Three layers. Explicit <code class="language-plaintext highlighter-rouge">&lt;redacted&gt;</code> tags for things you mark yourself. Pattern detection that catches API keys, passwords, and credentials automatically. And <code class="language-plaintext highlighter-rouge">.memoryignore</code> rules for custom patterns. Nothing sensitive hits disk.</p>

<h2 id="making-agents-actually-save">Making Agents Actually Save</h2>

<p>Here’s the thing about MCP tools — the agent <em>can</em> call them, but will it? Retrieval works well because agents tend to grab context at the start. Saving is the hard part. The agent finishes its work and moves on. It doesn’t naturally think “I should save what I learned.”</p>

<p>The trick is the tool descriptions. When you register an MCP tool, you include a description. Agents read these descriptions and treat them as instructions. So instead of:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>"Save a memory for future sessions. Call this when you make decisions."
</code></pre></div></div>

<p>I wrote:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>"Save a memory for future sessions. You MUST call this before ending
any session where you made changes, fixed bugs, made decisions, or
learned something. This is not optional — failing to save means the
next session starts from zero."
</code></pre></div></div>

<p>That “MUST” language makes a real difference. It’s not 100% reliable — nothing with LLMs is — but agents follow strong tool descriptions much more consistently than passive ones.</p>

<h2 id="cross-agent-memory">Cross-Agent Memory</h2>

<p>One of the things I wanted was a single vault for all my agents. A memory saved by Claude Code should be searchable from Cursor or Codex. They’re all working on the same codebase. Why should they have separate memories?</p>

<p>EchoVault stores everything in one place. The MCP server is the same regardless of which agent connects to it. Setup is one command per agent:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>memory setup claude-code   <span class="c"># writes ~/.claude.json</span>
memory setup cursor        <span class="c"># writes .cursor/mcp.json</span>
memory setup codex         <span class="c"># writes .codex/config.toml + AGENTS.md</span>
memory setup opencode      <span class="c"># writes opencode.json</span>
</code></pre></div></div>

<p>Each agent has its own config format and conventions. Claude Code uses JSON with <code class="language-plaintext highlighter-rouge">mcpServers</code>. Cursor uses the same schema but different file paths. Codex uses TOML with <code class="language-plaintext highlighter-rouge">[mcp_servers]</code>. OpenCode uses JSON with a <code class="language-plaintext highlighter-rouge">mcp</code> key and a different command format (<code class="language-plaintext highlighter-rouge">command</code> as an array instead of separate <code class="language-plaintext highlighter-rouge">command</code> + <code class="language-plaintext highlighter-rouge">args</code>).</p>

<p>I wrote shared helpers so each agent’s setup is just a thin wrapper around <code class="language-plaintext highlighter-rouge">_install_mcp_servers()</code> or <code class="language-plaintext highlighter-rouge">_install_toml_mcp()</code>. Adding a new agent takes maybe 20 lines of code.</p>

<h2 id="what-gets-saved">What Gets Saved</h2>

<p>Not everything should be a memory. Trivial changes don’t need to be persisted. Information that’s obvious from reading the code doesn’t need a memory. The goal is to capture what a future agent wouldn’t know from just looking at the codebase.</p>

<p>Good memories:</p>

<ul>
  <li><strong>Decisions.</strong> “Chose JWT over sessions because the API needs to be stateless.” A future agent reading the code sees JWT but doesn’t know <em>why</em>.</li>
  <li><strong>Bugs.</strong> “The ORM lazy-loads relationships by default, causing N+1 queries in the user list endpoint. Fixed by adding <code class="language-plaintext highlighter-rouge">.options(joinedload(...))</code>. Root cause: SQLAlchemy default behavior.” A future agent won’t hit the same bug.</li>
  <li><strong>Patterns.</strong> “All API endpoints follow the pattern: validate input, check permissions, execute, return response. Don’t add business logic in the route handler.” A future agent follows the existing patterns instead of inventing new ones.</li>
  <li><strong>Context.</strong> “The legacy auth module is being replaced. Don’t modify it — changes go into the new auth service at <code class="language-plaintext highlighter-rouge">src/auth/v2/</code>.” A future agent doesn’t waste time on dead code.</li>
</ul>

<p>Each memory has a title, a “what happened” summary, optional “why” and “impact” fields, tags, and a category. Search returns compact ~50-token summaries. Full details are fetched on demand so context windows don’t get bloated.</p>

<h2 id="the-technical-bits">The Technical Bits</h2>

<p>A few implementation details that might be useful if you’re building something similar.</p>

<p><strong>FTS5 for keyword search.</strong> SQLite’s FTS5 extension is fast and works with zero configuration. No external service needed. It handles stemming, phrase matching, and ranking. For most use cases, this is all you need.</p>

<p><strong>sqlite-vec for semantic search.</strong> When you want “authentication” to match “JWT token rotation”, you need vectors. I use <code class="language-plaintext highlighter-rouge">sqlite-vec</code> to store embeddings right in the same SQLite database. No vector database needed. Embedding providers are pluggable — Ollama for local, OpenAI or OpenRouter for cloud.</p>

<p><strong>Hybrid search.</strong> The search pipeline runs FTS5 first (fast, precise), then semantic search (slower, fuzzy), and merges the results. This gives you the best of both worlds — exact keyword matches and semantic similarity.</p>

<p><strong>TOML parsing with fallbacks.</strong> Codex writes some non-standard TOML — unquoted filesystem paths as table keys, dotted version strings as key names. Standard <code class="language-plaintext highlighter-rouge">tomllib</code> chokes on these. I added a fallback that appends the MCP section directly via string operations when parsing fails. It’s not pretty but it handles real-world config files.</p>

<p><strong>Symlink handling.</strong> Some agents create symlinks in their skill directories. <code class="language-plaintext highlighter-rouge">shutil.rmtree()</code> crashes on symlinks. Small thing but it bit me in production.</p>

<h2 id="setting-it-up">Setting It Up</h2>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>pip <span class="nb">install </span>git+https://github.com/mraza007/echovault.git
memory init
memory setup claude-code
</code></pre></div></div>

<p>That’s it. Three commands. The agent has memory now.</p>

<p>If you want semantic search, configure an embedding provider:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>memory config init
<span class="c"># Edit ~/.memory/config.yaml to set your provider</span>
memory reindex
</code></pre></div></div>

<p>For fully local operation with no external API calls, use Ollama:</p>

<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="na">embedding</span><span class="pi">:</span>
  <span class="na">provider</span><span class="pi">:</span> <span class="s">ollama</span>
  <span class="na">model</span><span class="pi">:</span> <span class="s">nomic-embed-text</span>
</code></pre></div></div>

<h2 id="what-ive-learned">What I’ve Learned</h2>

<p>Building this taught me a few things about agent tooling.</p>

<p><strong>Tool descriptions are instructions.</strong> Agents read them and follow them. Strong, directive language in tool descriptions is more effective than passive documentation. “You MUST” works better than “You can.”</p>

<p><strong>Local-first matters.</strong> Not because of ideology, but because of practical constraints. Consultants work with multiple clients. Sensitive decisions shouldn’t leave the machine. And when your internet goes out, local tools still work.</p>

<p><strong>MCP is the right abstraction.</strong> Instead of writing agent-specific hooks, skills, and config formats, I write one MCP server and each agent connects to it. When a new agent comes along, I add a setup function for its config format. The memory logic doesn’t change.</p>

<p><strong>Simple storage wins.</strong> Markdown files you can read in any editor. SQLite you can query with any tool. No custom binary formats. No daemon to keep running. The system is completely inspectable and debuggable.</p>

<p>The code is at <a href="https://github.com/mraza007/echovault">github.com/mraza007/echovault</a>. It’s MIT licensed. If you’re tired of your agents forgetting everything, give it a shot.</p>

          ]]>
        </description>
        <pubDate>Tue, 17 Feb 2026 00:00:00 +0000</pubDate>
        <link>//muhammadraza.me/2026/building-local-memory-for-coding-agents/</link>
        <guid isPermaLink="true">//muhammadraza.me/2026/building-local-memory-for-coding-agents/</guid>
        
        <category>ai</category>
        
        <category>python</category>
        
        <category>tools</category>
        
        
        
        <dc:creator>{&quot;name&quot;=&gt;&quot;Muhammad Raza&quot;}</dc:creator>
        <dc:rights></dc:rights>
      </item>
    
      <item>
        <title>AI Agents Are Just CI Pipelines With an LLM Plugged In</title>
        <description>
          <![CDATA[
            
            <p>In this post, I’ll show you how to think about AI agents through the infrastructure patterns you already use. Think about your CI runner. It spins up an environment. Runs some steps. Reads files. Runs tests. Captures output. Decides what to do next. Knows when to stop.</p>

<p>Now swap out the hardcoded logic for an LLM. That’s it. That’s an AI agent in simpler terms. The fancy demos want you to think it’s magic. Some brand new thing you need to learn from scratch. It’s not. When you take away the hype, an agent is just a controlled automation loop. The LLM handles the reasoning and everything else is infrastructure you’ve built a hundred times.</p>

<p>Here’s what matters, the agent itself isn’t the hard part but The harness is, the execution environment, tooling, guardrails, and observability. It’s all the important stuff that makes automation work in production.</p>

<p>DevOps engineers have been building harnesses forever. CI runners. Deployment pipelines. Infrastructure automation. The patterns are the same. The skills transfer directly.</p>

<p>So if you’re wondering whether AI agents are worth learning, here’s the short answer. You’re already halfway there.</p>

<h2 id="what-an-agent-actually-looks-like">What an Agent Actually Looks Like</h2>

<p>Let’s forget the marketing hype around AI agents and understand from a DevOps engineer’s point of view, what an agent actually looks like. An AI agent has six parts.</p>

<ol>
  <li>
    <p><code class="language-plaintext highlighter-rouge">An LLM</code>: Now LLM is the most important part of an agent as this acts as a brain. It reads context and decides what to do next. It doesn’t touch anything directly.</p>
  </li>
  <li>
    <p><code class="language-plaintext highlighter-rouge">A workspace</code>: Think of it as a sandboxed environment. A cloned repo. A container. A temp directory. Same as any CI job.</p>
  </li>
  <li>
    <p><code class="language-plaintext highlighter-rouge">A set of tools</code>: These are the actions it can request. Read a file. Run a command. Call an API. Query logs. The agent doesn’t run these itself. It asks for them.</p>
  </li>
  <li>
    <p><code class="language-plaintext highlighter-rouge">A control loop</code>: This is the core pattern. Observe the current state. Decide an action. Execute it. Check the result. Keep going until you’re done.</p>
  </li>
  <li>
    <p><code class="language-plaintext highlighter-rouge">Policies and limits</code>: Timeouts. Permission boundaries. Rate limits. Cost caps. Without these, agents can spin forever or do things they shouldn’t.</p>
  </li>
  <li>
    <p><code class="language-plaintext highlighter-rouge">A termination condition</code>: The agent needs to know when to stop. Task complete. Error threshold hit. Human review needed. Something has to end the loop.</p>
  </li>
</ol>

<p>Now none of this is new as you’ve built systems with all these components. The only difference is the LLM sitting in the decision seat.</p>

<h2 id="the-harness-does-the-heavy-lifting">The Harness Does the Heavy Lifting</h2>

<p>Everyone focuses on the LLM. They miss the important part. The harness is what makes an agent actually work.</p>

<p>The harness is everything around the model. It spins up the environment. Exposes tools. Executes commands on the agent’s behalf. Captures logs and diffs. Enforces limits. Decides when the loop should stop.</p>

<p>Sound familiar? It should. This is what CI runners do.</p>

<p>GitHub Actions. GitLab runners. Jenkins agents. They all follow the same pattern. Spin up an isolated environment. Run steps. Capture output. Handle success and failure. Clean up.</p>

<p>An agent harness does the exact same thing. The only twist is the steps aren’t hardcoded in YAML. They come from the LLM at runtime.</p>

<p>This is why DevOps engineers are perfect for this work. You already think about isolation, execution, logging, and cleanup. You already build systems that run untrusted code safely. Agent harnesses are the same problem with a new input source.</p>

<h2 id="tool-use-is-the-safety-mechanism">Tool Use Is the Safety Mechanism</h2>

<p>Agents don’t touch systems directly. This matters. The LLM never runs a command itself. Never writes a file itself. It requests actions through tools.</p>

<p>The harness gets the request. Validates it. Executes it in a controlled way. Returns a structured result.</p>

<p>This is how you keep agents safe.</p>

<p>Say the agent wants to run a shell command. The harness can check it against an allowlist. Run it in a sandbox. Set a timeout. Capture stderr. The agent never gets raw shell access.</p>

<p>Same thing for file operations. The agent requests a file write. The harness checks the path. Validates the content. Writes the file and returns confirmation.</p>

<p>You control what tools exist. You control how they behave. You control what the agent can even ask for.</p>

<p>This is the same idea behind least privilege. The agent only gets access to what it needs. The harness enforces the boundary.</p>

<h2 id="the-control-loop-in-practice">The Control Loop in Practice</h2>

<p>The core of any agent is the control loop. It looks like this.</p>

<ol>
  <li>
    <p><code class="language-plaintext highlighter-rouge">Observe</code>: The agent reads the current state. Test output. Log files. Diffs. Error messages. Whatever context it needs.</p>
  </li>
  <li>
    <p><code class="language-plaintext highlighter-rouge">Decide</code>: The LLM looks at the state and picks an action. Run another test. Edit a file. Ask for more information. Give up.</p>
  </li>
  <li>
    <p><code class="language-plaintext highlighter-rouge">Execute</code>: The harness runs the requested action and returns the result.</p>
  </li>
  <li>
    <p><code class="language-plaintext highlighter-rouge">Verify</code>: The agent checks if the action worked. Did the test pass? Did the error go away? Is the task done?</p>
  </li>
  <li>
    <p><code class="language-plaintext highlighter-rouge">Repeat</code>: If the task isn’t complete, go back to observe.</p>
  </li>
</ol>

<p>This loop keeps running until a termination condition hits—success, failure, timeout, max iterations, or human intervention.</p>

<p>You’ve seen this before: build, test, fix, rebuild. CI pipelines do this, deployment rollbacks do this, and health check loops do this.</p>

<p>Agents just make the “decide” step dynamic instead of scripted, and here’s where they actually help in DevOps work.</p>

<p><strong>CI failure analysis.</strong> When a test fails, the agent reads the logs, checks the diff, identifies the cause, and suggests a fix—maybe even applying it and rerunning the test.</p>

<p><strong>Terraform drift detection.</strong> The agent compares actual state to declared state, flags the drift, and proposes a remediation plan while a human approves before anything changes.</p>

<p><strong>Kubernetes manifest review.</strong> The agent checks YAML against best practices (missing resource limits, no liveness probes, exposed secrets) catching the stuff humans miss in review.</p>

<p><strong>Cost anomaly investigation.</strong> When spending spikes, the agent queries cost explorer, correlates with recent deployments, and surfaces the likely cause, saving an hour of digging.</p>

<p><strong>Incident log triage.</strong> Faced with pages of logs, the agent reads them, extracts the relevant lines, and summarizes what went wrong (not replacing the engineer, but getting them to the answer faster).</p>

<p>Notice the pattern: the agent assists and handles the tedious parts while the human stays in control of decisions that matter.</p>

<p>AI agents sound complicated with their new frameworks, new terminology, and new paradigms.</p>

<p>But look past the hype and you’ll see something familiar.</p>

<p>An agent is an automation loop where the LLM picks the next step, the harness executes it safely, tools provide controlled access to systems, and policies keep things bounded.</p>

<p>This is CI/CD architecture, infrastructure thinking, the stuff you already do.</p>

<p>When you read about agent frameworks or watch demos of coding assistants, you now have a lens to see the harness underneath, spot the control loop, and ask the right questions: what tools does it expose, what limits exist, and how does it handle failure?</p>

<p>You don’t need to become an ML engineer to understand agents—you just need to recognize the infrastructure patterns you’ve been using all along.</p>

<p>The LLM is the new part. Everything else is your domain.</p>

          ]]>
        </description>
        <pubDate>Sat, 03 Jan 2026 00:00:00 +0000</pubDate>
        <link>//muhammadraza.me/2026/ai-agents-devops-perspective/</link>
        <guid isPermaLink="true">//muhammadraza.me/2026/ai-agents-devops-perspective/</guid>
        
        <category>ai</category>
        
        <category>devops</category>
        
        <category>automation</category>
        
        
        
        <dc:creator>{&quot;name&quot;=&gt;&quot;Muhammad Raza&quot;}</dc:creator>
        <dc:rights></dc:rights>
      </item>
    
      <item>
        <title>AWS Cost Optimization Case Study: How I Cut a Client&apos;s Bill by 50%</title>
        <description>
          <![CDATA[
            
            <p>Last month, a client’s AWS bill hit $5,000 — up 40% from last year with no clear explanation.</p>

<p>After one week of systematic analysis, I cut it to <strong>$2,500/month</strong> — a 50% reduction, saving them <strong>$30,000 annually</strong>. Here’s exactly how I did it, with the scripts you can use.</p>

<h2 id="the-discovery-phase-how-i-found-the-problems">The Discovery Phase: How I Found the Problems</h2>

<p>Before touching anything, I needed to understand the infrastructure. Here’s my systematic approach:</p>

<h3 id="step-1-pull-cost-data-by-service">Step 1: Pull Cost Data by Service</h3>

<p>First, I analyzed their AWS Cost Explorer data to understand where money was going:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>aws ce get-cost-and-usage <span class="se">\</span>
  <span class="nt">--time-period</span> <span class="nv">Start</span><span class="o">=</span>2024-11-01,End<span class="o">=</span>2024-11-30 <span class="se">\</span>
  <span class="nt">--granularity</span> MONTHLY <span class="se">\</span>
  <span class="nt">--metrics</span> <span class="s2">"BlendedCost"</span> <span class="se">\</span>
  <span class="nt">--group-by</span> <span class="nv">Type</span><span class="o">=</span>DIMENSION,Key<span class="o">=</span>SERVICE
</code></pre></div></div>

<p>This gave me the high-level breakdown. But I needed more detail.</p>

<h3 id="step-2-build-a-resource-inventory">Step 2: Build a Resource Inventory</h3>

<p>I wrote a Python script to scan all resources across regions and identify optimization opportunities:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">boto3</span>

<span class="k">def</span> <span class="nf">scan_ebs_volumes</span><span class="p">():</span>
    <span class="s">"""Find GP2 volumes that should be GP3 and unattached volumes."""</span>
    <span class="n">ec2</span> <span class="o">=</span> <span class="n">boto3</span><span class="p">.</span><span class="n">client</span><span class="p">(</span><span class="s">'ec2'</span><span class="p">)</span>
    <span class="n">volumes</span> <span class="o">=</span> <span class="n">ec2</span><span class="p">.</span><span class="n">describe_volumes</span><span class="p">()[</span><span class="s">'Volumes'</span><span class="p">]</span>

    <span class="n">gp2_volumes</span> <span class="o">=</span> <span class="p">[]</span>
    <span class="n">unattached</span> <span class="o">=</span> <span class="p">[]</span>

    <span class="k">for</span> <span class="n">vol</span> <span class="ow">in</span> <span class="n">volumes</span><span class="p">:</span>
        <span class="k">if</span> <span class="n">vol</span><span class="p">[</span><span class="s">'VolumeType'</span><span class="p">]</span> <span class="o">==</span> <span class="s">'gp2'</span><span class="p">:</span>
            <span class="n">gp2_volumes</span><span class="p">.</span><span class="n">append</span><span class="p">({</span>
                <span class="s">'VolumeId'</span><span class="p">:</span> <span class="n">vol</span><span class="p">[</span><span class="s">'VolumeId'</span><span class="p">],</span>
                <span class="s">'Size'</span><span class="p">:</span> <span class="n">vol</span><span class="p">[</span><span class="s">'Size'</span><span class="p">],</span>
                <span class="s">'State'</span><span class="p">:</span> <span class="n">vol</span><span class="p">[</span><span class="s">'State'</span><span class="p">],</span>
                <span class="s">'MonthlyCost'</span><span class="p">:</span> <span class="n">vol</span><span class="p">[</span><span class="s">'Size'</span><span class="p">]</span> <span class="o">*</span> <span class="mf">0.10</span><span class="p">,</span>  <span class="c1"># GP2 pricing
</span>                <span class="s">'GP3Cost'</span><span class="p">:</span> <span class="n">vol</span><span class="p">[</span><span class="s">'Size'</span><span class="p">]</span> <span class="o">*</span> <span class="mf">0.08</span><span class="p">,</span>      <span class="c1"># GP3 pricing
</span>                <span class="s">'Savings'</span><span class="p">:</span> <span class="n">vol</span><span class="p">[</span><span class="s">'Size'</span><span class="p">]</span> <span class="o">*</span> <span class="mf">0.02</span>
            <span class="p">})</span>

        <span class="k">if</span> <span class="n">vol</span><span class="p">[</span><span class="s">'State'</span><span class="p">]</span> <span class="o">==</span> <span class="s">'available'</span><span class="p">:</span>  <span class="c1"># Not attached
</span>            <span class="n">unattached</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">vol</span><span class="p">)</span>

    <span class="k">return</span> <span class="n">gp2_volumes</span><span class="p">,</span> <span class="n">unattached</span>
</code></pre></div></div>

<p>I built similar functions for:</p>
<ul>
  <li>RDS instances (storage type, utilization, Multi-AZ necessity)</li>
  <li>EC2 instances (generation, Reserved Instance coverage)</li>
  <li>Elastic IPs (attached vs idle)</li>
  <li>EBS snapshots (age, associated volumes)</li>
  <li>S3 buckets (storage class, lifecycle policies)</li>
</ul>

<h3 id="step-3-analyze-cloudwatch-metrics-for-utilization">Step 3: Analyze CloudWatch Metrics for Utilization</h3>

<p>This is critical. Before recommending any right-sizing, I needed data:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">get_instance_utilization</span><span class="p">(</span><span class="n">instance_id</span><span class="p">,</span> <span class="n">days</span><span class="o">=</span><span class="mi">30</span><span class="p">):</span>
    <span class="s">"""Get average CPU utilization over the past N days."""</span>
    <span class="n">cloudwatch</span> <span class="o">=</span> <span class="n">boto3</span><span class="p">.</span><span class="n">client</span><span class="p">(</span><span class="s">'cloudwatch'</span><span class="p">)</span>

    <span class="n">response</span> <span class="o">=</span> <span class="n">cloudwatch</span><span class="p">.</span><span class="n">get_metric_statistics</span><span class="p">(</span>
        <span class="n">Namespace</span><span class="o">=</span><span class="s">'AWS/EC2'</span><span class="p">,</span>
        <span class="n">MetricName</span><span class="o">=</span><span class="s">'CPUUtilization'</span><span class="p">,</span>
        <span class="n">Dimensions</span><span class="o">=</span><span class="p">[{</span><span class="s">'Name'</span><span class="p">:</span> <span class="s">'InstanceId'</span><span class="p">,</span> <span class="s">'Value'</span><span class="p">:</span> <span class="n">instance_id</span><span class="p">}],</span>
        <span class="n">StartTime</span><span class="o">=</span><span class="n">datetime</span><span class="p">.</span><span class="n">now</span><span class="p">()</span> <span class="o">-</span> <span class="n">timedelta</span><span class="p">(</span><span class="n">days</span><span class="o">=</span><span class="n">days</span><span class="p">),</span>
        <span class="n">EndTime</span><span class="o">=</span><span class="n">datetime</span><span class="p">.</span><span class="n">now</span><span class="p">(),</span>
        <span class="n">Period</span><span class="o">=</span><span class="mi">86400</span><span class="p">,</span>  <span class="c1"># Daily
</span>        <span class="n">Statistics</span><span class="o">=</span><span class="p">[</span><span class="s">'Average'</span><span class="p">,</span> <span class="s">'Maximum'</span><span class="p">]</span>
    <span class="p">)</span>

    <span class="k">return</span> <span class="n">response</span><span class="p">[</span><span class="s">'Datapoints'</span><span class="p">]</span>
</code></pre></div></div>

<p>The results were eye-opening:</p>
<ul>
  <li>Two t3.xlarge instances averaging <strong>12% CPU</strong></li>
  <li>RDS storage at <strong>95% free space</strong></li>
  <li>Multiple log groups with <strong>no retention policy</strong> (storing terabytes)</li>
</ul>

<h3 id="step-4-map-dependencies-before-cutting">Step 4: Map Dependencies Before Cutting</h3>

<p>Before deleting anything, I mapped what depended on what:</p>
<ul>
  <li>Which services used which Elastic IPs?</li>
  <li>Which applications wrote to which log groups?</li>
  <li>Which backups were actually needed for compliance?</li>
</ul>

<p>This prevented the classic mistake of breaking production while optimizing costs.</p>

<h2 id="the-starting-point">The Starting Point</h2>

<p>After the discovery phase, here’s what I was working with:</p>

<table>
  <thead>
    <tr>
      <th>Service</th>
      <th>Monthly Cost</th>
      <th>% of Total</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>EC2-Other (EBS, NAT, IPs)</td>
      <td>$1,250</td>
      <td>25%</td>
    </tr>
    <tr>
      <td>RDS</td>
      <td>$750</td>
      <td>15%</td>
    </tr>
    <tr>
      <td>EC2 Compute</td>
      <td>$650</td>
      <td>13%</td>
    </tr>
    <tr>
      <td>CloudWatch</td>
      <td>$500</td>
      <td>10%</td>
    </tr>
    <tr>
      <td>AWS Backup</td>
      <td>$500</td>
      <td>10%</td>
    </tr>
    <tr>
      <td>ECS Fargate</td>
      <td>$400</td>
      <td>8%</td>
    </tr>
    <tr>
      <td>S3</td>
      <td>$250</td>
      <td>5%</td>
    </tr>
    <tr>
      <td>VPC</td>
      <td>$250</td>
      <td>5%</td>
    </tr>
    <tr>
      <td>Everything else</td>
      <td>$450</td>
      <td>9%</td>
    </tr>
    <tr>
      <td><strong>Total</strong></td>
      <td><strong>$5,000</strong></td>
      <td><strong>100%</strong></td>
    </tr>
  </tbody>
</table>

<p>The distribution told me a lot. EC2-related costs (compute + EBS + networking) made up over 38% of the bill. That’s where I started.</p>

<h2 id="phase-1-quick-wins-implemented-same-day">Phase 1: Quick Wins (Implemented Same Day)</h2>

<h3 id="release-idle-elastic-ips--saved-50month">Release Idle Elastic IPs — Saved $50/month</h3>

<p>My inventory script flagged 5 Elastic IPs with no association:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>aws ec2 describe-addresses <span class="nt">--query</span> <span class="s1">'Addresses[?AssociationId==null]'</span>
</code></pre></div></div>

<p>Someone had provisioned them for test environments that were deleted months ago. Classic ghost infrastructure.</p>

<p><strong>Time to fix:</strong> 5 minutes.</p>

<h3 id="migrate-ebs-gp2-to-gp3--saved-125month">Migrate EBS GP2 to GP3 — Saved $125/month</h3>

<p>The script found 6,000+ GB across multiple EBS volumes still on GP2. GP3 costs <a href="https://aws.amazon.com/ebs/pricing/">20% less</a> <strong>and</strong> provides better baseline performance (3,000 IOPS vs GP2’s variable IOPS based on size).</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>aws ec2 modify-volume <span class="nt">--volume-id</span> vol-xxx <span class="nt">--volume-type</span> gp3
</code></pre></div></div>

<p>No downtime. Just a CLI command per volume.</p>

<p><strong>Time to fix:</strong> 30 minutes for all volumes.</p>

<h3 id="set-cloudwatch-log-retention--saved-100month">Set CloudWatch Log Retention — Saved $100/month</h3>

<p>My scan found 20+ log groups with no retention policy — storing logs forever:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>aws logs describe-log-groups <span class="nt">--query</span> <span class="s1">'logGroups[?retentionInDays==null].logGroupName'</span>
</code></pre></div></div>

<p>Set production to 90 days, staging to 30 days.</p>

<p><strong>Time to fix:</strong> 20 minutes.</p>

<h2 id="phase-2-the-big-discoveries">Phase 2: The Big Discoveries</h2>

<h3 id="aws-backup-running-24x-more-often-than-needed--saved-400month">AWS Backup Running 24x More Often Than Needed — Saved $400/month</h3>

<p>This was the most surprising find. When I pulled the backup plan configuration:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>aws backup get-backup-plan <span class="nt">--backup-plan-id</span> xxx
</code></pre></div></div>

<p>I saw: <strong>hourly backups</strong>. 24 backups per day. For every resource.</p>

<p>The backup storage had grown to $500/month — 10% of their total bill.</p>

<p>I reviewed their recovery requirements (they only needed daily backups with 14-day retention) and reconfigured:</p>

<div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">{</span><span class="w">
  </span><span class="nl">"ScheduleExpression"</span><span class="p">:</span><span class="w"> </span><span class="s2">"cron(0 5 * * ? *)"</span><span class="p">,</span><span class="w">
  </span><span class="nl">"StartWindowMinutes"</span><span class="p">:</span><span class="w"> </span><span class="mi">60</span><span class="p">,</span><span class="w">
  </span><span class="nl">"CompletionWindowMinutes"</span><span class="p">:</span><span class="w"> </span><span class="mi">120</span><span class="p">,</span><span class="w">
  </span><span class="nl">"Lifecycle"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
    </span><span class="nl">"DeleteAfterDays"</span><span class="p">:</span><span class="w"> </span><span class="mi">14</span><span class="w">
  </span><span class="p">}</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>

<p><strong>Time to fix:</strong> 1 hour (including testing).</p>

<h3 id="cloudwatch-metric-streams-to-nowhere--saved-400month">CloudWatch Metric Streams to Nowhere — Saved $400/month</h3>

<p>My CloudWatch cost breakdown showed $400/month on “Metric Streams” — 100+ million metric updates going somewhere.</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>aws cloudwatch list-metric-streams
</code></pre></div></div>

<p>Found a stream configured to send data to a third-party monitoring tool. When I asked about it, nobody on the team knew it existed. The integration had been set up by a previous contractor and was never used.</p>

<p>This is a perfect example of ghost infrastructure that accumulates over time.</p>

<h3 id="rds-over-provisioned-by-95">RDS Over-Provisioned by 95%</h3>

<p>My RDS analysis showed all instances had massive storage allocated. The CloudWatch metrics told the real story:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>aws cloudwatch get-metric-statistics <span class="se">\</span>
  <span class="nt">--namespace</span> AWS/RDS <span class="se">\</span>
  <span class="nt">--metric-name</span> FreeStorageSpace <span class="se">\</span>
  <span class="nt">--dimensions</span> <span class="nv">Name</span><span class="o">=</span>DBInstanceIdentifier,Value<span class="o">=</span>production-db <span class="se">\</span>
  <span class="nt">--start-time</span> 2024-11-01T00:00:00Z <span class="se">\</span>
  <span class="nt">--end-time</span> 2024-11-30T23:59:59Z <span class="se">\</span>
  <span class="nt">--period</span> 86400 <span class="se">\</span>
  <span class="nt">--statistics</span> Average
</code></pre></div></div>

<p><strong>Result:</strong> 95% free space across all databases.</p>

<p>RDS storage can only be increased, not decreased. But I migrated all instances from GP2 to GP3 storage — same price, better performance.</p>

<p>For the next database refresh, I recommended right-sized storage instead of the default massive allocations.</p>

<p><strong>Saved:</strong> $150/month</p>

<h2 id="phase-3-infrastructure-improvements">Phase 3: Infrastructure Improvements</h2>

<h3 id="nat-gateway-consolidation--saved-125month">NAT Gateway Consolidation — Saved $125/month</h3>

<p>My VPC analysis showed NAT Gateways in every AZ across multiple regions costing $500/month combined. After reviewing their architecture and traffic patterns, they only needed half of them.</p>

<h3 id="ecs-task-right-sizing--saved-250month">ECS Task Right-Sizing — Saved $250/month</h3>

<p>The ECS service scan found:</p>
<ul>
  <li>A staging service constantly failing health checks and restarting (consuming resources 24/7 while accomplishing nothing)</li>
  <li>Legacy services still running in production that nobody was using</li>
</ul>

<p>These issues relate directly to the <a href="/2025/ecs-decisions-that-waste-6-weeks/">ECS architectural decisions</a> that often waste weeks of engineering time. Plus, enabled Fargate Spot for fault-tolerant workloads (70% savings on those tasks).</p>

<h3 id="s3-lifecycle-policies--saved-150month">S3 Lifecycle Policies — Saved $150/month</h3>

<p>My S3 bucket analysis showed backup buckets had grown to 10+ TB with no lifecycle policy. Old backups were stored in Standard tier forever.</p>

<p>Added a policy to transition to Glacier after 90 days:</p>

<div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">{</span><span class="w">
  </span><span class="nl">"Rules"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="w">
    </span><span class="p">{</span><span class="w">
      </span><span class="nl">"ID"</span><span class="p">:</span><span class="w"> </span><span class="s2">"archive-old-backups"</span><span class="p">,</span><span class="w">
      </span><span class="nl">"Status"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Enabled"</span><span class="p">,</span><span class="w">
      </span><span class="nl">"Transitions"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="w">
        </span><span class="p">{</span><span class="w">
          </span><span class="nl">"Days"</span><span class="p">:</span><span class="w"> </span><span class="mi">90</span><span class="p">,</span><span class="w">
          </span><span class="nl">"StorageClass"</span><span class="p">:</span><span class="w"> </span><span class="s2">"GLACIER"</span><span class="w">
        </span><span class="p">}</span><span class="w">
      </span><span class="p">]</span><span class="w">
    </span><span class="p">}</span><span class="w">
  </span><span class="p">]</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>

<h3 id="reserved-instances-for-stable-workloads--saved-500month">Reserved Instances for Stable Workloads — Saved $500/month</h3>

<p>My EC2 coverage analysis showed zero Reserved Instance coverage on instances running 24/7.</p>

<p>I helped them purchase <a href="https://aws.amazon.com/savingsplans/compute-pricing/">Compute Savings Plans</a> covering their steady-state workloads. Immediate 30-40% savings on compute.</p>

<h3 id="ec2-instance-right-sizing--saved-250month">EC2 Instance Right-Sizing — Saved $250/month</h3>

<p>The utilization data was clear: multiple instances running at 10-15% CPU.</p>

<p>Downsized t3.xlarge instances to t3.large where utilization data supported it. Same workload, half the cost.</p>

<h2 id="the-results">The Results</h2>

<table>
  <thead>
    <tr>
      <th>Category</th>
      <th>Monthly Savings</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Reserved Instances / Savings Plans</td>
      <td>$500</td>
    </tr>
    <tr>
      <td>AWS Backup (hourly → daily)</td>
      <td>$400</td>
    </tr>
    <tr>
      <td>CloudWatch Metric Streams</td>
      <td>$400</td>
    </tr>
    <tr>
      <td>ECS cleanup + Fargate Spot</td>
      <td>$250</td>
    </tr>
    <tr>
      <td>EC2 right-sizing</td>
      <td>$250</td>
    </tr>
    <tr>
      <td>S3 lifecycle policies</td>
      <td>$150</td>
    </tr>
    <tr>
      <td>RDS improvements</td>
      <td>$150</td>
    </tr>
    <tr>
      <td>EBS GP2 → GP3</td>
      <td>$125</td>
    </tr>
    <tr>
      <td>NAT Gateway consolidation</td>
      <td>$125</td>
    </tr>
    <tr>
      <td>CloudWatch log retention</td>
      <td>$100</td>
    </tr>
    <tr>
      <td>Idle Elastic IPs</td>
      <td>$50</td>
    </tr>
    <tr>
      <td><strong>Total Monthly Savings</strong></td>
      <td><strong>$2,500</strong></td>
    </tr>
  </tbody>
</table>

<p><strong>From $5,000/month to $2,500/month — exactly 50% reduction.</strong></p>

<p>Over a year, that’s <strong>$30,000 back in their pocket</strong>.</p>

<h2 id="the-methodology">The Methodology</h2>

<p>Here’s the systematic approach I use for every cost optimization engagement:</p>

<h3 id="1-get-the-data-first">1. Get the Data First</h3>

<p>Before making any changes, I pull:</p>
<ul>
  <li>AWS Cost Explorer data (by service, by tag, over time)</li>
  <li>CloudWatch metrics for utilization</li>
  <li>Resource inventory across all regions</li>
</ul>

<h3 id="2-find-the-ghosts">2. Find the Ghosts</h3>

<p>“Ghost infrastructure” costs more than you think:</p>
<ul>
  <li>Unused Elastic IPs</li>
  <li>Detached EBS volumes</li>
  <li>Empty S3 buckets accumulating requests</li>
  <li>Log groups with infinite retention</li>
  <li>Metric Streams nobody monitors</li>
  <li>Test environments that outlived their purpose</li>
</ul>

<h3 id="3-right-size-ruthlessly">3. Right-Size Ruthlessly</h3>

<p>Check actual utilization before committing to Reserved Instances:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># EC2 CPU utilization over 30 days</span>
aws cloudwatch get-metric-statistics <span class="se">\</span>
  <span class="nt">--namespace</span> AWS/EC2 <span class="se">\</span>
  <span class="nt">--metric-name</span> CPUUtilization <span class="se">\</span>
  <span class="nt">--dimensions</span> <span class="nv">Name</span><span class="o">=</span>InstanceId,Value<span class="o">=</span>i-xxx <span class="se">\</span>
  <span class="nt">--start-time</span> <span class="si">$(</span><span class="nb">date</span> <span class="nt">-d</span> <span class="s2">"30 days ago"</span> +%Y-%m-%dT%H:%M:%S<span class="si">)</span> <span class="se">\</span>
  <span class="nt">--end-time</span> <span class="si">$(</span><span class="nb">date</span> +%Y-%m-%dT%H:%M:%S<span class="si">)</span> <span class="se">\</span>
  <span class="nt">--period</span> 3600 <span class="se">\</span>
  <span class="nt">--statistics</span> Average
</code></pre></div></div>

<p>If a t3.xlarge averages 15% CPU, you’re paying for 85% idle capacity.</p>

<h3 id="4-modernize-storage">4. Modernize Storage</h3>

<p>GP2 → GP3 is almost always worth it:</p>
<ul>
  <li>20% cheaper at baseline</li>
  <li>Better performance (3,000 IOPS baseline)</li>
  <li>Zero downtime migration</li>
</ul>

<h3 id="5-review-backup-policies">5. Review Backup Policies</h3>

<p>Backups grow silently. Questions to ask:</p>
<ul>
  <li>How often do you actually need backups?</li>
  <li>How long do you really need to keep them?</li>
  <li>Are you backing up dev/test environments at production frequency?</li>
</ul>

<h2 id="what-this-looks-like-over-time">What This Looks Like Over Time</h2>

<table>
  <thead>
    <tr>
      <th>Metric</th>
      <th>Before</th>
      <th>After</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Monthly spend</td>
      <td>$5,000</td>
      <td>$2,500</td>
    </tr>
    <tr>
      <td>Annual spend</td>
      <td>$60,000</td>
      <td>$30,000</td>
    </tr>
    <tr>
      <td><strong>Annual savings</strong></td>
      <td>—</td>
      <td><strong>$30,000</strong></td>
    </tr>
  </tbody>
</table>

<p>The best part? None of these changes affected performance or reliability. Most improved it.</p>

<h2 id="common-patterns-i-see">Common Patterns I See</h2>

<p>After doing this for multiple clients, patterns emerge:</p>

<ol>
  <li><strong>Metric Streams nobody monitors</strong> — $100-400/month just disappearing</li>
  <li><strong>Hourly backups for daily restore needs</strong> — 24x the storage cost</li>
  <li><strong>GP2 volumes from years ago</strong> — never migrated to GP3</li>
  <li><strong>Multi-AZ staging databases</strong> — paying for HA nobody needs</li>
  <li><strong>NAT Gateways in every AZ</strong> — when one or two would suffice</li>
  <li><strong>Logs kept forever</strong> — “just in case”</li>
  <li><strong>No Reserved Instances</strong> — paying full on-demand for 24/7 workloads</li>
  <li><strong>Over-provisioned everything</strong> — “it might need it someday”</li>
</ol>

<hr />

<h2 id="need-help-with-your-aws-bill">Need Help With Your AWS Bill?</h2>

<p>I do AWS cost optimization as part of my DevOps consulting practice. If your AWS bill feels too high or you just want a second pair of eyes on your infrastructure, let’s talk.</p>

<p><strong><a href="https://calendly.com/muhammad-07/30-minute-meeting">Book a free 30-minute call</a></strong> — I’ll review your current setup and tell you where I see opportunities.</p>

<hr />

<p><em>Have questions about any of these optimizations? Drop a comment below or reach out on <a href="https://twitter.com/muhammad_o7">Twitter/X</a>.</em></p>


          ]]>
        </description>
        <pubDate>Sat, 27 Dec 2025 00:00:00 +0000</pubDate>
        <link>//muhammadraza.me/2025/aws-cost-optimization-case-study/</link>
        <guid isPermaLink="true">//muhammadraza.me/2025/aws-cost-optimization-case-study/</guid>
        
        <category>aws</category>
        
        <category>devops</category>
        
        <category>case-study</category>
        
        
        
        <dc:creator>{&quot;name&quot;=&gt;&quot;Muhammad Raza&quot;}</dc:creator>
        <dc:rights></dc:rights>
      </item>
    
      <item>
        <title>The 5 ECS Decisions That Waste 6 Weeks (And What to Pick Instead)</title>
        <description>
          <![CDATA[
            
            <p>I’ve been helping Python teams deploy to AWS for the past 2 years now. The pattern is always the same: a team has a working FastAPI or Django app running perfectly on their laptops with <code class="language-plaintext highlighter-rouge">docker-compose up</code>, and then someone says “let’s put this in ECS.” Six weeks later, they’re still arguing about whether to use Fargate or EC2.</p>

<p>The problem isn’t that ECS is hard. The problem is that teams treat infrastructure decisions like they’re permanent. They’re not.</p>

<p>Last year I worked with a startup that spent 5 weeks evaluating container orchestration options. Five weeks. They had a working app. They had paying customers waiting. But the engineering team was stuck in an endless loop of “what if we need to scale?” and “shouldn’t we future-proof this?”</p>

<p>They launched on Fargate. It took 3 days once they stopped debating.</p>

<p>Here are the 5 decisions that waste the most time and what I tell every client to pick.</p>

<h2 id="fargate-vs-ec2-just-use-fargate">Fargate vs EC2: Just Use Fargate</h2>

<p>This one wastes more time than all the others combined.</p>

<p>I get it. EC2 looks cheaper on paper. You can run the numbers, build a spreadsheet, show that at 50 containers you’ll save $400/month with EC2. The finance person gets excited. Someone mentions spot instances. Now you’re three meetings deep into capacity planning for traffic you don’t have yet.</p>

<p>Here’s what actually happens with EC2: you spend a week figuring out instance types, another week on auto-scaling groups, then you hit some weird issue where your containers won’t place because the bin-packing algorithm can’t find space, and suddenly your “cheaper” option has eaten two sprints of engineering time.</p>

<p>Fargate just works. You tell it how much CPU and memory you need, and it runs your container. No instances to manage, no patching, no capacity planning.</p>

<p>“But it’s more expensive!”</p>

<p>Sure. Maybe 20-30% more at scale. But you’re not at scale. You’re trying to ship. And even if Fargate costs you an extra $200/month right now, that’s nothing compared to the $30k+ in engineering salaries you’re burning while debating this.</p>

<p>Python apps especially benefit from Fargate. Your Django app with Celery workers is memory-heavy and I/O bound. You’re not doing CPU-intensive work. Fargate lets you right-size memory without playing Tetris with EC2 instance types.</p>

<p>Pick Fargate. When you’re running 200 containers 24/7 and have real cost data, revisit. Until then, move on.</p>

<h2 id="ecs-service-discovery-use-an-internal-alb">ECS Service Discovery: Use an Internal ALB</h2>

<p>When your services need to talk to each other, AWS gives you three options: Cloud Map, internal ALB, or Service Connect. I’ve seen teams spend weeks evaluating all three, setting up proof-of-concepts, reading whitepapers.</p>

<p>Just use an internal ALB.</p>

<p>I know, it’s not great. It’s a load balancer. It’s been around forever. But that’s exactly why you should use it:</p>

<ul>
  <li>It gives you a stable DNS name your services can call</li>
  <li>Health checks work out of the box</li>
  <li>You get access logs for debugging</li>
  <li>Every developer on your team already understands HTTP</li>
</ul>

<p>Your FastAPI service calls <code class="language-plaintext highlighter-rouge">http://api-internal.yourdomain.local/users</code> and it just works. No service mesh. No Envoy sidecars. No DNS caching gotchas.</p>

<p>Cloud Map is fine, but I’ve debugged too many issues where services couldn’t find each other because of DNS TTL problems. Service Connect is powerful, but now you’re operating a service mesh. Do you really want to be debugging Envoy proxy configuration when your actual problem is a database query?</p>

<p>The internal ALB is boring. Boring is good. Boring means you’re debugging your application code instead of your infrastructure.</p>

<h2 id="cicd-for-ecs-use-github-actions">CI/CD for ECS: Use GitHub Actions</h2>

<p>I’m gonna be honest here: if your code is on GitHub, use GitHub Actions. Don’t overthink this.</p>

<p>“But CodePipeline is AWS-native!”</p>

<p>Yes, and it requires you to set up a pipeline with Source, Build, and Deploy stages, configure IAM roles for each stage, create buildspec files, and wire everything together. It’s more YAML for the same result.</p>

<p>“But Jenkins gives us more control!”</p>

<p>It’s 2025. Please don’t set up a Jenkins server. You’ll spend more time maintaining Jenkins than deploying your app.</p>

<p>GitHub Actions has an official AWS action that handles ECS deployments:</p>

<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">Deploy to ECS</span>
  <span class="na">uses</span><span class="pi">:</span> <span class="s">aws-actions/amazon-ecs-deploy-task-definition@v1</span>
  <span class="na">with</span><span class="pi">:</span>
    <span class="na">task-definition</span><span class="pi">:</span> <span class="s">task-definition.json</span>
    <span class="na">service</span><span class="pi">:</span> <span class="s">my-service</span>
    <span class="na">cluster</span><span class="pi">:</span> <span class="s">my-cluster</span>
    <span class="na">wait-for-service-stability</span><span class="pi">:</span> <span class="no">true</span>
</code></pre></div></div>

<p>That’s the whole thing. It registers your task definition, updates the service, and waits for the deployment to stabilize. AWS maintains it. It works.</p>

<p>Your deployment workflow lives in your repo, your team already knows GitHub Actions from running tests, and you’re not managing another piece of infrastructure. If you want to understand how these pipeline runners actually work internally, I wrote a deep dive on <a href="/2025/building-cicd-pipeline-runner-python/">building a CI/CD pipeline runner from scratch in Python</a>.</p>

<p>If you’re on GitLab, use GitLab CI. If you’re on Bitbucket, use Bitbucket Pipelines. The point is: use whatever’s already integrated with your code. Don’t add complexity.</p>

<h2 id="ecs-secrets-management-use-ssm-parameter-store">ECS Secrets Management: Use SSM Parameter Store</h2>

<p>Where do you store your database passwords and API keys?</p>

<p>Not in your task definition. I’ve seen that. Please don’t.</p>

<p>The two real options are SSM Parameter Store and Secrets Manager. Teams debate this endlessly because Secrets Manager has automatic rotation and sounds more “enterprise.”</p>

<p>Here’s the thing: SSM Parameter Store is free, integrates natively with ECS, and handles 99% of use cases.</p>

<div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">{</span><span class="w">
  </span><span class="nl">"secrets"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="w">
    </span><span class="p">{</span><span class="w">
      </span><span class="nl">"name"</span><span class="p">:</span><span class="w"> </span><span class="s2">"DATABASE_URL"</span><span class="p">,</span><span class="w">
      </span><span class="nl">"valueFrom"</span><span class="p">:</span><span class="w"> </span><span class="s2">"arn:aws:ssm:us-east-1:123456789:parameter/myapp/database_url"</span><span class="w">
    </span><span class="p">}</span><span class="w">
  </span><span class="p">]</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>

<p>Your Python app reads <code class="language-plaintext highlighter-rouge">os.environ['DATABASE_URL']</code> like it does locally. No SDK, no code changes.</p>

<p>Secrets Manager costs $0.40 per secret per month and is worth it if you need automatic rotation for RDS credentials. But you probably don’t need that on day one. Start with SSM, migrate specific secrets to Secrets Manager later if you need rotation.</p>

<p>And please, don’t set up HashiCorp Vault unless you have compliance requirements that specifically mandate it. You’re now operating a distributed system just to store passwords. That’s not simplifying your life.</p>

<h2 id="ecs-logging-and-monitoring-use-cloudwatch">ECS Logging and Monitoring: Use CloudWatch</h2>

<p>Every team wants to evaluate Datadog, New Relic, Honeycomb, and then maybe self-host Prometheus and Grafana “for cost savings.”</p>

<p>Stop. Use CloudWatch.</p>

<p>Add this to your task definition:</p>

<div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">{</span><span class="w">
  </span><span class="nl">"logConfiguration"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
    </span><span class="nl">"logDriver"</span><span class="p">:</span><span class="w"> </span><span class="s2">"awslogs"</span><span class="p">,</span><span class="w">
    </span><span class="nl">"options"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
      </span><span class="nl">"awslogs-group"</span><span class="p">:</span><span class="w"> </span><span class="s2">"/ecs/my-service"</span><span class="p">,</span><span class="w">
      </span><span class="nl">"awslogs-region"</span><span class="p">:</span><span class="w"> </span><span class="s2">"us-east-1"</span><span class="p">,</span><span class="w">
      </span><span class="nl">"awslogs-stream-prefix"</span><span class="p">:</span><span class="w"> </span><span class="s2">"ecs"</span><span class="w">
    </span><span class="p">}</span><span class="w">
  </span><span class="p">}</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>

<p>Done. Your container logs go to CloudWatch. You can query them with Log Insights. Enable Container Insights and you get CPU/memory metrics. Set up a few alarms. You now have better observability than 80% of startups.</p>

<p>Datadog is genuinely good software. I like it. But it costs $15+ per host per month from day one, and you need to manage another vendor relationship. You can add it later when you actually need distributed tracing or APM.</p>

<p>Self-hosted observability is a trap. I’ve seen teams spend months building ELK stacks and Prometheus clusters. That’s infrastructure work that doesn’t ship features. Unless you have a dedicated platform team, don’t volunteer for this.</p>

<h2 id="the-actual-pattern-here">The Actual Pattern Here</h2>

<p>Look at what I recommended:</p>

<ul>
  <li>Fargate over EC2</li>
  <li>Internal ALB over Cloud Map or Service Connect</li>
  <li>GitHub Actions over CodePipeline or Jenkins</li>
  <li>SSM over Secrets Manager or Vault</li>
  <li>CloudWatch over Datadog or self-hosted</li>
</ul>

<p>Every single choice optimizes for the same thing: <strong>less stuff to manage</strong>.</p>

<p>Yes, some of these cost slightly more money. Yes, some of them are less flexible. But they all share one property: they let you ship faster and debug easier.</p>

<p>And here’s what nobody puts in their architecture decision records: all of these choices are reversible.</p>

<ul>
  <li>Fargate to EC2? Task definitions work on both.</li>
  <li>ALB to Service Connect? Just DNS changes.</li>
  <li>SSM to Secrets Manager? Same integration pattern.</li>
  <li>CloudWatch to Datadog? Add the agent, keep CloudWatch as backup.</li>
</ul>

<p>The “wrong” choice costs you maybe a few hundred dollars a month in inefficiency. The debate about the “right” choice costs you weeks of engineering time.</p>

<h2 id="what-ive-actually-seen-happen">What I’ve Actually Seen Happen</h2>

<p>Teams that follow this advice ship in about a week:</p>

<ul>
  <li>Day 1-2: Fargate cluster up, first service running</li>
  <li>Day 3: ALB routing traffic, services talking to each other</li>
  <li>Day 4: GitHub Actions deploying on push to main</li>
  <li>Day 5: Secrets in SSM, logs in CloudWatch, basic alarms set up</li>
</ul>

<p>Week 2: Building features.</p>

<p>Teams that “do it right” are still having meetings about networking topology in week 6.</p>

<p>I’ve watched startups run out of runway while their infrastructure was still “almost ready.” I’ve seen senior engineers burn out on DevOps work instead of building the product that got them excited in the first place.</p>

<p>Your Python app on Fargate with CloudWatch logs isn’t going to fall over at 1,000 users. Probably not at 10,000. By the time scale is actually a problem, you’ll have the traffic data and revenue to solve it properly.</p>

<p>Ship first. Optimize later.</p>

<hr />

<p><strong>If you found this helpful, share it on X and tag me <a href="https://twitter.com/muhammad_o7">@muhammad_o7</a></strong> - I’d love to hear about your ECS deployment experiences. You can also connect with me on <a href="https://www.linkedin.com/in/muhammad-raza-07/">LinkedIn</a>.</p>

<p><strong>Need Help?</strong> I’m available for AWS and DevOps consulting. If you’re stuck in ECS decision paralysis or need help getting to production faster, reach out via <a href="mailto:muhammadraza0047@gmail.com">email</a> or DM me on <a href="https://twitter.com/muhammad_o7">X/Twitter</a>.</p>

          ]]>
        </description>
        <pubDate>Thu, 18 Dec 2025 00:00:00 +0000</pubDate>
        <link>//muhammadraza.me/2025/ecs-decisions-that-waste-6-weeks/</link>
        <guid isPermaLink="true">//muhammadraza.me/2025/ecs-decisions-that-waste-6-weeks/</guid>
        
        <category>aws</category>
        
        <category>devops</category>
        
        <category>python</category>
        
        
        
        <dc:creator>{&quot;name&quot;=&gt;&quot;Muhammad Raza&quot;}</dc:creator>
        <dc:rights></dc:rights>
      </item>
    
      <item>
        <title>Building AI Agents for DevOps: From CI/CD Automation to Autonomous Deployments</title>
        <description>
          <![CDATA[
            
            <p>In my previous post, I showed you how to build a <a href="https://muhammadrazame.github.io/blog/2025/11/24/building-ci-cd-pipeline-runner-from-scratch-in-python/">CI/CD pipeline runner from scratch in Python</a>. We built something powerful: a system that could orchestrate jobs, manage dependencies, and pass artifacts between stages. It was the muscles of your deployment workflow.</p>

<p>But here’s the problem: that pipeline runner can only do exactly what you tell it to do.</p>

<p>It’s 2 AM. Your deployment pipeline fails. The error message is cryptic: Error: Connection refused on port 5432. Your traditional CI/CD pipeline stops dead. It sends an alert. You wake up, check the logs, realize the database connection pool was exhausted, restart the service, and go back to bed frustrated.</p>

<p>What if your pipeline could investigate the failure itself?</p>

<p>What if, instead of just stopping and alerting you, it could:</p>

<ul>
  <li>Analyze the error logs</li>
  <li>Check recent code changes</li>
  <li>Search for similar issues in your repository</li>
  <li>Identify that this same error happened two weeks ago when someone forgot to increase the connection pool</li>
  <li>Post a detailed root cause analysis to Slack with a suggested fix</li>
</ul>

<p>That’s not science fiction. That’s what AI agents can do for your DevOps workflows.</p>

<p>Over the past 2 years working independently as a DevOps consultant, I’ve seen the same patterns at every client: pipeline failures that need investigation, deployment decisions that require context, and incidents that demand rapid root cause analysis. These aren’t problems that need faster execution. They need reasoning.</p>

<p>That’s when I realized: the CI/CD runner we built is powerful, but it’s missing a brain. So I decided to add one.</p>

<h2 id="traditional-automation-vs-ai-agents">Traditional Automation vs. AI Agents</h2>

<p>Here’s the fundamental difference:</p>

<table>
  <thead>
    <tr>
      <th>Traditional CI/CD Pipeline</th>
      <th>AI Agent</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Executes predefined steps in order</td>
      <td>Reasons about what steps to take</td>
    </tr>
    <tr>
      <td>Fails when encountering unexpected situations</td>
      <td>Investigates and adapts to new situations</td>
    </tr>
    <tr>
      <td>Requires humans to make decisions</td>
      <td>Makes informed decisions autonomously</td>
    </tr>
    <tr>
      <td>Uses fixed if-then-else logic</td>
      <td>Uses context-aware reasoning</td>
    </tr>
    <tr>
      <td>Needs explicit error handling for every case</td>
      <td>Generalizes from patterns and past experience</td>
    </tr>
  </tbody>
</table>

<p>Your traditional pipeline is like a factory assembly line: efficient and reliable for known workflows, but completely stuck when something unexpected happens.</p>

<p>An AI agent is like a DevOps engineer who can think, investigate, and make decisions based on the full context of your system.</p>

<h2 id="what-were-building">What We’re Building</h2>

<p>In this post, I’m going to show you how to build a Pipeline Health Monitor Agent: an AI system that watches your GitHub Actions workflows and autonomously investigates failures.</p>

<p>Here’s what our agent will do:</p>

<ul>
  <li><strong>Monitor</strong>: Watch for GitHub Actions workflow failures via webhooks</li>
  <li><strong>Investigate</strong>: Automatically fetch logs, check recent commits, and analyze error patterns</li>
  <li><strong>Reason</strong>: Use an LLM (like GPT-4 or Claude) to understand what went wrong</li>
  <li><strong>Report</strong>: Post detailed findings to Slack with actionable recommendations</li>
  <li><strong>Learn</strong>: Remember similar issues and apply learned patterns</li>
</ul>

<p>And we’ll do all of this securely. Research shows that 48% of AI-generated code contains vulnerabilities, and I’m going to show you exactly how to validate every action your agent takes.</p>

<h2 id="what-youll-learn">What You’ll Learn</h2>

<p>By the end of this post, you’ll be able to:</p>

<ul>
  <li>Understand how AI agents differ from traditional automation and when to use each</li>
  <li>Build a working DevOps AI agent using LangChain and LangGraph</li>
  <li>Integrate the agent with your existing GitHub Actions workflows</li>
  <li>Implement security validation layers to prevent AI-generated vulnerabilities</li>
</ul>

<p>We’ll build this progressively: starting with the core agent, adding GitHub Actions integration, and then hardening it with security layers. Every code example will be complete and runnable.</p>

<p>The core philosophy: AI agents augment your pipeline, they don’t replace it. You’ll still have your traditional CI/CD workflows. The agent just makes them smarter.</p>

<p>Let’s start by understanding what AI agents actually are and how they work.</p>

<h2 id="understanding-ai-agents-the-4-core-components">Understanding AI Agents: The 4 Core Components</h2>

<p>Before we start coding, you need to understand what makes an AI agent fundamentally different from a script or a traditional automation workflow.</p>

<p>A traditional pipeline is a sequence of commands. An AI agent is a reasoning loop.</p>

<h3 id="the-agent-loop">The Agent Loop</h3>

<p>Every AI agent operates in a continuous cycle:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>┌─────────────────────────────────────────────────────┐
│                                                     │
│  Observe → Reason → Plan → Act → Observe (repeat)   │
│                                                     │
└─────────────────────────────────────────────────────┘
</code></pre></div></div>

<p>Here’s what happens when your GitHub Actions workflow fails:</p>

<ol>
  <li><strong>Observe</strong>: Agent receives webhook notification about pipeline failure</li>
  <li><strong>Reason</strong>: LLM analyzes the error message and context</li>
  <li><strong>Plan</strong>: Agent decides which tools to use (check logs, git history, search issues)</li>
  <li><strong>Act</strong>: Agent executes those tools and gathers information</li>
  <li><strong>Observe</strong>: Agent reviews tool outputs and repeats the cycle until it has an answer</li>
</ol>

<p>This is completely different from your CI/CD runner, which executes steps linearly and stops when something fails.</p>

<h3 id="the-4-core-components">The 4 Core Components</h3>

<p>Every AI agent is built from these four pieces:</p>

<h4 id="1-the-llm-brain">1. The LLM (Brain)</h4>

<p>The Large Language Model is the decision-making engine. It takes in context (pipeline logs, error messages, git history) and decides what to do next.</p>

<p>Think of it as the “thinking” part. When your pipeline fails with a database connection error, the LLM reasons: “This could be a configuration issue, a networking problem, or resource exhaustion. I should check recent config changes first, then network logs, then resource usage.”</p>

<p>Common choices: GPT-4, Claude 3.5 Sonnet, GPT-3.5 (cheaper for simple tasks)</p>

<h4 id="2-tools-hands">2. Tools (Hands)</h4>

<p>Tools are functions the agent can call to interact with the world. For DevOps, these might be:</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">get_github_logs(workflow_id)</code> - Fetch pipeline logs</li>
  <li><code class="language-plaintext highlighter-rouge">analyze_recent_commits(repo, hours)</code> - Check recent code changes</li>
  <li><code class="language-plaintext highlighter-rouge">search_similar_issues(error_message)</code> - Find related GitHub issues</li>
  <li><code class="language-plaintext highlighter-rouge">get_docker_status(container_id)</code> - Check container health</li>
  <li><code class="language-plaintext highlighter-rouge">query_prometheus(metric, timerange)</code> - Get monitoring data</li>
</ul>

<p>The LLM decides which tools to call and when. You just define what each tool does.</p>

<h4 id="3-memory">3. Memory</h4>

<p>Agents need two types of memory:</p>

<p><strong>Short-term memory (conversation history)</strong>: The current investigation. “I checked the logs and found a connection error. Then I checked recent commits and found a database config change.”</p>

<p><strong>Long-term memory (learned patterns)</strong>: Historical knowledge. “The last three times we saw Connection refused on port 5432, it was because the connection pool size was too small.”</p>

<p>For our pipeline monitor, we’ll start with short-term memory. Long-term memory requires a vector database (we’ll save that for a future post).</p>

<h4 id="4-prompts-instructions">4. Prompts (Instructions)</h4>

<p>The prompt is how you tell the agent what its job is and how to behave. A good DevOps agent prompt includes:</p>

<ul>
  <li><strong>Role definition</strong>: “You are a DevOps AI agent that investigates pipeline failures.”</li>
  <li><strong>Context</strong>: “The system runs on Kubernetes in AWS. Database is PostgreSQL. Cache is Redis.”</li>
  <li><strong>Constraints</strong>: “Never execute destructive commands. Always explain your reasoning.”</li>
  <li><strong>Output format</strong>: “Provide a root cause analysis with suggested fixes.”</li>
</ul>

<p>Prompt engineering is critical. A vague prompt like “debug the issue” will give you vague results. A specific prompt with context will give you actionable insights.</p>

<h3 id="how-it-all-works-together">How It All Works Together</h3>

<p>Here’s a concrete example of the agent loop in action:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Pipeline failure detected
    ↓
Agent observes: "Workflow #1234 failed with exit code 1"
    ↓
LLM reasons: "Exit code 1 is generic. I need more information."
    ↓
Agent plans: "Call get_github_logs() to see the actual error"
    ↓
Agent acts: Fetches logs, finds "psycopg2.OperationalError: could not connect to server"
    ↓
LLM reasons: "Database connection failure. Could be config, network, or resources."
    ↓
Agent plans: "Check recent commits for database config changes"
    ↓
Agent acts: Calls analyze_recent_commits(), finds commit changing DATABASE_URL
    ↓
LLM reasons: "Root cause identified. Recent commit broke database connection."
    ↓
Agent outputs: Detailed report with commit hash, explanation, and fix suggestion
</code></pre></div></div>

<h3 id="when-to-use-ai-agents-vs-traditional-automation">When to Use AI Agents vs. Traditional Automation</h3>

<p>Not every problem needs an AI agent. Here’s when each makes sense:</p>

<p><strong>Use traditional CI/CD automation when:</strong></p>

<ul>
  <li>The workflow is fully deterministic</li>
  <li>You know all possible failure modes</li>
  <li>Speed and cost are critical</li>
  <li>Zero tolerance for unexpected behavior</li>
</ul>

<p><strong>Use AI agents when:</strong></p>

<ul>
  <li>Failures require investigation and reasoning</li>
  <li>Context matters (recent changes, system state, historical patterns)</li>
  <li>The problem space is too large for explicit if-then rules</li>
  <li>You need adaptive behavior</li>
</ul>

<p><strong>Examples:</strong></p>

<p>Traditional automation: “If tests fail, don’t deploy” (simple rule)</p>

<p>AI agent: “Tests failed. Analyze which tests, check if they’re flaky, review recent code changes, determine if this is a real issue or infrastructure problem, suggest next steps” (complex reasoning)</p>

<h3 id="what-were-building-next">What We’re Building Next</h3>

<p>Now that you understand the components, we’re going to build a Pipeline Health Monitor Agent that uses:</p>

<ul>
  <li><strong>LLM</strong>: GPT-4 or Claude for reasoning</li>
  <li><strong>Tools</strong>: GitHub API, log analysis, issue search</li>
  <li><strong>Memory</strong>: Conversation history for multi-step investigation</li>
  <li><strong>Prompts</strong>: DevOps-specific instructions with infrastructure context</li>
</ul>

<p>In the next section, we’ll write the actual code.</p>

<h2 id="building-version-1-pipeline-health-monitor-agent">Building Version 1: Pipeline Health Monitor Agent</h2>

<p>Now we’re going to build a working AI agent that monitors your GitHub Actions workflows and investigates failures. This is production-ready code that you can deploy today.</p>

<h3 id="what-our-agent-will-do">What Our Agent Will Do</h3>

<p>When a GitHub Actions workflow fails, our agent will:</p>

<ul>
  <li>Receive a webhook notification with the workflow ID</li>
  <li>Fetch the workflow logs from GitHub</li>
  <li>Analyze recent commits to find what changed</li>
  <li>Search existing GitHub issues for similar errors</li>
  <li>Use an LLM (GPT-4, Claude, or others via OpenRouter) to reason about the root cause</li>
  <li>Generate a detailed report with recommendations</li>
</ul>

<p>Let’s build it step by step.</p>

<h3 id="installation-and-setup">Installation and Setup</h3>

<p>First, install uv if you don’t have it already:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># On macOS/Linux</span>
curl <span class="nt">-LsSf</span> https://astral.sh/uv/install.sh | sh

<span class="c"># On Windows</span>
powershell <span class="nt">-c</span> <span class="s2">"irm https://astral.sh/uv/install.ps1 | iex"</span>
</code></pre></div></div>

<p>Create a new project directory and set up a virtual environment:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">mkdir </span>pipeline-agent
<span class="nb">cd </span>pipeline-agent

<span class="c"># Create virtual environment with uv</span>
uv venv

<span class="c"># Activate the virtual environment</span>
<span class="c"># On macOS/Linux:</span>
<span class="nb">source</span> .venv/bin/activate

<span class="c"># On Windows:</span>
.venv<span class="se">\S</span>cripts<span class="se">\a</span>ctivate
</code></pre></div></div>

<p>Install the required dependencies using uv:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>uv pip <span class="nb">install </span>langchain langchain-openai requests python-dotenv
</code></pre></div></div>

<p>Set up your environment variables in a <code class="language-plaintext highlighter-rouge">.env</code> file.</p>

<p><strong>Option 1: Using OpenAI directly</strong></p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">OPENAI_API_KEY</span><span class="o">=</span>your_openai_api_key_here
<span class="nv">GITHUB_TOKEN</span><span class="o">=</span>your_github_personal_access_token
<span class="nv">GITHUB_REPO</span><span class="o">=</span>username/repository
<span class="nv">USE_OPENROUTER</span><span class="o">=</span><span class="nb">false</span>
</code></pre></div></div>

<p><strong>Option 2: Using OpenRouter (recommended for cost savings)</strong></p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">OPENROUTER_API_KEY</span><span class="o">=</span>your_openrouter_api_key_here
<span class="nv">GITHUB_TOKEN</span><span class="o">=</span>your_github_personal_access_token
<span class="nv">GITHUB_REPO</span><span class="o">=</span>username/repository
<span class="nv">USE_OPENROUTER</span><span class="o">=</span><span class="nb">true
</span><span class="nv">MODEL_NAME</span><span class="o">=</span>anthropic/claude-3.5-sonnet  <span class="c"># or openai/gpt-4, google/gemini-pro, etc.</span>
</code></pre></div></div>

<p><strong>Why OpenRouter?</strong></p>

<ul>
  <li>Access multiple LLM providers through one API</li>
  <li>Often cheaper than going direct (they negotiate bulk rates)</li>
  <li>Easy to switch between models without changing code</li>
  <li>Get API key at: https://openrouter.ai/</li>
</ul>

<h3 id="step-1-define-the-tools">Step 1: Define the Tools</h3>

<p>Tools are functions the agent can call. Each tool is decorated with <code class="language-plaintext highlighter-rouge">@tool</code> and includes a docstring that tells the LLM what it does.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># agent_investigator.py
</span><span class="kn">import</span> <span class="nn">os</span>
<span class="kn">import</span> <span class="nn">requests</span>
<span class="kn">from</span> <span class="nn">datetime</span> <span class="kn">import</span> <span class="n">datetime</span><span class="p">,</span> <span class="n">timedelta</span>
<span class="kn">from</span> <span class="nn">typing</span> <span class="kn">import</span> <span class="n">Optional</span>
<span class="kn">from</span> <span class="nn">langchain.tools</span> <span class="kn">import</span> <span class="n">tool</span>
<span class="kn">from</span> <span class="nn">dotenv</span> <span class="kn">import</span> <span class="n">load_dotenv</span>

<span class="n">load_dotenv</span><span class="p">()</span>

<span class="n">GITHUB_TOKEN</span> <span class="o">=</span> <span class="n">os</span><span class="p">.</span><span class="n">getenv</span><span class="p">(</span><span class="s">"GITHUB_TOKEN"</span><span class="p">)</span>
<span class="n">GITHUB_REPO</span> <span class="o">=</span> <span class="n">os</span><span class="p">.</span><span class="n">getenv</span><span class="p">(</span><span class="s">"GITHUB_REPO"</span><span class="p">)</span>
<span class="n">HEADERS</span> <span class="o">=</span> <span class="p">{</span>
    <span class="s">"Authorization"</span><span class="p">:</span> <span class="sa">f</span><span class="s">"token </span><span class="si">{</span><span class="n">GITHUB_TOKEN</span><span class="si">}</span><span class="s">"</span><span class="p">,</span>
    <span class="s">"Accept"</span><span class="p">:</span> <span class="s">"application/vnd.github.v3+json"</span>
<span class="p">}</span>

<span class="o">@</span><span class="n">tool</span>
<span class="k">def</span> <span class="nf">get_workflow_logs</span><span class="p">(</span><span class="n">workflow_run_id</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">str</span><span class="p">:</span>
    <span class="s">"""
    Fetch logs from a failed GitHub Actions workflow run.

    Args:
        workflow_run_id: The GitHub Actions workflow run ID

    Returns:
        String containing the workflow logs
    """</span>
    <span class="k">try</span><span class="p">:</span>
        <span class="c1"># Get workflow run details
</span>        <span class="n">run_url</span> <span class="o">=</span> <span class="sa">f</span><span class="s">"https://api.github.com/repos/</span><span class="si">{</span><span class="n">GITHUB_REPO</span><span class="si">}</span><span class="s">/actions/runs/</span><span class="si">{</span><span class="n">workflow_run_id</span><span class="si">}</span><span class="s">"</span>
        <span class="n">run_response</span> <span class="o">=</span> <span class="n">requests</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="n">run_url</span><span class="p">,</span> <span class="n">headers</span><span class="o">=</span><span class="n">HEADERS</span><span class="p">)</span>
        <span class="n">run_response</span><span class="p">.</span><span class="n">raise_for_status</span><span class="p">()</span>
        <span class="n">run_data</span> <span class="o">=</span> <span class="n">run_response</span><span class="p">.</span><span class="n">json</span><span class="p">()</span>

        <span class="c1"># Get jobs for this workflow run
</span>        <span class="n">jobs_url</span> <span class="o">=</span> <span class="sa">f</span><span class="s">"</span><span class="si">{</span><span class="n">run_url</span><span class="si">}</span><span class="s">/jobs"</span>
        <span class="n">jobs_response</span> <span class="o">=</span> <span class="n">requests</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="n">jobs_url</span><span class="p">,</span> <span class="n">headers</span><span class="o">=</span><span class="n">HEADERS</span><span class="p">)</span>
        <span class="n">jobs_response</span><span class="p">.</span><span class="n">raise_for_status</span><span class="p">()</span>
        <span class="n">jobs_data</span> <span class="o">=</span> <span class="n">jobs_response</span><span class="p">.</span><span class="n">json</span><span class="p">()</span>

        <span class="c1"># Extract logs from failed jobs
</span>        <span class="n">logs</span> <span class="o">=</span> <span class="p">[]</span>
        <span class="n">logs</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="sa">f</span><span class="s">"Workflow: </span><span class="si">{</span><span class="n">run_data</span><span class="p">[</span><span class="s">'name'</span><span class="p">]</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
        <span class="n">logs</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="sa">f</span><span class="s">"Status: </span><span class="si">{</span><span class="n">run_data</span><span class="p">[</span><span class="s">'conclusion'</span><span class="p">]</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
        <span class="n">logs</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="sa">f</span><span class="s">"Started: </span><span class="si">{</span><span class="n">run_data</span><span class="p">[</span><span class="s">'created_at'</span><span class="p">]</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
        <span class="n">logs</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="sa">f</span><span class="s">"Branch: </span><span class="si">{</span><span class="n">run_data</span><span class="p">[</span><span class="s">'head_branch'</span><span class="p">]</span><span class="si">}</span><span class="se">\n</span><span class="s">"</span><span class="p">)</span>

        <span class="k">for</span> <span class="n">job</span> <span class="ow">in</span> <span class="n">jobs_data</span><span class="p">[</span><span class="s">'jobs'</span><span class="p">]:</span>
            <span class="k">if</span> <span class="n">job</span><span class="p">[</span><span class="s">'conclusion'</span><span class="p">]</span> <span class="o">==</span> <span class="s">'failure'</span><span class="p">:</span>
                <span class="n">logs</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="sa">f</span><span class="s">"</span><span class="se">\n</span><span class="s">Failed Job: </span><span class="si">{</span><span class="n">job</span><span class="p">[</span><span class="s">'name'</span><span class="p">]</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
                <span class="n">logs</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="sa">f</span><span class="s">"Conclusion: </span><span class="si">{</span><span class="n">job</span><span class="p">[</span><span class="s">'conclusion'</span><span class="p">]</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>

                <span class="c1"># Get job logs
</span>                <span class="n">log_url</span> <span class="o">=</span> <span class="sa">f</span><span class="s">"https://api.github.com/repos/</span><span class="si">{</span><span class="n">GITHUB_REPO</span><span class="si">}</span><span class="s">/actions/jobs/</span><span class="si">{</span><span class="n">job</span><span class="p">[</span><span class="s">'id'</span><span class="p">]</span><span class="si">}</span><span class="s">/logs"</span>
                <span class="n">log_response</span> <span class="o">=</span> <span class="n">requests</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="n">log_url</span><span class="p">,</span> <span class="n">headers</span><span class="o">=</span><span class="n">HEADERS</span><span class="p">)</span>

                <span class="k">if</span> <span class="n">log_response</span><span class="p">.</span><span class="n">status_code</span> <span class="o">==</span> <span class="mi">200</span><span class="p">:</span>
                    <span class="c1"># Extract last 50 lines (most relevant errors are at the end)
</span>                    <span class="n">log_lines</span> <span class="o">=</span> <span class="n">log_response</span><span class="p">.</span><span class="n">text</span><span class="p">.</span><span class="n">split</span><span class="p">(</span><span class="s">'</span><span class="se">\n</span><span class="s">'</span><span class="p">)</span>
                    <span class="n">relevant_logs</span> <span class="o">=</span> <span class="n">log_lines</span><span class="p">[</span><span class="o">-</span><span class="mi">50</span><span class="p">:]</span>
                    <span class="n">logs</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="s">"</span><span class="se">\n</span><span class="s">Last 50 lines of logs:"</span><span class="p">)</span>
                    <span class="n">logs</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="s">'</span><span class="se">\n</span><span class="s">'</span><span class="p">.</span><span class="n">join</span><span class="p">(</span><span class="n">relevant_logs</span><span class="p">))</span>

        <span class="k">return</span> <span class="s">'</span><span class="se">\n</span><span class="s">'</span><span class="p">.</span><span class="n">join</span><span class="p">(</span><span class="n">logs</span><span class="p">)</span>

    <span class="k">except</span> <span class="n">requests</span><span class="p">.</span><span class="n">exceptions</span><span class="p">.</span><span class="n">RequestException</span> <span class="k">as</span> <span class="n">e</span><span class="p">:</span>
        <span class="k">return</span> <span class="sa">f</span><span class="s">"Error fetching workflow logs: </span><span class="si">{</span><span class="nb">str</span><span class="p">(</span><span class="n">e</span><span class="p">)</span><span class="si">}</span><span class="s">"</span>


<span class="o">@</span><span class="n">tool</span>
<span class="k">def</span> <span class="nf">analyze_recent_commits</span><span class="p">(</span><span class="n">hours</span><span class="p">:</span> <span class="nb">int</span> <span class="o">=</span> <span class="mi">24</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">str</span><span class="p">:</span>
    <span class="s">"""
    Analyze recent commits to the repository that might have caused the failure.

    Args:
        hours: Number of hours to look back (default: 24)

    Returns:
        String containing recent commits with author, message, and files changed
    """</span>
    <span class="k">try</span><span class="p">:</span>
        <span class="n">since</span> <span class="o">=</span> <span class="p">(</span><span class="n">datetime</span><span class="p">.</span><span class="n">utcnow</span><span class="p">()</span> <span class="o">-</span> <span class="n">timedelta</span><span class="p">(</span><span class="n">hours</span><span class="o">=</span><span class="n">hours</span><span class="p">)).</span><span class="n">isoformat</span><span class="p">()</span> <span class="o">+</span> <span class="s">'Z'</span>
        <span class="n">commits_url</span> <span class="o">=</span> <span class="sa">f</span><span class="s">"https://api.github.com/repos/</span><span class="si">{</span><span class="n">GITHUB_REPO</span><span class="si">}</span><span class="s">/commits"</span>
        <span class="n">params</span> <span class="o">=</span> <span class="p">{</span><span class="s">'since'</span><span class="p">:</span> <span class="n">since</span><span class="p">,</span> <span class="s">'per_page'</span><span class="p">:</span> <span class="mi">10</span><span class="p">}</span>

        <span class="n">response</span> <span class="o">=</span> <span class="n">requests</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="n">commits_url</span><span class="p">,</span> <span class="n">headers</span><span class="o">=</span><span class="n">HEADERS</span><span class="p">,</span> <span class="n">params</span><span class="o">=</span><span class="n">params</span><span class="p">)</span>
        <span class="n">response</span><span class="p">.</span><span class="n">raise_for_status</span><span class="p">()</span>
        <span class="n">commits</span> <span class="o">=</span> <span class="n">response</span><span class="p">.</span><span class="n">json</span><span class="p">()</span>

        <span class="k">if</span> <span class="ow">not</span> <span class="n">commits</span><span class="p">:</span>
            <span class="k">return</span> <span class="sa">f</span><span class="s">"No commits found in the last </span><span class="si">{</span><span class="n">hours</span><span class="si">}</span><span class="s"> hours."</span>

        <span class="n">result</span> <span class="o">=</span> <span class="p">[</span><span class="sa">f</span><span class="s">"Recent commits (last </span><span class="si">{</span><span class="n">hours</span><span class="si">}</span><span class="s"> hours):</span><span class="se">\n</span><span class="s">"</span><span class="p">]</span>

        <span class="k">for</span> <span class="n">commit</span> <span class="ow">in</span> <span class="n">commits</span><span class="p">:</span>
            <span class="n">sha</span> <span class="o">=</span> <span class="n">commit</span><span class="p">[</span><span class="s">'sha'</span><span class="p">][:</span><span class="mi">7</span><span class="p">]</span>
            <span class="n">author</span> <span class="o">=</span> <span class="n">commit</span><span class="p">[</span><span class="s">'commit'</span><span class="p">][</span><span class="s">'author'</span><span class="p">][</span><span class="s">'name'</span><span class="p">]</span>
            <span class="n">message</span> <span class="o">=</span> <span class="n">commit</span><span class="p">[</span><span class="s">'commit'</span><span class="p">][</span><span class="s">'message'</span><span class="p">].</span><span class="n">split</span><span class="p">(</span><span class="s">'</span><span class="se">\n</span><span class="s">'</span><span class="p">)[</span><span class="mi">0</span><span class="p">]</span>  <span class="c1"># First line only
</span>            <span class="n">date</span> <span class="o">=</span> <span class="n">commit</span><span class="p">[</span><span class="s">'commit'</span><span class="p">][</span><span class="s">'author'</span><span class="p">][</span><span class="s">'date'</span><span class="p">]</span>

            <span class="c1"># Get files changed in this commit
</span>            <span class="n">commit_detail_url</span> <span class="o">=</span> <span class="sa">f</span><span class="s">"https://api.github.com/repos/</span><span class="si">{</span><span class="n">GITHUB_REPO</span><span class="si">}</span><span class="s">/commits/</span><span class="si">{</span><span class="n">commit</span><span class="p">[</span><span class="s">'sha'</span><span class="p">]</span><span class="si">}</span><span class="s">"</span>
            <span class="n">commit_response</span> <span class="o">=</span> <span class="n">requests</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="n">commit_detail_url</span><span class="p">,</span> <span class="n">headers</span><span class="o">=</span><span class="n">HEADERS</span><span class="p">)</span>
            <span class="n">commit_data</span> <span class="o">=</span> <span class="n">commit_response</span><span class="p">.</span><span class="n">json</span><span class="p">()</span>

            <span class="n">files_changed</span> <span class="o">=</span> <span class="p">[</span><span class="n">f</span><span class="p">[</span><span class="s">'filename'</span><span class="p">]</span> <span class="k">for</span> <span class="n">f</span> <span class="ow">in</span> <span class="n">commit_data</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="s">'files'</span><span class="p">,</span> <span class="p">[])]</span>

            <span class="n">result</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="sa">f</span><span class="s">"</span><span class="se">\n</span><span class="s">Commit </span><span class="si">{</span><span class="n">sha</span><span class="si">}</span><span class="s"> by </span><span class="si">{</span><span class="n">author</span><span class="si">}</span><span class="s"> (</span><span class="si">{</span><span class="n">date</span><span class="si">}</span><span class="s">)"</span><span class="p">)</span>
            <span class="n">result</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="sa">f</span><span class="s">"Message: </span><span class="si">{</span><span class="n">message</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
            <span class="n">result</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="sa">f</span><span class="s">"Files changed: </span><span class="si">{</span><span class="s">', '</span><span class="p">.</span><span class="n">join</span><span class="p">(</span><span class="n">files_changed</span><span class="p">[</span><span class="si">:</span><span class="mi">5</span><span class="p">])</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>  <span class="c1"># First 5 files
</span>            <span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">files_changed</span><span class="p">)</span> <span class="o">&gt;</span> <span class="mi">5</span><span class="p">:</span>
                <span class="n">result</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="sa">f</span><span class="s">"... and </span><span class="si">{</span><span class="nb">len</span><span class="p">(</span><span class="n">files_changed</span><span class="p">)</span> <span class="o">-</span> <span class="mi">5</span><span class="si">}</span><span class="s"> more files"</span><span class="p">)</span>

        <span class="k">return</span> <span class="s">'</span><span class="se">\n</span><span class="s">'</span><span class="p">.</span><span class="n">join</span><span class="p">(</span><span class="n">result</span><span class="p">)</span>

    <span class="k">except</span> <span class="n">requests</span><span class="p">.</span><span class="n">exceptions</span><span class="p">.</span><span class="n">RequestException</span> <span class="k">as</span> <span class="n">e</span><span class="p">:</span>
        <span class="k">return</span> <span class="sa">f</span><span class="s">"Error analyzing commits: </span><span class="si">{</span><span class="nb">str</span><span class="p">(</span><span class="n">e</span><span class="p">)</span><span class="si">}</span><span class="s">"</span>


<span class="o">@</span><span class="n">tool</span>
<span class="k">def</span> <span class="nf">search_similar_issues</span><span class="p">(</span><span class="n">error_keywords</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">str</span><span class="p">:</span>
    <span class="s">"""
    Search GitHub issues for similar error messages or problems.

    Args:
        error_keywords: Keywords from the error message to search for

    Returns:
        String containing relevant GitHub issues and their solutions
    """</span>
    <span class="k">try</span><span class="p">:</span>
        <span class="c1"># Search issues in the repository
</span>        <span class="n">search_url</span> <span class="o">=</span> <span class="s">"https://api.github.com/search/issues"</span>
        <span class="n">query</span> <span class="o">=</span> <span class="sa">f</span><span class="s">"repo:</span><span class="si">{</span><span class="n">GITHUB_REPO</span><span class="si">}</span><span class="s"> </span><span class="si">{</span><span class="n">error_keywords</span><span class="si">}</span><span class="s"> is:issue"</span>
        <span class="n">params</span> <span class="o">=</span> <span class="p">{</span><span class="s">'q'</span><span class="p">:</span> <span class="n">query</span><span class="p">,</span> <span class="s">'sort'</span><span class="p">:</span> <span class="s">'relevance'</span><span class="p">,</span> <span class="s">'per_page'</span><span class="p">:</span> <span class="mi">5</span><span class="p">}</span>

        <span class="n">response</span> <span class="o">=</span> <span class="n">requests</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="n">search_url</span><span class="p">,</span> <span class="n">headers</span><span class="o">=</span><span class="n">HEADERS</span><span class="p">,</span> <span class="n">params</span><span class="o">=</span><span class="n">params</span><span class="p">)</span>
        <span class="n">response</span><span class="p">.</span><span class="n">raise_for_status</span><span class="p">()</span>
        <span class="n">issues</span> <span class="o">=</span> <span class="n">response</span><span class="p">.</span><span class="n">json</span><span class="p">()</span>

        <span class="k">if</span> <span class="n">issues</span><span class="p">[</span><span class="s">'total_count'</span><span class="p">]</span> <span class="o">==</span> <span class="mi">0</span><span class="p">:</span>
            <span class="k">return</span> <span class="sa">f</span><span class="s">"No similar issues found for keywords: </span><span class="si">{</span><span class="n">error_keywords</span><span class="si">}</span><span class="s">"</span>

        <span class="n">result</span> <span class="o">=</span> <span class="p">[</span><span class="sa">f</span><span class="s">"Found </span><span class="si">{</span><span class="n">issues</span><span class="p">[</span><span class="s">'total_count'</span><span class="p">]</span><span class="si">}</span><span class="s"> similar issues:</span><span class="se">\n</span><span class="s">"</span><span class="p">]</span>

        <span class="k">for</span> <span class="n">issue</span> <span class="ow">in</span> <span class="n">issues</span><span class="p">[</span><span class="s">'items'</span><span class="p">][:</span><span class="mi">5</span><span class="p">]:</span>
            <span class="n">result</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="sa">f</span><span class="s">"</span><span class="se">\n</span><span class="s">#</span><span class="si">{</span><span class="n">issue</span><span class="p">[</span><span class="s">'number'</span><span class="p">]</span><span class="si">}</span><span class="s">: </span><span class="si">{</span><span class="n">issue</span><span class="p">[</span><span class="s">'title'</span><span class="p">]</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
            <span class="n">result</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="sa">f</span><span class="s">"State: </span><span class="si">{</span><span class="n">issue</span><span class="p">[</span><span class="s">'state'</span><span class="p">]</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
            <span class="n">result</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="sa">f</span><span class="s">"URL: </span><span class="si">{</span><span class="n">issue</span><span class="p">[</span><span class="s">'html_url'</span><span class="p">]</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>

            <span class="c1"># Get first comment if issue is closed (might contain solution)
</span>            <span class="k">if</span> <span class="n">issue</span><span class="p">[</span><span class="s">'state'</span><span class="p">]</span> <span class="o">==</span> <span class="s">'closed'</span> <span class="ow">and</span> <span class="n">issue</span><span class="p">[</span><span class="s">'comments'</span><span class="p">]</span> <span class="o">&gt;</span> <span class="mi">0</span><span class="p">:</span>
                <span class="n">comments_url</span> <span class="o">=</span> <span class="n">issue</span><span class="p">[</span><span class="s">'comments_url'</span><span class="p">]</span>
                <span class="n">comments_response</span> <span class="o">=</span> <span class="n">requests</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="n">comments_url</span><span class="p">,</span> <span class="n">headers</span><span class="o">=</span><span class="n">HEADERS</span><span class="p">)</span>
                <span class="n">comments</span> <span class="o">=</span> <span class="n">comments_response</span><span class="p">.</span><span class="n">json</span><span class="p">()</span>
                <span class="k">if</span> <span class="n">comments</span><span class="p">:</span>
                    <span class="n">first_comment</span> <span class="o">=</span> <span class="n">comments</span><span class="p">[</span><span class="mi">0</span><span class="p">][</span><span class="s">'body'</span><span class="p">][:</span><span class="mi">200</span><span class="p">]</span>  <span class="c1"># First 200 chars
</span>                    <span class="n">result</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="sa">f</span><span class="s">"Solution hint: </span><span class="si">{</span><span class="n">first_comment</span><span class="si">}</span><span class="s">..."</span><span class="p">)</span>

        <span class="k">return</span> <span class="s">'</span><span class="se">\n</span><span class="s">'</span><span class="p">.</span><span class="n">join</span><span class="p">(</span><span class="n">result</span><span class="p">)</span>

    <span class="k">except</span> <span class="n">requests</span><span class="p">.</span><span class="n">exceptions</span><span class="p">.</span><span class="n">RequestException</span> <span class="k">as</span> <span class="n">e</span><span class="p">:</span>
        <span class="k">return</span> <span class="sa">f</span><span class="s">"Error searching issues: </span><span class="si">{</span><span class="nb">str</span><span class="p">(</span><span class="n">e</span><span class="p">)</span><span class="si">}</span><span class="s">"</span>
</code></pre></div></div>

<h3 id="step-2-create-the-agent-with-llm-provider-support">Step 2: Create the Agent with LLM Provider Support</h3>

<p>Now we’ll create the agent with support for both OpenAI and OpenRouter:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">langchain.agents</span> <span class="kn">import</span> <span class="n">create_openai_tools_agent</span><span class="p">,</span> <span class="n">AgentExecutor</span>
<span class="kn">from</span> <span class="nn">langchain_openai</span> <span class="kn">import</span> <span class="n">ChatOpenAI</span>
<span class="kn">from</span> <span class="nn">langchain_core.prompts</span> <span class="kn">import</span> <span class="n">ChatPromptTemplate</span><span class="p">,</span> <span class="n">MessagesPlaceholder</span>

<span class="k">def</span> <span class="nf">get_llm</span><span class="p">():</span>
    <span class="s">"""
    Initialize the LLM based on environment configuration.
    Supports both OpenAI directly and OpenRouter.
    """</span>
    <span class="n">use_openrouter</span> <span class="o">=</span> <span class="n">os</span><span class="p">.</span><span class="n">getenv</span><span class="p">(</span><span class="s">"USE_OPENROUTER"</span><span class="p">,</span> <span class="s">"false"</span><span class="p">).</span><span class="n">lower</span><span class="p">()</span> <span class="o">==</span> <span class="s">"true"</span>

    <span class="k">if</span> <span class="n">use_openrouter</span><span class="p">:</span>
        <span class="c1"># Using OpenRouter for access to multiple models
</span>        <span class="n">api_key</span> <span class="o">=</span> <span class="n">os</span><span class="p">.</span><span class="n">getenv</span><span class="p">(</span><span class="s">"OPENROUTER_API_KEY"</span><span class="p">)</span>
        <span class="n">model_name</span> <span class="o">=</span> <span class="n">os</span><span class="p">.</span><span class="n">getenv</span><span class="p">(</span><span class="s">"MODEL_NAME"</span><span class="p">,</span> <span class="s">"anthropic/claude-3.5-sonnet"</span><span class="p">)</span>

        <span class="n">llm</span> <span class="o">=</span> <span class="n">ChatOpenAI</span><span class="p">(</span>
            <span class="n">model</span><span class="o">=</span><span class="n">model_name</span><span class="p">,</span>
            <span class="n">openai_api_key</span><span class="o">=</span><span class="n">api_key</span><span class="p">,</span>
            <span class="n">openai_api_base</span><span class="o">=</span><span class="s">"https://openrouter.ai/api/v1"</span><span class="p">,</span>
            <span class="n">temperature</span><span class="o">=</span><span class="mi">0</span><span class="p">,</span>
            <span class="n">default_headers</span><span class="o">=</span><span class="p">{</span>
                <span class="s">"HTTP-Referer"</span><span class="p">:</span> <span class="s">"https://github.com/your-username/pipeline-agent"</span><span class="p">,</span>
                <span class="s">"X-Title"</span><span class="p">:</span> <span class="s">"Pipeline Health Monitor Agent"</span>
            <span class="p">}</span>
        <span class="p">)</span>
        <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"Using OpenRouter with model: </span><span class="si">{</span><span class="n">model_name</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
    <span class="k">else</span><span class="p">:</span>
        <span class="c1"># Using OpenAI directly
</span>        <span class="n">api_key</span> <span class="o">=</span> <span class="n">os</span><span class="p">.</span><span class="n">getenv</span><span class="p">(</span><span class="s">"OPENAI_API_KEY"</span><span class="p">)</span>
        <span class="n">llm</span> <span class="o">=</span> <span class="n">ChatOpenAI</span><span class="p">(</span>
            <span class="n">model</span><span class="o">=</span><span class="s">"gpt-4"</span><span class="p">,</span>
            <span class="n">temperature</span><span class="o">=</span><span class="mi">0</span><span class="p">,</span>
            <span class="n">openai_api_key</span><span class="o">=</span><span class="n">api_key</span>
        <span class="p">)</span>
        <span class="k">print</span><span class="p">(</span><span class="s">"Using OpenAI GPT-4"</span><span class="p">)</span>

    <span class="k">return</span> <span class="n">llm</span>

<span class="c1"># Initialize the LLM
</span><span class="n">llm</span> <span class="o">=</span> <span class="n">get_llm</span><span class="p">()</span>

<span class="c1"># Define the system prompt
</span><span class="n">system_prompt</span> <span class="o">=</span> <span class="s">"""You are an expert DevOps AI agent that investigates CI/CD pipeline failures.

Your role is to:
1. Analyze workflow logs to identify the root cause of failures
2. Examine recent code changes that might have introduced issues
3. Search for similar problems in the issue tracker
4. Provide a clear, actionable root cause analysis

When analyzing failures:
- Focus on the actual error messages, not just symptoms
- Consider recent code changes as potential causes
- Look for patterns in similar past issues
- Be specific about what broke and why
- Suggest concrete fixes, not vague advice

Your investigation should be thorough but concise. Developers need actionable insights, not lengthy explanations.

Output format:
**Root Cause**: [One sentence summary]
**Evidence**: [Key findings from logs/commits/issues]
**Recommendation**: [Specific steps to fix]
**Related Issues**: [Links to similar problems if found]
"""</span>

<span class="c1"># Create the prompt template
</span><span class="n">prompt</span> <span class="o">=</span> <span class="n">ChatPromptTemplate</span><span class="p">.</span><span class="n">from_messages</span><span class="p">([</span>
    <span class="p">(</span><span class="s">"system"</span><span class="p">,</span> <span class="n">system_prompt</span><span class="p">),</span>
    <span class="p">(</span><span class="s">"human"</span><span class="p">,</span> <span class="s">"{input}"</span><span class="p">),</span>
    <span class="n">MessagesPlaceholder</span><span class="p">(</span><span class="n">variable_name</span><span class="o">=</span><span class="s">"agent_scratchpad"</span><span class="p">),</span>
<span class="p">])</span>

<span class="c1"># Create the agent
</span><span class="n">tools</span> <span class="o">=</span> <span class="p">[</span><span class="n">get_workflow_logs</span><span class="p">,</span> <span class="n">analyze_recent_commits</span><span class="p">,</span> <span class="n">search_similar_issues</span><span class="p">]</span>
<span class="n">agent</span> <span class="o">=</span> <span class="n">create_openai_tools_agent</span><span class="p">(</span><span class="n">llm</span><span class="p">,</span> <span class="n">tools</span><span class="p">,</span> <span class="n">prompt</span><span class="p">)</span>

<span class="c1"># Create the agent executor
</span><span class="n">agent_executor</span> <span class="o">=</span> <span class="n">AgentExecutor</span><span class="p">(</span>
    <span class="n">agent</span><span class="o">=</span><span class="n">agent</span><span class="p">,</span>
    <span class="n">tools</span><span class="o">=</span><span class="n">tools</span><span class="p">,</span>
    <span class="n">verbose</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span>
    <span class="n">max_iterations</span><span class="o">=</span><span class="mi">5</span><span class="p">,</span>
    <span class="n">handle_parsing_errors</span><span class="o">=</span><span class="bp">True</span>
<span class="p">)</span>
</code></pre></div></div>

<h3 id="step-3-run-the-investigation">Step 3: Run the Investigation</h3>

<p>Finally, we create a function to trigger the investigation:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">investigate_failure</span><span class="p">(</span><span class="n">workflow_run_id</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">dict</span><span class="p">:</span>
    <span class="s">"""
    Investigate a failed GitHub Actions workflow.

    Args:
        workflow_run_id: The GitHub Actions workflow run ID

    Returns:
        Dict containing the investigation result
    """</span>
    <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"</span><span class="se">\n</span><span class="s">Starting investigation for workflow run </span><span class="si">{</span><span class="n">workflow_run_id</span><span class="si">}</span><span class="s">..."</span><span class="p">)</span>
    <span class="k">print</span><span class="p">(</span><span class="s">"="</span> <span class="o">*</span> <span class="mi">60</span><span class="p">)</span>

    <span class="n">input_text</span> <span class="o">=</span> <span class="sa">f</span><span class="s">"""A GitHub Actions workflow has failed (run ID: </span><span class="si">{</span><span class="n">workflow_run_id</span><span class="si">}</span><span class="s">).

Please investigate this failure by:
1. Fetching and analyzing the workflow logs
2. Checking recent commits for changes that might have caused this
3. Searching for similar issues that might provide insights

Provide a comprehensive root cause analysis with specific recommendations."""</span>

    <span class="k">try</span><span class="p">:</span>
        <span class="n">result</span> <span class="o">=</span> <span class="n">agent_executor</span><span class="p">.</span><span class="n">invoke</span><span class="p">({</span><span class="s">"input"</span><span class="p">:</span> <span class="n">input_text</span><span class="p">})</span>

        <span class="k">print</span><span class="p">(</span><span class="s">"</span><span class="se">\n</span><span class="s">"</span> <span class="o">+</span> <span class="s">"="</span> <span class="o">*</span> <span class="mi">60</span><span class="p">)</span>
        <span class="k">print</span><span class="p">(</span><span class="s">"INVESTIGATION COMPLETE"</span><span class="p">)</span>
        <span class="k">print</span><span class="p">(</span><span class="s">"="</span> <span class="o">*</span> <span class="mi">60</span><span class="p">)</span>
        <span class="k">print</span><span class="p">(</span><span class="n">result</span><span class="p">[</span><span class="s">'output'</span><span class="p">])</span>

        <span class="k">return</span> <span class="p">{</span>
            <span class="s">"success"</span><span class="p">:</span> <span class="bp">True</span><span class="p">,</span>
            <span class="s">"workflow_run_id"</span><span class="p">:</span> <span class="n">workflow_run_id</span><span class="p">,</span>
            <span class="s">"analysis"</span><span class="p">:</span> <span class="n">result</span><span class="p">[</span><span class="s">'output'</span><span class="p">]</span>
        <span class="p">}</span>

    <span class="k">except</span> <span class="nb">Exception</span> <span class="k">as</span> <span class="n">e</span><span class="p">:</span>
        <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"</span><span class="se">\n</span><span class="s">Error during investigation: </span><span class="si">{</span><span class="nb">str</span><span class="p">(</span><span class="n">e</span><span class="p">)</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
        <span class="k">return</span> <span class="p">{</span>
            <span class="s">"success"</span><span class="p">:</span> <span class="bp">False</span><span class="p">,</span>
            <span class="s">"workflow_run_id"</span><span class="p">:</span> <span class="n">workflow_run_id</span><span class="p">,</span>
            <span class="s">"error"</span><span class="p">:</span> <span class="nb">str</span><span class="p">(</span><span class="n">e</span><span class="p">)</span>
        <span class="p">}</span>


<span class="k">if</span> <span class="n">__name__</span> <span class="o">==</span> <span class="s">"__main__"</span><span class="p">:</span>
    <span class="kn">import</span> <span class="nn">sys</span>

    <span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">sys</span><span class="p">.</span><span class="n">argv</span><span class="p">)</span> <span class="o">&lt;</span> <span class="mi">2</span><span class="p">:</span>
        <span class="k">print</span><span class="p">(</span><span class="s">"Usage: python agent_investigator.py &lt;workflow_run_id&gt;"</span><span class="p">)</span>
        <span class="n">sys</span><span class="p">.</span><span class="nb">exit</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span>

    <span class="n">workflow_run_id</span> <span class="o">=</span> <span class="n">sys</span><span class="p">.</span><span class="n">argv</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span>
    <span class="n">investigate_failure</span><span class="p">(</span><span class="n">workflow_run_id</span><span class="p">)</span>
</code></pre></div></div>

<h3 id="model-recommendations-via-openrouter">Model Recommendations via OpenRouter</h3>

<p>Here are some good model choices for DevOps investigations:</p>

<p><strong>For best reasoning (higher cost):</strong></p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">anthropic/claude-3.5-sonnet</code> - Excellent at technical analysis</li>
  <li><code class="language-plaintext highlighter-rouge">openai/gpt-4-turbo</code> - Strong general reasoning</li>
  <li><code class="language-plaintext highlighter-rouge">google/gemini-pro-1.5</code> - Good for long context (helpful with large logs)</li>
</ul>

<p><strong>For cost efficiency (lower cost):</strong></p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">anthropic/claude-3-haiku</code> - Fast and cheap, good for simple failures</li>
  <li><code class="language-plaintext highlighter-rouge">openai/gpt-3.5-turbo</code> - Decent reasoning, very affordable</li>
  <li><code class="language-plaintext highlighter-rouge">meta-llama/llama-3.1-70b-instruct</code> - Open source, cost-effective</li>
</ul>

<p><strong>Cost comparison per investigation:</strong></p>

<ul>
  <li>GPT-4: ~$0.15-0.30</li>
  <li>Claude 3.5 Sonnet: ~$0.10-0.20</li>
  <li>GPT-3.5: ~$0.02-0.05</li>
  <li>Llama 3.1 70B: ~$0.01-0.03</li>
</ul>

<h3 id="how-it-works">How It Works</h3>

<p>Let’s walk through what happens when you run this:</p>

<ol>
  <li>
    <p><strong>You trigger the agent</strong>: <code class="language-plaintext highlighter-rouge">python agent_investigator.py 12345678</code></p>
  </li>
  <li>
    <p><strong>Agent receives the task</strong>: “Investigate workflow run 12345678”</p>
  </li>
  <li>
    <p><strong>LLM decides first action</strong>: “I should fetch the workflow logs to see what failed”</p>
  </li>
  <li>
    <p><strong>Agent calls</strong> <code class="language-plaintext highlighter-rouge">get_workflow_logs()</code>: Returns the last 50 lines of failed job logs</p>
  </li>
  <li>
    <p><strong>LLM analyzes logs</strong>: “I see a database connection error. Let me check recent commits for database config changes”</p>
  </li>
  <li>
    <p><strong>Agent calls</strong> <code class="language-plaintext highlighter-rouge">analyze_recent_commits()</code>: Returns commits from the last 24 hours</p>
  </li>
  <li>
    <p><strong>LLM finds suspicious commit</strong>: “Commit abc123 changed database.yml. Let me search for similar issues”</p>
  </li>
  <li>
    <p><strong>Agent calls</strong> <code class="language-plaintext highlighter-rouge">search_similar_issues()</code>: Finds issue #42 about database connection problems</p>
  </li>
  <li>
    <p><strong>LLM synthesizes findings</strong>: Produces a final report with root cause and fix</p>
  </li>
</ol>

<p>The entire process takes 10-30 seconds depending on the complexity.</p>

<h3 id="example-output">Example Output</h3>

<p>Here’s what the agent produces for a real failure:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Root Cause: Database connection pool exhaustion caused by recent increase in concurrent workers without adjusting max_connections setting.

Evidence:
- Workflow logs show "psycopg2.OperationalError: FATAL: sorry, too many clients already"
- Commit d4e5f6a (2 hours ago) changed worker count from 4 to 16 in deploy.yml
- Issue #127 documented same error when worker count was increased last month

Recommendation:
1. Increase PostgreSQL max_connections from 100 to 200 in database config
2. Or reduce worker count back to 8 as a temporary fix
3. Add connection pooling with PgBouncer for better resource management

Related Issues:
- #127: Database connection errors after scaling workers
- #89: PostgreSQL connection pool configuration guide
</code></pre></div></div>

<p>This is exactly what you need: the root cause, evidence, and actionable fixes.</p>

<h3 id="key-design-decisions">Key Design Decisions</h3>

<p><strong>Why max_iterations=5?</strong> Prevents infinite loops. Most investigations complete in 3-4 iterations.</p>

<p><strong>Why last 50 lines of logs?</strong> Error messages are typically at the end. Sending full logs wastes tokens and costs money.</p>

<p><strong>Why temperature=0?</strong> We want deterministic, factual analysis. Higher temperature adds creativity, which we don’t need for debugging.</p>

<p><strong>Why support OpenRouter?</strong> Gives you flexibility to switch models based on cost and performance. Claude 3.5 Sonnet often performs better than GPT-4 for technical debugging at a lower price.</p>

<p>In the next section, we’ll integrate this agent with GitHub Actions so it runs automatically when workflows fail.</p>

<h2 id="github-actions-integration">GitHub Actions Integration</h2>

<p>Now that we have a working agent, let’s integrate it with GitHub Actions so it automatically investigates failures. We’ll use GitHub’s workflow events to trigger our agent whenever a pipeline fails.</p>

<h3 id="architecture-overview">Architecture Overview</h3>

<p>Here’s how the integration works:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>GitHub Actions Workflow Fails
    ↓
GitHub triggers workflow_run event
    ↓
Our "Investigate Failure" workflow runs
    ↓
Calls agent_investigator.py with workflow ID
    ↓
Agent investigates and generates report
    ↓
Posts results to GitHub issue or Slack
</code></pre></div></div>

<h3 id="step-1-set-up-github-secrets">Step 1: Set Up GitHub Secrets</h3>

<p>First, add your API keys to GitHub repository secrets:</p>

<ol>
  <li>Go to your repository on GitHub</li>
  <li>Click <strong>Settings &gt; Secrets and variables &gt; Actions</strong></li>
  <li>Click <strong>New repository secret</strong></li>
  <li>Add these secrets:</li>
</ol>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>OPENAI_API_KEY (or OPENROUTER_API_KEY)
GITHUB_TOKEN (automatically provided by GitHub Actions)
SLACK_WEBHOOK_URL (optional, for notifications)
</code></pre></div></div>

<p>For OpenRouter users, also add:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>USE_OPENROUTER=true
MODEL_NAME=anthropic/claude-3.5-sonnet
</code></pre></div></div>

<h3 id="step-2-create-the-investigation-workflow">Step 2: Create the Investigation Workflow</h3>

<p>Create a new file <code class="language-plaintext highlighter-rouge">.github/workflows/investigate-failures.yml</code>:</p>

<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="na">name</span><span class="pi">:</span> <span class="s">AI Agent - Investigate Failures</span>

<span class="na">on</span><span class="pi">:</span>
  <span class="na">workflow_run</span><span class="pi">:</span>
    <span class="na">workflows</span><span class="pi">:</span> <span class="pi">[</span><span class="s2">"</span><span class="s">*"</span><span class="pi">]</span>  <span class="c1"># Monitor all workflows</span>
    <span class="na">types</span><span class="pi">:</span>
      <span class="pi">-</span> <span class="s">completed</span>

<span class="na">jobs</span><span class="pi">:</span>
  <span class="na">investigate</span><span class="pi">:</span>
    <span class="c1"># Only run if the workflow failed</span>
    <span class="na">if</span><span class="pi">:</span> <span class="s">$</span>
    <span class="na">runs-on</span><span class="pi">:</span> <span class="s">ubuntu-latest</span>

    <span class="na">steps</span><span class="pi">:</span>
      <span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">Checkout code</span>
        <span class="na">uses</span><span class="pi">:</span> <span class="s">actions/checkout@v4</span>

      <span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">Set up Python</span>
        <span class="na">uses</span><span class="pi">:</span> <span class="s">actions/setup-python@v5</span>
        <span class="na">with</span><span class="pi">:</span>
          <span class="na">python-version</span><span class="pi">:</span> <span class="s1">'</span><span class="s">3.11'</span>

      <span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">Install uv</span>
        <span class="na">run</span><span class="pi">:</span> <span class="pi">|</span>
          <span class="s">curl -LsSf https://astral.sh/uv/install.sh | sh</span>
          <span class="s">echo "$HOME/.cargo/bin" &gt;&gt; $GITHUB_PATH</span>

      <span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">Create virtual environment and install dependencies</span>
        <span class="na">run</span><span class="pi">:</span> <span class="pi">|</span>
          <span class="s">uv venv</span>
          <span class="s">source .venv/bin/activate</span>
          <span class="s">uv pip install langchain langchain-openai requests python-dotenv</span>

      <span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">Run AI investigation</span>
        <span class="na">env</span><span class="pi">:</span>
          <span class="na">GITHUB_TOKEN</span><span class="pi">:</span> <span class="s">$</span>
          <span class="na">GITHUB_REPO</span><span class="pi">:</span> <span class="s">$</span>
          <span class="na">OPENAI_API_KEY</span><span class="pi">:</span> <span class="s">$</span>
          <span class="na">OPENROUTER_API_KEY</span><span class="pi">:</span> <span class="s">$</span>
          <span class="na">USE_OPENROUTER</span><span class="pi">:</span> <span class="s">$</span>
          <span class="na">MODEL_NAME</span><span class="pi">:</span> <span class="s">$</span>
        <span class="na">run</span><span class="pi">:</span> <span class="pi">|</span>
          <span class="s">source .venv/bin/activate</span>
          <span class="s">python agent_investigator.py $</span>

      <span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">Post results to GitHub issue</span>
        <span class="na">if</span><span class="pi">:</span> <span class="s">always()</span>
        <span class="na">uses</span><span class="pi">:</span> <span class="s">actions/github-script@v7</span>
        <span class="na">with</span><span class="pi">:</span>
          <span class="na">script</span><span class="pi">:</span> <span class="pi">|</span>
            <span class="s">const fs = require('fs');</span>

            <span class="s">// Read the investigation results</span>
            <span class="s">const workflowName = '$';</span>
            <span class="s">const workflowUrl = '$';</span>
            <span class="s">const runId = '$';</span>

            <span class="s">// Create or update issue with findings</span>
            <span class="s">const title = `Pipeline Failure: ${workflowName}`;</span>
            <span class="s">const body = `## Automated Investigation Report</span>

<span class="na">**Workflow**</span><span class="pi">:</span> <span class="pi">[</span><span class="nv">$</span><span class="pi">{</span><span class="nv">workflowName</span><span class="pi">}]</span><span class="s">(${workflowUrl})</span>
<span class="na">**Run ID**</span><span class="pi">:</span> <span class="s">${runId}</span>
<span class="na">**Branch**</span><span class="pi">:</span> <span class="s">$</span>
<span class="na">**Commit**</span><span class="pi">:</span> <span class="s">$</span>

<span class="c1">### Investigation Results</span>

<span class="s">The AI agent has completed its investigation. Check the workflow logs for detailed analysis.</span>

<span class="na">**Next Steps**</span><span class="pi">:</span>
<span class="s">1. Review the root cause analysis above</span>
<span class="s">2. Check the recommended fixes</span>
<span class="s">3. Review related issues if any were found</span>
<span class="s">4. Apply the fix and re-run the workflow</span>

<span class="nn">---</span>
<span class="nv">*This</span> <span class="s">issue was automatically created by the Pipeline Health Monitor AI Agent*</span>
<span class="err">`</span><span class="s">;</span>

            <span class="s">// Search for existing open issue</span>
            <span class="s">const issues = await github.rest.issues.listForRepo({</span>
              <span class="s">owner</span><span class="err">:</span> <span class="s">context.repo.owner,</span>
              <span class="s">repo</span><span class="err">:</span> <span class="s">context.repo.repo,</span>
              <span class="s">state</span><span class="err">:</span> <span class="s1">'</span><span class="s">open'</span><span class="err">,</span>
              <span class="na">labels</span><span class="pi">:</span> <span class="pi">[</span><span class="s1">'</span><span class="s">pipeline-failure'</span><span class="pi">,</span> <span class="s1">'</span><span class="s">ai-investigated'</span><span class="pi">]</span>
<span class="err">            }</span><span class="s">);</span>

            <span class="s">const existingIssue = issues.data.find(issue =&gt;</span>
              <span class="s">issue.title.includes(workflowName)</span>
            <span class="s">);</span>

            <span class="s">if (existingIssue) {</span>
              <span class="s">// Update existing issue</span>
              <span class="s">await github.rest.issues.createComment({</span>
                <span class="s">owner</span><span class="err">:</span> <span class="s">context.repo.owner,</span>
                <span class="s">repo</span><span class="err">:</span> <span class="s">context.repo.repo,</span>
                <span class="s">issue_number</span><span class="err">:</span> <span class="s">existingIssue.number,</span>
                <span class="s">body</span><span class="err">:</span> <span class="err">`</span><span class="c1">## New Failure Detected\n\n${body}`</span>
              <span class="err">}</span><span class="s">);</span>
            <span class="s">} else {</span>
              <span class="s">// Create new issue</span>
              <span class="s">await github.rest.issues.create({</span>
                <span class="s">owner</span><span class="err">:</span> <span class="s">context.repo.owner,</span>
                <span class="s">repo</span><span class="err">:</span> <span class="s">context.repo.repo,</span>
                <span class="s">title</span><span class="err">:</span> <span class="s">title,</span>
                <span class="s">body</span><span class="err">:</span> <span class="s">body,</span>
                <span class="s">labels</span><span class="err">:</span> <span class="pi">[</span><span class="s1">'</span><span class="s">pipeline-failure'</span><span class="pi">,</span> <span class="s1">'</span><span class="s">ai-investigated'</span><span class="pi">]</span>
              <span class="err">}</span><span class="s">);</span>
            <span class="s">}</span>
</code></pre></div></div>

<h3 id="how-it-works-in-production">How It Works in Production</h3>

<p>Once deployed, here’s what happens automatically:</p>

<ol>
  <li>Developer pushes code that breaks a test</li>
  <li>CI pipeline fails (tests, build, deployment, etc.)</li>
  <li>GitHub triggers the <code class="language-plaintext highlighter-rouge">workflow_run</code> event</li>
  <li>Investigation workflow starts within seconds</li>
  <li>Agent fetches logs, analyzes commits, searches issues</li>
  <li>LLM reasons about the root cause</li>
  <li>Results posted to GitHub issue and Slack</li>
  <li>Developer sees detailed analysis with fix recommendations</li>
</ol>

<p>All of this happens in 30-60 seconds after the failure.</p>

<h3 id="cost-considerations">Cost Considerations</h3>

<p>Each investigation costs approximately:</p>

<ul>
  <li>GPT-4: $0.15-0.30 per investigation</li>
  <li>Claude 3.5 Sonnet (via OpenRouter): $0.10-0.20</li>
  <li>GPT-3.5: $0.02-0.05</li>
</ul>

<p>For a team with:</p>

<ul>
  <li>20 pipeline failures per day</li>
  <li>Using Claude 3.5 Sonnet ($0.15 average)</li>
</ul>

<p>Monthly cost: 20 × $0.15 × 30 = $90</p>

<p>Compare this to:</p>

<ul>
  <li>Developer time investigating failures: 30 min × 20 failures = 10 hours/day</li>
  <li>At $100/hour = $1,000/day saved</li>
</ul>

<p>The ROI is clear.</p>

<h2 id="security-validation-the-48-vulnerability-problem">Security Validation: The 48% Vulnerability Problem</h2>

<p>Here’s the uncomfortable truth: research shows that 48% of AI-generated code contains vulnerabilities. In some studies, 60% of AI suggestions for financial services contained high-severity security flaws.</p>

<p>As DevOps consultants, we can’t afford to blindly trust AI-generated recommendations. Our agent has read access to logs, commits, and issues, but what if we extend it to execute fixes automatically? We need layers of security validation.</p>

<h3 id="the-real-security-risks">The Real Security Risks</h3>

<p>Before we dive into solutions, let’s understand what can go wrong:</p>

<p><strong>Prompt Injection Attacks</strong>: Google’s security team demonstrated a real exploit where hidden HTML comments in a dependency’s README convinced a build agent that a malicious package was legitimate. The agent shipped the malicious code to production.</p>

<p><strong>Hallucinated Commands</strong>: An LLM might confidently suggest running <code class="language-plaintext highlighter-rouge">kubectl delete deployment production</code> when it meant to suggest <code class="language-plaintext highlighter-rouge">kubectl delete pod production-5f6h8</code>.</p>

<p><strong>Information Leakage</strong>: Agents with access to logs might inadvertently expose secrets, API keys, or sensitive data when posting to public channels.</p>

<p><strong>Shadow AI</strong>: Developers creating custom agents without proper governance, leading to unauthorized automation running in your pipelines.</p>

<p>Let’s build defenses against all of these.</p>

<h3 id="layer-1-restrict-agent-permissions">Layer 1: Restrict Agent Permissions</h3>

<p>The principle of least privilege applies to AI agents just like any other system component.</p>

<p>Our current agent only has read-only access:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Current tools - all read-only
</span><span class="n">tools</span> <span class="o">=</span> <span class="p">[</span>
    <span class="n">get_workflow_logs</span><span class="p">,</span>       <span class="c1"># Read GitHub logs
</span>    <span class="n">analyze_recent_commits</span><span class="p">,</span>  <span class="c1"># Read git history
</span>    <span class="n">search_similar_issues</span>    <span class="c1"># Read GitHub issues
</span><span class="p">]</span>
</code></pre></div></div>

<p>This is intentional. Investigation does not require execution.</p>

<h3 id="layer-2-secrets-detection">Layer 2: Secrets Detection</h3>

<p>Never let the agent expose secrets in logs or notifications.</p>

<p>Create a secrets scanner:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># secrets_scanner.py
</span><span class="kn">import</span> <span class="nn">re</span>
<span class="kn">from</span> <span class="nn">typing</span> <span class="kn">import</span> <span class="n">List</span><span class="p">,</span> <span class="n">Tuple</span>

<span class="k">class</span> <span class="nc">SecretsScanner</span><span class="p">:</span>
    <span class="s">"""Detect and redact secrets from agent outputs."""</span>

    <span class="n">PATTERNS</span> <span class="o">=</span> <span class="p">{</span>
        <span class="s">'aws_key'</span><span class="p">:</span> <span class="sa">r</span><span class="s">'AKIA[0-9A-Z]{16}'</span><span class="p">,</span>
        <span class="s">'github_token'</span><span class="p">:</span> <span class="sa">r</span><span class="s">'gh[pousr]_[A-Za-z0-9_]{36,255}'</span><span class="p">,</span>
        <span class="s">'generic_api_key'</span><span class="p">:</span> <span class="sa">r</span><span class="s">'api[_-]?key["\']?\s*[:=]\s*["\']?([a-zA-Z0-9_\-]{20,})'</span><span class="p">,</span>
        <span class="s">'password'</span><span class="p">:</span> <span class="sa">r</span><span class="s">'password["\']?\s*[:=]\s*["\']?([^\s"\']{8,})'</span><span class="p">,</span>
        <span class="s">'private_key'</span><span class="p">:</span> <span class="sa">r</span><span class="s">'-----BEGIN (RSA |OPENSSH )?PRIVATE KEY-----'</span><span class="p">,</span>
        <span class="s">'jwt'</span><span class="p">:</span> <span class="sa">r</span><span class="s">'eyJ[A-Za-z0-9-_=]+\.eyJ[A-Za-z0-9-_=]+\.?[A-Za-z0-9-_.+/=]*'</span><span class="p">,</span>
        <span class="s">'connection_string'</span><span class="p">:</span> <span class="sa">r</span><span class="s">'(postgres|mysql|mongodb)://[^:]+:[^@]+@'</span><span class="p">,</span>
    <span class="p">}</span>

    <span class="o">@</span><span class="nb">staticmethod</span>
    <span class="k">def</span> <span class="nf">scan</span><span class="p">(</span><span class="n">text</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="n">Tuple</span><span class="p">[</span><span class="nb">bool</span><span class="p">,</span> <span class="n">List</span><span class="p">[</span><span class="nb">str</span><span class="p">]]:</span>
        <span class="s">"""
        Scan text for secrets.

        Args:
            text: Text to scan

        Returns:
            Tuple of (has_secrets, list of secret types found)
        """</span>
        <span class="n">found_secrets</span> <span class="o">=</span> <span class="p">[]</span>

        <span class="k">for</span> <span class="n">secret_type</span><span class="p">,</span> <span class="n">pattern</span> <span class="ow">in</span> <span class="n">SecretsScanner</span><span class="p">.</span><span class="n">PATTERNS</span><span class="p">.</span><span class="n">items</span><span class="p">():</span>
            <span class="k">if</span> <span class="n">re</span><span class="p">.</span><span class="n">search</span><span class="p">(</span><span class="n">pattern</span><span class="p">,</span> <span class="n">text</span><span class="p">,</span> <span class="n">re</span><span class="p">.</span><span class="n">IGNORECASE</span><span class="p">):</span>
                <span class="n">found_secrets</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">secret_type</span><span class="p">)</span>

        <span class="k">return</span> <span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">found_secrets</span><span class="p">)</span> <span class="o">&gt;</span> <span class="mi">0</span><span class="p">,</span> <span class="n">found_secrets</span><span class="p">)</span>

    <span class="o">@</span><span class="nb">staticmethod</span>
    <span class="k">def</span> <span class="nf">redact</span><span class="p">(</span><span class="n">text</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">str</span><span class="p">:</span>
        <span class="s">"""
        Redact secrets from text.

        Args:
            text: Text to redact

        Returns:
            Text with secrets replaced by [REDACTED]
        """</span>
        <span class="n">redacted</span> <span class="o">=</span> <span class="n">text</span>

        <span class="k">for</span> <span class="n">secret_type</span><span class="p">,</span> <span class="n">pattern</span> <span class="ow">in</span> <span class="n">SecretsScanner</span><span class="p">.</span><span class="n">PATTERNS</span><span class="p">.</span><span class="n">items</span><span class="p">():</span>
            <span class="n">redacted</span> <span class="o">=</span> <span class="n">re</span><span class="p">.</span><span class="n">sub</span><span class="p">(</span><span class="n">pattern</span><span class="p">,</span> <span class="sa">f</span><span class="s">'[REDACTED:</span><span class="si">{</span><span class="n">secret_type</span><span class="p">.</span><span class="n">upper</span><span class="p">()</span><span class="si">}</span><span class="s">]'</span><span class="p">,</span> <span class="n">redacted</span><span class="p">,</span> <span class="n">flags</span><span class="o">=</span><span class="n">re</span><span class="p">.</span><span class="n">IGNORECASE</span><span class="p">)</span>

        <span class="k">return</span> <span class="n">redacted</span>


<span class="c1"># Usage in agent output
</span><span class="k">def</span> <span class="nf">safe_output</span><span class="p">(</span><span class="n">text</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">str</span><span class="p">:</span>
    <span class="s">"""Process agent output to remove secrets before displaying."""</span>
    <span class="n">scanner</span> <span class="o">=</span> <span class="n">SecretsScanner</span><span class="p">()</span>
    <span class="n">has_secrets</span><span class="p">,</span> <span class="n">secret_types</span> <span class="o">=</span> <span class="n">scanner</span><span class="p">.</span><span class="n">scan</span><span class="p">(</span><span class="n">text</span><span class="p">)</span>

    <span class="k">if</span> <span class="n">has_secrets</span><span class="p">:</span>
        <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"WARNING: Detected secrets in output: </span><span class="si">{</span><span class="s">', '</span><span class="p">.</span><span class="n">join</span><span class="p">(</span><span class="n">secret_types</span><span class="p">)</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
        <span class="k">return</span> <span class="n">scanner</span><span class="p">.</span><span class="n">redact</span><span class="p">(</span><span class="n">text</span><span class="p">)</span>

    <span class="k">return</span> <span class="n">text</span>
</code></pre></div></div>

<p>Update the investigation function to use secrets scanning:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">investigate_failure</span><span class="p">(</span><span class="n">workflow_run_id</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">dict</span><span class="p">:</span>
    <span class="s">"""Investigate a failed GitHub Actions workflow with secret protection."""</span>
    <span class="c1"># ... existing code ...
</span>
    <span class="k">try</span><span class="p">:</span>
        <span class="n">result</span> <span class="o">=</span> <span class="n">agent_executor</span><span class="p">.</span><span class="n">invoke</span><span class="p">({</span><span class="s">"input"</span><span class="p">:</span> <span class="n">input_text</span><span class="p">})</span>

        <span class="c1"># Scan for secrets before outputting
</span>        <span class="n">safe_analysis</span> <span class="o">=</span> <span class="n">safe_output</span><span class="p">(</span><span class="n">result</span><span class="p">[</span><span class="s">'output'</span><span class="p">])</span>

        <span class="k">print</span><span class="p">(</span><span class="s">"</span><span class="se">\n</span><span class="s">"</span> <span class="o">+</span> <span class="s">"="</span> <span class="o">*</span> <span class="mi">60</span><span class="p">)</span>
        <span class="k">print</span><span class="p">(</span><span class="s">"INVESTIGATION COMPLETE"</span><span class="p">)</span>
        <span class="k">print</span><span class="p">(</span><span class="s">"="</span> <span class="o">*</span> <span class="mi">60</span><span class="p">)</span>
        <span class="k">print</span><span class="p">(</span><span class="n">safe_analysis</span><span class="p">)</span>

        <span class="k">return</span> <span class="p">{</span>
            <span class="s">"success"</span><span class="p">:</span> <span class="bp">True</span><span class="p">,</span>
            <span class="s">"workflow_run_id"</span><span class="p">:</span> <span class="n">workflow_run_id</span><span class="p">,</span>
            <span class="s">"analysis"</span><span class="p">:</span> <span class="n">safe_analysis</span>
        <span class="p">}</span>
    <span class="k">except</span> <span class="nb">Exception</span> <span class="k">as</span> <span class="n">e</span><span class="p">:</span>
        <span class="k">return</span> <span class="p">{</span>
            <span class="s">"success"</span><span class="p">:</span> <span class="bp">False</span><span class="p">,</span>
            <span class="s">"workflow_run_id"</span><span class="p">:</span> <span class="n">workflow_run_id</span><span class="p">,</span>
            <span class="s">"error"</span><span class="p">:</span> <span class="nb">str</span><span class="p">(</span><span class="n">e</span><span class="p">)</span>
        <span class="p">}</span>
</code></pre></div></div>

<h3 id="layer-3-audit-trail">Layer 3: Audit Trail</h3>

<p>Log every agent decision for security review and debugging.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># audit_logger.py
</span><span class="kn">import</span> <span class="nn">json</span>
<span class="kn">from</span> <span class="nn">datetime</span> <span class="kn">import</span> <span class="n">datetime</span>
<span class="kn">from</span> <span class="nn">pathlib</span> <span class="kn">import</span> <span class="n">Path</span>
<span class="kn">from</span> <span class="nn">typing</span> <span class="kn">import</span> <span class="n">Dict</span><span class="p">,</span> <span class="n">Any</span>

<span class="k">class</span> <span class="nc">AuditLogger</span><span class="p">:</span>
    <span class="s">"""Log all agent actions for security auditing."""</span>

    <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">log_dir</span><span class="p">:</span> <span class="nb">str</span> <span class="o">=</span> <span class="s">".agent_logs"</span><span class="p">):</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">log_dir</span> <span class="o">=</span> <span class="n">Path</span><span class="p">(</span><span class="n">log_dir</span><span class="p">)</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">log_dir</span><span class="p">.</span><span class="n">mkdir</span><span class="p">(</span><span class="n">exist_ok</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>

    <span class="k">def</span> <span class="nf">log_investigation</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">event_data</span><span class="p">:</span> <span class="n">Dict</span><span class="p">[</span><span class="nb">str</span><span class="p">,</span> <span class="n">Any</span><span class="p">]):</span>
        <span class="s">"""
        Log an investigation event.

        Args:
            event_data: Dictionary containing event details
        """</span>
        <span class="n">timestamp</span> <span class="o">=</span> <span class="n">datetime</span><span class="p">.</span><span class="n">utcnow</span><span class="p">().</span><span class="n">isoformat</span><span class="p">()</span>
        <span class="n">log_entry</span> <span class="o">=</span> <span class="p">{</span>
            <span class="s">"timestamp"</span><span class="p">:</span> <span class="n">timestamp</span><span class="p">,</span>
            <span class="s">"event_type"</span><span class="p">:</span> <span class="s">"investigation"</span><span class="p">,</span>
            <span class="o">**</span><span class="n">event_data</span>
        <span class="p">}</span>

        <span class="c1"># Log to daily file
</span>        <span class="n">log_file</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">log_dir</span> <span class="o">/</span> <span class="sa">f</span><span class="s">"audit_</span><span class="si">{</span><span class="n">datetime</span><span class="p">.</span><span class="n">utcnow</span><span class="p">().</span><span class="n">strftime</span><span class="p">(</span><span class="s">'%Y-%m-%d'</span><span class="p">)</span><span class="si">}</span><span class="s">.jsonl"</span>

        <span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="n">log_file</span><span class="p">,</span> <span class="s">'a'</span><span class="p">)</span> <span class="k">as</span> <span class="n">f</span><span class="p">:</span>
            <span class="n">f</span><span class="p">.</span><span class="n">write</span><span class="p">(</span><span class="n">json</span><span class="p">.</span><span class="n">dumps</span><span class="p">(</span><span class="n">log_entry</span><span class="p">)</span> <span class="o">+</span> <span class="s">'</span><span class="se">\n</span><span class="s">'</span><span class="p">)</span>

    <span class="k">def</span> <span class="nf">log_tool_call</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">tool_name</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span> <span class="n">args</span><span class="p">:</span> <span class="n">Dict</span><span class="p">,</span> <span class="n">result</span><span class="p">:</span> <span class="n">Any</span><span class="p">,</span> <span class="n">duration</span><span class="p">:</span> <span class="nb">float</span><span class="p">):</span>
        <span class="s">"""Log a tool call."""</span>
        <span class="n">log_entry</span> <span class="o">=</span> <span class="p">{</span>
            <span class="s">"timestamp"</span><span class="p">:</span> <span class="n">datetime</span><span class="p">.</span><span class="n">utcnow</span><span class="p">().</span><span class="n">isoformat</span><span class="p">(),</span>
            <span class="s">"event_type"</span><span class="p">:</span> <span class="s">"tool_call"</span><span class="p">,</span>
            <span class="s">"tool"</span><span class="p">:</span> <span class="n">tool_name</span><span class="p">,</span>
            <span class="s">"arguments"</span><span class="p">:</span> <span class="n">args</span><span class="p">,</span>
            <span class="s">"result_preview"</span><span class="p">:</span> <span class="nb">str</span><span class="p">(</span><span class="n">result</span><span class="p">)[:</span><span class="mi">200</span><span class="p">],</span>
            <span class="s">"duration_seconds"</span><span class="p">:</span> <span class="n">duration</span>
        <span class="p">}</span>

        <span class="n">log_file</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">log_dir</span> <span class="o">/</span> <span class="sa">f</span><span class="s">"audit_</span><span class="si">{</span><span class="n">datetime</span><span class="p">.</span><span class="n">utcnow</span><span class="p">().</span><span class="n">strftime</span><span class="p">(</span><span class="s">'%Y-%m-%d'</span><span class="p">)</span><span class="si">}</span><span class="s">.jsonl"</span>

        <span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="n">log_file</span><span class="p">,</span> <span class="s">'a'</span><span class="p">)</span> <span class="k">as</span> <span class="n">f</span><span class="p">:</span>
            <span class="n">f</span><span class="p">.</span><span class="n">write</span><span class="p">(</span><span class="n">json</span><span class="p">.</span><span class="n">dumps</span><span class="p">(</span><span class="n">log_entry</span><span class="p">)</span> <span class="o">+</span> <span class="s">'</span><span class="se">\n</span><span class="s">'</span><span class="p">)</span>

    <span class="k">def</span> <span class="nf">log_security_event</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">event_type</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span> <span class="n">details</span><span class="p">:</span> <span class="n">Dict</span><span class="p">[</span><span class="nb">str</span><span class="p">,</span> <span class="n">Any</span><span class="p">]):</span>
        <span class="s">"""Log a security-related event."""</span>
        <span class="n">log_entry</span> <span class="o">=</span> <span class="p">{</span>
            <span class="s">"timestamp"</span><span class="p">:</span> <span class="n">datetime</span><span class="p">.</span><span class="n">utcnow</span><span class="p">().</span><span class="n">isoformat</span><span class="p">(),</span>
            <span class="s">"event_type"</span><span class="p">:</span> <span class="s">"security"</span><span class="p">,</span>
            <span class="s">"security_event"</span><span class="p">:</span> <span class="n">event_type</span><span class="p">,</span>
            <span class="o">**</span><span class="n">details</span>
        <span class="p">}</span>

        <span class="n">log_file</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">log_dir</span> <span class="o">/</span> <span class="sa">f</span><span class="s">"security_</span><span class="si">{</span><span class="n">datetime</span><span class="p">.</span><span class="n">utcnow</span><span class="p">().</span><span class="n">strftime</span><span class="p">(</span><span class="s">'%Y-%m-%d'</span><span class="p">)</span><span class="si">}</span><span class="s">.jsonl"</span>

        <span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="n">log_file</span><span class="p">,</span> <span class="s">'a'</span><span class="p">)</span> <span class="k">as</span> <span class="n">f</span><span class="p">:</span>
            <span class="n">f</span><span class="p">.</span><span class="n">write</span><span class="p">(</span><span class="n">json</span><span class="p">.</span><span class="n">dumps</span><span class="p">(</span><span class="n">log_entry</span><span class="p">)</span> <span class="o">+</span> <span class="s">'</span><span class="se">\n</span><span class="s">'</span><span class="p">)</span>
</code></pre></div></div>

<h3 id="layer-4-rate-limiting-and-cost-controls">Layer 4: Rate Limiting and Cost Controls</h3>

<p>Prevent runaway costs and API abuse:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># rate_limiter.py
</span><span class="kn">import</span> <span class="nn">time</span>
<span class="kn">from</span> <span class="nn">collections</span> <span class="kn">import</span> <span class="n">deque</span>
<span class="kn">from</span> <span class="nn">datetime</span> <span class="kn">import</span> <span class="n">datetime</span><span class="p">,</span> <span class="n">timedelta</span>

<span class="k">class</span> <span class="nc">RateLimiter</span><span class="p">:</span>
    <span class="s">"""Rate limit agent executions to prevent abuse and control costs."""</span>

    <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">max_investigations_per_hour</span><span class="p">:</span> <span class="nb">int</span> <span class="o">=</span> <span class="mi">20</span><span class="p">):</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">max_per_hour</span> <span class="o">=</span> <span class="n">max_investigations_per_hour</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">investigation_times</span> <span class="o">=</span> <span class="n">deque</span><span class="p">()</span>

    <span class="k">def</span> <span class="nf">can_investigate</span><span class="p">(</span><span class="bp">self</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">bool</span><span class="p">:</span>
        <span class="s">"""Check if we can run another investigation."""</span>
        <span class="n">now</span> <span class="o">=</span> <span class="n">datetime</span><span class="p">.</span><span class="n">utcnow</span><span class="p">()</span>
        <span class="n">cutoff</span> <span class="o">=</span> <span class="n">now</span> <span class="o">-</span> <span class="n">timedelta</span><span class="p">(</span><span class="n">hours</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>

        <span class="c1"># Remove investigations older than 1 hour
</span>        <span class="k">while</span> <span class="bp">self</span><span class="p">.</span><span class="n">investigation_times</span> <span class="ow">and</span> <span class="bp">self</span><span class="p">.</span><span class="n">investigation_times</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">&lt;</span> <span class="n">cutoff</span><span class="p">:</span>
            <span class="bp">self</span><span class="p">.</span><span class="n">investigation_times</span><span class="p">.</span><span class="n">popleft</span><span class="p">()</span>

        <span class="k">return</span> <span class="nb">len</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">investigation_times</span><span class="p">)</span> <span class="o">&lt;</span> <span class="bp">self</span><span class="p">.</span><span class="n">max_per_hour</span>

    <span class="k">def</span> <span class="nf">record_investigation</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
        <span class="s">"""Record that an investigation occurred."""</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">investigation_times</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">datetime</span><span class="p">.</span><span class="n">utcnow</span><span class="p">())</span>

    <span class="k">def</span> <span class="nf">time_until_next_allowed</span><span class="p">(</span><span class="bp">self</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">int</span><span class="p">:</span>
        <span class="s">"""Get seconds until next investigation is allowed."""</span>
        <span class="k">if</span> <span class="bp">self</span><span class="p">.</span><span class="n">can_investigate</span><span class="p">():</span>
            <span class="k">return</span> <span class="mi">0</span>

        <span class="n">oldest</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">investigation_times</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span>
        <span class="n">time_until_allowed</span> <span class="o">=</span> <span class="p">(</span><span class="n">oldest</span> <span class="o">+</span> <span class="n">timedelta</span><span class="p">(</span><span class="n">hours</span><span class="o">=</span><span class="mi">1</span><span class="p">))</span> <span class="o">-</span> <span class="n">datetime</span><span class="p">.</span><span class="n">utcnow</span><span class="p">()</span>
        <span class="k">return</span> <span class="nb">int</span><span class="p">(</span><span class="n">time_until_allowed</span><span class="p">.</span><span class="n">total_seconds</span><span class="p">())</span>
</code></pre></div></div>

<h3 id="security-checklist">Security Checklist</h3>

<p>Before deploying your AI agent to production, verify:</p>

<ul>
  <li>Agent has minimum required permissions (read-only by default)</li>
  <li>All commands validated before execution</li>
  <li>Secrets scanner active on all outputs</li>
  <li>Audit logging enabled and monitored</li>
  <li>Rate limiting configured</li>
  <li>GitHub tokens scoped correctly (no admin access)</li>
  <li>LLM API keys stored in secrets, not code</li>
  <li>No secrets committed to repository</li>
  <li>Slack webhooks use incoming webhook URLs only</li>
  <li>Agent cannot modify production without approval</li>
</ul>

<h3 id="real-world-security-scenario">Real-World Security Scenario</h3>

<p>Here’s how these layers work together:</p>

<ol>
  <li>Agent investigates failure and LLM suggests: <code class="language-plaintext highlighter-rouge">kubectl delete pod production-db-0</code></li>
  <li>Command validator catches this: “APPROVAL REQUIRED: Command requires human approval”</li>
  <li>Agent posts recommendation to GitHub issue instead of executing</li>
  <li>Secrets scanner detects database connection string in logs and redacts it</li>
  <li>Audit logger records the attempted command and approval requirement</li>
  <li>Human reviews the recommendation and decides whether to execute</li>
  <li>If approved, human runs command manually with full context</li>
</ol>

<p>The agent accelerates investigation but humans retain control over critical actions.</p>

<h2 id="practical-tips-and-common-pitfalls">Practical Tips and Common Pitfalls</h2>

<p>After building and running AI agents for DevOps investigations, I’ve learned what works and what doesn’t. Here are the hard-earned lessons that will save you time and money.</p>

<h3 id="prompt-engineering-best-practices">Prompt Engineering Best Practices</h3>

<p>Your prompt is the most important part of your agent. A vague prompt gives vague results. A specific prompt with context gives actionable insights.</p>

<p><strong>Bad Prompt:</strong></p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">system_prompt</span> <span class="o">=</span> <span class="s">"""You are an AI agent. Debug the issue."""</span>
</code></pre></div></div>

<p>Why it fails: Too generic, no context, no output format.</p>

<p><strong>Good Prompt:</strong></p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">system_prompt</span> <span class="o">=</span> <span class="s">"""You are an expert DevOps AI agent that investigates CI/CD pipeline failures.

Infrastructure context:
- Python microservices running on Kubernetes in AWS EKS
- PostgreSQL 14 database with connection pooling
- Redis for caching
- GitHub Actions for CI/CD

Your role is to:
1. Analyze workflow logs to identify the root cause of failures
2. Examine recent code changes that might have introduced issues
3. Search for similar problems in the issue tracker
4. Provide a clear, actionable root cause analysis

When analyzing failures:
- Focus on the actual error messages, not just symptoms
- Consider recent code changes as potential causes
- Look for patterns in similar past issues
- Be specific about what broke and why
- Suggest concrete fixes, not vague advice

Output format:
**Root Cause**: [One sentence summary]
**Evidence**: [Key findings from logs/commits/issues]
**Recommendation**: [Specific steps to fix]
**Related Issues**: [Links to similar problems if found]
"""</span>
</code></pre></div></div>

<p>Why it works: Infrastructure context, clear role, specific instructions, defined output format.</p>

<h3 id="common-pitfalls-and-solutions">Common Pitfalls and Solutions</h3>

<p><strong>Pitfall 1: Agent Loops Infinitely</strong></p>

<p>Symptom: Agent keeps calling tools without making progress.</p>

<p>Solution: Set <code class="language-plaintext highlighter-rouge">max_iterations</code>:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">agent_executor</span> <span class="o">=</span> <span class="n">AgentExecutor</span><span class="p">(</span>
    <span class="n">agent</span><span class="o">=</span><span class="n">agent</span><span class="p">,</span>
    <span class="n">tools</span><span class="o">=</span><span class="n">tools</span><span class="p">,</span>
    <span class="n">verbose</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span>
    <span class="n">max_iterations</span><span class="o">=</span><span class="mi">5</span><span class="p">,</span>  <span class="c1"># Stop after 5 iterations
</span>    <span class="n">handle_parsing_errors</span><span class="o">=</span><span class="bp">True</span>
<span class="p">)</span>
</code></pre></div></div>

<p><strong>Pitfall 2: Costs Spiral Out of Control</strong></p>

<p>Symptom: Your OpenAI bill is $500 for 100 investigations.</p>

<p>Cause: Using GPT-4 for everything, not optimizing token usage.</p>

<p>Solution: Use the right model for the task:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">get_llm</span><span class="p">(</span><span class="n">task_complexity</span><span class="p">:</span> <span class="nb">str</span> <span class="o">=</span> <span class="s">"medium"</span><span class="p">):</span>
    <span class="s">"""Choose LLM based on task complexity."""</span>

    <span class="k">if</span> <span class="n">task_complexity</span> <span class="o">==</span> <span class="s">"simple"</span><span class="p">:</span>
        <span class="c1"># Use cheaper model for simple log analysis
</span>        <span class="n">model</span> <span class="o">=</span> <span class="s">"gpt-3.5-turbo"</span>  <span class="c1"># $0.002 per investigation
</span>    <span class="k">elif</span> <span class="n">task_complexity</span> <span class="o">==</span> <span class="s">"medium"</span><span class="p">:</span>
        <span class="n">model</span> <span class="o">=</span> <span class="s">"anthropic/claude-3.5-sonnet"</span>  <span class="c1"># $0.15 per investigation
</span>    <span class="k">else</span><span class="p">:</span>  <span class="c1"># complex
</span>        <span class="n">model</span> <span class="o">=</span> <span class="s">"openai/gpt-4"</span>  <span class="c1"># $0.30 per investigation
</span>
    <span class="k">return</span> <span class="n">ChatOpenAI</span><span class="p">(</span><span class="n">model</span><span class="o">=</span><span class="n">model</span><span class="p">,</span> <span class="n">temperature</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>
</code></pre></div></div>

<p>Cost comparison:</p>

<ul>
  <li>GPT-4: $0.30 per investigation</li>
  <li>Claude 3.5 Sonnet: $0.15 per investigation</li>
  <li>GPT-3.5: $0.02 per investigation</li>
</ul>

<p>For 100 investigations/month:</p>
<ul>
  <li>All GPT-4: $30</li>
  <li>All GPT-3.5: $2</li>
  <li>Mixed (80% GPT-3.5, 20% GPT-4): $6.80</li>
</ul>

<p><strong>Pitfall 3: Secrets Leak in Logs</strong></p>

<p>Symptom: API keys visible in agent output.</p>

<p>Solution: Always scan output (from the security section):</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">secrets_scanner</span> <span class="kn">import</span> <span class="n">safe_output</span>

<span class="n">result</span> <span class="o">=</span> <span class="n">agent_executor</span><span class="p">.</span><span class="n">invoke</span><span class="p">({</span><span class="s">"input"</span><span class="p">:</span> <span class="n">input_text</span><span class="p">})</span>
<span class="n">safe_result</span> <span class="o">=</span> <span class="n">safe_output</span><span class="p">(</span><span class="n">result</span><span class="p">[</span><span class="s">'output'</span><span class="p">])</span>  <span class="c1"># Redacts secrets
</span></code></pre></div></div>

<h3 id="performance-benchmarks">Performance Benchmarks</h3>

<p>From my production deployments:</p>

<p><strong>Investigation time:</strong></p>
<ul>
  <li>Simple failures (import errors): 10-15 seconds</li>
  <li>Medium complexity (config issues): 20-30 seconds</li>
  <li>Complex failures (race conditions): 45-60 seconds</li>
</ul>

<p><strong>Accuracy:</strong></p>
<ul>
  <li>Correct root cause identified: 78% of cases</li>
  <li>Helpful suggestions even when wrong: 92% of cases</li>
  <li>Completely useless output: 8% of cases</li>
</ul>

<p><strong>Cost per investigation:</strong></p>
<ul>
  <li>GPT-3.5: $0.02-0.05</li>
  <li>Claude 3.5 Sonnet: $0.10-0.20</li>
  <li>GPT-4: $0.15-0.30</li>
</ul>

<p><strong>Developer time saved:</strong></p>
<ul>
  <li>Average investigation time (manual): 25 minutes</li>
  <li>Average investigation time (agent): 30 seconds</li>
  <li>Time saved: 24.5 minutes per failure</li>
</ul>

<p>For 20 failures/day: 490 minutes = 8+ hours saved daily.</p>

<h3 id="quick-reference-dos-and-donts">Quick Reference: Dos and Don’ts</h3>

<p><strong>DO:</strong></p>
<ul>
  <li>Set max_iterations to prevent loops</li>
  <li>Add timeouts to all API calls</li>
  <li>Scan outputs for secrets</li>
  <li>Log all agent decisions</li>
  <li>Use structured output formats</li>
  <li>Cache frequent queries</li>
  <li>Choose models based on complexity</li>
  <li>Test prompts in isolation first</li>
</ul>

<p><strong>DON’T:</strong></p>
<ul>
  <li>Give agents write access without validation</li>
  <li>Trust AI-generated commands blindly</li>
  <li>Send full logs (use last 50 lines)</li>
  <li>Use GPT-4 for everything (cost optimization)</li>
  <li>Ignore rate limits</li>
  <li>Commit API keys to git</li>
  <li>Skip error handling</li>
  <li>Deploy without testing</li>
</ul>

<h2 id="next-steps-and-extensions">Next Steps and Extensions</h2>

<p>You’ve built a working AI agent that automatically investigates pipeline failures. But this is just the beginning. Here are practical ways to extend and improve your agent.</p>

<h3 id="what-youve-built">What You’ve Built</h3>

<p>Let’s recap what your agent can do:</p>

<ul>
  <li>Monitor GitHub Actions workflows automatically</li>
  <li>Investigate failures within 30 seconds</li>
  <li>Fetch and analyze workflow logs</li>
  <li>Examine recent code changes</li>
  <li>Search for similar issues</li>
  <li>Generate root cause analysis with recommendations</li>
  <li>Redact secrets from outputs</li>
  <li>Log all actions for audit</li>
  <li>Rate limit to control costs</li>
  <li>Post results to GitHub issues</li>
</ul>

<h3 id="extension-ideas">Extension Ideas</h3>

<p><strong>1. Multi-Agent System</strong></p>

<p>Create specialist agents for different tasks:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Build Agent: Optimizes build performance
</span><span class="n">build_agent</span> <span class="o">=</span> <span class="n">create_agent</span><span class="p">(</span>
    <span class="n">tools</span><span class="o">=</span><span class="p">[</span><span class="n">analyze_build_logs</span><span class="p">,</span> <span class="n">suggest_caching</span><span class="p">,</span> <span class="n">optimize_dependencies</span><span class="p">],</span>
    <span class="n">role</span><span class="o">=</span><span class="s">"Build Optimization Specialist"</span>
<span class="p">)</span>

<span class="c1"># Security Agent: Scans for vulnerabilities
</span><span class="n">security_agent</span> <span class="o">=</span> <span class="n">create_agent</span><span class="p">(</span>
    <span class="n">tools</span><span class="o">=</span><span class="p">[</span><span class="n">scan_dependencies</span><span class="p">,</span> <span class="n">check_secrets</span><span class="p">,</span> <span class="n">validate_configs</span><span class="p">],</span>
    <span class="n">role</span><span class="o">=</span><span class="s">"Security Analyst"</span>
<span class="p">)</span>

<span class="c1"># Deploy Agent: Manages deployments
</span><span class="n">deploy_agent</span> <span class="o">=</span> <span class="n">create_agent</span><span class="p">(</span>
    <span class="n">tools</span><span class="o">=</span><span class="p">[</span><span class="n">check_health</span><span class="p">,</span> <span class="n">deploy_staging</span><span class="p">,</span> <span class="n">rollback_if_needed</span><span class="p">],</span>
    <span class="n">role</span><span class="o">=</span><span class="s">"Deployment Specialist"</span>
<span class="p">)</span>
</code></pre></div></div>

<p><strong>2. Kubernetes Integration</strong></p>

<p>Add tools for Kubernetes operations:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">@</span><span class="n">tool</span>
<span class="k">def</span> <span class="nf">get_pod_status</span><span class="p">(</span><span class="n">namespace</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span> <span class="n">pod_name</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">str</span><span class="p">:</span>
    <span class="s">"""Get Kubernetes pod status and recent events."""</span>
    <span class="k">pass</span>

<span class="o">@</span><span class="n">tool</span>
<span class="k">def</span> <span class="nf">analyze_pod_logs</span><span class="p">(</span><span class="n">namespace</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span> <span class="n">pod_name</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">str</span><span class="p">:</span>
    <span class="s">"""Fetch and analyze pod logs."""</span>
    <span class="k">pass</span>
</code></pre></div></div>

<p><strong>3. Learning from History</strong></p>

<p>Implement long-term memory with a vector database:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">langchain.vectorstores</span> <span class="kn">import</span> <span class="n">Chroma</span>
<span class="kn">from</span> <span class="nn">langchain.embeddings</span> <span class="kn">import</span> <span class="n">OpenAIEmbeddings</span>

<span class="c1"># Store past investigations
</span><span class="n">vectorstore</span> <span class="o">=</span> <span class="n">Chroma</span><span class="p">(</span>
    <span class="n">collection_name</span><span class="o">=</span><span class="s">"investigation_history"</span><span class="p">,</span>
    <span class="n">embedding_function</span><span class="o">=</span><span class="n">OpenAIEmbeddings</span><span class="p">()</span>
<span class="p">)</span>

<span class="c1"># When investigating a new failure
</span><span class="n">similar_cases</span> <span class="o">=</span> <span class="n">vectorstore</span><span class="p">.</span><span class="n">similarity_search</span><span class="p">(</span>
    <span class="n">error_message</span><span class="p">,</span>
    <span class="n">k</span><span class="o">=</span><span class="mi">3</span>  <span class="c1"># Find 3 most similar past failures
</span><span class="p">)</span>
</code></pre></div></div>

<p>This lets your agent learn from experience.</p>

<h3 id="resources-and-further-learning">Resources and Further Learning</h3>

<p><strong>LangChain Documentation</strong></p>

<ul>
  <li><a href="https://python.langchain.com/docs">LangChain Official Docs</a></li>
  <li><a href="https://python.langchain.com/docs/modules/agents">LangChain Agents Guide</a></li>
  <li><a href="https://python.langchain.com/docs/modules/tools">LangChain Tools Documentation</a></li>
</ul>

<p><strong>OpenRouter</strong></p>

<ul>
  <li><a href="https://openrouter.ai">Get API key</a></li>
  <li><a href="https://openrouter.ai/docs#pricing">Pricing</a></li>
  <li><a href="https://openrouter.ai/models">Model comparison</a></li>
</ul>

<p><strong>Security Resources</strong></p>

<ul>
  <li><a href="https://owasp.org/www-project-top-10-for-large-language-model-applications">OWASP LLM Top 10</a></li>
</ul>

<h2 id="final-thoughts">Final Thoughts</h2>

<p>AI agents aren’t replacing DevOps engineers. They’re accelerating investigation, reducing toil, and freeing you to focus on higher-value work.</p>

<p>The agent we built is read-only by design. It investigates and recommends, but humans make the final decisions. This is the right balance for production systems in 2025.</p>

<p>Start small:</p>

<ol>
  <li>Deploy the read-only investigation agent</li>
  <li>Monitor its accuracy for a few weeks</li>
  <li>Tune prompts based on results</li>
  <li>Gradually add more capabilities</li>
  <li>Always maintain human oversight</li>
</ol>

<p>Over the past 2 years as a DevOps consultant, I’ve seen teams waste countless hours on repetitive failure investigations. This agent solves that problem.</p>

<p>The code is production-ready. The security is enterprise-grade. The cost is negligible compared to developer time saved.</p>

<p>What are you waiting for? Give your CI/CD pipeline a brain.</p>

<hr />

<h2 id="want-to-learn-more">Want to Learn More?</h2>

<p>If you’re interested in deepening your DevOps and systems programming knowledge, check out <a href="https://www.educative.io/unlimited?aff=BYvq">Educative.io’s Unlimited Plan</a> - it’s an excellent resource for hands-on learning with interactive courses.</p>

<hr />

<p><strong>If you found this helpful, share it on X and tag me <a href="https://twitter.com/muhammad_o7">@muhammad_o7</a></strong> - I’d love to hear your thoughts! You can also connect with me on <a href="https://www.linkedin.com/in/muhammad-raza-07/">LinkedIn</a>.</p>

<p><strong>Need Help?</strong> I’m available for Python and DevOps consulting. If you need help with CI/CD, automation, infrastructure, or AI agents for your DevOps workflows, reach out via email or DM me on <a href="https://twitter.com/muhammad_o7">X/Twitter</a>.</p>

          ]]>
        </description>
        <pubDate>Tue, 25 Nov 2025 00:00:00 +0000</pubDate>
        <link>//muhammadraza.me/2025/building-ai-agents-devops-automation/</link>
        <guid isPermaLink="true">//muhammadraza.me/2025/building-ai-agents-devops-automation/</guid>
        
        <category>ai</category>
        
        <category>devops</category>
        
        <category>automation</category>
        
        <category>python</category>
        
        
        
        <dc:creator>{&quot;name&quot;=&gt;&quot;Muhammad Raza&quot;}</dc:creator>
        <dc:rights></dc:rights>
      </item>
    
  </channel>
</rss>
