Job Search Agent — Multi-Agent AI System with Eval-Driven CI

Built a multi-agent system that converts job postings into complete application packs via Telegram — OCR ingestion, LLM-powered resume mutation with truthfulness guards, LaTeX compilation, and eval-gated CI preventing regressions.

Multi-agent orchestrationEval-gated quality controlZero hallucinated claimsTelegram → full application pack

AI ProductMulti-Agent SystemsQuality Engineering0→1

Overview

A production multi-agent system that takes a job posting (URL or screenshot) and produces a complete application pack: tailored resume, cover letter, outreach drafts, and follow-up schedule. Deployed on Railway with a Telegram bot interface, eval-driven CI gating, and 222 passing tests.

Context & Role

Solo builder — owned architecture, implementation, eval framework, and deployment. PRD with planner/executor/profile agent architecture. 21+ Linear issues planned and shipped. Phase 2 active (Oct 2025 – Apr 2026).

Problem

Job applications are repetitive but high-stakes: each requires a tailored resume, cover letter, and outreach strategy. Manual tailoring takes 30–60 minutes per application. Existing AI tools generate generic output with no quality guarantees — hallucinated skills, broken formatting, and no way to prevent regressions.

Agent Architecture

Three-agent system: Planner Agent parses job descriptions and identifies requirements, keywords, and match signals. Executor Agent mutates resume content within editable regions only — enforcing truthfulness guards that prevent fabricated skills or experience. Profile Agent maintains a canonical user profile that evolves across applications. All agents orchestrated via OpenRouter with cost tracking per invocation.

Eval-Driven CI Gating

The core differentiator: every change runs through an eval suite before shipping. Metrics tracked: compile rate (LaTeX must produce valid PDF), forbidden claims detection (no hallucinated skills), edit region violations (mutations only in designated sections), cost per application (OpenRouter spend), and latency budgets. Eval gates prevent regressions — any change that increases forbidden claims or breaks output quality cannot ship.

Ingestion & Output Pipeline

Input: Telegram bot receives job posting URL or screenshot (OCR extraction). Processing: parse JD → match against profile → mutate resume → compile LaTeX (single-page enforcement) → generate outreach drafts (email, LinkedIn DM, referral ask). Output: compiled PDF uploaded to Google Drive, calendar event created for follow-up, escalation tiers for automated follow-up scheduling.

Product Decisions

Chose Telegram over web UI for zero-friction input — paste a link, get a pack. LaTeX over DOCX for precise formatting control and ATS compatibility. Editable regions over full-document mutation to enforce truthfulness — the LLM can only modify designated sections, never fabricate new experience. Eval-first development: wrote the eval framework before building the agents, so quality constraints shaped the architecture.

Architecture

Telegram webhook → FastAPI service → Agent orchestrator (Planner → Executor → Profile) → LaTeX compiler → Google Drive upload → Calendar event creation. PostgreSQL for application history and profile state. Railway for deployment with environment-based config.

Metrics & Impact

Production-deployed and generating complete application packs from a single Telegram message. Eval framework catches forbidden claims, compilation failures, and edit violations before they reach users. Quality gates enforced on every change.

Challenges & Trade-offs

LLM output variance required building the eval framework first — without it, resume mutations would drift toward hallucination. LaTeX single-page enforcement needed iterative font/margin tuning per resume variant. Cost control: OpenRouter routing lets the system pick cheaper models for parsing while reserving expensive models for resume mutation.

Lessons

Eval frameworks aren't overhead — they're the product. Without CI-gated quality checks, LLM-powered systems degrade silently. Truthfulness guards must be architectural (editable regions), not just prompt-level. Telegram bots are underrated as production interfaces for personal tools.

Tech Stack

PythonFastAPIOpenRouter LLMsLaTeXPostgreSQLTelegram Bot APIRailwayGoogle Drive APIGoogle Calendar API