Job Search Agent — Multi-Agent AI System with Eval-Driven CI
Built a multi-agent system that converts job postings into complete application packs via Telegram — OCR ingestion, LLM-powered resume mutation with truthfulness guards, LaTeX compilation, and eval-gated CI preventing regressions.
Overview
A production multi-agent system that takes a job posting (URL or screenshot) and produces a complete application pack: tailored resume, cover letter, outreach drafts, and follow-up schedule. Deployed on Railway with a Telegram bot interface, eval-driven CI gating, and 222 passing tests.
Context & Role
Solo builder — architecture, implementation, eval framework, and deployment. 39 commits, 21+ Linear issues tracked, PRD with planner/executor/profile agent architecture. Phase 2 active (Oct 2025 – Apr 2026).
Problem
Job applications are repetitive but high-stakes: each requires a tailored resume, cover letter, and outreach strategy. Manual tailoring takes 30–60 minutes per application. Existing AI tools generate generic output with no quality guarantees — hallucinated skills, broken formatting, and no way to prevent regressions.
Agent Architecture
Three-agent system: Planner Agent parses job descriptions and identifies requirements, keywords, and match signals. Executor Agent mutates resume content within editable regions only — enforcing truthfulness guards that prevent fabricated skills or experience. Profile Agent maintains a canonical user profile that evolves across applications. All agents orchestrated via OpenRouter with cost tracking per invocation.
Eval-Driven CI Gating
The core differentiator: every code change runs through an eval suite before merging. Metrics tracked: compile rate (LaTeX must produce valid PDF), forbidden claims detection (no hallucinated skills), edit region violations (mutations only in designated sections), cost per application (OpenRouter spend), and latency budgets. 222 tests enforce these constraints. CI gates prevent regressions — a PR that increases forbidden claims or breaks compilation cannot merge.
Ingestion & Output Pipeline
Input: Telegram bot receives job posting URL or screenshot (OCR extraction). Processing: parse JD → match against profile → mutate resume → compile LaTeX (single-page enforcement) → generate outreach drafts (email, LinkedIn DM, referral ask). Output: compiled PDF uploaded to Google Drive, calendar event created for follow-up, escalation tiers for automated follow-up scheduling.
Product Decisions
Chose Telegram over web UI for zero-friction input — paste a link, get a pack. LaTeX over DOCX for precise formatting control and ATS compatibility. Editable regions over full-document mutation to enforce truthfulness — the LLM can only modify designated sections, never fabricate new experience. Eval-first development: wrote the eval framework before building the agents, so quality constraints shaped the architecture.
Architecture
Telegram webhook → FastAPI service → Agent orchestrator (Planner → Executor → Profile) → LaTeX compiler → Google Drive upload → Calendar event creation. PostgreSQL for application history and profile state. Railway for deployment with environment-based config.
Metrics & Impact
222 tests passing with CI gates enforced. 39 commits across the project. Production-deployed on Railway. Complete application pack generated from a single Telegram message. Eval framework catches forbidden claims, compilation failures, and edit violations before they reach production.
Challenges & Trade-offs
LLM output variance required building the eval framework first — without it, resume mutations would drift toward hallucination. LaTeX single-page enforcement needed iterative font/margin tuning per resume variant. Cost control: OpenRouter routing lets the system pick cheaper models for parsing while reserving expensive models for resume mutation.
Lessons
Eval frameworks aren't overhead — they're the product. Without CI-gated quality checks, LLM-powered systems degrade silently. Truthfulness guards must be architectural (editable regions), not just prompt-level. Telegram bots are underrated as production interfaces for personal tools.