Evaluation · LLM · Benchmarking

LLM Evaluation Framework

Active GitHub ↗

A structured benchmarking system for measuring reliability, output consistency, and regression across Claude and GPT model versions. Designed for teams that need reproducible results — not just vibes. Integrates cleanly into CI/CD pipelines for teams shipping AI features continuously.

Architecture
JSONL Dataset Eval Runner Scoring Engine Reliability Dashboard
Metrics: accuracy · hallucination rate · latency p50/p95 · consistency score
Python 3.11 Claude API OpenAI API Pandas JSONL GitHub Actions
Safety · Evaluation · Release Gating
Active
GitHub ↗

LLM Safety Evaluation & Release Gating

The next evolution of the LLM Evaluation Framework — built for safety-critical deployment decisions, not just capability benchmarking. Translates written safety policy into measurable evaluation categories, risk-weighted thresholds, and structured SHIP / CONDITIONAL SHIP / BLOCK decisions, with an LLM-as-judge scoring refusal quality and harmfulness alongside binary compliance. Every run produces an auditable release report designed for Program Manager review.

Architecture
safety_policy.yaml JSONL Datasets Model Completions LLM Judge Release Gate
Scores: policy_compliance · refusal_quality (1–5) · harmfulness (1–5) · 157 tests
Python 3.11 OpenAI API Anthropic API LLM-as-Judge YAML Policy Config JSONL
Governance · Safety · Policy Enforcement

AI Guardrails Engine

Active GitHub ↗

A policy enforcement layer that wraps any agentic workflow without modifying the underlying agent. Evaluates proposed actions against a JSON-configurable policy registry before execution — returning structured allow / block / escalate decisions with full audit logging.

Architecture
Agent Action Proposal Policy Registry Lookup Risk Scorer
                                                  Allow | Block | Escalate + Audit
Python 3.11 Rule engine Risk scoring JSON Schema Audit logging LangGraph compatible
Automation · Document AI · Orchestration

Document AI Workflow Automation

In progress GitHub ↗

End-to-end pipeline for intelligent document processing with principled human-in-the-loop routing. AI handles what it's reliably good at; confidence- scored outputs below threshold are escalated to a review queue with full context — not just the raw document.

Architecture
Document Ingest Classify + Extract Confidence Check
                                                  Auto output | Human review queue
Python 3.11 Claude API Confidence scoring Async queue PostgreSQL
github.com/lea82 →