Projects — Lea Yanhui Li

Evaluation · LLM · Benchmarking

LLM Evaluation Framework

Active GitHub ↗

A structured benchmarking system for measuring reliability, output consistency, and regression across Claude and GPT model versions. Designed for teams that need reproducible results — not just vibes. Integrates cleanly into CI/CD pipelines for teams shipping AI features continuously.

Architecture

JSONL Dataset → Eval Runner → Scoring Engine → Reliability Dashboard

Metrics: accuracy · hallucination rate · latency p50/p95 · consistency score

Python 3.11 Claude API OpenAI API Pandas JSONL GitHub Actions

GitHub ↗ Demo → Architecture →

Safety · Evaluation · Release Gating

Active

GitHub ↗

LLM Safety Evaluation & Release Gating

The next evolution of the LLM Evaluation Framework — built for safety-critical deployment decisions, not just capability benchmarking. Translates written safety policy into measurable evaluation categories, risk-weighted thresholds, and structured SHIP / CONDITIONAL SHIP / BLOCK decisions, with an LLM-as-judge scoring refusal quality and harmfulness alongside binary compliance. Every run produces an auditable release report designed for Program Manager review.

Architecture

safety_policy.yaml → JSONL Datasets → Model Completions → LLM Judge → Release Gate

Scores: policy_compliance · refusal_quality (1–5) · harmfulness (1–5) · 157 tests

Python 3.11 OpenAI API Anthropic API LLM-as-Judge YAML Policy Config JSONL

GitHub ↗ Quickstart → Real Results → Dataset Card →

Governance · Safety · Policy Enforcement

AI Guardrails Engine

Active GitHub ↗

A policy enforcement layer that wraps any agentic workflow without modifying the underlying agent. Evaluates proposed actions against a JSON-configurable policy registry before execution — returning structured allow / block / escalate decisions with full audit logging.

Architecture

Agent Action Proposal → Policy Registry Lookup → Risk Scorer

→ Allow | Block | Escalate + Audit

Python 3.11 Rule engine Risk scoring JSON Schema Audit logging LangGraph compatible

View Repo ↗ README →

Automation · Document AI · Orchestration

Document AI Workflow Automation

In progress GitHub ↗

End-to-end pipeline for intelligent document processing with principled human-in-the-loop routing. AI handles what it's reliably good at; confidence- scored outputs below threshold are escalated to a review queue with full context — not just the raw document.

Architecture

Document Ingest → Classify + Extract → Confidence Check

→ Auto output | Human review queue

Python 3.11 Claude API Confidence scoring Async queue PostgreSQL

View Repo ↗ README →

Systems & Infrastructure

LLM Evaluation Framework

LLM Safety Evaluation & Release Gating

AI Guardrails Engine

Document AI Workflow Automation