Available for collaboration

Lea Yanhui Li

AI Systems Architect • Evaluation • Reliability • Safe AI Automation

I design evaluation, reliability, and governance systems that determine when AI can safely automate complex workflows.

My work focuses on the infrastructure around AI systems — evaluation frameworks, ground-truth datasets, confidence calibration, and human-in-the-loop decision pipelines.

I currently work on large-scale AI governance and automation systems at Meta, with prior experience across enterprise platforms at Amazon, Hitachi and Oracle, and as a co-founder of an early-stage startup.

I hold an MBA from UC Berkeley Haas and a master's degree in computer engineering from National University of Singapore.

View Projects → GitHub ↗ LinkedIn ↗ Email Resume →

Focus Areas

AI Evaluation

Benchmark systems and test harnesses for LLM reliability at production scale.

AI Governance

Policy enforcement layers, risk scoring, and guardrails for autonomous agents.

Safe Automation

Human-in-the-loop thresholds and escalation logic for consequential AI decisions.

LLM Reliability

Confidence scoring, output validation, and regression testing across model versions.

Workflow Orchestration

Multi-agent pipeline design for document processing and review routing.

AI Infrastructure

Evaluation datasets, prompt versioning, and model monitoring at platform scale.

Active AI Experiments All experiments →

EXP-01

LLM Reliability Evaluation

Side-by-side evaluation of Claude Haiku 4.5 and GPT-4o across hallucination, classification, and refusal compliance on 12 structured test items.

Claude 100% · GPT 91.7% — GPT fabricated fictional Einstein lecture (hal-002)

Results in → GitHub →

EXP-02

Guardrail Policy Engine Prototype

Minimal policy enforcement layer for LLM agent actions — rule-based risk scoring with configurable escalation thresholds and structured audit output.

Upcoming → GitHub →

EXP-03

Multi-Agent Document Workflow

OpenClaw-based orchestration experiment: supervisor agent coordinating classifier, extractor, and human-review-router sub-agents across document batches.

In progress → GitHub →

Selected Projects All projects →

Evaluation · LLM

LLM Evaluation Framework

GitHub ↗

Dataset (JSONL) → Eval Runner → Scoring Engine → Reliability Dashboard

Python Claude API OpenAI API Pandas JSONL

Governance · Safety

AI Guardrails Engine

GitHub ↗

Policy enforcement layer for AI agents. Evaluates proposed actions against a configurable policy registry before execution — returns allow / block / escalate with risk score.

Agent Action → Policy Registry → Risk Scorer → Allow / Block / Escalate

Python Rule engine Risk scoring JSON Schema Audit logging

Automation · Document AI

Document AI Workflow

GitHub ↗

End-to-end pipeline for document classification, confidence scoring, structured extraction, and confidence-based routing to automated processing or human review.

Ingest → Classify + Extract → Confidence Check → Auto / Review Queue

Python Claude API Confidence scoring Review queue

GitHub Repositories github.com/lea82 →

Recent Writing All writing →

Essay · Forthcoming

Why AI Evaluation Is the Hardest Problem in AI Deployment

On the structural reasons evaluation resists standardization, and what that means for teams shipping AI at scale.

Essay · Forthcoming

Designing Safe Automation Thresholds

A framework for determining where AI should act autonomously, where it should pause, and where humans must decide.