Available for collaboration

Lea Yanhui Li

AI Systems Architect Evaluation Reliability Safe AI Automation

I design evaluation, reliability, and governance systems that determine when AI can safely automate complex workflows.

My work focuses on the infrastructure around AI systems — evaluation frameworks, ground-truth datasets, confidence calibration, and human-in-the-loop decision pipelines.

I currently work on large-scale AI governance and automation systems at Meta, with prior experience across enterprise platforms at Amazon, Hitachi and Oracle, and as a co-founder of an early-stage startup.

I hold an MBA from UC Berkeley Haas and a master's degree in computer engineering from National University of Singapore.

AI Evaluation

Benchmark systems and test harnesses for LLM reliability at production scale.

AI Governance

Policy enforcement layers, risk scoring, and guardrails for autonomous agents.

Safe Automation

Human-in-the-loop thresholds and escalation logic for consequential AI decisions.

LLM Reliability

Confidence scoring, output validation, and regression testing across model versions.

Workflow Orchestration

Multi-agent pipeline design for document processing and review routing.

AI Infrastructure

Evaluation datasets, prompt versioning, and model monitoring at platform scale.

All experiments →
EXP-01

LLM Reliability Evaluation

Side-by-side evaluation of Claude Haiku 4.5 and GPT-4o across hallucination, classification, and refusal compliance on 12 structured test items.

Claude 100% · GPT 91.7% — GPT fabricated fictional Einstein lecture (hal-002)
EXP-02

Guardrail Policy Engine Prototype

Minimal policy enforcement layer for LLM agent actions — rule-based risk scoring with configurable escalation thresholds and structured audit output.

EXP-03

Multi-Agent Document Workflow

OpenClaw-based orchestration experiment: supervisor agent coordinating classifier, extractor, and human-review-router sub-agents across document batches.

All projects →
Evaluation · LLM

LLM Evaluation Framework

GitHub ↗
Dataset (JSONL) Eval Runner Scoring Engine Reliability Dashboard
Python Claude API OpenAI API Pandas JSONL
Governance · Safety

AI Guardrails Engine

GitHub ↗

Policy enforcement layer for AI agents. Evaluates proposed actions against a configurable policy registry before execution — returns allow / block / escalate with risk score.

Agent Action Policy Registry Risk Scorer Allow / Block / Escalate
Python Rule engine Risk scoring JSON Schema Audit logging
Automation · Document AI

Document AI Workflow

GitHub ↗

End-to-end pipeline for document classification, confidence scoring, structured extraction, and confidence-based routing to automated processing or human review.

Ingest Classify + Extract Confidence Check Auto / Review Queue
Python Claude API Confidence scoring Review queue
github.com/lea82 →
All writing →
01

Why AI Evaluation Is the Hardest Problem in AI Deployment

On the structural reasons evaluation resists standardization, and what that means for teams shipping AI at scale.

02

Designing Safe Automation Thresholds

A framework for determining where AI should act autonomously, where it should pause, and where humans must decide.