Lea Yanhui Li
I design evaluation, reliability, and governance systems that determine when AI can safely automate complex workflows.
My work focuses on the infrastructure around AI systems — evaluation frameworks, ground-truth datasets, confidence calibration, and human-in-the-loop decision pipelines.
I currently work on large-scale AI governance and automation systems at Meta, with prior experience across enterprise platforms at Amazon, Hitachi and Oracle, and as a co-founder of an early-stage startup.
I hold an MBA from UC Berkeley Haas and a master's degree in computer engineering from National University of Singapore.
AI Evaluation
Benchmark systems and test harnesses for LLM reliability at production scale.
AI Governance
Policy enforcement layers, risk scoring, and guardrails for autonomous agents.
Safe Automation
Human-in-the-loop thresholds and escalation logic for consequential AI decisions.
LLM Reliability
Confidence scoring, output validation, and regression testing across model versions.
Workflow Orchestration
Multi-agent pipeline design for document processing and review routing.
AI Infrastructure
Evaluation datasets, prompt versioning, and model monitoring at platform scale.
LLM Reliability Evaluation
Side-by-side evaluation of Claude Haiku 4.5 and GPT-4o across hallucination, classification, and refusal compliance on 12 structured test items.
Guardrail Policy Engine Prototype
Minimal policy enforcement layer for LLM agent actions — rule-based risk scoring with configurable escalation thresholds and structured audit output.
Multi-Agent Document Workflow
OpenClaw-based orchestration experiment: supervisor agent coordinating classifier, extractor, and human-review-router sub-agents across document batches.
LLM Evaluation Framework
AI Guardrails Engine
Policy enforcement layer for AI agents. Evaluates proposed actions against a configurable policy registry before execution — returns allow / block / escalate with risk score.
Document AI Workflow
End-to-end pipeline for document classification, confidence scoring, structured extraction, and confidence-based routing to automated processing or human review.
Why AI Evaluation Is the Hardest Problem in AI Deployment
On the structural reasons evaluation resists standardization, and what that means for teams shipping AI at scale.
Designing Safe Automation Thresholds
A framework for determining where AI should act autonomously, where it should pause, and where humans must decide.