Krisis

A clinical evaluation framework for testing LLM safety behavior in medical reasoning tasks.

May 16, 2026

Krisis is a research framework for evaluating how large language models behave in high-stakes clinical reasoning tasks. It is not clinical software and is not intended to diagnose, treat, or triage patients. Instead, it focuses on a research question that becomes more important as models get more fluent: can an LLM recognize when it should not confidently answer?

The project grew out of Cady AI, an earlier chronic kidney disease detection chatbot that used the UCI Chronic Kidney Disease dataset to predict CKD status, return class probabilities, and explain which lab results pushed risk upward. That project raised a deeper safety problem. Accuracy alone is not enough in clinical contexts; models also need to know when to abstain, defer, or express uncertainty.

Krisis turns that problem into a reusable benchmark system. It provides clinical suites that convert datasets into benchmark-ready patient records, model backends for routing prompts through different providers, a benchmark runner for batched evaluations, metrics for safety behavior, and report outputs for reviewing model performance.

The current implementation includes a Chronic Kidney Disease suite based on the UCI CKD dataset. It supports detection, staging, and synthetic progression stress testing. The progression task is intentionally framed as synthetic because the underlying dataset is cross-sectional rather than longitudinal; the goal is to test reasoning and deferral behavior, not claim real patient progression outcomes.

The most important part of Krisis is that it evaluates more than whether a model got an answer right. It looks at accuracy, calibration, coverage, abstention, uncertainty, and deferral behavior. That makes it useful for studying whether a model can behave responsibly under ambiguity, incomplete information, or high-risk clinical conditions.

From an engineering perspective, Krisis is a way to package clinical AI evaluation into a structured framework: datasets become patient records, records flow through model backends, outputs become evaluation results, and metrics turn those results into reports that can be inspected, compared, and extended.