这是用户在 2025-8-3 15:13 为 https://curebench.ai/ 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?

CURE-Bench @ NeurIPS 2025

Benchmarking AI reasoning for therapeutic decision-making at scale
对大规模治疗决策的人工智能推理进行基准测试

CURE-Bench Overview  CURE-Bench 概述

Precision therapeutics require models that can reason over complex relationships between patients, diseases, and drugs. Large language models and large reasoning models, especially when combined with external tool use and multi-agent coordination, have demonstrated the potential to perform structured, multi-step reasoning in clinical settings. However, existing benchmarks (mostly QA benchmarks) do not evaluate these capabilities in the context of real-world therapeutic decision-making.
精准治疗需要能够推理患者、疾病和药物之间复杂关系的模型。大型语言模型和大型推理模型,特别是当与外部工具使用和多智能体协调相结合时,已经证明了在临床环境中执行结构化、多步骤推理的潜力。然而,现有基准(主要是 QA 基准)并未在现实世界的治疗决策背景下评估这些能力。

We present CURE-Bench, a competition and benchmark for evaluating AI models in drug decision-making and treatment planning. CURE-Bench includes clinically grounded tasks such as recommending treatments, assessing drug safety and efficacy, designing treatment plans, and identifying repurposing opportunities for diseases with limited therapeutic options.
我们推出了 CURE-Bench,这是一项评估药物决策和治疗计划中人工智能模型的竞赛和基准。CURE-Bench 包括基于临床的任务,例如推荐治疗方法、评估药物安全性和有效性、设计治疗计划以及确定治疗选择有限的疾病的重新利用机会。

Competition Tracks  比赛赛道

Track 1: Internal Model Reasoning
分会场 1:内部模型推理

In Track 1, models must reason using only their internal parameters — no external tools, APIs, or retrieval. This track is ideal for evaluating standalone models like LLaMA or DeepSeek-R1.
在轨道 1 中,模型必须仅使用其内部参数进行推理,而不能使用外部工具、API 或检索。该轨道非常适合评估 LLaMA 或 DeepSeek-R1 等独立模型。

  • Must generate both answer and reasoning trace
    必须同时生成答案和推理跟踪
  • No access to internet or external structured databases
    无法访问互联网或外部结构化数据库

Track 2: Agentic Tool-Augmented Reasoning
第 2 条:代理工具增强推理

In Track 2, models act as agents that can invoke biomedical tools (e.g., FDA, OpenTargets, PubMed) during reasoning. This track evaluates integration, orchestration, and tool usage effectiveness. We provide ToolUniverse as the easy-to-use toolbox.
在轨道 2 中,模型充当代理,可以在推理过程中调用生物医学工具(例如 FDA、OpenTargets、PubMed)。本课程评估集成、编排和工具使用效率。我们提供 ToolUniverse 作为易于使用的工具箱。

  • Must include reasoning trace with tool call log + final answer
    必须包括带有工具调用日志 + 最终答案的推理跟踪
  • Encouraged to use multi-agent pipelines
    鼓励使用多代理管道

Therapeutic Reasoning Tasks
治疗推理任务

CURE-Bench features 12 real-world biomedical reasoning tasks spanning drug labeling, safety, and regulatory challenges. These tasks simulate the complexity of pharmaceutical information processing, requiring structured reasoning and trace generation.
CURE-Bench 具有 12 项真实世界的生物医学推理任务,涵盖药物标签、安全和监管挑战。这些任务模拟了药物信息处理的复杂性,需要结构化推理和痕量生成。

Treatment Recommendation  治疗建议

Questions regarding specialized treatment recommendations considering patient populations.
有关考虑患者群体的专业治疗建议的问题。

Adverse Event  不良事件

Predicting potential adverse events and side effects based on drug properties and patient factors.
根据药物特性和患者因素预测潜在的不良事件和副作用。

Drug Overview  药物概述

Package label principal display panel and comprehensive drug description.
包装标签、主要展示面板和全面的药物描述。

Drug Ingredients  药物成分

Product data elements and active/inactive ingredient analysis.
产品数据元素和活性/非活性成分分析。

Drug Warnings and Safety  药物警告和安全

Boxed warnings, contraindications, adverse reactions, and drug interactions.
黑框警告、禁忌症、不良反应和药物相互作用。

Drug Dependence and Abuse
药物依赖和滥用

Drug abuse potential, dependence, controlled substance classification, and overdosage.
药物滥用潜力、依赖性、受控物质分类和过量。

Dosage and Administration
剂量和给药

Indications, usage, administration, dosage forms, strengths, and instructions.
适应症、用法、给药、剂型、强度和说明。

Drug Use in Specific Populations
特定人群的药物使用

Usage in pregnancy, pediatric, geriatric populations, and nursing mothers.
用于怀孕、儿科、老年人群和哺乳母亲。

Pharmacology  药理学

Clinical pharmacology, mechanism of action, pharmacodynamics, and pharmacokinetics.
临床药理学、作用机制、药效学和药代动力学。

Clinical Information  临床信息

Clinical studies and trial data analysis.
临床研究和试验数据分析。

Nonclinical Toxicology  非临床毒理学

Toxicology, carcinogenesis, mutagenesis, fertility impairment, and animal studies.
毒理学、致癌、诱变、生育障碍和动物研究。

Patient-Focused Information
以患者为中心的信息

Patient medication guides, package inserts, and patient information.
患者用药指南、包装说明书和患者信息。

Agentic Dataset Generation Pipeline
代理数据集生成管道

Built on top of TxAgent's data generation system, CURE-Bench introduces a modular, agentic system for generating thousands of therapeutic decision-making examples. This three-stage process ensures data diversity, traceability, and grounded biomedical logic. The pipeline leverages advanced AI agents to create realistic clinical scenarios that challenge models across multiple reasoning dimensions.
CURE-Bench 建立在 TxAgent 的数据生成系统之上,引入了一个模块化的代理系统,用于生成数千个治疗决策示例。这个三阶段过程确保了数据的多样性、可追溯性和接地的生物医学逻辑。该管道利用先进的人工智能代理来创建逼真的临床场景,挑战多个推理维度的模型。

🧠 QuestionGen  🧠 问题生成

Generates clinically plausible queries, spanning treatment, safety, and reasoning challenges across specialties.
生成临床上合理的查询,涵盖跨专业的治疗、安全和推理挑战。

🧬 TraceGen  🧬 追踪基因

Produces structured reasoning traces grounded in biomedical knowledge — including multi-step inference with tool use.
生成基于生物医学知识的结构化推理痕迹,包括使用工具的多步骤推理。

🧰 ToolGen  🧰 工具生成

Generating agentic tools that can be invoked during reasoning, including databases, APIs, and retrieval systems for biomedical information.
生成可在推理过程中调用的代理工具,包括数据库、API 和生物医学信息检索系统。

Evaluation Criteria  评估标准

Submissions to CURE-Bench are evaluated through a weighted aggregate of multiple metrics. Final rankings are determined by assigning predefined weights to each metric based on their importance, with the overall score computed as a weighted sum. Each submission must include reasoning traces, final answers, token usage, and model size/type information.
提交给 CURE-Bench 的作品通过多个指标的加权聚合进行评估。最终排名是通过根据每个指标的重要性为每个指标分配预定义的权重来确定的,总分计算为加权总和。每个提交必须包括推理跟踪、最终答案、令牌使用情况和模型大小/类型信息。

Correctness of Multi-Step Inference  多步推理的正确性

Evaluates reasoning model performance across 13 drug prescription and specialized treatment tasks.
评估 13 项药物处方和专业治疗任务的推理模型性能。

📊Direct Performance  直接性能

Accuracy on standard multiple-choice questions requiring multi-step reasoning with a single correct answer.
标准多项选择题的准确性,需要通过一个正确答案进行多步推理。

🔗Accumulated Performance  累积业绩

Performance on tasks with multiple intermediate steps. Final answer is correct only if all steps are correct.
具有多个中间步骤的任务的性能。只有当所有步骤都正确时,最终答案才是正确的。

🧠Open-Ended Reasoning  开放式推理

Free-form reasoning without answer options. Measured by open-ended accuracy.
没有答案选项的自由形式推理。通过开放式精度进行测量。

🛡️Reasoning Robustness  推理鲁棒性

Evaluates stability and reliability when faced with input variations
在面对输入变化时评估稳定性和可靠性

🔄Rephrasing Consistency  改写一致性

Consistent answers across semantically equivalent but differently worded versions of the same question.
同一问题的语义等效但措辞不同的版本的一致答案。

🎯Option Order Robustness  期权顺序稳健性

Correct answers when multiple-choice options are reordered, isolating reasoning from presentation bias.
重新排序多项选择选项时的正确答案,将推理与演示偏差隔离开来。

Reasoning Efficiency  推理效率

Measures computational cost and resource usage for generating reasoning traces and answers
测量用于生成推理跟踪和答案的计算成本和资源使用情况

💾Token Usage  令牌使用

Total tokens (input + output) consumed during inference. Submissions must report average tokens per example and model size/type.
推理期间消耗的令牌总数(输入 + 输出)。提交必须报告每个示例的平均令牌和模型大小/类型。

🔧Tool Usage (Track 2)  工具使用(曲目 2)

Efficiency based on number of tools, variety of tools used, and correctness of tool usage.
基于工具数量、使用工具种类和工具使用正确性的效率。

🤖Agentic Judge  代理法官

Multiple collaborative agents evaluate reasoning traces to provide robust and objective assessments
多个协作代理评估推理痕迹,以提供稳健和客观的评估

🔍Factuality Check  事实性检查

Agents extract factual claims from reasoning traces and verifies claims using retrieval-augmented generation methods.
代理从推理跟踪中提取事实主张,并使用检索增强生成方法验证主张。

👩‍⚕️Clinical Relevance Check  临床相关性检查

AI agents evaluate clinical reasoning quality, treatment appropriateness, and alignment with medical practice standards.
人工智能代理评估临床推理质量、治疗适当性以及与医疗实践标准的一致性。

👥Human Expert Evaluation  人类专家评估

Top-performing teams (Top 5-10) undergo clinical relevance assessment by domain experts. This serves as a validity check to identify and disqualify models that may engage in metric hacking.
表现最好的团队(前 5-10 名)接受领域专家的临床相关性评估。这可作为有效性检查,以识别和取消可能参与指标黑客攻击的模型。

Expert Panel Composition  专家小组组成

👩‍⚕️
Clinicians  临床 医师
Harvard Medical School  哈佛医学院
🔬
Clinical Researchers  临床研究人员
Across Specialties  跨专业
💊
Pharmacists  药剂师
Therapeutic Expertise  治疗专长

In partnership with Chan Zuckerberg Initiative Rare as One Program
Chan Zuckerberg 合作,Initiative Rare as One 计划

🎯 Problem Resolution  🎯 问题解决

Does the response directly and correctly address the clinical question?
回答是否直接、正确地解决了临床问题?

💡 Helpfulness  💡 乐于 助 人

Is the explanation useful and informative in the clinical context?
该解释在临床环境中有用且信息丰富吗?

🏥 Clinical Consensus  🏥 临床共识

Does the response align with current medical practice and standards?
应对措施是否符合当前的医疗实践和标准?

📋 Factual Accuracy  📋 事实准确性

Are the claims correct, well-supported, and free from irrelevant content?
这些说法是否正确、有充分支持且没有不相关的内容?

✅ Completeness  ✅ 完整性

Does the response omit information that would change the disease treatment plan?
响应是否遗漏了会改变疾病治疗计划的信息?

🔬 Scientific Validity  🔬 科学效度

Is the reasoning chain coherent, grounded in biomedical evidence, and scientifically sound?
推理链是否连贯、以生物医学证据为基础且科学合理?

Competition Timeline  比赛时间表

Below are the key milestones and dates for CURE-Bench. We recommend subscribing to our GitHub repository for starter kit and submission updates.
以下是 CURE-Bench 的关键里程碑和日期。我们建议订阅我们的 GitHub 存储库以获取初学者工具包和提交更新。

Jun 20  6月20日
🧪 Development Phase Begins
🧪 开发阶段开始
Sep 25  9月 25
📊 Final Evaluation Phase Opens
📊 最终评估阶段开始
Oct 15  10月15日
🏁 Final Submissions Due  🏁 最终提交截止日期
Nov 1  11月1日
🏆 Winners Announced  🏆 获奖者公布

All deadlines are 11:59 PM AoE (Anywhere on Earth).
所有截止日期均为晚上 11:59 AoE(地球上的任何地方)。

How to Participate  如何参与

You can participate individually or as a team. Submissions must follow our track-specific formats. Resources, data, and examples are provided in the Starter Kit.
您可以单独或团队参与。提交的作品必须遵循我们特定于赛道的格式。初学者工具包中提供了资源、数据和示例。

Required Submission Components
必需的提交组件

  • Reasoning traces: Multi-step reasoning process for each question
    推理痕迹: 每个问题的多步骤推理过程
  • Final answers: Complete responses to therapeutic questions
    最终答案: 对治疗问题的完整回答
  • Token usage: Average tokens per example and total consumption
    代币使用: 每个示例的平均令牌数和总消耗量
  • Model information: Model size and type (or API-based LLM details)
    型号信息: 模型大小和类型(或基于 API 的 LLM 详细信息)
  • Tool usage log (Track 2): Record of external tool invocations
    工具使用日志(轨道 2): 外部工具调用记录

Organizers and Partners  主办单位及合作伙伴

In collaboration with Harvard Medical School, Harvard University, MIT, the Kempner Institute, Brigham and Women's Hospital, the Chan Zuckerberg Initiative, the Milken Institute, and the Biswas Family Foundation, CURE-Bench provides a rigorous, reproducible competition framework for assessing the performance, robustness, and interpretability of reasoning models in high-stakes clinical applications. It will accelerate the development of therapeutic AI and foster collaboration between AI and therapeutics communities.
CURE-Bench 与哈佛医学院、哈佛大学、麻省理工学院、肯普纳研究所、布莱根妇女医院、Chan Zuckerberg Initiative、Milken Institute 和 Biswas Family Foundation 合作,提供了一个严格、可重复的竞争框架,用于评估推理模型在高风险临床应用中的性能、稳健性和可解释性。它将加速治疗性人工智能的发展,并促进人工智能与治疗界之间的合作。

Harvard Medical School
Harvard University
Kempner Institute
Massachusetts Institute of Technology
Brigham and Women's Hospital
Chan Zuckerberg Initiative
Milken Institute
Biswas Family Foundation
Chan Zuckerberg Initiative
MIT Lincoln Laboratory