Tool Learning -- 数据集与评估

2025-11-17

导读: “You get what you measure.” 评估 Tool Learning 的难点在于它是一个交互式环境下的多步决策过程。本讲义选取了 6 个具有代表性的 Benchmark,揭示了该领域评估标准从单一的“成功率”向“过程正确性”、“环境稳定性”及“复杂推理能力”演进的轨迹。

1. 基础与规模化 (Foundation & Scale)

1.1 ToolBench (Qin et al., 2023)

The “ImageNet” moment for Tool Learning.

2. 环境稳定性与复现性 (Reliability & Reproducibility)

1.2 StableToolBench (Guo et al., 2024)

Solving the “API Rot” problem.

3. 决策意识与判别 (Decision Making / Necessity)

1.3 WTU-EVAL (Ning et al., 2024)

To use, or not to use? That is the question.

4. 复杂推理与组合 (Complexity & Compositionality)

1.4 ToolHop (Ye et al., 2025)

Multi-hop Reasoning over Tools.

5. 个性化与知识融合 (Personalization & Context)

1.5 FamilyTool (Wang et al., 2025)

Tool Learning meets Knowledge Graphs.

6. 过程监督与诊断 (Process Supervision)

1.6 TRAJECT-Bench (He et al., 2025)

Evaluate the journey, not just the destination.

总结:Benchmark 的演进图谱

Benchmark核心关注点 (Key Focus)数学/逻辑映射适用场景
ToolBench广度 (Breadth)$\max \sum{T_i}
WTU-EVAL判别 (Necessity)πneed(q)\pi_{\text{need}}(q)避免模型幻觉与滥用
StableToolBench稳定性 (Stability)Var(Score)0\text{Var}(\text{Score}) \to 0算法迭代对比、复现
ToolHop深度 (Depth)Tn(T1(q))T_n(\dots T_1(q))复杂规划 Agent
FamilyTool上下文 (Context)f(q,UserKG,Tools)f(q, \text{UserKG}, \text{Tools})个性化助手、智能家居
TRAJECT-Bench过程 (Process)P(τcorrectq)P(\tau_{correct} \mid q)细粒度归因分析

主题: 工具学习, agent