提示词工程：从核心原则到前沿实践（3）：评估原则

本文是「提示词工程：从核心原则到前沿实践」系列的第 3 篇，共 4 篇。在上一篇中，我们探讨了「promptfoo」的相关内容。

评估原则

1. 采用评估驱动开发（Adopt eval-driven development）
- **尽早并频繁进行评估**：在开发的每个阶段都编写范围明确的测试，以便及时发现问题并进行调整。
- **持续评估**：评估是一个持续的过程，而不是一次性的任务。通过持续评估，可以不断优化模型性能。

2. 设计针对特定任务的评估（Design task-specific evals）
- **反映真实世界分布**：确保评估测试能够真实地反映模型在实际应用场景中的表现，而不是仅仅在理想条件下表现良好。
- **具体化评估目标**：明确评估的具体目标和成功标准，使评估更具针对性和可操作性。

3. 记录一切（Log everything）
- **开发过程中的记录**：在开发过程中详细记录日志，以便后续可以从日志中挖掘出有价值的评估案例，为改进模型提供依据。
- **数据追溯性**：通过记录日志，能够追溯模型的决策过程和输出结果，便于分析和优化。

4. 尽可能实现自动化（Automate when possible）
- **结构化评估体系**：构建能够自动评分的评估体系，提高评估效率和客观性，减少人工干预带来的误差。
- **自动化评分与人工判断结合**：虽然自动化评分可以提高效率，但也要结合人工判断来确保评估结果的准确性和合理性。

5. 保持一致性（Maintain agreement）
- **利用人工反馈校准自动化评分**：通过人工评估的结果来校准自动化评分系统，确保自动化评分与人工评估的一致性，提高评估的可靠性。
- **定期校准**：随着模型的更新和应用场景的变化，定期对评估体系进行校准，以保持其有效性。

6. 避免反模式（Avoid anti-patterns）
- **避免过于笼统的指标**：不要仅仅依赖于困惑度或 BLEU 分数等学术指标，而应结合具体应用场景选择更合适的评估指标。
- **避免有偏设计**：确保评估数据集能够忠实地再现生产环境中的流量模式，避免因数据偏差导致的评估结果不准确。
- **避免基于感觉的评估**：不要仅凭主观感觉判断模型性能，而应通过具体的数据和评估指标来进行客观评估。
- **重视人工反馈**：不要忽视人工反馈，要将自动化指标与人工评估进行校准，以提高评估的准确性。

这些原则共同构成了评估的最佳实践，帮助开发者设计出更有效、更可靠的评估方案，以提升人工智能模型在生产环境中的性能和可靠性。

eval 参考资料

3.2 自动化提示词工程

手动设计和优化提示词不仅耗时，需要深厚的专业知识，而且最终得到的很可能只是一个次优解。因此，自动化提示词工程成为了一个重要的研究方向，其主要方法可分为两类：

离散提示词（Discrete Prompts）：这类方法的目标是自动生成或优化人类可读的、由真实文本构成的提示词。它们在现有的词汇空间中进行搜索和组合。常见的技术包括：
- 挖掘（Mining）：从大型语料库中挖掘出能够连接输入和输出的模板。
- 释义（Paraphrasing）：对一个种子提示词进行转述或同义词替换，以生成多个变体进行测试。
- 基于强化学习的方法（e.g., RLPrompt）：使用强化学习训练一个策略网络，该网络能够生成在特定任务上获得更高奖励（即更好性能）的离散提示词。
连续提示词（Continuous Prompts / Soft Prompts）：不同于操作人类可读文本的离散提示词，连续提示词直接在模型的嵌入空间（embedding space）中优化一个向量序列。这些“软提示词”对人类来说是不可读的，但对模型而言可能更加高效。
- 代表性技术: “Prefix-Tuning”和“P-tuning”是其中的典型代表。它们在输入序列前或中间添加可训练的连续向量，并在训练时冻结LLM的主体参数，只更新这些提示词向量。
- 优缺点: 虽然这种方法通常性能更优且参数效率更高，但它需要直接访问模型的权重和梯度，这使其与仅通过API调用的闭源模型不兼容。

随着系统复杂度增加，手动微调 Prompt 变得不可维护。DSPy (Declarative Self-improving Language Programs) 提出了一种范式转移：将 Prompt 视为模型参数，通过优化器自动学习。

🌰DSPY：核心理念：编程而非提示 (Programming, not Prompting)

在 DSPy 中，你定义签名 (Signature)（输入输出是什么）和模块 (Module)（处理逻辑），而具体的 Prompt 文本和 Few-shot 示例由优化器 (Optimizer) 在编译时通过评测数据自动生成。

**DSPy 优化器家族：**优化器通过在训练集上运行，尝试不同的指令和示例组合，最大化评测指标。

BootstrapFewShot / RandomSearch： 也就是 “自动 Few-shot”。使用 Teacher 模型（通常是更大的模型）为你的程序生成高质量的输入 - 输出示例，并选出最好的 K 个放入 Prompt。
MIPROv2 (Multi-prompt Instruction Proposal Optimizer)： 更进一步，不仅优化示例，还利用贝叶斯优化搜索最佳的 System Instruction（指令文本）。
BootstrapFinetune： 将 Prompt 工程的成果 “蒸馏” 到小模型权重中，适用于极致成本/延迟优化。

工程： https://gitlab.mayfair-inc.com/liukai/llm-dspy-sample

"""
DSPy 自动化提示词工程示例
===========================

本示例展示如何使用 DSPy 框架构建一个中英文翻译器，
并通过 BootstrapFewShot 优化器自动优化提示词。

核心概念:
- Signature: 声明输入输出的规约
- Module: 定义执行逻辑
- Optimizer: 自动优化提示词和示例

运行前请确保设置环境变量 OPENAI_API_KEY
"""

import os
import json
import dspy
from dspy.teleprompt import BootstrapFewShot

# ============================================================
# 全局定义 Signature（便于在函数间共享）
# ============================================================

class ChineseToEnglish(dspy.Signature):
    """将中文地道地翻译成英文，涵盖口语、书面语或诗歌风格。"""
    chinese = dspy.InputField(desc="待翻译的中文文本")
    english = dspy.OutputField(desc="地道的英文翻译")

class Assess(dspy.Signature):
    """评估翻译质量。"""
    chinese = dspy.InputField(desc="原始中文文本")
    english = dspy.InputField(desc="翻译后的英文文本")
    score = dspy.OutputField(desc="翻译质量评分，1-5分，5分最高")

# ============================================================
# 定义 Module
# ============================================================

class Translator(dspy.Module):
    """翻译器模块，使用 ChainOfThought 进行推理"""
    def __init__(self):
        super().__init__()
        self.prog = dspy.ChainOfThought(ChineseToEnglish)

    def forward(self, chinese: str):
        return self.prog(chinese=chinese)

def print_section(title: str, char: str = "="):
    """打印分隔线和标题"""
    print(f"\n{char * 60}")
    print(f"  {title}")
    print(f"{char * 60}")

def print_prompt_structure(compiled_module):
    """
    美化展示编译后的提示词结构
    """
    print_section("📋 DSPy 自动生成的提示词结构", "─")
    
    print("\n【系统指令】")
    print(f"  任务: {ChineseToEnglish.__doc__.strip()}")
    print(f"  输入字段: chinese (待翻译的中文文本)")
    print(f"  输出字段: reasoning (推理过程) + english (地道的英文翻译)")
    
    # 从编译后的模块状态中获取 demos
    demos = []
    for name, predictor in compiled_module.named_predictors():
        state = predictor.dump_state()
        if 'demos' in state and state['demos']:
            demos.extend(state['demos'])
    
    if demos:
        print(f"\n【自动选择的 Few-shot 示例】共 {len(demos)} 个")
        print("  (DSPy 从训练集中自动挑选最有效的示例)")
        
        for i, demo in enumerate(demos, 1):
            print(f"\n  示例 {i}:")
            print(f"    中文: {demo.get('chinese', 'N/A')}")
            if demo.get('english'):
                print(f"    英文: {demo['english']}")
            if demo.get('reasoning'):
                reasoning = demo['reasoning']
                reasoning_preview = reasoning[:80] + "..." if len(reasoning) > 80 else reasoning
                print(f"    推理: {reasoning_preview}")
            if demo.get('augmented'):
                print(f"    [自动引导生成的示例]")
    else:
        print("\n【Few-shot 示例】无（使用 zero-shot）")

def save_compiled_program(compiled_module, filepath: str):
    """
    将编译后的程序保存为可读的 JSON 文件
    """
    output = {
        "task_description": ChineseToEnglish.__doc__.strip(),
        "input_fields": ["chinese"],
        "output_fields": ["reasoning", "english"],
        "selected_demos": [],
        "full_state": {}
    }
    
    # 从编译后的模块状态中获取 demos
    for name, predictor in compiled_module.named_predictors():
        state = predictor.dump_state()
        output["full_state"][name] = {
            "signature": state.get("signature", {}),
            "demos_count": len(state.get("demos", []))
        }
        if 'demos' in state and state['demos']:
            for demo in state['demos']:
                demo_dict = {
                    "chinese": demo.get('chinese'),
                    "english": demo.get('english'),
                    "reasoning": demo.get('reasoning'),
                    "augmented": demo.get('augmented', False)
                }
                output["selected_demos"].append(demo_dict)
    
    with open(filepath, 'w', encoding='utf-8') as f:
        json.dump(output, f, ensure_ascii=False, indent=2)
    
    print(f"\n💾 编译结果已保存至: {filepath}")

def compare_before_after(test_text: str, unoptimized: Translator, optimized: Translator):
    """
    对比优化前后的翻译效果
    """
    print(f"\n  输入: {test_text}")
    
    # 优化前
    result_before = unoptimized(chinese=test_text)
    print(f"  优化前: {result_before.english}")
    
    # 优化后
    result_after = optimized(chinese=test_text)
    print(f"  优化后: {result_after.english}")
    
    if hasattr(result_after, 'reasoning') and result_after.reasoning:
        reasoning_preview = result_after.reasoning[:100] + "..." if len(result_after.reasoning) > 100 else result_after.reasoning
        print(f"  推理过程: {reasoning_preview}")

def main():
    # ============================================================
    # 1. 环境配置
    # ============================================================
    api_key = os.getenv('OPENAI_API_KEY')
    if not api_key:
        raise ValueError(
            "请设置 OPENAI_API_KEY 环境变量\n"
            "例如: export OPENAI_API_KEY='your-api-key'"
        )

    lm = dspy.LM('openai/gpt-4o-mini', api_key=api_key)
    dspy.configure(lm=lm)

    print_section("🚀 DSPy 自动化提示词工程演示")
    print("\n✅ LLM 配置完成: gpt-4o-mini")

    # ============================================================
    # 2. 准备训练数据
    # ============================================================
    trainset = [
        dspy.Example(
            chinese="绝绝子，这家店真好喝！",
            english="This place is absolutely fire!"
        ).with_inputs('chinese'),

        dspy.Example(
            chinese="欲穷千里目，更上一层楼。",
            english="To see a thousand miles further, one must ascend another story."
        ).with_inputs('chinese'),

        dspy.Example(
            chinese="这个项目 yyds！",
            english="This project is the GOAT!"
        ).with_inputs('chinese'),

        dspy.Example(
            chinese="我太喜欢这个小姐姐了。",
            english="I absolutely adore this young lady."
        ).with_inputs('chinese'),

        dspy.Example(
            chinese="人生如梦，一樽还酹江月。",
            english="Life is but a dream, let me raise a cup to the river and moon."
        ).with_inputs('chinese'),

        dspy.Example(
            chinese="你好吗？我很好，谢谢。",
            english="How are you? I'm doing well, thank you."
        ).with_inputs('chinese'),

        dspy.Example(
            chinese="明月几时有，把酒问青天。",
            english="When does the bright moon appear? I raise my cup to question the sky."
        ).with_inputs('chinese'),

        dspy.Example(
            chinese="这部电影的特效太炸裂了！",
            english="The visual effects in this movie are absolutely mind-blowing!"
        ).with_inputs('chinese'),
    ]

    print(f"📚 训练集: {len(trainset)} 个示例（口语、诗词、网络用语）")

    # ============================================================
    # 3. 定义评估指标
    # ============================================================
    def ai_metric(gold, pred, trace=None):
        judge = dspy.Predict(Assess)
        try:
            result = judge(chinese=gold.chinese, english=pred.english)
            score_str = str(result.score).strip()
            score = int(''.join(filter(str.isdigit, score_str))[:1])
            return score >= 4
        except (ValueError, AttributeError, IndexError):
            return False

    # ============================================================
    # 4. 创建未优化的翻译器（用于对比）
    # ============================================================
    unoptimized_translator = Translator()

    # ============================================================
    # 5. 编译优化
    # ============================================================
    print_section("🔄 编译优化中...")
    print("\n  DSPy 正在自动:")
    print("  1. 尝试不同的示例组合")
    print("  2. 使用 AI 评估翻译质量")
    print("  3. 选择最优的 few-shot 示例")
    print("\n  (请稍候，这需要约 1-2 分钟...)")

    optimizer = BootstrapFewShot(
        metric=ai_metric,
        max_bootstrapped_demos=2,
        max_labeled_demos=4,
        max_rounds=1,
    )

    compiled_translator = optimizer.compile(
        Translator(),
        trainset=trainset
    )

    print("\n✅ 编译优化完成！")

    # ============================================================
    # 6. 展示优化后的提示词结构
    # ============================================================
    print_prompt_structure(compiled_translator)

    # ============================================================
    # 7. 优化前后对比测试
    # ============================================================
    print_section("⚡ 优化前后对比", "─")
    
    test_cases = [
        "床前明月光，疑是地上霜。",
        "这件事太离谱了！",
        "不以物喜，不以己悲。",
    ]
    
    for i, text in enumerate(test_cases, 1):
        print(f"\n【对比测试 {i}】")
        compare_before_after(text, unoptimized_translator, compiled_translator)

    # ============================================================
    # 8. 保存编译结果
    # ============================================================
    save_compiled_program(compiled_translator, "compiled_translator.json")

    # ============================================================
    # 9. 展示原始 LLM 调用历史（可选）
    # ============================================================
    print_section("📜 最后一次 LLM 调用详情", "─")
    print("\n(以下是 DSPy 发送给 LLM 的完整提示词)")
    print("-" * 60)
    dspy.inspect_history(n=1)

    print_section("✨ 演示完成！")

if __name__ == "__main__":
    main()

最后优化的prompt

System message:

Your input fields are:
1. `chinese` (str): 待翻译的中文文本
Your output fields are:
1. `reasoning` (str): 
2. `english` (str): 地道的英文翻译
All interactions will be structured in the following way, with the appropriate values filled in.

[[ ## chinese ## ]]
{chinese}

[[ ## reasoning ## ]]
{reasoning}

[[ ## english ## ]]
{english}

[[ ## completed ## ]]
In adhering to this structure, your objective is: 
        将中文地道地翻译成英文，涵盖口语、书面语或诗歌风格。


User message:

This is an example of the task, though some input or output fields are not supplied.

[[ ## chinese ## ]]
欲穷千里目，更上一层楼。


Assistant message:

[[ ## reasoning ## ]]
Not supplied for this particular example. 

[[ ## english ## ]]
To see a thousand miles further, one must ascend another story.

[[ ## completed ## ]]


User message:

This is an example of the task, though some input or output fields are not supplied.

[[ ## chinese ## ]]
你好吗？我很好，谢谢。


Assistant message:

[[ ## reasoning ## ]]
Not supplied for this particular example. 

[[ ## english ## ]]
How are you? I'm doing well, thank you.

[[ ## completed ## ]]


User message:

[[ ## chinese ## ]]
绝绝子，这家店真好喝！


Assistant message:

[[ ## reasoning ## ]]
The phrase "绝绝子" is a slang expression used to convey strong admiration or excitement, similar to saying "amazing" or "awesome" in English. The speaker is expressing that they find the drinks at this shop remarkably good.

[[ ## english ## ]]
This place is absolutely fantastic; the drinks are amazing!

[[ ## completed ## ]]


User message:

[[ ## chinese ## ]]
我太喜欢这个小姐姐了。


Assistant message:

[[ ## reasoning ## ]]
The phrase expresses a strong affection or admiration for a young woman. "小姐姐" often implies a cute or attractive young lady, and the use of "太喜欢了" emphasizes the intensity of the feeling. A natural and colloquial translation would capture both the enthusiasm and the informal tone.

[[ ## english ## ]]
I really like this girl!

[[ ## completed ## ]]


User message:

[[ ## chinese ## ]]
不以物喜，不以己悲。

Respond with the corresponding output fields, starting with the field `[[ ## reasoning ## ]]`, then `[[ ## english ## ]]`, and then ending with the marker for `[[ ## completed ## ]]`.


Response:

[[ ## reasoning ## ]]
This phrase encapsulates a philosophical viewpoint emphasizing emotional resilience and detachment. It suggests that one should not be overly elated by external possessions or circumstances, nor should one be excessively saddened by personal misfortunes. A natural English translation would reflect this deeper understanding and the balanced perspective on life.

[[ ## english ## ]]
One should not rejoice over material gains, nor be sorrowful over personal losses.

[[ ## completed ## ]]

对比测试

【对比测试 1】
  输入: 床前明月光，疑是地上霜。
  优化前: The moonlight before my bed, 
I suspect it’s frost on the ground.
  优化后: The bright moonlight shines before my bed, as if there's frost on the ground.
  推理过程: This couplet is a well-known line from the poem "静夜思" (Quiet Night Thoughts) by Li Bai. It beautiful...

【对比测试 2】
  输入: 这件事太离谱了！
  优化前: This is absolutely ridiculous!
  优化后: This is just too ridiculous!
  推理过程: The phrase expresses incredulity or disbelief about a situation being wildly outrageous or absurd. "...

【对比测试 3】
  输入: 不以物喜，不以己悲。
  优化前: Do not rejoice over possessions, nor grieve over oneself.
  优化后: One should not rejoice over material gains, nor be sorrowful over personal losses.
  推理过程: This phrase encapsulates a philosophical viewpoint emphasizing emotional resilience and detachment. ...

参考资源：DSPy Optimizers Documentation

通过有效的评估和自动化，我们可以提升提示词的质量，但还必须确保其安全性，以防范日益增多的恶意攻击。

下一篇我们将探讨「安全性专题：提示词注入攻击与防御实践」，敬请关注本系列。

「提示词工程：从核心原则到前沿实践」系列目录

引言：为何提示词工程是技术团队的核心能力
promptfoo
评估原则（本文）
安全性专题：提示词注入攻击与防御实践

评估原则

eval 参考资料

3.2 自动化提示词工程

🌰DSPY：核心理念：编程而非提示 (Programming, not Prompting)