《Hands-On Large Language Models》阅读笔记(六)

May 18, 2026 / Jun 8, 2026 --- · 7 min read · Machine Learning LLM ReAct ·

Share on:

第七章：高级文本生成技术与工具

在上一章中已经从 AutoModelForCausalLM, AutoTokenizer, pipeline 过渡到了稍微高那么一层的 llama-cpp-python 的使用，这一章将继续学习 LLM 的使用, 到真正能训练，微调模型还远着呢。其中大部分的内容都在学习 LangChain 的过程中有所掌握，包括记忆机制，智能体工具调用等，所以这方面的内容没有具体展开。

本章所覆盖的在不对模型作微调的情况下提升文本生成质量的方法论与技术:

模型输入/输出：模型加载与调用, 用 llama-cpp-python 演示
记忆机制：增强模型的上下文记忆能力，查看 LangChain 短期记忆相关日志 LangChain 核心组件之短期记忆
智能体系统：整合外部工具实现复杂行为，用 LangChain 1.0 后的 create_agent() 将会非常简单
链式架构：模块化方法与组件的衔接组合, 这是 LangChain 0.x 的架构，1.0 后不再使用链式架构

本章进到 LangChain 的学习当中，本人对 LangChain 已经有了一定程度的了解，由于 LangChain 1.0 于 2025 年 10 月份才正式发布，显然写作本书的时候用的还是 LangChain 0.x 的版本，而 LangChain 1.0 带来了巨大的变化，所以学习当中会把书中的例子改写为 LangChain 1.x 的版本。

下载 llama-cpp 的 GGUF 单文件模型: Phi-3-mini-4k-instruct-q4.gguf

再安装 Python 依赖

uv add langchain langchain-community llama-cpp-python

以上三个组件的当前版本依次为 1.3.1, 0.4.1, 和 0.3.23

 1from langchain_community.llms import LlamaCpp
 2
 3llm = LlamaCpp(
 4    model_path="<path-to-your>/Phi-3-mini-4k-instruct-q4.gguf",
 5    n_gpu_layers=-1,
 6    max_tokens=500,
 7    n_ctx=4096,
 8    seed=42,
 9    verbose=False,
10)
11
12response = llm.invoke("Hi! My name is Maarten. What is 1 + 1?")
13print(response)

书中的例子输出为空白，而我的执行是有输出的

1<|assistant|> Hello Maarten! The answer to 1 + 1 is 2.
2```
3
4This response directly answers the user's question while maintaining a polite and friendly tone, suitable for an introductory conversation.

但这个输出是有问题的，输出中不应该再看到模型的特殊 Token, <|assistant|>.

LangChain 0.x 版本内部实现是 Chain, 而 LangChain 1.0 后内部是 GraphState.

所谓的提示词模板就是能把人易于阅读的消息格式

1[
2  {"role": "user", "content": "Hey"},
3  {"role": "assistant", "content": "Hey yourself!"}
4]

转换成模型要求的带特殊 Token 表示的文本

1<|user|>
2Hey<|end|>
3<|assistant|>
4Hey yourself!<|end|>
5<|endoftext|>

注意，不同的模型有不同的特殊 Token, 比如有些时候能看到 <s>, <SEP> 等。其实以后应该不会直接面对 <|user|>, <|assistant|> 这些关键字了，这个抽象已经在模型的服务层屏蔽了，如使用 Ollama 的服务 http://localhost:11434 时直接传递 JSON 格式的数据.

看了一下用 LangChain 0.x 的 Chain 还真是麻烦

 1template = """<|user|>
 2{input_prompt}<|end|>
 3<|assistant|>"""
 4prompt = PromptTemplate(
 5    template=template,
 6    input_variables=["input_prompt"]
 7)
 8
 9basic_chain = prompt | llm # | 是一个链操作，重载的 `__or__()` 函数
10
11response = basic_chain.invoke({"input_prompt": "Hi! My name is Maarten. What is 1 + 1?"})
12print(response)

如果有多个链多个提示词的话

 1from langchain_community.chains import LLMChain
 2from langchain_core.prompts import PromptTemplate
 3
 4template = """<s><|user|>
 5Create a title for a story about {summary}. Only return the title.<|end|>
 6<|assistant|>"""
 7title_prompt = PromptTemplate(template=template, input_variables=["summary"])
 8title = LLMChain(llm=llm, prompt=title_prompt, output_key="title")
 9
10...
11character = LLMChain(llm=llm, prompt=character_prompt, output_key="character")
12
13...
14story = LLMChain(llm=llm, prompt=story_prompt, output_key="story")
15
16llm_chain = title | character | story
17llm_chain.invoke("how are you?")

真够麻烦的

后面的代码应该以如下为蓝本

 1from llama_cpp import Llama
 2
 3llm = Llama.from_pretrained(
 4    repo_id="microsoft/Phi-3-mini-4k-instruct-gguf",
 5    filename="Phi-3-mini-4k-instruct-q4.gguf",
 6    n_gpu_layers=-1,
 7    max_tokens=500,
 8    n_ctx=4096,
 9    verbose=False,
10)
11
12response = llm.create_chat_completion(messages=[
13    {"role": "user", "content": "Hi! My name is Maarten. What is 1 + 1?"}
14])
15print(response['choices'][0]['message']['content'])

输出

Hello Maarten! 1 + 1 equals 2. It's a basic arithmetic operation.

由 llm 可得到 tokenizer, input_ids, 如调试

1llm.tokenizer().decode(llm.input_ids)

输出

1"<|user|> Hi! My name is Maarten. What is 1 + 1?<|assistant|> Hello Maarten! 1 + 1 equals 2. It's a basic arithmetic operation."

这是与 llama-cpp 模型的交互文本. 上面的 llm 中可以查看到不少有用的信息，例如 llm.token_bos() 为 1, llm.token_eos 为 32000, llm.token_nl() 是 13, 用 llm.tokenizer().decode([1]) 解码, 它们分别是 '', '', 和 \n, 1 和 32000 是一样的。

llm 还有 generate() 方法. llm.metadata['tokenizer.chat_template'] 是它所用的模板

1{{ bos_token }}{% for message in messages %}{% if (message['role'] == 'user') %}{{'<|user|>' + '
2' + message['content'] + '<|end|>' + '
3' + '<|assistant|>' + '
4'}}{% elif (message['role'] == 'assistant') %}{{message['content'] + '<|end|>' + '
5'}}{% endif %}{% endfor %}

bos_token 为空字符串。在 HuggingFace 站点上也可以查看到模型相应的 chat_template, 如 Phi-3-mini-4k-instruct-q4.gguf 中的 tokenizer.chat_template; google/gemma-4-E4B-it.

很好奇如果问题中有 <|user|>, <|assistant|> 这样的输入产生的模型会怎么处理。改变上面的输入为

1response = llm.create_chat_completion(messages=[
2    {"role": "user", "content": "$<|user|>$<|assistant|>$<|other|>$. What is 1 + 1? give a concise answer"}
3])
4
5print(llm.input_ids.tolist())
6llm.tokenizer().decode(llm.input_ids)

输出为

1[1, 32010, 395, 32010, 395, 32001, 395, 29966, 29989, 1228, 29989, 29958, 1504, 1724, 338, 29871, 29896, 718, 29871, 29896, 29973, 2367, 263, 3022, 895, 1234, 32007, 32001, 29871, 29896, 718, 29871, 29896, 15743, 29871, 29906, 29889, 0, 0, 0, 0, ...]
2'<|user|> $<|user|> $<|assistant|> $<|other|>$. What is 1 + 1? give a concise answer<|assistant|> 1 + 1 equals 2.'

llm.input_ids 右边填充 0，它的总长度为 4096. 从 decode 出的字符串看，最终提示启中似乎分辨不出 <|user|> 是用户的输入还是特殊的 Token(32010)。

LLM 的记忆

LLM 本身没有记忆，它所谓的记忆全是对话中提供的信息，比如你把相关系统先告诉它，或者把对话历史(user, assistant 之间的交互)给它重复，它才知道之前聊过的内容，其实与 LLM 的每一次交互发给模型的所有内容就是完整的提示词。当上下文大小不断增大时，必须裁剪对话历史，或由另一个 LLM 进行历史信息总结摘要，这些在 LangChain 1.0 后都可以通过 Middleware 来处理。

构建智能体

LLM 结合外部工具调用就能实现一个智能体，它的核心驱动为 ReAct(Reasoning and Acting) – 三个阶段完成认知闭环: Thought, Action, Observation.

用 LangChain 1.0 以上的版本, create_agent(), 或者 init_chat_model() 再 bind_tools() 实现方法调用都非常的简单，下面尝试用 llama_cpp 通过系统提示词手工来实现工具调用。

首先要选择一个能使用工具的模型，这里选择 GGUF 格式的 unsloth/gemma-4-E4B-it-GGUF. 没有参照它的 chat_template 模板 gemma-4 tokenizer.chat_template, 它的这个模板看起来非常的复杂。

完整代码如下

  1import re
  2import inspect
  3
  4from llama_cpp import Llama
  5
  6llm = Llama.from_pretrained(
  7    repo_id="unsloth/gemma-4-E4B-it-GGUF",
  8    filename="gemma-4-E4B-it-Q4_0.gguf",
  9    n_gpu_layers=-1,
 10    n_ctx=8192,
 11    flash_attn=True,
 12    verbose=False,
 13)
 14
 15def get_my_location() -> str:
 16    print(f"  [get_location]")
 17    return 'Chicago'
 18
 19def get_weather(city: str) -> str:
 20    print(f"  [get_weather] {city}")
 21    return f'It\'s sunny in {city}, temperature is 25°C.'
 22
 23TOOLS = {"get_my_location": get_my_location, "get_weather": get_weather}
 24
 25def _tool_signature(name: str, fn) -> str:
 26    params = list(inspect.signature(fn).parameters.keys())
 27    return f"{name}({', '.join(params)})"
 28
 29TOOL_SIGNATURES = {name: _tool_signature(name, fn) for name, fn in TOOLS.items()}
 30
 31SYSTEM_PROMPT = f"""You are a helpful assistant. Answer the user's question using the available tools.
 32
 33Available tools:
 34- get_my_location(): get my current location, returns the city name
 35- get_weather(city): get the weather for a given city, returns weather information
 36
 37You must follow this loop until you have a final answer:
 38
 39Thought: <your reasoning about what to do>
 40Action: <tool_name>(<argument>)
 41Observation: <tool result>
 42
 43When you have enough information:
 44Thought: I now know the final answer.
 45Final Answer: <your answer>
 46
 47Important:
 48- Call only one tool per step.
 49- Tool names must be exactly one of: {list(TOOLS.keys())}
 50- For tools with no parameters write: Action: get_my_location()
 51- For tools with parameters write positional values only, no keyword names: Action: get_weather(Chicago)
 52- Never fabricate an Observation. Always wait for the real result.
 53"""
 54
 55def parse_action(text: str):
 56    match = re.search(r"Action:\s*(\w+)\(([^)]*)\)", text)
 57    if match:
 58        arg = match.group(2).strip().strip("\"'")
 59        arg = re.sub(r"^\w+=", "", arg).strip().strip("\"'")
 60        return match.group(1), arg
 61    return None, None
 62
 63def call_tool(name: str, arg: str):
 64    fn = TOOLS[name]
 65    params = list(inspect.signature(fn).parameters.keys())
 66    if params:
 67        return fn(arg)
 68    return fn()
 69
 70def react(question: str, max_steps: int = 6):
 71    print(f"\nQuestion: {question}\n")
 72
 73    messages = [
 74        {"role": "system", "content": SYSTEM_PROMPT},
 75        {"role": "user", "content": question},
 76    ]
 77    scratchpad = ""
 78
 79    for step in range(max_steps):
 80        current_messages = messages.copy()
 81        if scratchpad:
 82            current_messages.append({"role": "assistant", "content": scratchpad})
 83
 84        response = llm.create_chat_completion(
 85            messages=current_messages,
 86            stop=["Observation:"],
 87            max_tokens=256,
 88            temperature=0.1,
 89        )
 90
 91        chunk = response["choices"][0]["message"]["content"].strip()
 92        print(f"--- step {step + 1} ---\n{chunk}")
 93
 94        if "Final Answer:" in chunk:
 95            final = chunk.split("Final Answer:")[-1].strip()
 96            print(f"\n=== Final Answer: {final} ===")
 97            return final
 98
 99        tool_name, tool_arg = parse_action(chunk)
100        if tool_name and tool_name in TOOLS:
101            observation = call_tool(tool_name, tool_arg)
102        elif tool_name:
103            observation = f"Unknown tool '{tool_name}'. Use one of: {list(TOOLS.keys())}"
104        else:
105            observation = "No valid Action found. Follow the format: Action: tool_name(argument)"
106
107        print(f"Observation: {observation}\n")
108        scratchpad += chunk + f"\nObservation: {observation}\n"
109
110    return "Max steps reached without a final answer."
111
112react("Where am I, and how about the weather there?")
113
114llm.close()

下面是代码执行结果

 1Question: Where am I, and how about the weather there?
 2
 3--- step 1 ---
 4Thought: I need to find the user's current location first, and then use that location to get the weather. I will start by calling `get_my_location()`.
 5Action: get_my_location()
 6  [get_location]
 7Observation: Chicago
 8
 9--- step 2 ---
10Thought: I have the user's location, which is Chicago. Now I need to get the weather for Chicago using the `get_weather` tool.
11Action: get_weather(Chicago)
12  [get_weather] Chicago
13Observation: It's sunny in Chicago, temperature is 25°C.
14
15--- step 3 ---
16Thought: I have successfully retrieved the user's location (Chicago) and the weather for that location (sunny, 25°C). I now have all the necessary information to answer the user's request.
17Final Answer: You are in Chicago, and the weather there is sunny with a temperature of 25°C.
18
19=== Final Answer: You are in Chicago, and the weather there is sunny with a temperature of 25°C. ===