《Hands-On Large Language Models》阅读笔记(八)
第九章:多模态 LLM
本书第二部分(使用预训练模型)除了 "文本分类","文本聚类和主题建模" 两章外,其他的 "提示词工程", "高级文本生成技术与工具","语义搜索与 RAG" 这三章都都只是对原有知识的巩固. 现在学习本部分的最后一章 "多模态 LLM" 应该能了解到一些模型处理图片的相关知识, 这对我来说是新的知识。
多模态就是指模型能够处理多种类型的数据,比如文本、图像、音频等,而不仅仅是文本,如果你的模型只能处理图像和音频也是多模态。视觉 Transformer (Vision Transformer, ViT) 在图像识别任务中超越了传统的卷积神经网络(CNN)。ViT 的核心功能是将非结构化的图像数据转换为可用于分类等任务的数值表示。
对于本文,Transformer 编码器对文本拆分再编码成数值表示,对文本的切割看成是一维的,ViT 对图像按水平方向和垂直方向进行网络化切割成小块,这和 CNN 用卷积核来提取图像特征的方式类似。

文本的词汇量是有限的,所以可以把文本分割的 Token 映射为一个数值,切成的图片块太过多样性,无法为每一种可能的图片块分配一个 Token. Transformer 的做法是把图片块平铺后对所有图像块实施线性嵌入操作,像文本嵌入一样把一组图片块转换为嵌入向量。这些蕴含语义信息的向量便可作为 Transformer 模型的标准化输入。

比如按论文 An Image is Worth 16x16 Words 的那样把图像切成 16x16 的小块, 每个小块被展平为一个向量,然后通过线性变换映射到一个固定维度的嵌入空间中,这些嵌入向量就可以输入到 Transformer 模型中进行处理。
当这种图片嵌入向量进入编码器后,视觉与文本模态的处理路径就完全相同了,这种架构统一性为多模态模型的构建奠定了重要基础。想像一下,那视频和音频是怎么处理的呢? 视频首先分割成帧,每帧图像按上述方式处理成嵌入向量,音频则可以通过短时傅里叶变换(STFT)等方法将其转换为频谱图,然后再进行类似的切块和嵌入处理 -- 这是在 IntelliJ IDEA 中书写时 AI 自动提示出来的。
多模态嵌入模型可在用一向量空间中为不同模态生成嵌入向量,所以可以直接比较不同模态之间的相似性,比如文本和图像之间的相似性。对比语言-图像预训练(CLIP - Contrastive Language-Image Pre-training) 模型就是一个典型的多模态嵌入模型。
CLIP 能同时计算图像嵌入与文本嵌入的模型,这种独特的跨模态对齐能力使 CLIP 及其同类模型可支撑以下核心应用场景
- 零样本分类 通过比对图像嵌入与类别描述文本的嵌入向量,实现无需训练数据的精准分类。
- 语义聚类 将图像集合与关键词库进行联合聚类,揭示视觉内容与文本概念的潜在关联。
- 跨模态检索 在海量多模态数据中,实现文本-图像的双向即时检索。
- 生成引导 驱动图像生成模型(如稳定扩散模型 3)实现更精准的文本-图像对齐。
CLIP 的训练数据是带描述的图像,它采用双编码器架构,文本编码器处理描述文本,生成语义嵌入; 图像编码器处理图像,生成图像嵌入; 经过联合训练后, 配对的图文数据将在共享的向量空间中获得高度对齐的嵌入向量表示,匹配图文对的相似度。所以 CLIP 训练分三步
- 分别使用图像编码器和文本编码器处理输入的图像和文本,生成对应的嵌入向量
- 使用余弦相似度计算图像嵌入和文本嵌入之间的相似度
- 根据预期相似度更新文本编码器和图像编码器参数,使得配对的图文数据在共享的向量空间中距离逐渐缩小
好像预训练的起步都是训练如何分类
开始使用 OpenCLIP 实战
下载三个图片, puppy.png, cat.png, car.png
然后是使用模型 "openai/clip-vit-base-patch32" 对图片和文本进行编码,计算相似度
1import numpy as np
2
3from PIL import Image
4from transformers import CLIPProcessor, CLIPModel, CLIPTokenizer
5import torch.nn.functional as F
6
7# 加载模型与输入
8model_id="openai/clip-vit-base-patch32"
9clip_tokenizer = CLIPTokenizer.from_pretrained(model_id)
10clip_processor = CLIPProcessor.from_pretrained(model_id) # 相当于处理文本的 tokenizer
11model = CLIPModel.from_pretrained(model_id) # CLIPModel 内含两个编码器:文本编码器和图像编码器(ViT)
12
13def image_caption_score(image_file: str, caption: str):
14 image = Image.open(image_file).convert("RGB")
15
16 inputs = clip_tokenizer(caption, return_tensors="pt")
17 print("caption inputs: ", inputs)
18 # caption inputs: {'input_ids': tensor([[49406, 320, 6829, 1629, 530, 518, 2583, 49407]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1]])}
19
20 print("caption tokens: ", clip_tokenizer.convert_ids_to_tokens(inputs["input_ids"][0]))
21 # caption tokens: ['<|startoftext|>', 'a</w>', 'puppy</w>', 'playing</w>', 'in</w>', 'the</w>', 'snow</w>', '<|endoftext|>']
22
23 # 整个文本转换成了一个维度为 512 的嵌入向量,这是一个文本嵌入操作
24 text_embedding = model.get_text_features(**inputs).pooler_output
25 print("caption embedding shape: ", text_embedding.shape)
26 # caption embedding shape: torch.Size([1, 512])
27
28 processed_image = clip_processor(text=None, images=image, return_tensors="pt")["pixel_values"]
29 print("processed_image shape: ", processed_image.shape)
30 # processed_image shape: torch.Size([1, 3, 224, 224])
31 # 1: batch, 3: RGB 三个通道,224*224: 图像被缩小/裁剪成了 224*224 的大小
32
33 # 输出采样的图片区域,是一个 224*224 的 RGB 图片
34 # 为什么是 224*224, 因为模型 openai/clip-vit-base-patch32", patch 大小是 32*32, 而 224*224 包含 7*7 个 patch
35 save_sample_image(processed_image, image_file)
36
37 # 上面 49 个 patch 铺平后被编码成了一个维度为 512 的嵌入向量,这是一个图像嵌入操作
38 image_embedding = model.get_image_features(processed_image).pooler_output
39 print("image_embedding shape: ", image_embedding.shape)
40 # image_embedding shape: torch.Size([1, 512])
41
42 # 计算它们的相似度
43 score = F.cosine_similarity(image_embedding, text_embedding)
44 print(f"{image_file}->{caption}: {score.tolist()[0]:.2f}")
45
46
47def save_sample_image(processed_image: np.ndarray, image_file: str):
48 img = processed_image.squeeze(0) # [3, 224, 224] 去掉了第一维
49 img = img.permute(1, 2, 0).numpy() # [224, 224, 3],CHW → HWC, 通道优先变换为高度优先, C: Channel
50
51 img_np = (img - img.min()) / (img.max() - img.min()) * 255
52 img_np = img_np.astype(np.uint8)
53
54 Image.fromarray(img_np).save("sample-" + image_file)
55
56
57if __name__ == '__main__':
58 image_files = ["puppy.png", "cat.png", "car.png"]
59 captions = [
60 "a puppy playing in the snow",
61 "a pixelated image of a cute cat",
62 "A supercar on the road \nwith the sunset in the background"
63 ]
64 for image_file in image_files:
65 for caption in captions:
66 image_caption_score(image_file, caption)
这里保留了重复编码图片和描述的代码,实际使用中对所有出现的图片和描述只需要编码一次。代码中有详细的过程注释,主要过程就是把图片和描述同时编码到维度为 512 的向量空间中,然后就可以计算它们的余弦相似度,把结果整理如下






图片取样可能会把最小边缩放到 224,保持宽高比不变,所以有些图片会被裁剪掉一部分,或者在边缘添加黑色填充。对于有些长宽比较极端的图片, 可能只是取了一小部分,比如本文第一个图片,长宽为 1384*330, 取样的图片如下


还可以用简单的 sentence-transformers 加载 CLIP
1from PIL import Image
2
3images = [Image.open(image_file).convert("RGB") for image_file in image_files]
4
5from sentence_transformers import SentenceTransformer, util
6
7model = SentenceTransformer("clip-ViT-B-32")
8
9image_embeddings = model.encode(images)
10text_embeddings = model.encode(captions)
11
12sim_matrix = util.cos_sim(
13 image_embeddings, text_embeddings
14)
15
16print(sim_matrix)
输出一样的相似度值
1tensor([[0.3315, 0.1863, 0.1084],
2 [0.1488, 0.3463, 0.0947],
3 [0.0762, 0.1260, 0.3098]])
BLIP-2 技术赋予传统语言模型视觉认知
BLIP-2(Bootstrapping Language-Image Pre-training), 它通过构建名为 Querying Transformer(Q-Former) 的智能桥梁,巧妙连接预训练视频编码器与预训练 LLM, 而非重新构建整个系统架构, 仅需要训练 Q-Former 模块。完整过程是

多模态输入预处理
同时输入文本和图片时
1from PIL import Image
2
3from transformers import AutoProcessor, Blip2ForConditionalGeneration
4
5model_name = "Salesforce/blip2-opt-2.7b"
6blip_processor = AutoProcessor.from_pretrained(model_name)
7model = Blip2ForConditionalGeneration.from_pretrained(model_name).to("mps")
8# print(model.vision_model, model.language_model)
9print(model)
10
11image = Image.open("car.png").convert("RGB")
12image_pixels = blip_processor(images=image, return_tensors="pt").to("mps")["pixel_values"]
13print(image_pixels.shape) # torch.Size([1, 3, 224, 224])
14
15text = "Her vocalization was remarkably melodic"
16token_ids = blip_processor(text=text, return_tensors="pt").to("mps")["input_ids"][0]
17tokens = blip_processor.tokenizer.convert_ids_to_tokens(token_ids)
18print(tokens) # ['</s>', 'Her', 'Ġvocal', 'ization', 'Ġwas', 'Ġremarkably', 'Ġmel', 'odic']
blip_model 中同时有 vision_model 和 language_model, 这里图片和文本分开来处理的,也可一次调用 blip_processor() 同时处理
1inputs = blip_processor(text=text, images=image, return_tensors="pt").to("mps")
Token 中的特殊符号 Ġ, 实际代表空格,就是当前 Token 是否与前面的 Token 紧密相连,这是为了打印时故意把空格字符(32) 加上 256 变成了 288,
这样就能在打印时清晰地看到空格的位置了。
blip_model 打印出来看到里面的层次
1Blip2ForConditionalGeneration(
2 (vision_model): Blip2VisionModel(
3 (embeddings): Blip2VisionEmbeddings(
4 (patch_embedding): Conv2d(3, 1408, kernel_size=(14, 14), stride=(14, 14))
5 )
6 (encoder): Blip2Encoder(
7 (layers): ModuleList(
8 (0-38): 39 x Blip2EncoderLayer(
9 (self_attn): Blip2Attention(
10 (qkv): Linear(in_features=1408, out_features=4224, bias=True)
11 (projection): Linear(in_features=1408, out_features=1408, bias=True)
12 )
13 (layer_norm1): LayerNorm((1408,), eps=1e-06, elementwise_affine=True)
14 (mlp): Blip2MLP(
15 (activation_fn): GELUActivation()
16 (fc1): Linear(in_features=1408, out_features=6144, bias=True)
17 (fc2): Linear(in_features=6144, out_features=1408, bias=True)
18 )
19 (layer_norm2): LayerNorm((1408,), eps=1e-06, elementwise_affine=True)
20 )
21 )
22 )
23 (post_layernorm): LayerNorm((1408,), eps=1e-06, elementwise_affine=True)
24 )
25 (qformer): Blip2QFormerModel(
26 (layernorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
27 (dropout): Dropout(p=0.1, inplace=False)
28 (encoder): Blip2QFormerEncoder(
29 (layer): ModuleList(
30 (0): Blip2QFormerLayer(
31 (attention): Blip2QFormerAttention(
32 (attention): Blip2QFormerMultiHeadAttention(
33 (query): Linear(in_features=768, out_features=768, bias=True)
34 (key): Linear(in_features=768, out_features=768, bias=True)
35 (value): Linear(in_features=768, out_features=768, bias=True)
36 (dropout): Dropout(p=0.1, inplace=False)
37 )
38 (output): Blip2QFormerSelfOutput(
39 (dense): Linear(in_features=768, out_features=768, bias=True)
40 (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
41 (dropout): Dropout(p=0.1, inplace=False)
42 )
43 )
44 (crossattention): Blip2QFormerAttention(
45 (attention): Blip2QFormerMultiHeadAttention(
46 (query): Linear(in_features=768, out_features=768, bias=True)
47 (key): Linear(in_features=1408, out_features=768, bias=True)
48 (value): Linear(in_features=1408, out_features=768, bias=True)
49 (dropout): Dropout(p=0.1, inplace=False)
50 )
51 (output): Blip2QFormerSelfOutput(
52 (dense): Linear(in_features=768, out_features=768, bias=True)
53 (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
54 (dropout): Dropout(p=0.1, inplace=False)
55 )
56 )
57 (intermediate_query): Blip2QFormerIntermediate(
58 (dense): Linear(in_features=768, out_features=3072, bias=True)
59 (intermediate_act_fn): GELUActivation()
60 )
61 (output_query): Blip2QFormerOutput(
62 (dense): Linear(in_features=3072, out_features=768, bias=True)
63 (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
64 (dropout): Dropout(p=0.1, inplace=False)
65 )
66 )
67 (1): Blip2QFormerLayer(
68 (attention): Blip2QFormerAttention(
69 (attention): Blip2QFormerMultiHeadAttention(
70 (query): Linear(in_features=768, out_features=768, bias=True)
71 (key): Linear(in_features=768, out_features=768, bias=True)
72 (value): Linear(in_features=768, out_features=768, bias=True)
73 (dropout): Dropout(p=0.1, inplace=False)
74 )
75 (output): Blip2QFormerSelfOutput(
76 (dense): Linear(in_features=768, out_features=768, bias=True)
77 (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
78 (dropout): Dropout(p=0.1, inplace=False)
79 )
80 )
81 (intermediate_query): Blip2QFormerIntermediate(
82 (dense): Linear(in_features=768, out_features=3072, bias=True)
83 (intermediate_act_fn): GELUActivation()
84 )
85 (output_query): Blip2QFormerOutput(
86 (dense): Linear(in_features=3072, out_features=768, bias=True)
87 (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
88 (dropout): Dropout(p=0.1, inplace=False)
89 )
90 )
91 (2): Blip2QFormerLayer(
92 (attention): Blip2QFormerAttention(
93 (attention): Blip2QFormerMultiHeadAttention(
94 (query): Linear(in_features=768, out_features=768, bias=True)
95 (key): Linear(in_features=768, out_features=768, bias=True)
96 (value): Linear(in_features=768, out_features=768, bias=True)
97 (dropout): Dropout(p=0.1, inplace=False)
98 )
99 (output): Blip2QFormerSelfOutput(
100 (dense): Linear(in_features=768, out_features=768, bias=True)
101 (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
102 (dropout): Dropout(p=0.1, inplace=False)
103 )
104 )
105 (crossattention): Blip2QFormerAttention(
106 (attention): Blip2QFormerMultiHeadAttention(
107 (query): Linear(in_features=768, out_features=768, bias=True)
108 (key): Linear(in_features=1408, out_features=768, bias=True)
109 (value): Linear(in_features=1408, out_features=768, bias=True)
110 (dropout): Dropout(p=0.1, inplace=False)
111 )
112 (output): Blip2QFormerSelfOutput(
113 (dense): Linear(in_features=768, out_features=768, bias=True)
114 (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
115 (dropout): Dropout(p=0.1, inplace=False)
116 )
117 )
118 (intermediate_query): Blip2QFormerIntermediate(
119 (dense): Linear(in_features=768, out_features=3072, bias=True)
120 (intermediate_act_fn): GELUActivation()
121 )
122 (output_query): Blip2QFormerOutput(
123 (dense): Linear(in_features=3072, out_features=768, bias=True)
124 (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
125 (dropout): Dropout(p=0.1, inplace=False)
126 )
127 )
128 (3): Blip2QFormerLayer(
129 (attention): Blip2QFormerAttention(
130 (attention): Blip2QFormerMultiHeadAttention(
131 (query): Linear(in_features=768, out_features=768, bias=True)
132 (key): Linear(in_features=768, out_features=768, bias=True)
133 (value): Linear(in_features=768, out_features=768, bias=True)
134 (dropout): Dropout(p=0.1, inplace=False)
135 )
136 (output): Blip2QFormerSelfOutput(
137 (dense): Linear(in_features=768, out_features=768, bias=True)
138 (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
139 (dropout): Dropout(p=0.1, inplace=False)
140 )
141 )
142 (intermediate_query): Blip2QFormerIntermediate(
143 (dense): Linear(in_features=768, out_features=3072, bias=True)
144 (intermediate_act_fn): GELUActivation()
145 )
146 (output_query): Blip2QFormerOutput(
147 (dense): Linear(in_features=3072, out_features=768, bias=True)
148 (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
149 (dropout): Dropout(p=0.1, inplace=False)
150 )
151 )
152 (4): Blip2QFormerLayer(
153 (attention): Blip2QFormerAttention(
154 (attention): Blip2QFormerMultiHeadAttention(
155 (query): Linear(in_features=768, out_features=768, bias=True)
156 (key): Linear(in_features=768, out_features=768, bias=True)
157 (value): Linear(in_features=768, out_features=768, bias=True)
158 (dropout): Dropout(p=0.1, inplace=False)
159 )
160 (output): Blip2QFormerSelfOutput(
161 (dense): Linear(in_features=768, out_features=768, bias=True)
162 (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
163 (dropout): Dropout(p=0.1, inplace=False)
164 )
165 )
166 (crossattention): Blip2QFormerAttention(
167 (attention): Blip2QFormerMultiHeadAttention(
168 (query): Linear(in_features=768, out_features=768, bias=True)
169 (key): Linear(in_features=1408, out_features=768, bias=True)
170 (value): Linear(in_features=1408, out_features=768, bias=True)
171 (dropout): Dropout(p=0.1, inplace=False)
172 )
173 (output): Blip2QFormerSelfOutput(
174 (dense): Linear(in_features=768, out_features=768, bias=True)
175 (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
176 (dropout): Dropout(p=0.1, inplace=False)
177 )
178 )
179 (intermediate_query): Blip2QFormerIntermediate(
180 (dense): Linear(in_features=768, out_features=3072, bias=True)
181 (intermediate_act_fn): GELUActivation()
182 )
183 (output_query): Blip2QFormerOutput(
184 (dense): Linear(in_features=3072, out_features=768, bias=True)
185 (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
186 (dropout): Dropout(p=0.1, inplace=False)
187 )
188 )
189 (5): Blip2QFormerLayer(
190 (attention): Blip2QFormerAttention(
191 (attention): Blip2QFormerMultiHeadAttention(
192 (query): Linear(in_features=768, out_features=768, bias=True)
193 (key): Linear(in_features=768, out_features=768, bias=True)
194 (value): Linear(in_features=768, out_features=768, bias=True)
195 (dropout): Dropout(p=0.1, inplace=False)
196 )
197 (output): Blip2QFormerSelfOutput(
198 (dense): Linear(in_features=768, out_features=768, bias=True)
199 (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
200 (dropout): Dropout(p=0.1, inplace=False)
201 )
202 )
203 (intermediate_query): Blip2QFormerIntermediate(
204 (dense): Linear(in_features=768, out_features=3072, bias=True)
205 (intermediate_act_fn): GELUActivation()
206 )
207 (output_query): Blip2QFormerOutput(
208 (dense): Linear(in_features=3072, out_features=768, bias=True)
209 (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
210 (dropout): Dropout(p=0.1, inplace=False)
211 )
212 )
213 (6): Blip2QFormerLayer(
214 (attention): Blip2QFormerAttention(
215 (attention): Blip2QFormerMultiHeadAttention(
216 (query): Linear(in_features=768, out_features=768, bias=True)
217 (key): Linear(in_features=768, out_features=768, bias=True)
218 (value): Linear(in_features=768, out_features=768, bias=True)
219 (dropout): Dropout(p=0.1, inplace=False)
220 )
221 (output): Blip2QFormerSelfOutput(
222 (dense): Linear(in_features=768, out_features=768, bias=True)
223 (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
224 (dropout): Dropout(p=0.1, inplace=False)
225 )
226 )
227 (crossattention): Blip2QFormerAttention(
228 (attention): Blip2QFormerMultiHeadAttention(
229 (query): Linear(in_features=768, out_features=768, bias=True)
230 (key): Linear(in_features=1408, out_features=768, bias=True)
231 (value): Linear(in_features=1408, out_features=768, bias=True)
232 (dropout): Dropout(p=0.1, inplace=False)
233 )
234 (output): Blip2QFormerSelfOutput(
235 (dense): Linear(in_features=768, out_features=768, bias=True)
236 (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
237 (dropout): Dropout(p=0.1, inplace=False)
238 )
239 )
240 (intermediate_query): Blip2QFormerIntermediate(
241 (dense): Linear(in_features=768, out_features=3072, bias=True)
242 (intermediate_act_fn): GELUActivation()
243 )
244 (output_query): Blip2QFormerOutput(
245 (dense): Linear(in_features=3072, out_features=768, bias=True)
246 (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
247 (dropout): Dropout(p=0.1, inplace=False)
248 )
249 )
250 (7): Blip2QFormerLayer(
251 (attention): Blip2QFormerAttention(
252 (attention): Blip2QFormerMultiHeadAttention(
253 (query): Linear(in_features=768, out_features=768, bias=True)
254 (key): Linear(in_features=768, out_features=768, bias=True)
255 (value): Linear(in_features=768, out_features=768, bias=True)
256 (dropout): Dropout(p=0.1, inplace=False)
257 )
258 (output): Blip2QFormerSelfOutput(
259 (dense): Linear(in_features=768, out_features=768, bias=True)
260 (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
261 (dropout): Dropout(p=0.1, inplace=False)
262 )
263 )
264 (intermediate_query): Blip2QFormerIntermediate(
265 (dense): Linear(in_features=768, out_features=3072, bias=True)
266 (intermediate_act_fn): GELUActivation()
267 )
268 (output_query): Blip2QFormerOutput(
269 (dense): Linear(in_features=3072, out_features=768, bias=True)
270 (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
271 (dropout): Dropout(p=0.1, inplace=False)
272 )
273 )
274 (8): Blip2QFormerLayer(
275 (attention): Blip2QFormerAttention(
276 (attention): Blip2QFormerMultiHeadAttention(
277 (query): Linear(in_features=768, out_features=768, bias=True)
278 (key): Linear(in_features=768, out_features=768, bias=True)
279 (value): Linear(in_features=768, out_features=768, bias=True)
280 (dropout): Dropout(p=0.1, inplace=False)
281 )
282 (output): Blip2QFormerSelfOutput(
283 (dense): Linear(in_features=768, out_features=768, bias=True)
284 (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
285 (dropout): Dropout(p=0.1, inplace=False)
286 )
287 )
288 (crossattention): Blip2QFormerAttention(
289 (attention): Blip2QFormerMultiHeadAttention(
290 (query): Linear(in_features=768, out_features=768, bias=True)
291 (key): Linear(in_features=1408, out_features=768, bias=True)
292 (value): Linear(in_features=1408, out_features=768, bias=True)
293 (dropout): Dropout(p=0.1, inplace=False)
294 )
295 (output): Blip2QFormerSelfOutput(
296 (dense): Linear(in_features=768, out_features=768, bias=True)
297 (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
298 (dropout): Dropout(p=0.1, inplace=False)
299 )
300 )
301 (intermediate_query): Blip2QFormerIntermediate(
302 (dense): Linear(in_features=768, out_features=3072, bias=True)
303 (intermediate_act_fn): GELUActivation()
304 )
305 (output_query): Blip2QFormerOutput(
306 (dense): Linear(in_features=3072, out_features=768, bias=True)
307 (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
308 (dropout): Dropout(p=0.1, inplace=False)
309 )
310 )
311 (9): Blip2QFormerLayer(
312 (attention): Blip2QFormerAttention(
313 (attention): Blip2QFormerMultiHeadAttention(
314 (query): Linear(in_features=768, out_features=768, bias=True)
315 (key): Linear(in_features=768, out_features=768, bias=True)
316 (value): Linear(in_features=768, out_features=768, bias=True)
317 (dropout): Dropout(p=0.1, inplace=False)
318 )
319 (output): Blip2QFormerSelfOutput(
320 (dense): Linear(in_features=768, out_features=768, bias=True)
321 (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
322 (dropout): Dropout(p=0.1, inplace=False)
323 )
324 )
325 (intermediate_query): Blip2QFormerIntermediate(
326 (dense): Linear(in_features=768, out_features=3072, bias=True)
327 (intermediate_act_fn): GELUActivation()
328 )
329 (output_query): Blip2QFormerOutput(
330 (dense): Linear(in_features=3072, out_features=768, bias=True)
331 (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
332 (dropout): Dropout(p=0.1, inplace=False)
333 )
334 )
335 (10): Blip2QFormerLayer(
336 (attention): Blip2QFormerAttention(
337 (attention): Blip2QFormerMultiHeadAttention(
338 (query): Linear(in_features=768, out_features=768, bias=True)
339 (key): Linear(in_features=768, out_features=768, bias=True)
340 (value): Linear(in_features=768, out_features=768, bias=True)
341 (dropout): Dropout(p=0.1, inplace=False)
342 )
343 (output): Blip2QFormerSelfOutput(
344 (dense): Linear(in_features=768, out_features=768, bias=True)
345 (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
346 (dropout): Dropout(p=0.1, inplace=False)
347 )
348 )
349 (crossattention): Blip2QFormerAttention(
350 (attention): Blip2QFormerMultiHeadAttention(
351 (query): Linear(in_features=768, out_features=768, bias=True)
352 (key): Linear(in_features=1408, out_features=768, bias=True)
353 (value): Linear(in_features=1408, out_features=768, bias=True)
354 (dropout): Dropout(p=0.1, inplace=False)
355 )
356 (output): Blip2QFormerSelfOutput(
357 (dense): Linear(in_features=768, out_features=768, bias=True)
358 (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
359 (dropout): Dropout(p=0.1, inplace=False)
360 )
361 )
362 (intermediate_query): Blip2QFormerIntermediate(
363 (dense): Linear(in_features=768, out_features=3072, bias=True)
364 (intermediate_act_fn): GELUActivation()
365 )
366 (output_query): Blip2QFormerOutput(
367 (dense): Linear(in_features=3072, out_features=768, bias=True)
368 (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
369 (dropout): Dropout(p=0.1, inplace=False)
370 )
371 )
372 (11): Blip2QFormerLayer(
373 (attention): Blip2QFormerAttention(
374 (attention): Blip2QFormerMultiHeadAttention(
375 (query): Linear(in_features=768, out_features=768, bias=True)
376 (key): Linear(in_features=768, out_features=768, bias=True)
377 (value): Linear(in_features=768, out_features=768, bias=True)
378 (dropout): Dropout(p=0.1, inplace=False)
379 )
380 (output): Blip2QFormerSelfOutput(
381 (dense): Linear(in_features=768, out_features=768, bias=True)
382 (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
383 (dropout): Dropout(p=0.1, inplace=False)
384 )
385 )
386 (intermediate_query): Blip2QFormerIntermediate(
387 (dense): Linear(in_features=768, out_features=3072, bias=True)
388 (intermediate_act_fn): GELUActivation()
389 )
390 (output_query): Blip2QFormerOutput(
391 (dense): Linear(in_features=3072, out_features=768, bias=True)
392 (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
393 (dropout): Dropout(p=0.1, inplace=False)
394 )
395 )
396 )
397 )
398 )
399 (language_projection): Linear(in_features=768, out_features=2560, bias=True)
400 (language_model): OPTForCausalLM(
401 (model): OPTModel(
402 (decoder): OPTDecoder(
403 (embed_tokens): Embedding(50304, 2560, padding_idx=1)
404 (embed_positions): OPTLearnedPositionalEmbedding(2050, 2560)
405 (final_layer_norm): LayerNorm((2560,), eps=1e-05, elementwise_affine=True)
406 (layers): ModuleList(
407 (0-31): 32 x OPTDecoderLayer(
408 (self_attn): OPTAttention(
409 (k_proj): Linear(in_features=2560, out_features=2560, bias=True)
410 (v_proj): Linear(in_features=2560, out_features=2560, bias=True)
411 (q_proj): Linear(in_features=2560, out_features=2560, bias=True)
412 (out_proj): Linear(in_features=2560, out_features=2560, bias=True)
413 )
414 (activation_fn): ReLU()
415 (self_attn_layer_norm): LayerNorm((2560,), eps=1e-05, elementwise_affine=True)
416 (fc1): Linear(in_features=2560, out_features=10240, bias=True)
417 (fc2): Linear(in_features=10240, out_features=2560, bias=True)
418 (final_layer_norm): LayerNorm((2560,), eps=1e-05, elementwise_affine=True)
419 )
420 )
421 )
422 )
423 (lm_head): Linear(in_features=2560, out_features=50304, bias=False)
424 )
425)
下面是两个带有图片时的用例,分别是给指定图片生成描述和基于聊天的多模态提示词
生成图像描述
输入 car.png 这张图片,让模型输出一段对该图片的描述
1from PIL import Image
2
3from transformers import AutoProcessor, Blip2ForConditionalGeneration
4
5model_name = "Salesforce/blip2-opt-2.7b"
6blip_processor = AutoProcessor.from_pretrained(model_name)
7model = Blip2ForConditionalGeneration.from_pretrained(model_name).to("mps")
8
9image = Image.open("car.png").convert("RGB")
10inputs = blip_processor(images=image, return_tensors="pt").to("mps")
11
12generated_ids = model.generate(**inputs, max_new_tokens=50)
13generated_text = blip_processor.batch_decode(generated_ids, skip_special_tokens=True)
14print(generated_text[0].strip())
输出为
an orange supercar driving on the road at sunset
没问题
基于聊天的多模态提示词
和上面的代码类似,只是 blip_processor() 处理图片时用 text 参数加上一段提示词让模型回答问题,这里不写完整代码,但换一张 cat.png 图片
1image = Image.open("cat.png").convert("RGB")
2
3prompt = "Question: Write down what you see in this picture. Answer:"
4inputs = blip_processor(images=image, text=prompt, return_tensors="pt").to("mps")
5
6generated_ids = model.generate(**inputs, max_new_tokens=50)
7generated_text = blip_processor.batch_decode(generated_ids, skip_special_tokens=True)
8print(generated_text[0].strip())
输出为
Question: Write down what you see in this picture. Answer: a cat
小结
本章学习了 LLM 实现多模态能力的核心机制,了解了 ViT 如何能图片进行编码,从左到右,从上往下把图片按网络切分,然后平铺再转换成嵌入向量,
再与它相关联的文本嵌入放在同一个向量空间,这样就能象处理文本一样处理图片,进行分类,相似度计算等。BLIP-2 通过引入 Q-Former 模块,
把视觉与传统的 LLM 连接了起来。
[版权声明]
本文采用 署名-非商业性使用-相同方式共享 4.0 国际 (CC BY-NC-SA 4.0) 进行许可。