试用 Llama-3.1-8B-Instruct AI 模型 | 隔叶黄莺 Yanbin Blog

2024-10-30 | 阅读(312)

IT 从业人员累的一个原因是要紧跟时代步伐，甚至是被拽着赶，更别说福报 996. 从早先 CGI, ASP, PHP, 到 Java, .Net, Java 开发是 Spring, Hibernate, 而后云时代 AWS, Azure, 程序一路奔波在掌握工具的使用。而如今言必提的 AI 模型更是时髦，n B 参数, 量化, 微调, ML, LLM, NLP, AGI, RAG, Token, LoRA 等一众词更让坠入云里雾里。

去年以机器学习为名买的(游戏机)一直未被正名，机器配置为 CPU i9-13900F + 内存 64G + 显卡 RTX 4090，从进门之后完全处于游戏状态，花了数百小时对《黑神话》进行了几翻测试。

现在要好好用它的 GPU 来体验一下 Meta 开源的 AI 模型，切换到操作系统 Ubuntu 20.04, 用 transformers 的方式试了下两个模型，分别是

Llama-3.1-8B-Instruct: 显存使用了 16G，它的老版本的模型是 Meta-Llama-3-8B-Instruct(支持中文问话，输出是英文)
Llama-3.2-11B-Vision-Instruct: 显存锋值到了 22.6G(可以分析图片的内容)

都是使用的 torch_dtype=torch.bfloat16, 对于 24 G 显存的 4090 还用不着主内存来帮忙。如果用 float32 则需更多的显存，对于 Llama-3.2-11B-Vision-Instruct 使用 float32, 则要求助于主内存，将看到

Some parameters are on the meta device because they were offloaded to the cpu.

反之，对原始模型降低精度，量化成 8 位或 4 位则更节约显卡，这是后话，这里主要记述使用上面的 Llama-3.1-8B-Instruct 模型的过程以及感受它的强大，可比小瞧了这个 8B 的小家伙。所以在手机上可以离线轻松跑一个 1B 的模型。

首先进到 Hugging Face 的 meta-llama/Llama-3.1-8B-Instruct 页面，它提供有 Use with transformers, Tool use with transoformer, llama3 的使用方式，本文采用第一种方式。

配置 huggingface-cli

运行代码后会去下载该模型件，但我们须事先安装后 huggingface-cli 并配置好 token

pip install -U "huggingface_hub[cli]"

1	pip install -U "huggingface_hub[cli]"

然后就可使用 huggingface-cli 命令了，如果不行请把

export PATH="$PATH:$(python3 -m site --user-base)/bin"

1	export PATH="$PATH:$(python3 -m site --user-base)/bin"

加到所用 shell 的配置文件中去

执行

huggingface-cli login

1	huggingface-cli login

到这里 https://huggingface.co/settings/tokens 配置一个 token 并用来 login, 只需做一次即可。

以后用 llama.cpp 或 java-llama.cpp 时可用 huggingface-cli 来下载模型。

申请 Llama-3.1-8B-Instruct 的访问权限

进到页面 Llama-3.1-8B-Instruct，点击 "Expand to review and access" 填入信息申请访问权限，正常情况下几分钟后就会被批准，如果超过十分钟未通过审核就需要好好审核下自己了。

创建 Python 虚拟环境

本机操作系统是 Ubuntu 20.04, Python 版本为 12, Nvidia RTX 显卡 CUDA 版本是 12.2(运行 nvidia-smi 可看到)。从模型 Llama-3.1-8B-Instruct 的示例代码中可知需要的基本依赖是 torch 和 transformers, 安装支持 GPU 的 torch 组件

python -m venv .venv
source .venv/bin/activate


pip install torch
pip install transformers accelerate

python -m venv .venv

source .venv/bin/activate

pip install torch

pip install transformers accelerate

在运行后面 test-llama-3-8b.py 脚本时会发现还需要 accelerate 组件，所以上面加上了 accelerate 依赖

ImportError: Using low_cpu_mem_usage=True or a device_map requires Accelerate: pip install 'accelerate>=0.26.0'

验证 pytorch 是否使用 GPU

python -c 'import torch; print(torch.cuda.is_available()); print(torch.version.cuda)'
True
12.4

python -c 'import torch; print(torch.cuda.is_available()); print(torch.version.cuda)'

True

12.4

看到 True 即是，12.4 为当前 torch 支持的 Cuda 版本。

相当于安装 torch 时用的命令

pip install torch --index-url https://download.pytorch.org/whl/cu124

1	pip install torch --index-url https://download.pytorch.org/whl/cu124

pip 找了一个兼容的 torch 版本

创建 test-llama-3-8b.py

内容来自 Llama-3.1-8B-Instruct 的 Use with transformers 并稍作修改

import transformers
import torch

model_id = "meta-llama/Llama-3.1-8B-Instruct"

pipeline = transformers.pipeline(
    "text-generation",
    model=model_id,
    model_kwargs={"torch_dtype": torch.bfloat16},
    device_map="auto",
)

messages = [
    {"role": "system", "content": "Wellcome to Llama model!"},
    {"role": "user", "content": "USA 5 biggest cities and population?"},
]

outputs = pipeline(
    messages,
    max_new_tokens=1024,
)

print(outputs[0]["generated_text"][-1]['content'])

import transformers

import torch

model_id = "meta-llama/Llama-3.1-8B-Instruct"

pipeline = transformers.pipeline(

"text-generation",

model=model_id,

model_kwargs={"torch_dtype": torch.bfloat16},

device_map="auto",

)

messages = [

{"role": "system", "content": "Wellcome to Llama model!"},

{"role": "user", "content": "USA 5 biggest cities and population?"},

]

outputs = pipeline(

messages,

max_new_tokens=1024,

)

print(outputs[0]["generated_text"][-1]['content'])

原代码中 model_id = "meta-llama/Meta-Llama-3.1-8B-Instruct"，有拼写错误，改为 meta-llama/Llama-3.1-8B-Instruct，提示和问题修改了, 调整了 max_new_tokens 和只输出答案的 'content' 部分。

运行 test-llama-3-8b.py

python test-llama-3-8b.py

1	python test-llama-3-8b.py

然后就陷入了下载模型的等待时间，下载的快慢就看你的网速或者你的电脑是否与 Hugging Face 之间是否多了堵东西。

我在本机下载过程不到十分钟，下载完即执行，看这次结果

model-00001-of-00004.safetensors: 100%|██████████████████████████████████████████████████████████| 4.98G/4.98G [02:24<00:00, 30.8MB/s]
model-00002-of-00004.safetensors: 100%|██████████████████████████████████████████████████████████| 5.00G/5.00G [02:45<00:00, 30.3MB/s]
model-00003-of-00004.safetensors: 100%|██████████████████████████████████████████████████████████| 4.92G/4.92G [02:45<00:00, 29.7MB/s]
model-00004-of-00004.safetensors: 100%|██████████████████████████████████████████████████████████| 1.17G/1.17G [00:47<00:00, 24.8MB/s]
Downloading shards: 100%|██████████████████████████████████████████████████████████████████████████████| 4/4 [08:42<00:00, 130.50s/it]
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████| 4/4 [00:01<00:00,  2.45it/s]
generation_config.json: 100%|████████████████████████████████████████████████████████████████████████| 184/184 [00:00<00:00, 1.20MB/s]
tokenizer_config.json: 100%|█████████████████████████████████████████████████████████████████████| 55.4k/55.4k [00:00<00:00, 2.38MB/s]
tokenizer.json: 100%|████████████████████████████████████████████████████████████████████████████| 9.09M/9.09M [00:00<00:00, 13.9MB/s]
special_tokens_map.json: 100%|███████████████████████████████████████████████████████████████████████| 296/296 [00:00<00:00, 2.08MB/s]
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Based on the available data as of my cut-off knowledge in 2023, the top 5 biggest cities in the United States by population are:

1. **New York City, NY**: approximately 8,804,190 people
New York City is the most populous city in the United States and is located in the state of New York. It is a global hub for finance, culture, and entertainment.

2. **Los Angeles, CA**: approximately 3,898,747 people
Los Angeles is the second-most populous city in the United States and is located in the state of California. It is a major center for the film and television industry, as well as a hub for technology and innovation.

3. **Chicago, IL**: approximately 2,693,976 people
Chicago is the third-most populous city in the United States and is located in the state of Illinois. It is a major hub for finance, commerce, and culture, and is known for its vibrant music and arts scene.

4. **Houston, TX**: approximately 2,355,386 people
Houston is the fourth-most populous city in the United States and is located in the state of Texas. It is a major center for the energy industry and is home to the Johnson Space Center, where NASA's mission control is located.

5. **Phoenix, AZ**: approximately 1,708,025 people
Phoenix is the fifth-most populous city in the United States and is located in the state of Arizona. It is a major hub for the technology and healthcare industries, and is known for its desert landscapes and warm climate.

Please note that these numbers are estimates based on data from the United States Census Bureau for 2020, and may have changed slightly since then due to population growth and other factors.

model-00001-of-00004.safetensors: 100%|██████████████████████████████████████████████████████████| 4.98G/4.98G [02:24<00:00, 30.8MB/s]

model-00002-of-00004.safetensors: 100%|██████████████████████████████████████████████████████████| 5.00G/5.00G [02:45<00:00, 30.3MB/s]

model-00003-of-00004.safetensors: 100%|██████████████████████████████████████████████████████████| 4.92G/4.92G [02:45<00:00, 29.7MB/s]

model-00004-of-00004.safetensors: 100%|██████████████████████████████████████████████████████████| 1.17G/1.17G [00:47<00:00, 24.8MB/s]

Downloading shards: 100%|██████████████████████████████████████████████████████████████████████████████| 4/4 [08:42<00:00, 130.50s/it]

Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████| 4/4 [00:01<00:00, 2.45it/s]

generation_config.json: 100%|████████████████████████████████████████████████████████████████████████| 184/184 [00:00<00:00, 1.20MB/s]

tokenizer_config.json: 100%|█████████████████████████████████████████████████████████████████████| 55.4k/55.4k [00:00<00:00, 2.38MB/s]

tokenizer.json: 100%|████████████████████████████████████████████████████████████████████████████| 9.09M/9.09M [00:00<00:00, 13.9MB/s]

special_tokens_map.json: 100%|███████████████████████████████████████████████████████████████████████| 296/296 [00:00<00:00, 2.08MB/s]

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.

Based on the available data as of my cut-off knowledge in 2023, the top 5 biggest cities in the United States by population are:

1. **New York City, NY**: approximately 8,804,190 people

New York City is the most populous city in the United States and is located in the state of New York. It is a global hub for finance, culture, and entertainment.

2. **Los Angeles, CA**: approximately 3,898,747 people

Los Angeles is the second-most populous city in the United States and is located in the state of California. It is a major center for the film and television industry, as well as a hub for technology and innovation.

3. **Chicago, IL**: approximately 2,693,976 people

Chicago is the third-most populous city in the United States and is located in the state of Illinois. It is a major hub for finance, commerce, and culture, and is known for its vibrant music and arts scene.

4. **Houston, TX**: approximately 2,355,386 people

Houston is the fourth-most populous city in the United States and is located in the state of Texas. It is a major center for the energy industry and is home to the Johnson Space Center, where NASA's mission control is located.

5. **Phoenix, AZ**: approximately 1,708,025 people

Phoenix is the fifth-most populous city in the United States and is located in the state of Arizona. It is a major hub for the technology and healthcare industries, and is known for its desert landscapes and warm climate.

Please note that these numbers are estimates based on data from the United States Census Bureau for 2020, and may have changed slightly since then due to population growth and other factors.

第二次执行 python test-llama-3-8b.py

python test-llama-3-8b.py
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████| 4/4 [00:01<00:00,  2.34it/s]
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Based on the available data, here are the top 5 biggest cities in the USA by population:

1. **New York City, NY**: Approximately 8,804,190 people (as of 2020 United States Census)
2. **Los Angeles, CA**: Approximately 3,898,747 people (as of 2020 United States Census)
3. **Chicago, IL**: Approximately 2,670,504 people (as of 2020 United States Census)
4. **Houston, TX**: Approximately 2,355,386 people (as of 2020 United States Census)
5. **Phoenix, AZ**: Approximately 1,708,025 people (as of 2020 United States Census)

Please note that these numbers may have changed slightly since the last census in 2020, but these are the most up-to-date figures available.

python test-llama-3-8b.py

Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████| 4/4 [00:01<00:00, 2.34it/s]

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.

Based on the available data, here are the top 5 biggest cities in the USA by population:

1. **New York City, NY**: Approximately 8,804,190 people (as of 2020 United States Census)

2. **Los Angeles, CA**: Approximately 3,898,747 people (as of 2020 United States Census)

3. **Chicago, IL**: Approximately 2,670,504 people (as of 2020 United States Census)

4. **Houston, TX**: Approximately 2,355,386 people (as of 2020 United States Census)

5. **Phoenix, AZ**: Approximately 1,708,025 people (as of 2020 United States Census)

Please note that these numbers may have changed slightly since the last census in 2020, but these are the most up-to-date figures available.

没有了下载过程，但每次都有 "Loading checkpoint shards" 过程，生成不同的结果，总的执行时间是 6.7 秒左右

模型文件下载到哪里？

在 ~/.cache/huggingface/hub/*

du -sh ~/.cache/huggingface/hub/*
15G /home/yanbin/.cache/huggingface/hub/models--meta-llama--Llama-3.1-8B-Instruct
20G /home/yanbin/.cache/huggingface/hub/models--meta-llama--Llama-3.2-11B-Vision-Instruct
15G /home/yanbin/.cache/huggingface/hub/models--meta-llama--Meta-Llama-3-8B-Instruct
4.0K    /home/yanbin/.cache/huggingface/hub/version.txt

du -sh ~/.cache/huggingface/hub/*

15G /home/yanbin/.cache/huggingface/hub/models--meta-llama--Llama-3.1-8B-Instruct

20G /home/yanbin/.cache/huggingface/hub/models--meta-llama--Llama-3.2-11B-Vision-Instruct

15G /home/yanbin/.cache/huggingface/hub/models--meta-llama--Meta-Llama-3-8B-Instruct

4.0K /home/yanbin/.cache/huggingface/hub/version.txt

具体模型文件在各自的 Snapshot 中

ls -Llh ~/.cache/huggingface/hub/models--meta-llama--Llama-3.1-8B-Instruct/snapshots/0e9e39f249a16976918f6564b8830bc894c89659/
total 15G
-rw-rw-r-- 1 yanbin yanbin  855 Oct 30 13:10 config.json
-rw-rw-r-- 1 yanbin yanbin  184 Oct 30 13:23 generation_config.json
-rw-rw-r-- 1 yanbin yanbin 4.7G Oct 30 13:17 model-00001-of-00004.safetensors
-rw-rw-r-- 1 yanbin yanbin 4.7G Oct 30 13:20 model-00002-of-00004.safetensors
-rw-rw-r-- 1 yanbin yanbin 4.6G Oct 30 13:22 model-00003-of-00004.safetensors
-rw-rw-r-- 1 yanbin yanbin 1.1G Oct 30 13:23 model-00004-of-00004.safetensors
-rw-rw-r-- 1 yanbin yanbin  24K Oct 30 13:14 model.safetensors.index.json
-rw-rw-r-- 1 yanbin yanbin  296 Oct 30 13:23 special_tokens_map.json
-rw-rw-r-- 1 yanbin yanbin  55K Oct 30 13:23 tokenizer_config.json
-rw-rw-r-- 1 yanbin yanbin 8.7M Oct 30 13:23 tokenizer.json

ls -Llh ~/.cache/huggingface/hub/models--meta-llama--Llama-3.1-8B-Instruct/snapshots/0e9e39f249a16976918f6564b8830bc894c89659/

total 15G

-rw-rw-r-- 1 yanbin yanbin 855 Oct 30 13:10 config.json

-rw-rw-r-- 1 yanbin yanbin 184 Oct 30 13:23 generation_config.json

-rw-rw-r-- 1 yanbin yanbin 4.7G Oct 30 13:17 model-00001-of-00004.safetensors

-rw-rw-r-- 1 yanbin yanbin 4.7G Oct 30 13:20 model-00002-of-00004.safetensors

-rw-rw-r-- 1 yanbin yanbin 4.6G Oct 30 13:22 model-00003-of-00004.safetensors

-rw-rw-r-- 1 yanbin yanbin 1.1G Oct 30 13:23 model-00004-of-00004.safetensors

-rw-rw-r-- 1 yanbin yanbin 24K Oct 30 13:14 model.safetensors.index.json

-rw-rw-r-- 1 yanbin yanbin 296 Oct 30 13:23 special_tokens_map.json

-rw-rw-r-- 1 yanbin yanbin 55K Oct 30 13:23 tokenizer_config.json

-rw-rw-r-- 1 yanbin yanbin 8.7M Oct 30 13:23 tokenizer.json

下载的文件是在 https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct/tree/main 能看到的不包含 original 目录的文件。.safetensors 是 Hugging Face 推出的模型存储格式。

original 目录中是 PyTorch 训练出来的原始文件 .pth 格式。如果用 huggingface-cli download 手动下载的，就要用 --exclude "original/*" 只下载 .safetensors, 或用 --include "original/*" 只下载 original 目录中的原始文件，比如以 llama(llama.cpp 或 java-llama.cpp) 方式使用模型就要 original 文件，然后转换成 GGUF(GPT-Generated Unified Format) 格式.

要相应 GGUF 格式的模型也用不着自己手工转换，在 Hugging Face 上能找到别人转换好的，比如像 bartowski/Llama-3.2-1B-Instruct-GGUF。

一个模型动不动十几，几十G，大参数的模型更不得了，看来有硬盘危机了，又不能用机械硬盘，是时候要加个 4T 的 SSD。

多测试些问题

为避免重复加载模型，我们修改代码，问题可由两种方式输入。

1. 非交互方式：启动脚本时直接输入，每次加载模型，Python 执行完立即退出，如 python test-llama-3-8b.py "who are you?"
2. 交互方式：启动脚本时不带参数，只需加载一次模型，进入用户交互模式，直到输入 exit 后才退出程序，python test-llama-3-8b.py

并加上模型加载和推理的执行时间测量代码，有设定了环境变量 PY_DEBUG=true 时会打印出相应的执行时间，方便我们对比不同情形下的性能。

改造后的代码如下

import transformers
import torch
import sys
import os
from time import time

DEBUG = os.getenv("PY_DEBUG", "false")

model_id = "meta-llama/Llama-3.1-8B-Instruct"

start = time()
pipeline = transformers.pipeline(
    "text-generation",
    model=model_id,
    model_kwargs={"torch_dtype": torch.bfloat16},
    device_map="auto",
)
if DEBUG == "true":
    print(f'loading model: {time() - start}')

def infer(question):
    start = time()
    messages = [
        {"role": "system", "content": "Wellcome to Llama model!"},
        {"role": "user", "content": question},
    ]

    outputs = pipeline(
        messages,
        max_new_tokens=1024,
    )

    if DEBUG == "true":
        print(f'inference: {time() - start}')

    print(outputs[0]["generated_text"][-1]['content'])


user_input = sys.argv[1] if len(sys.argv) > 1 else None
if user_input:
    infer(user_input)

else:
    while True:
        user_input = input("Your input (type 'exit' to quit): ")
        if user_input.lower() == 'exit':
            print('Bye!')
            break
        infer(user_input)

import transformers

import torch

import sys

import os

from time import time

DEBUG = os.getenv("PY_DEBUG", "false")

model_id = "meta-llama/Llama-3.1-8B-Instruct"

start = time()

pipeline = transformers.pipeline(

"text-generation",

model=model_id,

model_kwargs={"torch_dtype": torch.bfloat16},

device_map="auto",

)

if DEBUG == "true":

print(f'loading model: {time() - start}')

def infer(question):

start = time()

messages = [

{"role": "system", "content": "Wellcome to Llama model!"},

{"role": "user", "content": question},

]

outputs = pipeline(

messages,

max_new_tokens=1024,

)

if DEBUG == "true":

print(f'inference: {time() - start}')

print(outputs[0]["generated_text"][-1]['content'])

user_input = sys.argv[1] if len(sys.argv) > 1 else None

if user_input:

infer(user_input)

else:

while True:

user_input = input("Your input (type 'exit' to quit): ")

if user_input.lower() == 'exit':

print('Bye!')

break

infer(user_input)

执行

python test-llama-3-8b.py
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████| 4/4 [00:01<00:00,  2.48it/s]
Your input (type 'exit' to quit):

python test-llama-3-8b.py

Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████| 4/4 [00:01<00:00, 2.48it/s]

Your input (type 'exit' to quit):

在装载了模型后，测试相同的问题：USA 5 biggest cities and population?

返回结果只需要 2.5 秒，比完全要加载模型来回答问题的 6.7 秒快了许多，节约了加载模型的 4.2 秒的时间。

试试中文, 输入

Your input (type 'exit' to quit): 中国国歌的歌名是什么？
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
中国国歌的歌名是《义勇军进行曲》。它于1935年由中国作曲家徐6765（音）写曲，词曲由汪精卫创作。
Your input (type 'exit' to quit):

Your input (type 'exit' to quit): 中国国歌的歌名是什么？

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.

中国国歌的歌名是《义勇军进行曲》。它于1935年由中国作曲家徐6765（音）写曲，词曲由汪精卫创作。

Your input (type 'exit' to quit):

中文也能理解，还能输出中文，而前一版本的模型 meta-llama/Meta-Llama-3-8B-Instruct 只能理解中文(输入)，但不会说与写(输出)

来点技术问题

Your input (type 'exit' to quit): write code to read s3 object in Python
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
**Reading an S3 Object in Python**
=====================================

You can use the AWS SDK for Python (Boto3) to read an S3 object. Here's a step-by-step example:

**Prerequisites**
---------------

* You have an AWS account and have set up an S3 bucket.
* You have the AWS CLI installed and configured on your machine.
* You have the Boto3 library installed in your Python environment. You can install it using pip: `pip install boto3`

**Code**
------

```python
import boto3
from botocore.exceptions import NoCredentialsError

# Initialize the S3 client
s3 = boto3.client('s3')

def read_s3_object(bucket_name, object_key):
    """
    Read an S3 object.

    Args:
        bucket_name (str): The name of the S3 bucket.
        object_key (str): The key of the S3 object.

    Returns:
        str: The contents of the S3 object as a string.
    """
    try:
        # Get the object from S3
        response = s3.get_object(Bucket=bucket_name, Key=object_key)

        # Get the contents of the object
        contents = response['Body'].read().decode('utf-8')

        return contents

    except NoCredentialsError:
        print("Credentials not available")
        return None

    except Exception as e:
        print(f"Error reading S3 object: {e}")
        return None

# Example usage
bucket_name ='my-bucket'
object_key = 'path/to/object.txt'

contents = read_s3_object(bucket_name, object_key)
if contents:
    print(contents)
```

**Explanation**
--------------

1. We initialize the S3 client using `boto3.client('s3')`.
2. We define a function `read_s3_object` that takes the bucket name and object key as arguments.
3. We use the `get_object` method to retrieve the S3 object from the specified bucket and key.
4. We get the contents of the object using `response['Body'].read().decode('utf-8')`.
5. We return the contents as a string.
6. We handle exceptions, including `NoCredentialsError` and any other exceptions that may occur.

**Note**
----

Make sure to replace `my-bucket` and `path/to/object.txt` with your actual bucket name and object key. Also, ensure that you have the necessary credentials set up to access your S3 bucket.
Your input (type 'exit' to quit):

Your input (type 'exit' to quit): write code to read s3 object in Python

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.

**Reading an S3 Object in Python**

=====================================

You can use the AWS SDK for Python (Boto3) to read an S3 object. Here's a step-by-step example:

**Prerequisites**

---------------

* You have an AWS account and have set up an S3 bucket.

* You have the AWS CLI installed and configured on your machine.

* You have the Boto3 library installed in your Python environment. You can install it using pip: `pip install boto3`

**Code**

------

```python

import boto3

from botocore.exceptions import NoCredentialsError

# Initialize the S3 client

s3 = boto3.client('s3')

def read_s3_object(bucket_name, object_key):

"""

Read an S3 object.

Args:

bucket_name (str): The name of the S3 bucket.

object_key (str): The key of the S3 object.

Returns:

str: The contents of the S3 object as a string.

"""

try:

# Get the object from S3

response = s3.get_object(Bucket=bucket_name, Key=object_key)

# Get the contents of the object

contents = response['Body'].read().decode('utf-8')

return contents

except NoCredentialsError:

print("Credentials not available")

return None

except Exception as e:

print(f"Error reading S3 object: {e}")

return None

# Example usage

bucket_name ='my-bucket'

object_key = 'path/to/object.txt'

contents = read_s3_object(bucket_name, object_key)

if contents:

print(contents)

```

**Explanation**

--------------

1. We initialize the S3 client using `boto3.client('s3')`.

2. We define a function `read_s3_object` that takes the bucket name and object key as arguments.

3. We use the `get_object` method to retrieve the S3 object from the specified bucket and key.

4. We get the contents of the object using `response['Body'].read().decode('utf-8')`.

5. We return the contents as a string.

6. We handle exceptions, including `NoCredentialsError` and any other exceptions that may occur.

**Note**

----

Make sure to replace `my-bucket` and `path/to/object.txt` with your actual bucket name and object key. Also, ensure that you have the necessary credentials set up to access your S3 bucket.

Your input (type 'exit' to quit):

它能懂多少编程语言？

Your input (type 'exit' to quit): what programming languages do you know?
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
I have been trained on a wide range of programming languages, including but not limited to:

1. **Python**: I have knowledge of popular Python libraries such as NumPy, pandas, and scikit-learn, as well as popular frameworks like TensorFlow and PyTorch.
2. **Java**: I'm familiar with Java syntax, object-oriented programming, and popular libraries like Spring and Hibernate.
3. **JavaScript**: I have knowledge of JavaScript syntax, DOM manipulation, and popular libraries like React and Angular.
4. **C++**: I'm familiar with C++ syntax, object-oriented programming, and popular libraries like STL and Boost.
5. **C#**: I have knowledge of C# syntax, object-oriented programming, and popular frameworks like.NET and Unity.
6. **Go**: I'm familiar with Go syntax, concurrency, and popular libraries like Gorilla and Revel.
7. **Ruby**: I have knowledge of Ruby syntax, object-oriented programming, and popular frameworks like Ruby on Rails.
8. **Swift**: I'm familiar with Swift syntax, object-oriented programming, and popular frameworks like iOS and macOS development.
9. **PHP**: I have knowledge of PHP syntax, web development, and popular frameworks like Laravel and CodeIgniter.
10. **SQL**: I'm familiar with SQL syntax, database design, and popular databases like MySQL and PostgreSQL.
11. **TypeScript**: I have knowledge of TypeScript syntax, object-oriented programming, and popular frameworks like Angular and React.
12. **Kotlin**: I'm familiar with Kotlin syntax, object-oriented programming, and popular frameworks like Android development.
13. **R**: I have knowledge of R syntax, statistical computing, and popular libraries like dplyr and ggplot2.
14. **Haskell**: I'm familiar with Haskell syntax, functional programming, and popular libraries like Yesod and Hoogle.
15. **MATLAB**: I have knowledge of MATLAB syntax, numerical computing, and popular libraries like Statistics and Signal Processing Toolbox.

Please note that while I have knowledge of these languages, my proficiency may vary depending on the specific task or problem you're trying to solve.
Your input (type 'exit' to quit):

Your input (type 'exit' to quit): what programming languages do you know?

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.

I have been trained on a wide range of programming languages, including but not limited to:

1. **Python**: I have knowledge of popular Python libraries such as NumPy, pandas, and scikit-learn, as well as popular frameworks like TensorFlow and PyTorch.

2. **Java**: I'm familiar with Java syntax, object-oriented programming, and popular libraries like Spring and Hibernate.

3. **JavaScript**: I have knowledge of JavaScript syntax, DOM manipulation, and popular libraries like React and Angular.

4. **C++**: I'm familiar with C++ syntax, object-oriented programming, and popular libraries like STL and Boost.

5. **C#**: I have knowledge of C# syntax, object-oriented programming, and popular frameworks like.NET and Unity.

6. **Go**: I'm familiar with Go syntax, concurrency, and popular libraries like Gorilla and Revel.

7. **Ruby**: I have knowledge of Ruby syntax, object-oriented programming, and popular frameworks like Ruby on Rails.

8. **Swift**: I'm familiar with Swift syntax, object-oriented programming, and popular frameworks like iOS and macOS development.

9. **PHP**: I have knowledge of PHP syntax, web development, and popular frameworks like Laravel and CodeIgniter.

10. **SQL**: I'm familiar with SQL syntax, database design, and popular databases like MySQL and PostgreSQL.

11. **TypeScript**: I have knowledge of TypeScript syntax, object-oriented programming, and popular frameworks like Angular and React.

12. **Kotlin**: I'm familiar with Kotlin syntax, object-oriented programming, and popular frameworks like Android development.

13. **R**: I have knowledge of R syntax, statistical computing, and popular libraries like dplyr and ggplot2.

14. **Haskell**: I'm familiar with Haskell syntax, functional programming, and popular libraries like Yesod and Hoogle.

15. **MATLAB**: I have knowledge of MATLAB syntax, numerical computing, and popular libraries like Statistics and Signal Processing Toolbox.

Please note that while I have knowledge of these languages, my proficiency may vary depending on the specific task or problem you're trying to solve.

Your input (type 'exit' to quit):

它的信息够新?

Your input (type 'exit' to quit): what latest versions of spring and springboot
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
As of my knowledge cutoff in 2023, the latest versions of Spring and Spring Boot are:

1. **Spring Framework**: The latest version of the Spring Framework is **6.0.0** (released on September 14, 2022).
2. **Spring Boot**: The latest version of Spring Boot is **3.0.0** (released on September 14, 2022).

Here are some key features and improvements in these latest versions:

**Spring Framework 6.0.0**

* Improved performance and concurrency support
* Enhanced support for reactive programming
* Simplified configuration and dependency management
* Better support for cloud-native applications
* Improved security features

**Spring Boot 3.0.0**

* Based on Spring Framework 6.0.0, providing the same improvements and features
* Improved support for web development, including a new web framework (WebFlux)
* Enhanced support for cloud-native applications and serverless computing
* Improved security features and better support for OAuth 2.0 and OpenID Connect
* Better support for microservices architecture and service discovery

Please note that these versions are subject to change, and new versions may be released after my knowledge cutoff date. I recommend checking the official Spring documentation and release notes for the latest information.
Your input (type 'exit' to quit):

Your input (type 'exit' to quit): what latest versions of spring and springboot

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.

As of my knowledge cutoff in 2023, the latest versions of Spring and Spring Boot are:

1. **Spring Framework**: The latest version of the Spring Framework is **6.0.0** (released on September 14, 2022).

2. **Spring Boot**: The latest version of Spring Boot is **3.0.0** (released on September 14, 2022).

Here are some key features and improvements in these latest versions:

**Spring Framework 6.0.0**

* Improved performance and concurrency support

* Enhanced support for reactive programming

* Simplified configuration and dependency management

* Better support for cloud-native applications

* Improved security features

**Spring Boot 3.0.0**

* Based on Spring Framework 6.0.0, providing the same improvements and features

* Improved support for web development, including a new web framework (WebFlux)

* Enhanced support for cloud-native applications and serverless computing

* Improved security features and better support for OAuth 2.0 and OpenID Connect

* Better support for microservices architecture and service discovery

Please note that these versions are subject to change, and new versions may be released after my knowledge cutoff date. I recommend checking the official Spring documentation and release notes for the latest information.

Your input (type 'exit' to quit):

可以问到它的知识更新到什么时候的

Your input (type 'exit' to quit): when is your knowledge cutoff date?
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
My knowledge cutoff date is currently December 2023, but I have access to more recent information via internet search.

Your input (type 'exit' to quit): when is your knowledge cutoff date?

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.

My knowledge cutoff date is currently December 2023, but I have access to more recent information via internet search.

Dec 12, 2023

观察 CPU, 内存，GPU 的使用状态

CPU，内存状态监控用 top, GPU 的监控命令是 watch -d -n 0.5 nvidia-smi

比如程序一启动没进到输入提示符时

python test-llama-3-8b.py
Loading checkpoint shards: 100%|██████████████████████████████████████| 4/4 [00:01<00:00,  2.40it/s]
Your input (type 'exit' to quit):

python test-llama-3-8b.py

Loading checkpoint shards: 100%|██████████████████████████████████████| 4/4 [00:01<00:00, 2.40it/s]

Your input (type 'exit' to quit):

CPU，内存, GPU 的的运行状态是

free
               total        used        free      shared  buff/cache   available
Mem:        65543452     5375056     1389000       45840    58779396    59393188

free

total used free shared buff/cache available

Mem: 65543452 5375056 1389000 45840 58779396 59393188

模型一载入后内存所剩无几，VRAM 快用了 16G， GPU 待机状态

在问 "how to implement a http proxy with node.js and express?" 执行过程中的状态是

从上图看到 GPU 使用达到 98%，功率 291W，连 python 的 CPU 使用也拉满，真的很烧机器。

测试不用 CUDA

使用环境变量将 CUDA 设备隐藏

export CUDA_VISIBLE_DEVICES=""
python -c 'import torch; print(torch.cuda.is_available()); print(torch.version.cuda)'
False
12.4

export CUDA_VISIBLE_DEVICES=""

python -c 'import torch; print(torch.cuda.is_available()); print(torch.version.cuda)'

False

12.4

测试非交互的问话, 即执行

python test-llama-3-8b.py "USA 5 biggest cities and population?"

同时用 nvidia-smi 观察到的 GPU 确实没被使用，全跑在 CPU 上，一直处于极高的使用率 200 多

top - 16:00:32 up 1 day, 17:49,  3 users,  load average: 2.97, 1.61, 0.82
Tasks: 480 total,   2 running, 478 sleeping,   0 stopped,   0 zombie
%Cpu(s):  8.9 us,  0.4 sy,  0.0 ni, 90.7 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
MiB Mem :  64007.3 total,   2915.0 free,   5625.7 used,  55466.5 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.  57632.5 avail Mem

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
 181604 yanbin    20   0   23.3g  14.5g  14.2g R 282.4  23.3   4:49.41 python
  45045 yanbin    20   0   23.3g   4.4g 750044 S   2.7   7.0  37:32.62 ld-linux-x86-64

top - 16:00:32 up 1 day, 17:49, 3 users, load average: 2.97, 1.61, 0.82

Tasks: 480 total, 2 running, 478 sleeping, 0 stopped, 0 zombie

%Cpu(s): 8.9 us, 0.4 sy, 0.0 ni, 90.7 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st

MiB Mem : 64007.3 total, 2915.0 free, 5625.7 used, 55466.5 buff/cache

MiB Swap: 0.0 total, 0.0 free, 0.0 used. 57632.5 avail Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND

181604 yanbin 20 0 23.3g 14.5g 14.2g R 282.4 23.3 4:49.41 python

45045 yanbin 20 0 23.3g 4.4g 750044 S 2.7 7.0 37:32.62 ld-linux-x86-64

最后得到答案的总耗时为 8 分 14 秒(即 494 秒), 与使用 GPU 的 6.7 秒，基本是不实用状态。

用交互式的方式，也就是加载完模型，然后输入问题 "USA 5 biggest cities and population?" 测试下。从上一个测试也看出时间基本耗费在推理上，加载模型的时间相比而言可以忽略不记

测试的结果是加载模型后，从问题话到输出结果的时候是 8 分 3 秒(483 秒)，基本上用还是不用 GPU 进行推，问一个简单问题("USA 5 biggest cities and population?" ) 的时间对比是

使用 GPU(CUDA, RTX 4090, 24G, 16384 cuda cores): 2.5 秒
不使用 GPU(完全靠 CPU i9-13900F): 488 秒 -- 8 分 8 秒

这个真有必要突出强调，没有一个好的显卡就玩不了模型，模特还差不多, 而且还没对比更复杂的问题。不用 GPU 试下问题 “how to implement a http proxy with node.js and express?”，花费了 9 分 50 秒(即 590 秒)，这些用了 GPU 的话都是几秒内出结果的。再问 "1+1=?" 用了 2 分 40 秒，这智商，用了 GPU 来回答 "1+1=?" 的问题只需 0.5 秒。

Chat Bot UI 想法

既然可以做成命令行下交互式的问题，那很容易实现出一个 WebUI 的聊天窗口。比如用 FastAPI 把前面 transformer pipeline 实现为一个 API, 用户输入内容，调用 pipeline 产生输出，由于代码是用 ` 输出的 Markdown 格式，所以聊天容器内容显示要支持 Markdown。