From 01869065e70b2fd5a8d161220889da180391bf8c Mon Sep 17 00:00:00 2001 From: root <403644786@qq.com> Date: Tue, 23 Jul 2024 15:10:44 +0800 Subject: [PATCH] =?UTF-8?q?=E6=A0=B9=E6=8D=AE=E8=A6=81=E6=B1=82=E5=81=9A?= =?UTF-8?q?=E4=BA=86markdown=E7=9A=84=E6=A0=BC=E5=BC=8F=E4=BF=AE=E6=94=B9?= =?UTF-8?q?=EF=BC=8C=E6=9C=AF=E8=AF=ADapk=E6=B2=A1=E6=9C=89=E4=BF=AE?= =?UTF-8?q?=E6=94=B9=EF=BC=8C=E9=81=B5=E5=BE=AA=E4=BA=86readme=E4=B8=AD?= =?UTF-8?q?=E5=8E=9F=E6=9D=A5=E7=9A=84=E8=A1=A8=E8=BF=B0=E6=96=B9=E5=BC=8F?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- README.md | 10 +- README_zh.md | 8 +- docs/swift_train_and_infer.md | 176 +++++++++++++++++----------------- docs/xinference_infer.md | 20 ++-- 4 files changed, 109 insertions(+), 105 deletions(-) diff --git a/README.md b/README.md index 2251681..27e2714 100644 --- a/README.md +++ b/README.md @@ -71,12 +71,14 @@ Join our 💬 WeChat - [Citation](#citation) ## MiniCPM-Llama3-V 2.5 Common Module Navigation +You can click on the following table to quickly access the commonly used content you need. + | Functional Categories | | | | | | | || |:--------:|:------:|:--------------:|:--------:|:-------:|:-----------:|:-----------:|:--------:|:-----------:| -| Inference | [Transformers](https://github.com/OpenBMB/MiniCPM-V/blob/main/docs/inference_on_multiple_gpus.md) | [Ollama](https://github.com/OpenBMB/ollama/tree/minicpm-v2.5/examples/minicpm-v2.5) | [Swift](./docs/swift_train_and_infer.md) | [Llama.cpp](https://github.com/OpenBMB/llama.cpp/blob/minicpm-v2.5/examples/minicpmv/README.md) | [Xinfrence](./docs/xinference_infer.md) | [Gradio](./web_demo_2.5.py) | [Streamlit](./web_demo_streamlit-2_5.py) | -| Finetune | [Finetune](./finetune/readme.md) | [Lora](./finetune/readme.md) | [Swift](./docs/swift_train_and_infer.md) | | | | | | -| Edge Deployment | [Apk](http://minicpm.modelbest.cn/android/modelbest-release-20240528_182155.apk) | [Llama.cpp](https://github.com/OpenBMB/llama.cpp/blob/minicpm-v2.5/examples/minicpmv/README.md) | | | | | | | -| Quantize | [Bnb](./quantize/bnb_quantize.py) | +| Inference | [Transformers](https://github.com/OpenBMB/MiniCPM-V/blob/main/docs/inference_on_multiple_gpus.md) | [ollama](https://github.com/OpenBMB/ollama/tree/minicpm-v2.5/examples/minicpm-v2.5) | [SWIFT](./docs/swift_train_and_infer.md) | [llama.cpp](https://github.com/OpenBMB/llama.cpp/blob/minicpm-v2.5/examples/minicpmv/README.md) | [Xinfrence](./docs/xinference_infer.md) | [Gradio](./web_demo_2.5.py) | [Streamlit](./web_demo_streamlit-2_5.py) |[vLLM](#vllm) +| Finetune | [Finetune](./finetune/readme.md) | [Lora](./finetune/readme.md) | [SWIFT](./docs/swift_train_and_infer.md) | | | | | | +| Edge Deployment | [apk](http://minicpm.modelbest.cn/android/modelbest-release-20240528_182155.apk) | [llama.cpp](https://github.com/OpenBMB/llama.cpp/blob/minicpm-v2.5/examples/minicpmv/README.md) | | | | | | | +| Quantize | [Bnb](./quantize/bnb_quantize.py) | ## MiniCPM-Llama3-V 2.5 diff --git a/README_zh.md b/README_zh.md index 148f969..ab76d11 100644 --- a/README_zh.md +++ b/README_zh.md @@ -76,11 +76,13 @@ - [引用](#引用) ## MiniCPM-Llama3-V 2.5快速导航 +你可以点击以下表格快速访问MiniCPM-Llama3-V 2.5中你所需要的常用内容 + | 功能分类 | | | | | | | || |:--------:|:------:|:--------------:|:--------:|:-------:|:-----------:|:-----------:|:--------:|:-----------:| -| 推理 | [Transformers](https://github.com/OpenBMB/MiniCPM-V/blob/main/docs/inference_on_multiple_gpus.md) | [Ollama](https://github.com/OpenBMB/ollama/tree/minicpm-v2.5/examples/minicpm-v2.5) | [Swift](./docs/swift_train_and_infer.md) | [Llama.cpp](https://github.com/OpenBMB/llama.cpp/blob/minicpm-v2.5/examples/minicpmv/README.md) | [Xinfrence](./docs/xinference_infer.md) | [Gradio](./web_demo_2.5.py) | [Streamlit](./web_demo_streamlit-2_5.py) | -| 微调 | [Finetune](./finetune/readme.md) | [Lora](./finetune/readme.md) | [Swift](./docs/swift_train_and_infer.md) | | | | | | -| 安卓部署 | [Apk安装](http://minicpm.modelbest.cn/android/modelbest-release-20240528_182155.apk) | [Llama.cpp](https://github.com/OpenBMB/llama.cpp/blob/minicpm-v2.5/examples/minicpmv/README.md) | | | | | | | +| 推理 | [Transformers](https://github.com/OpenBMB/MiniCPM-V/blob/main/docs/inference_on_multiple_gpus.md) | [ollama](https://github.com/OpenBMB/ollama/tree/minicpm-v2.5/examples/minicpm-v2.5) | [SWIFT](./docs/swift_train_and_infer.md) | [Llama.cpp](https://github.com/OpenBMB/llama.cpp/blob/minicpm-v2.5/examples/minicpmv/README.md) | [Xinfrence](./docs/xinference_infer.md) | [Gradio](./web_demo_2.5.py) | [Streamlit](./web_demo_streamlit-2_5.py) |[vLLM](#vllm) +| 微调 | [Finetune](./finetune/readme.md) | [LoRA](./finetune/readme.md) | [Swift](./docs/swift_train_and_infer.md) | | | | | | +| 安卓部署 | [apk安装](http://minicpm.modelbest.cn/android/modelbest-release-20240528_182155.apk) | [Llama.cpp](https://github.com/OpenBMB/llama.cpp/blob/minicpm-v2.5/examples/minicpmv/README.md) | | | | | | | | 量化 | [Bnb量化](./quantize/bnb_quantize.py) | diff --git a/docs/swift_train_and_infer.md b/docs/swift_train_and_infer.md index df4903c..1e74607 100644 --- a/docs/swift_train_and_infer.md +++ b/docs/swift_train_and_infer.md @@ -1,18 +1,18 @@ -## Swift install -You can quickly install Swift using bash commands. +## SWIFT install +You can quickly install SWIFT using bash commands. ``` bash - git clone https://github.com/modelscope/swift.git - cd swift - pip install -r requirements.txt - pip install -e '.[llm]' +git clone https://github.com/modelscope/swift.git +cd swift +pip install -r requirements.txt +pip install -e '.[llm]' ``` -## Swift Infer -Inference using Swift can be carried out in two ways: through a command line interface and via Python code. +## SWIFT Infer +Inference using SWIFT can be carried out in two ways: through a command line interface and via Python code. ### Quick start -Here are the steps to launch Swift from the Bash command line: +Here are the steps to launch SWIFT from the Bash command line: 1. Run the bash code will download the model of MiniCPM-Llama3-V-2_5 and run the inference ``` shell @@ -21,115 +21,115 @@ CUDA_VISIBLE_DEVICES=0 swift infer --model_type minicpm-v-v2_5-chat 2. You can also run the code with more arguments below to run the inference: ``` - model_id_or_path # 可以写huggingface的模型id或者本地模型地址 - infer_backend ['AUTO', 'vllm', 'pt'] # 后段推理,默认auto - dtype ['bf16', 'fp16', 'fp32', 'AUTO'] # 计算精度 - max_length # 最大长度 - max_new_tokens: int = 2048 #最多生成多少token - do_sample: bool = True # 是否采样 - temperature: float = 0.3 # 生成时的温度系数 - top_k: int = 20 - top_p: float = 0.7 - repetition_penalty: float = 1. - num_beams: int = 1 - stop_words: List[str] = None - quant_method ['bnb', 'hqq', 'eetq', 'awq', 'gptq', 'aqlm'] # 模型的量化方式 - quantization_bit [0, 1, 2, 3, 4, 8] 默认是0,代表不使用量化 +model_id_or_path # 可以写huggingface的模型id或者本地模型地址 +infer_backend ['AUTO', 'vllm', 'pt'] # 后段推理,默认auto +dtype ['bf16', 'fp16', 'fp32', 'AUTO'] # 计算精度 +max_length # 最大长度 +max_new_tokens: int = 2048 #最多生成多少token +do_sample: bool = True # 是否采样 +temperature: float = 0.3 # 生成时的温度系数 +top_k: int = 20 +top_p: float = 0.7 +repetition_penalty: float = 1. +num_beams: int = 1 +stop_words: List[str] = None +quant_method ['bnb', 'hqq', 'eetq', 'awq', 'gptq', 'aqlm'] # 模型的量化方式 +quantization_bit [0, 1, 2, 3, 4, 8] 默认是0,代表不使用量化 ``` 3. Example: ``` shell - CUDA_VISIBLE_DEVICES=0,1 swift infer \ - --model_type minicpm-v-v2_5-chat \ - --model_id_or_path /root/ld/ld_model_pretrain/MiniCPM-Llama3-V-2_5 \ - --dtype bf16 +CUDA_VISIBLE_DEVICES=0,1 swift infer \ +--model_type minicpm-v-v2_5-chat \ +--model_id_or_path /root/ld/ld_model_pretrain/MiniCPM-Llama3-V-2_5 \ +--dtype bf16 ``` -### Python code with swift infer -The following demonstrates using Python code to initiate inference with the MiniCPM-Llama3-V-2_5 model through Swift. +### Python code with SWIFT infer +The following demonstrates using Python code to initiate inference with the MiniCPM-Llama3-V-2_5 model through SWIFT. ```python - import os - os.environ['CUDA_VISIBLE_DEVICES'] = '0,1' # 设置显卡数 +import os +os.environ['CUDA_VISIBLE_DEVICES'] = '0,1' # 设置显卡数 - from swift.llm import ( - get_model_tokenizer, get_template, inference, ModelType, - get_default_template_type, inference_stream - ) # 导入必要模块 +from swift.llm import ( + get_model_tokenizer, get_template, inference, ModelType, + get_default_template_type, inference_stream +) # 导入必要模块 - from swift.utils import seed_everything # 设置随机种子 - import torch +from swift.utils import seed_everything # 设置随机种子 +import torch - model_type = ModelType.minicpm_v_v2_5_chat - template_type = get_default_template_type(model_type) # 获取模板类型,主要是用于特殊token的构造和图像的处理流程 - print(f'template_type: {template_type}') +model_type = ModelType.minicpm_v_v2_5_chat +template_type = get_default_template_type(model_type) # 获取模板类型,主要是用于特殊token的构造和图像的处理流程 +print(f'template_type: {template_type}') - model, tokenizer = get_model_tokenizer(model_type, torch.bfloat16, - model_id_or_path='/root/ld/ld_model_pretrain/MiniCPM-Llama3-V-2_5', - model_kwargs={'device_map': 'auto'}) # 加载模型,并设置模型类型,模型路径,模型参数,设备分配等,计算精度等等 - model.generation_config.max_new_tokens = 256 - template = get_template(template_type, tokenizer) # 根据模版类型构造模板 - seed_everything(42) +model, tokenizer = get_model_tokenizer(model_type, torch.bfloat16, + model_id_or_path='/root/ld/ld_model_pretrain/MiniCPM-Llama3-V-2_5', + model_kwargs={'device_map': 'auto'}) # 加载模型,并设置模型类型,模型路径,模型参数,设备分配等,计算精度等等 +model.generation_config.max_new_tokens = 256 +template = get_template(template_type, tokenizer) # 根据模版类型构造模板 +seed_everything(42) - images = ['http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/road.png'] # 图片地址 - query = '距离各城市多远?' - response, history = inference(model, template, query, images=images) # 推理获得结果 - print(f'query: {query}') - print(f'response: {response}') +images = ['http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/road.png'] # 图片地址 +query = '距离各城市多远?' +response, history = inference(model, template, query, images=images) # 推理获得结果 +print(f'query: {query}') +print(f'response: {response}') - # 流式 - query = '距离最远的城市是哪?' - gen = inference_stream(model, template, query, history, images=images) # 调用流式输出接口 - print_idx = 0 - print(f'query: {query}\nresponse: ', end='') - for response, history in gen: - delta = response[print_idx:] - print(delta, end='', flush=True) - print_idx = len(response) - print() - print(f'history: {history}') +# 流式 +query = '距离最远的城市是哪?' +gen = inference_stream(model, template, query, history, images=images) # 调用流式输出接口 +print_idx = 0 +print(f'query: {query}\nresponse: ', end='') +for response, history in gen: + delta = response[print_idx:] + print(delta, end='', flush=True) + print_idx = len(response) +print() +print(f'history: {history}') ``` -## Swift train -Swift supports training on the local dataset,the training steps are as follows: +## SWIFT train +SWIFT supports training on the local dataset,the training steps are as follows: 1. Make the train data like this: ```jsonl {"query": "这张图片描述了什么", "response": "这张图片有一个大熊猫", "images": ["local_image_path"]} {"query": "这张图片描述了什么", "response": "这张图片有一个大熊猫", "history": [], "images": ["image_path"]} {"query": "竹子好吃么", "response": "看大熊猫的样子挺好吃呢", "history": [["这张图有什么", "这张图片有大熊猫"], ["大熊猫在干嘛", "吃竹子"]], "images": ["image_url"]} ``` -2. Lora Tuning: +2. LoRA Tuning: - The lora target model are k and v weight in llm you should pay attention to the eval_steps,maybe you should set the eval_steps to a large value, like 200000,beacuase in the eval time , swift will return a memory bug so you should set the eval_steps to a very large value. +The LoRA target model are k and v weight in LLM you should pay attention to the eval_steps,maybe you should set the eval_steps to a large value, like 200000,beacuase in the eval time , SWIFT will return a memory bug so you should set the eval_steps to a very large value. ```shell - # Experimental environment: A100 - # 32GB GPU memory - CUDA_VISIBLE_DEVICES=0 swift sft \ - --model_type minicpm-v-v2_5-chat \ - --dataset coco-en-2-mini \ +# Experimental environment: A100 +# 32GB GPU memory +CUDA_VISIBLE_DEVICES=0 swift sft \ +--model_type minicpm-v-v2_5-chat \ +--dataset coco-en-2-mini \ ``` 3. All parameters finetune: - When the argument of lora_target_modules is ALL, the model will finetune all the parameters. +When the argument of lora_target_modules is ALL, the model will finetune all the parameters. ```shell CUDA_VISIBLE_DEVICES=0,1 swift sft \ - --model_type minicpm-v-v2_5-chat \ - --dataset coco-en-2-mini \ - --lora_target_modules ALL \ - --eval_steps 200000 +--model_type minicpm-v-v2_5-chat \ +--dataset coco-en-2-mini \ +--lora_target_modules ALL \ +--eval_steps 200000 ``` -## Lora Merge and Infer -The lora weight can be merge to the base model and then load to infer. +## LoRA Merge and Infer +The LoRA weight can be merge to the base model and then load to infer. -1. Load the lora weight to infer run the follow code: -```shell -CUDA_VISIBLE_DEVICES=0 swift infer \ - --ckpt_dir /your/lora/save/checkpoint -``` -2. Merge the lora weight to the base model: - - The code will load and merge the lora weight to the base model, save the merge model to the lora save path and load the merge model to infer +1. Load the LoRA weight to infer run the follow code: ```shell CUDA_VISIBLE_DEVICES=0 swift infer \ - --ckpt_dir your/lora/save/checkpoint \ - --merge_lora true +--ckpt_dir /your/lora/save/checkpoint +``` +2. Merge the LoRA weight to the base model: + +The code will load and merge the LoRA weight to the base model, save the merge model to the LoRA save path and load the merge model to infer +```shell +CUDA_VISIBLE_DEVICES=0 swift infer \ +--ckpt_dir your/lora/save/checkpoint \ +--merge_lora true ``` \ No newline at end of file diff --git a/docs/xinference_infer.md b/docs/xinference_infer.md index b7be1c3..ab06a2b 100644 --- a/docs/xinference_infer.md +++ b/docs/xinference_infer.md @@ -10,7 +10,7 @@ pip install "xinference[all]" ### Quick start The initial steps for conducting inference with Xinference involve downloading the model during the first launch. -1. Start xinference in the terminal: +1. Start Xinference in the terminal: ```shell xinference ``` @@ -37,9 +37,9 @@ Replica : 1 ### Local MiniCPM-Llama3-V-2_5 Launch If you have already downloaded the MiniCPM-Llama3-V-2_5 model locally, you can proceed with Xinference inference following these steps: -1. Start xinference +1. Start Xinference ```shell - xinference +xinference ``` 2. Start the web ui. 3. To register a new model, follow these steps: the settings highlighted in red are fixed and cannot be changed, whereas others are customizable according to your needs. Complete the process by clicking the 'Register Model' button. @@ -50,12 +50,12 @@ If you have already downloaded the MiniCPM-Llama3-V-2_5 model locally, you can p 4. After completing the model registration, proceed to 'Custom Models' and locate the model you just registered. 5. Follow the config and launch the model. ```plaintext - Model engine : Transformers - model format : pytorch - Model size : 8 - quantization : none - N-GPU : auto - Replica : 1 +Model engine : Transformers +model format : pytorch +Model size : 8 +quantization : none +N-GPU : auto +Replica : 1 ``` 6. After first click the launch button,Xinference will download the model from Huggingface. we should click the chat button. ![alt text](../assets/xinferenc_demo_image/xinference_webui_button.png) @@ -64,4 +64,4 @@ If you have already downloaded the MiniCPM-Llama3-V-2_5 model locally, you can p ### FAQ 1. Why can't the sixth step open the WebUI? - Maybe your firewall or mac os to prevent the web to open. \ No newline at end of file +Maybe your firewall or mac os to prevent the web to open. \ No newline at end of file