Fine-Tuning Mistral-7B on Apple Silicon: A Mac User’s Journey with Axolotl & LoRA
微調 Mistral-7B on Apple Silicon:Mac 使用者與 Axolotl 和 LoRA 的旅程
TL;DR: Fine-tuning a large language model like Mistral-7B on an M Series Mac is absolutely possible — but it’s not without challenges. In this article, I’ll share my personal journey fine-tuning Mistral-7B on a Rust programming dataset using an M3 Ultra Mac. I’ll walk through my false starts with the Axolotl fine-tuning toolkit (and the CUDA-centric pitfalls I hit), and how I ultimately succeeded by writing a custom LoRA fine-tuning script with Transformers and PEFT. We’ll cover the errors I encountered (and how I fixed or worked around them), how I merged the LoRA weights into the base model, converted the result to GGUF format with llama.cpp, and got the model running locally in LM Studio. Along the way, I’ll include code snippets, tool links, and tips for Mac users (like disabling Weights & Biases prompts, silencing tokenizer warnings, and organizing your files). Let’s dive in! 🚀
TL;DR: 在 M 系列 Mac 上微調像 Mistral-7B 這樣的大型語言模型 是絕對可能的——但並非沒有挑戰。在本文中,我將分享我使用 M3 Ultra Mac 在 Rust 程式設計數據集上微調 Mistral-7B 的個人旅程。我將介紹我對 Axolotl 微調工具包的錯誤開始(以及我遇到的以 CUDA 為中心的陷阱),以及我如何通過使用 Transformers 和 PEFT 編寫自定義 LoRA 微調腳本而最終取得成功 。我們將介紹我遇到的錯誤(以及我如何修復或解決這些錯誤),如何將 LoRA 權重合併到基本模型中,使用 llama.cpp 將結果轉換為 GUF 格式,以及讓模型在 LM Studio 中本地運行 。在此過程中,我將包括代碼片段、工具連結和 Mac 使用者的提示(如禁用權重和偏差提示,靜音分詞器警告,以及組織你的檔)。讓我們開始吧!🚀

Why Fine-Tune Mistral-7B Locally on a Mac?
為什麼要在 Mac 上本地微調 Mistral-7B?
Fine-tuning large language models is often assumed to require a beefy NVIDIA GPU with CUDA. I wanted to see if I could fine-tune a model locally on Apple Silicon — taking advantage of the M-series chip’s unified memory and Metal Performance Shaders (MPS) backend for PyTorch. The model I chose was Mistral-7B, a powerful 7-billion-parameter model released by Mistral AI in late 2023, known for strong performance relative to its size. My goal was to fine-tune Mistral-7B on a Rust-specific dataset to create a Rust-fluent assistant.
通常認為微調大型語言模型需要具有 CUDA 的強大 NVIDIA GPU。我想看看是否可以在 Apple Silicon 上本地微調模型 — 利用 M 系列晶片的統一記憶體和 PyTorch 的 Metal Performance Shaders (MPS) 後端。我選擇的模型是 Mistral-7B,這是 Mistral AI 於 2023 年底發佈的一個強大的 70 億參數模型,以其相對於其大小的強大性能而聞名。我的目標是在特定於 Rust 的數據集上微調 Mistral-7B 以創建 Rust-fluent 助手。
Why not use a cloud GPU? Cost is one reason, but also the appeal of running everything offline on my Mac. The challenge is that much of the LLM tooling (fine-tuning frameworks, optimizers, etc.) is very CUDA-centric and not built with Apple’s GPUs in mind. Apple’s MPS support in PyTorch is steadily improving, but still has some gaps (e.g. incomplete mixed-precision support). In this article, I’ll share how I navigated these limitations.
為什麼不使用雲 GPU?成本是原因之一,也是在我的 Mac 上離線運行所有內容的吸引力。挑戰在於,許多 LLM 工具(微調框架、優化器等)都非常以 CUDA 為中心 ,並且在構建時沒有考慮到 Apple 的 GPU。Apple 在 PyTorch 中的 MPS 支援正在穩步提高,但仍存在一些差距(例如,混合精度支援不完整)。在本文中,我將分享我是如何應對這些限制的。
Attempt 1: Fine-Tuning with Axolotl (and the Pitfalls on Apple Silicon)
嘗試 1:使用 Axolotl 進行微調(以及 Apple Silicon 上的陷阱)
To kick things off, I decided to try Axolotl, an open-source toolkit that aims to streamline fine-tuning of LLMs with minimal code. Axolotl supports many architectures (LLaMA, Mistral, Falcon, etc.) and uses simple YAML configs to orchestrate data preprocessing, LoRA or full fine-tuning, and even model merging. This sounded perfect — I could focus on my dataset and config while Axolotl handled the heavy lifting.
首先,我決定嘗試 Axolotl,這是一個開原始件工具包,旨在用最少的代碼簡化 LLM 的微調。Axolotl 支援許多架構(LLaMA、Mistral、Falcon 等),並使用簡單的 YAML 配置來編排數據預處理、LoRA 或完全微調,甚至模型合併。這聽起來很完美——我可以專注於我的數據集和配置,而 Axolotl 則處理繁重的工作。
Axolotl setup: I created a Python 3.10 virtual environment on my Mac and ran the Axolotl installation. On Apple Silicon, the recommended approach was to install from source (since some dependencies need special handling). After installing PyTorch (with MPS support) and datasets, I did:
Axolotl 設置: 我在 Mac 上創建了一個 Python 3.10 虛擬環境並運行 Axolotl 安裝。在 Apple Silicon 上,推薦的方法是從源碼安裝(因為某些依賴項需要特殊處理)。安裝 PyTorch(支援 MPS)和數據集後,我執行以下作:
# Install Axolotl without CUDA extras
pip install axolotlThis installed Axolotl and most dependencies, but as expected, bitsandbytes (a library for 8-bit optimizations) failed to install — it has no support for macOS/M1 (it’s CUDA-only). I knew I wouldn’t be able to use 4-bit quantization or 8-bit optimizers on Mac, but that’s okay for a 7B model. I planned to fine-tune in full 16-bit or 32-bit precision on CPU/MPS.
這安裝了 Axolotl 和大多數依賴項,但正如預期的那樣,bitsandbytes(一個用於 8 位優化的庫)無法安裝——它不支援 macOS/M1(僅限 CUDA)。我知道我無法在 Mac 上使用 4 位量化或 8 位優化器,但對於 7B 模型來說沒關係。我計劃在 CPU/MPS 上以全 16 位或 32 位精度進行微調。
Next, I wrote an Axolotl YAML config for my task, specifying the base model (mistralai/Mistral-7B-v0.1 from Hugging Face), LoRA parameters (rank, alpha, target modules, etc.), and pointing to my Rust dataset (a JSON with instruction-output pairs). Then I ran Axolotl to begin training.
接下來,我為我的任務編寫了一個 Axolotl YAML 配置,指定基本模型(來自 Hugging Face 的 mistralai/Mistral-7B-v0.1)、LoRA 參數(等級、alpha、目標模組等),並指向我的 Rust 數據集(帶有指令-輸出對的 JSON)。然後我運行 Axolotl 開始訓練。
Challenges and Errors with Axolotl on M3
M3 上 Axolotl 的挑戰和錯誤
Almost immediately, I hit a series of issues trying to use Axolotl on Apple Silicon. Here’s a rundown of the key challenges I faced, and how I attempted to address them:
幾乎立即,我在嘗試在 Apple Silicon 上使用 Axolotl 時遇到了一系列問題。以下是我面臨的主要挑戰的概要,以及我是如何嘗試解決這些挑戰的:
- bitsandbytes incompatibility: As noted, Axolotl by default tries to use bitsandbytes for 8-bit model loading or optimizers (especially if you configure QLoRA or 4-bit training). On Mac, bitsandbytes isn’t available, causing installation or runtime errors. Workaround: I disabled any 4-bit quantization options in the config and let Axolotl load the model in full precision. Axolotl’s docs confirm that on Mac, you have to stick to full precision or FP16 — no 4/8-bit support.
bitsandbytes 不相容: 如前所述,默認情況下,Axolotl 嘗試將 bitsandbytes 用於 8 位模型載入或優化器(尤其是在您配置 QLoRA 或 4 位訓練時)。在 Mac 上,bitsandbytes 不可用,從而導致安裝或運行時錯誤。 解決方法: 我在配置中禁用了任何 4 位量化選項,並讓 Axolotl 以全精度載入模型。Axolotl 的文件證實,在 Mac 上,您必須堅持使用全精度或 FP16 — 不支援 4/8 位。 - “CUDA-only” cleanup routines: During shutdown, I saw warnings/errors related to CUDA operations (e.g. attempts to call torch.cuda.empty_cache() or use CUDA-specific memory cleanup) even though I was running on CPU/MPS. These weren’t show-stoppers, but they cluttered the logs with warnings. Solution: I mostly ignored these, but it highlighted that some parts of Axolotl assume an NVIDIA GPU environment. (Axolotl’s own docs note that M-series Mac support is partial, since not all dependencies support MPS)
“僅限 CUDA”清理例程: 在關機期間,我看到了與 CUDA 作相關的警告/錯誤(例如,嘗試調用 torch.cuda.empty_cache() 或使用特定於 CUDA 的記憶體清理),即使我在 CPU/MPS 上運行。這些並不是阻礙因素,但它們用警告弄亂了日誌。 溶液: 我基本上忽略了這些,但它強調了 Axolotl 的某些部分假設了 NVIDIA GPU 環境。(Axolotl 自己的文件指出,M 系列 Mac 支援是部分的,因為並非所有依賴項都支援 MPS) - Unexpected dataset key errors: My dataset was a simple JSON of instructions and answers (I had keys like “instruction” and “output”). Axolotl, however, expected a certain format or key naming depending on the chosen prompt template. At first, I got a KeyError complaining about missing keys. I realized Axolotl defaults to the OpenAI conversation format (expecting a list of messages with roles). Fix: I updated the config to map my dataset’s keys to what Axolotl expects. For example, I set field_instruction to “instruction” and field_reply to “output” in the config (or I could have reformatted my JSON). After this tweak, Axolotl was able to parse the dataset.
意外的數據集鍵錯誤: 我的數據集是一個包含說明和答案的簡單 JSON(我有“instruction”和“output”等鍵)。然而,Axolotl 需要一定的格式或鍵命名,具體取決於所選的提示範本。起初,我收到一個 KeyError 抱怨缺少鍵。我意識到 Axolotl 預設為 OpenAI 對話格式 (需要帶有角色的消息清單)。 修復: 我更新了配置以將我的數據集的鍵映射到 Axolotl 期望的鍵。例如,我在配置中將 field_instruction 設置為 “instruction”,將 field_reply 設置為 “output” (或者我可以重新格式化我的 JSON)。經過此調整后,Axolotl 能夠解析數據集。 - merge_and_unload error (merging LoRA weights): After training for a while (which itself was slow but working on CPU), Axolotl tries to merge the LoRA adapter into the base model to produce a final model. This step crashed with an AttributeError — “MistralForCausalLM object has no attribute ‘merge_and_unload’”. Essentially, the Mistral model class in transformers did not support the merge_and_unload() method that PEFT models have. This was a known Axolotl bug at the time for Mistral. The result: Axolotl couldn’t merge the weights.
merge_and_unload 錯誤(合併 LoRA 權重): 經過一段時間的訓練(本身很慢,但在 CPU 上運行),Axolotl 嘗試將 LoRA 適配器合併到基礎模型中以生成最終模型。此步驟崩潰,並顯示 AttributeError — “MistralForCausalLM 對象沒有屬性 'merge_and_unload'”。 本質上,transformer 中的 Mistral 模型類不支援 PEFT 模型所具有的 merge_and_unload() 方法。這是當時 Mistral 已知的 Axolotl 錯誤。結果:Axolotl 無法合併權重。 - Missing adapter_model.bin: To make matters worse, because of the merge failure, Axolotl also failed to save the LoRA adapter weights in the expected output folder. I was left with an output directory that had some logs and config, but no adapter_model.bin (the LoRA weight file) or merged model. After hours of training, I essentially had nothing usable to show for it.
缺少 adapter_model.bin: 更糟糕的是,由於合併失敗,Axolotl 也未能將 LoRA 適配器權重保存在預期的輸出資料夾中。我留下了一個輸出目錄,其中包含一些日誌和配置,但沒有 adapter_model.bin(LoRA 權重檔)或合併模型。經過幾個小時的訓練,我基本上沒有什麼可用的東西可以展示它。
After wrestling with these issues and digging through GitHub issues, I decided to change course. Axolotl is a great tool, but its Mac support (at least at that time) was bleeding-edge and these CUDA-centric roadblocks were draining my productivity. It was time for Plan B.
在與這些問題搏鬥並深入研究 GitHub 問題后,我決定改變方向。Axolotl 是一個很棒的工具,但它的 Mac 支援(至少在當時)是最先進的,這些以 CUDA 為中心的障礙正在耗盡我的生產力。是時候制定 B 計劃了。
Plan B: Fine-Tuning with a Custom PEFT Script (No Axolotl, No CUDA)
計劃 B:使用自訂 PEFT 腳本進行微調(無 Axolotl,無 CUDA)
Determined to get this model fine-tuned on my Mac, I rolled up my sleeves and wrote a custom training script using the Transformers library and the PEFT (Parameter-Efficient Fine-Tuning) library for LoRA. Writing my own script gave me full control and transparency, at the cost of re-implementing some of what Axolotl would have handled automatically. The upside: I could explicitly avoid any CUDA-specific code and handle the quirks of Apple Silicon myself.
決心在我的 Mac 上微調這個模型,我捲起袖子,使用 Transformers 庫和 LoRA 的 PEFT(參數高效微調)庫編寫了一個自定義訓練腳本。編寫自己的腳本給了我完全的控制權和透明度,但代價是重新實現一些 Axolotl 會自動處理的內容。好處:我可以明確避免任何特定於 CUDA 的代碼,並自己處理 Apple Silicon 的怪癖。
1. Environment Setup for Mac Development
1. Mac 開發環境設置
First, I made sure my environment was ready:
首先,我確保我的環境已準備就緒:
PyTorch with MPS: I installed a recent PyTorch build that supports the Apple MPS backend. For me, pip install torch torchvision torchaudio (PyTorch 2.1) worked out-of-the-box. After installation, I quickly tested torch.backends.mps.is_available() in a Python shell to confirm that the MPS (Metal Performance Shaders) backend was enabled. If MPS hadn’t worked, the fallback would be CPU, but MPS can accelerate tensor operations on the GPU (though it still uses GPU memory).
使用 MPS 的 PyTorch: 我安裝了支援 Apple MPS 後端的最新 PyTorch 版本。對我來說,pip install torch torchvision torchaudio (PyTorch 2.1) 開箱即用。安裝後,我在 Python shell 中快速測試了 torch.backends.mps.is_available(),以確認 MPS (Metal Performance Shaders) 後端已啟用。如果 MPS 不起作用,則回退將是 CPU,但 MPS 可以加速 GPU 上的張量運算(儘管它仍然使用 GPU 記憶體)。
Hugging Face libraries: I installed the Transformers, Datasets, and PEFT libraries:
Hugging Face 庫: 我安裝了 Transformers、Datasets 和 PEFT 庫:
pip install transformers datasets peft safetensors accelerateUsing safetensors is optional but recommended when dealing with model weights, as it’s a safer alternative to pickle-based .bin files. I also included accelerate just in case I wanted to use it for device placement (though for a single machine and MPS, it wasn’t strictly needed).
使用 safetensors 是可選的,但建議在處理模型權重時使用它,因為它是基於 pickle 的 .bin 檔的更安全替代方案。我還加入了 accelerate,以防萬一我想將其用於設備放置(儘管對於單台機器和 MPS,它並不是絕對需要的)。
Disable W&B and tokenizers warnings: By default, the Hugging Face Trainer will attempt to log to Weights & Biases. To avoid the interactive prompt or unwanted logging, I set an environment variable to disable W&B:
禁用 W&B 和 tokenizers 警告: 默認情況下,Hugging Face Trainer 將嘗試記錄 Weights & Biases。為了避免互動式提示或不需要的日誌記錄,我設置了一個環境變數來禁用 W&B:
export WANDB_DISABLED=true
export TOKENIZERS_PARALLELISM=falseI also turned off the tokenizers parallelism. These two lines can be added to your ~/.bashrc/~/.zshrc or just exported in the terminal before running the script. (Alternatively, in your Python script you can do os.environ[“WANDB_DISABLED”] = “true”). This ensures a cleaner output with no pauses or huge warnings.
我還關閉了分詞器並行性。這兩行可以添加到你的 ~/.bashrc/~/.zshrc 中,或者在運行腳本之前在終端中匯出。(或者,在 Python 腳本中,您可以執行 os.environ[“WANDB_DISABLED”] = “true”)。 這確保了更清晰的輸出,沒有停頓或巨大的警告。
- Organize files and folders: For sanity, I structured my project as follows:
組織檔和資料夾: 為了保持理智,我按如下方式構建了我的專案:
A directory for the base model weights (I used the Hugging Face mistralai/Mistral-7B-v0.1 — which I had downloaded in advance using huggingface_hub or git lfs). For example: ./models/Mistral-7B-v0.1/…
基本模型權重的目錄(我使用了 Hugging Face mistralai/Mistral-7B-v0.1 — 我提前使用 huggingface_hub 或 git lfs 下載了它)。例如:./models/Mistral-7B-v0.1/...A directory (or file) for the dataset. In my case, a JSON lines file rust_instruct.json containing entries with “instruction” and “output”.
數據集的目錄(或檔)。在我的例子中,一個 JSON 行檔 rust_instruct.json 包含帶有 “instruction” 和 “output” 的條目。An output directory for fine-tuning results. I created ./outputs/lora_rust/ for the LoRA adapter, and later ./outputs/merged_model/ for the merged full model.
用於微調結果的輸出目錄。我為 LoRA 適配器創建了 ./outputs/lora_rust/,後來又為合併的完整模型創建了 ./outputs/merged_model/。
Keeping these separate helps avoid mixing up files and makes it easier to convert or move things later.
將這些內容分開有助於避免混淆檔,並使其以後更容易轉換或移動內容。
2. Writing the LoRA Training Script
2. 編寫 LoRA 訓練腳本
Now for the main event: fine-tuning the model with LoRA. I wrote a script train_lora_mistral.py to do the following steps:
現在是主要事件:使用 LoRA 微調模型。我編寫了一個腳本 train_lora_mistral.py 來執行以下步驟:
a. Load the tokenizer and base model: Using the Hugging Face Transformers API:
a. 載入 tokenizer 和基本模型: 使用 Hugging Face Transformers API:
from transformers import AutoTokenizer, AutoModelForCausalLM
model_name = "./models/Mistral-7B-v0.1" # path or name of base model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
load_in_8bit=False, # we avoid bitsandbytes on Mac
torch_dtype="auto", # let PyTorch decide (float32 or float16 if MPS allows)
device_map={"": "mps"} # this places the model on the Apple GPU (MPS).
# If MPS is not available or if issues occur, use {"": "cpu"} to train on CPU.
)A few notes on this:
關於這一點的一些說明:
- I loaded the model in full precision (load_in_8bit=False) because 8-bit loading (via bitsandbytes) isn’t supported on Mac. Full FP16 or FP32 was fine given 7B isn’t too large. On my 32GB RAM Mac, the 7B model in FP16 (~13 GB) fits in memory.
我以全精度 (load_in_8bit=False) 載入了模型,因為 Mac 不支援 8 位載入(通過 bitsandbytes)。完整的 FP16 或 FP32 很好,因為 7B 不是太大。在我的 32GB RAM Mac 上,FP16 (~13 GB) 的 7B 型號適合記憶體。 - device_map={“”: “mps”} is a way to instruct Transformers to put the whole model on the MPS device. You could also call model.to(“mps”) after loading. If you only have CPU (or if MPS has issues), use “cpu”. Keep in mind training on CPU will be very slow — MPS can be 2–3× faster for matrix ops.
device_map={“”: “mps”} 是指示 Transformers 將整個模型放在 MPS 設備上的一種方法。你也可以在載入後調用 model.to(“mps”)。 如果您只有 CPU(或者 MPS 有問題),請使用 「cpu」。請記住,在 CPU 上訓練會非常慢 — 矩陣運算的 MPS 可能會快 2-3×。 - I used the local path for the model. If you haven’t downloaded the model, you can use the Hugging Face hub name (it will download automatically). Just be mindful of disk space.
我使用了模型的本地路徑。如果您尚未下載模型,則可以使用 Hugging Face 集線器名稱(它將自動下載)。請注意磁碟空間。
b. Add LoRA adapters to the model: Using PEFT (🤗 PEFT library) to wrap the model with LoRA:
b. 將 LoRA 適配器添加到模型: 使用 PEFT(🤗 PEFT 庫) 用 LoRA 包裝模型:
from peft import LoraConfig, get_peft_model, TaskType
# Configure LoRA
lora_config = LoraConfig(
r=16, # LoRA rank (trade-off between memory and capacity, common values: 4,8,16)
lora_alpha=32, # LoRA alpha scaling
target_modules=["q_proj", "k_proj", "v_proj", "o_proj"], # target linear layers in Mistral model
lora_dropout=0.05,
bias="none",
task_type=TaskType.CAUSAL_LM # we're fine-tuning a causal language model
)
# Wrap the model with LoRA
model = get_peft_model(model, lora_config)Some explanation: The target_modules are the names of the model’s weight modules we want to apply LoRA to. For LLaMA/Mistral architectures, the attention projection layers are typically named q_proj, k_proj, v_proj, and sometimes o_proj (output projection). I included those to allow LoRA to train those matrices. The rank and alpha are hyperparameters (I chose a moderately high rank of 16 for potentially better learning of coding knowledge, at the cost of a larger adapter). Note: A rank of 16 on a 7B model is still quite small in terms of new parameters — LoRA adds only (2 * hidden_size * r) parameters per target layer, which is much less than full model tuning.
一些解釋:target_modules 是我們要將 LoRA 應用於模型的權重模組的名稱。對於 LLaMA/Mistral 架構,注意力投影層通常命名為 q_proj、k_proj、v_proj,有時也稱為 o_proj(輸出投影)。我包含這些值以允許 LoRA 訓練這些矩陣。rank 和 alpha 是超參數(我選擇了中等較高的 rank 16,以便更好地學習編碼知識,但代價是適配器更大)。 注意: 就新參數而言,7B 模型上的 16 級仍然相當小 — LoRA 每個目標層僅添加 (2 * hidden_size * r) 參數,這比完整的模型調整要少得多。
c. Prepare the dataset and training data loader: I used the Datasets library to load my JSON and then set up the training loop:
c. 準備資料集和訓練數據載入器: 我使用 Datasets 庫載入了我的 JSON,然後設置了訓練迴圈:
from datasets import load_dataset
data = load_dataset("json", data_files="rust_instruct.json")
train_data = data["train"] # assuming the JSON is just a list of examplesMy dataset entries look like:
我的數據集條目如下所示:
{"instruction": "Explain the ownership model in Rust.", "output": "Rust's ownership model is based on ..."}
{"instruction":"Rust use statement","input":"","output":"use bytes::Bytes;"}I decided to fine-tune in an instruction-tuning style, where each example is like a prompt-response pair. I concatenated each instruction with perhaps a system prefix and used the output as the label. A quick way is to format each example into a single text with a special separator token, but since Mistral/LLaMA are usually trained in chat format, I could also use the prompt template approach. For simplicity, I created a function to join them:
我決定以指令調整風格進行微調,其中每個示例都類似於一對提示-回應。我可能將每條指令與系統前綴連接起來,並使用輸出作為標籤。一種快速的方法是使用特殊的分隔符標記將每個示例格式化為單個文本,但由於 Mistral/LLaMA 通常以聊天格式進行訓練,因此我也可以使用提示範本方法。為簡單起見,我創建了一個函數來聯接它們:
def format_example(example):
prompt = f"<s>[INST] {example['instruction']} [/INST]\n" # using a format akin to LLaMA-2 chat
response = example["output"]
full_text = prompt + response
return tokenizer(full_text, truncation=True)I then applied this formatting to the dataset and set it up for the Trainer:
然後,我將此格式應用於數據集,併為 Trainer 進行設置:
train_data = train_data.map(lambda ex: tokenizer(format_example(ex)["input_ids"]), batched=True)(Depending on memory, you might want to stream or use train_data.map() with caution on large sets. My dataset was small enough.)
(根據記憶體,您可能希望在大型 set 上謹慎流式傳輸或使用 train_data.map()。我的數據集足夠小。
d. Configure training parameters: I used the Hugging Face Trainer API for convenience:
d. 配置訓練參數: 為了方便起見,我使用了 Hugging Face Trainer API:
from transformers import TrainingArguments, Trainer, DataCollatorForLanguageModeling
training_args = TrainingArguments(
output_dir="./outputs/lora_rust",
per_device_train_batch_size=1,
gradient_accumulation_steps=4,
num_train_epochs=3,
logging_steps=10,
save_steps=50,
learning_rate=2e-4,
fp16=True, # enable mixed precision (this works on MPS as of PyTorch 2.1 for forward, but watch out for any issues)
report_to="none" # disable wandb logging
)
# Use a data collator that can handle language modeling (it will pad sequences to the same length in a batch)
data_collator = DataCollatorForLanguageModeling(tokenizer, mlm=False)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_data,
data_collator=data_collator
)I chose a very small batch_size of 1 with gradient_accumulation_steps=4 to simulate an effective batch of 4 — this was because of limited memory on the Mac and the fact that MPS backend currently doesn’t support more complex multi-batch operations as efficiently. The rest of hyperparams (epochs, LR) were picked somewhat heuristically for a small dataset. I set fp16=True hoping that mixed precision would be used; on MPS, PyTorch’s Automatic Mixed Precision support was still being worked on, but by PyTorch 2.1 some operations can use half precision on the GPU. In any case, it didn’t crash, and using FP16 where possible helps speed.
我選擇了一個非常小的 1 batch_size,gradient_accumulation_steps=4 來類比 4 的有效批次 — 這是因為 Mac 上的記憶體有限,而且 MPS 後端目前無法有效地支援更複雜的多批次作。其餘的超參數 (epochs, LR) 在某種程度上是啟發式地為小型數據集選擇的。我設置 fp16=True,希望使用混合精度;在 MPS 上,PyTorch 的自動混合精度支援仍在開發中,但到 PyTorch 2.1 時,某些作可以在 GPU 上使用半精度。無論如何,它沒有崩潰,並且盡可能使用 FP16 有助於提高速度。
e. Run training: Finally:
e. 運行訓練: 最後:
trainer.train()This started the fine-tuning process. It was slow — let’s be honest, fine-tuning 7B on a laptop CPU/MPS is nowhere near GPU training speeds. But it was progressing! The training loop printed loss updates every 10 steps, and over a couple of hours (for a small dataset) I got through my 3 epochs. I could see the loss decreasing and the model seemingly learning from the Rust examples.
這開始了微調過程。它很慢 ——老實說,在筆記型電腦 CPU/MPS 上微調 7B 遠不及 GPU 訓練速度。但它正在進步!訓練迴圈每 10 步列印一次損失更新,在幾個小時內(對於一個小數據集),我完成了我的 3 個時期。我可以看到損失在減少,模型似乎從 Rust 示例中學習。
After training, I saved the LoRA adapter:
訓練后,我保存了 LoRA 適配器:
trainer.model.save_pretrained("./outputs/lora_rust")The save_pretrained of a PEFT model will save the adapter weights and configuration. Indeed, in ./outputs/lora_rust/ I now saw adapter_model.bin (about a few 10s of MB, since LoRA weights are small) and adapter_config.json. Victory! I had a fine-tuned LoRA adapter.
PEFT 型號的 save_pretrained 將節省適配器的重量和配置。事實上,在 ./outputs/lora_rust/ 中,我現在看到了 adapter_model.bin(大約 10 MB 左右,因為 LoRA 權重很小)和 adapter_config.json。勝利!我有一個經過微調的 LoRA 適配器。
(Side note: I could have also used model.save_pretrained on the LoRA-wrapped model. Both achieve the same, since trainer.model is the LoRA-wrapped model. Just ensure you’re saving the PEFT model, not the base model alone.)
(旁注:我也可以在 LoRA 包裝的模型上使用 model.save_pretrained。兩者實現相同,因為 trainer.model 是 LoRA 包裝的模型。只需確保您保存的是 PEFT 模型,而不僅僅是基本模型。
3. Merging LoRA Weights with the Base Model
3. 將 LoRA 權重與基本模型合併
Having a LoRA adapter is great if you plan to use it via code (you can load the base model and apply the adapter on the fly). But I wanted to use this model in a self-contained way (e.g., in LM Studio or other inference tools that might not support PEFT natively). So the next step was merging the LoRA weights into the base model to create a standalone fine-tuned model.
如果您打算通過代碼使用 LoRA 適配器,那麼擁有 LoRA 適配器非常有用(您可以載入基本模型並動態應用配接器)。但我想以一種自包含的方式使用這個模型(例如,在 LM Studio 或其他可能本身不支援 PEFT 的推理工具中)。因此,下一步是將 LoRA 權重合併到基礎模型中, 以創建一個獨立的微調模型。
I wrote a small script merge_lora.py to do this:
我編寫了一個小腳本 merge_lora.py 來執行此作:
from transformers import AutoModelForCausalLM
from peft import PeftModel
base_model = AutoModelForCausalLM.from_pretrained(
"./models/Mistral-7B-v0.1",
torch_dtype="auto",
device_map={"": "cpu"} # we can do merging on CPU to avoid any MPS quirk
)
# Load the fine-tuned LoRA adapter into the base model
lora_model = PeftModel.from_pretrained(base_model, "./outputs/lora_rust")
# Merge and unload – incorporate LoRA weights into base model
merged_model = lora_model.merge_and_unload()
# Save the merged model
merged_model.save_pretrained("./outputs/merged_model", safe_serialization=True)A couple of things to highlight:
需要強調的幾點:
- I loaded the base model to CPU for merging. Merging is a one-time operation and is not performance critical, so CPU is fine (just needs enough RAM). This avoids any MPS issues with the merge_and_unload function.
我將基本模型載入到 CPU 進行合併。合併是一次性作,對性能不重要,因此 CPU 很好(只需要足夠的 RAM)。這避免了 merge_and_unload 函數的任何 MPS 問題。 - PeftModel.from_pretrained applied my LoRA to the base. Then calling merge_and_unload() gave me a normal transformers model (merged_model) with the weights updated as if they had been fully fine-tuned. This function essentially adds the low-rank updates (scaled by alpha) to the original weights.
PeftModel.from_pretrained 將我的 LoRA 應用於底座。然後調用 merge_and_unload() 會得到一個普通的變壓器模型 (merged_model),權重更新了,就好像它們已經完全微調一樣。此函數實質上是將低秩更新(按 alpha 縮放)添加到原始權重。 - I saved the merged model with safe_serialization=True which saves it in a .safetensors format (you could omit that or use the default to get a pytorch_model.bin, but I prefer safetensors for safety). The output directory now contained adapter_config.json (not needed anymore) and model.safetensors (the full weights, ~13GB in FP16).
我使用 safe_serialization=True 保存了合併後的模型,將其保存為 .safetensors 格式(您可以省略該格式或使用預設值來獲得 pytorch_model.bin,但為了安全起見,我更喜歡 safetensors)。輸出目錄現在包含 adapter_config.json (不再需要)和 model.safetensors (完整權重,在 FP16 中為 ~13GB)。
This step succeeded — unlike Axolotl’s built-in merge which errored out, doing it manually with the latest PEFT library worked. If your PEFT version is old, note that the merge_and_unload is a method on the PeftModel (specifically a PeftModelForCausalLM). In my case it was available and did the job. (In fact, the Axolotl issue was that they weren’t calling it on the correct object.)
這一步成功了 ——與 Axolotl 的內置合併不同,使用最新的 PEFT 庫手動執行是有效的。如果您的 PEFT 版本較舊,請注意,merge_and_unload 是 PeftModel 上的一個方法(特別是 PeftModelForCausalLM)。 就我而言,它是可用的並且完成了工作。(事實上,Axolotl 的問題在於他們沒有在正確的對象上調用它。
Now I had a merged model that I could use like any other Hugging Face model. For example, I could do:
現在我有一個合併的模型,我可以像任何其他 Hugging Face 模型一樣使用它。例如,我可以執行以下作:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
tok = AutoTokenizer.from_pretrained("./outputs/merged_model", use_fast=True)
mod = AutoModelForCausalLM.from_pretrained("./outputs/merged_model", torch_dtype=torch.float16, device_map={"": "mps"})
res = mod.generate(**tok("How do I implement a binary tree in Rust?", return_tensors="pt").to("mps"))
print(tok.decode(res[0]))And it would produce an answer using the fine-tuned knowledge. (I did a quick test like this — the answers seemed reasonably Rust-aware!)
它將使用微調的知識產生答案。(我做了一個這樣的快速測試 — 答案似乎相當能識別 Rust!
4. Converting the Model to GGUF for llama.cpp
4. 將模型轉換為 GGUF 以進行 llama.cpp
While having the Hugging Face format model is nice, running a 13GB model on my Mac for inference isn’t ideal (it can run, but not efficiently). Many local LLM tools (like LM Studio, text-generation-webui, etc.) prefer models in the GGUF format (which is the latest iteration of the GPT/LLAMA binary format, succeeding GGML/GGUF). GGUF allows various quantization levels and is optimized for CPU inference via llama.cpp.
雖然擁有 Hugging Face 格式模型很好,但在我的 Mac 上運行 13GB 模型進行推理並不理想(它可以運行,但效率不高)。許多本地 LLM 工具(如 LM Studio、text-generation-webui 等)更喜歡 GGUF 格式的模型(這是 GPT/LLAMA 二進位格式的最新版本,繼 GGML/GGUF 之後)。GGUF 允許各種量化級別,並通過 llama.cpp 針對 CPU 推理進行了優化。
I decided to convert my model to GGUF. llama.cpp provides a conversion script for this purpose. Here’s what I did:
我決定將我的模型轉換為 GGUF。llama.cpp 為此提供了一個轉換腳本。這是我所做的:
# Clone llama.cpp (if not already cloned)
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
# Install Python requirements for conversion (e.g., sentencepiece, numpy, safetensors etc.)
pip install -r requirements.txt
# Run the conversion script
python ./convert-hf-to-gguf.py ../outputs/merged_model --outfile mistral-rust-7b.gguf --outtype q4_0A breakdown of the command:
命令的分解:
- I pointed convert-hf-to-gguf.py to my merged_model directory, which contains the config.json, tokenizer.model, tokenizer.json, and the model.safetensors. The script needs those.
我將 convert-hf-to-gguf.py 指向我的 merged_model 目錄,其中包含 config.json、tokenizer.model、tokenizer.json 和 model.safetensors。腳本需要這些。 - — outfile mistral-rust-7b.gguf is the name of the output file I wanted.
— outfile mistral-rust-7b.gguf 是我想要的輸出文件的名稱。 - — outtype q4_0 specifies the quantization type. I chose 4-bit (q4_0) which drastically reduces the model size (my output GGUF file became around ~3.5 GB). You can choose other quantization levels or even unquantized (f16 or f32). I found q4_0 to be a good balance for local CPU inference — it might sacrifice a bit of accuracy but for my use-case (Rust explanations) it was fine. If you have more RAM and want better quality, q5_1 or q8_0 are options (with larger file sizes).
— outtype q4_0 指定量化類型。我選擇了 4 位 (q4_0), 這大大減小了模型大小(我輸出的 GGUF 檔變得大約 ~3.5 GB)。您可以選擇其他量化級別,甚至可以選擇未量化的級別(f16 或 f32)。我發現 q4_0 對於本地 CPU 推理來說是一個很好的平衡——它可能會犧牲一點準確性,但對於我的用例(Rust 解釋)來說,它很好。如果您有更多 RAM 並希望獲得更好的品質,則可以選擇 q5_1 或 q8_0(檔大小更大)。
The conversion script ran for a few minutes and successfully produced mistral-rust-7b.gguf. Now the model was in a single file, ready for use in llama.cpp-compatible UIs.
轉換腳本運行了幾分鐘,並成功生成了 mistral-rust-7b.gguf。現在,模型位於單個檔中,可以在 llama.cpp 相容的 UI 中使用。
5. Installing the Model in LM Studio (Folder Structure Gotcha)
5. 在 LM Studio 中安裝模型(資料夾結構陷阱)
Finally, I wanted to load the model into LM Studio, which is a nice Mac-friendly UI for chatting with local models. LM Studio supports GGUF models, but it expects them in a specific folder structure so it can recognize and list them.
最後,我想將模型載入到 LM Studio 中,這是一個非常適合 Mac 的 UI,用於與本地模型聊天。LM Studio 支援 GGUF 模型,但它需要它們位於特定的資料夾結構中,以便可以識別和列出它們。
According to the LM Studio docs, the model files should be placed under ~/.lmstudio/models with a publisher and model name hierarchy. Concretely, I did this:
根據 LM Studio 文件,模型檔應放置在 ~/.lmstudio/models 下,並帶有發佈者和模型名稱層次結構。具體來說,我是這樣做的:
# Create a folder for my model in LM Studio's directory
mkdir -p ~/.lmstudio/models/local/mistral-rust-7b
# Move the GGUF file into that folder
mv mistral-rust-7b.gguf ~/.lmstudio/models/local/mistral-rust-7b/Here, I used local as the “publisher” name (you could use anything, maybe your username or org name). And mistral-rust-7b as the model name folder. One important detail: the file name must contain .gguf (which it does as I named it). If you have a quantization suffix (like q4_0), it’s fine to include it (e.g., mistral-rust-7b-q4_0.gguf), just keep the extension. In my case I left it as mistral-rust-7b.gguf.
在這裡,我使用 local 作為 「publisher」 名稱(您可以使用任何內容,可能是您的使用者名或組織名稱)。mistral-rust-7b 作為模型名稱資料夾。一個重要的細節: 檔名必須包含 .gguf(它按照我的名字命名)。如果你有一個量化後綴(如 q4_0),可以包含它(例如,mistral-rust-7b-q4_0.gguf),只需保留擴展名即可。就我而言,我將其保留為 mistral-rust-7b.gguf。
I then launched LM Studio, and lo and behold, under “My Models” the model appeared as local/mistral-rust-7b. I could select it and start chatting. The Rust knowledge was there, and the model was running entirely on my Mac! 🎉
然後我啟動 LM Studio,瞧,在“我的模型”下,模型顯示為 local/mistral-rust-7b。我可以選擇它並開始聊天。Rust 知識就在那裡,模型完全在我的 Mac 上運行!🎉
Tips and Lessons Learned 提示和經驗教訓
To wrap up, here are some tips for beginners gleaned from this journey, especially for those fine-tuning LLMs on Mac hardware:
總而言之,以下是從這段旅程中為初學者收集到的一些提示,特別是對於那些在 Mac 硬體上微調 LLM 的人:
- Environment and dependencies: Use an isolated environment (conda or venv) and install only what you need. On Apple Silicon, make sure to use a PyTorch version that supports MPS. Check with a small test that MPS is available, but be prepared to fall back to CPU if something isn’t supported. Keep an eye on the PyTorch release notes for improvements to MPS (each version has gotten better).
環境和依賴項: 使用隔離的環境(conda 或 venv)並僅安裝您需要的環境。在 Apple Silicon 上,請確保使用支援 MPS 的 PyTorch 版本。通過一個小測試檢查 MPS 是否可用,但如果某些內容不受支援,請準備好回退到 CPU。請密切關注 PyTorch 發行說明,瞭解 MPS 的改進(每個版本都已變得更好)。 - Axolotl on Mac: Axolotl is a powerful tool and may improve Mac support over time, but currently you might encounter issues due to its expectation of NVIDIA tools. If you still want to use it, comb through the Axolotl docs and GitHub issues for Mac-specific tips. For example, they note that certain features like bitsandbytes, QLoRA, and DeepSpeed are not available on M-series Macs. You may have to disable those. Always double-check that your dataset format matches what Axolotl expects to avoid key errors.
Mac 上的 Axolotl:Axolotl 是一個強大的工具,可能會隨著時間的推移改善 Mac 支援,但目前您可能會遇到問題,因為它對 NVIDIA 工具的期望。如果您仍想使用它,請梳理 Axolotl 文件和 GitHub 問題以獲取特定於 Mac 的提示。例如,他們指出 bitsandbytes、QLoRA 和 DeepSpeed 等某些功能在 M 系列 Mac 上不可用。您可能必須禁用這些。始終仔細檢查您的數據集格式是否與 Axolotl 期望的格式匹配,以避免關鍵錯誤。 - PEFT (LoRA) approach: Using Hugging Face’s PEFT library directly gives you flexibility. You can tailor the training loop, integrate custom data processing, and debug easier in a straightforward Python script. The trade-off is writing more boilerplate (loading data, writing a training loop or using Trainer). For many, the Trainer API will be sufficient and saves a lot of manual coding.
PEFT (LoRA) 方法: 直接使用 Hugging Face 的 PEFT 庫為您提供靈活性。您可以在簡單的 Python 腳本中定製訓練迴圈、集成自定義數據處理和更輕鬆地進行調試。權衡是編寫更多的樣板(載入數據、編寫訓練迴圈或使用 Trainer)。對於許多人來說,Trainer API 就足夠了,並且可以節省大量手動編碼。 - Folder layout best practices: It’s easy to get confused with multiple versions of model weights flying around. I recommend organizing as follows:
資料夾佈局最佳實踐: 很容易與飛來飛去的多個版本的模型權重混淆。我建議按以下方式組織:
Base model folder: e.g. models/<model-name> containing the original model files (whether downloaded from HF or converted).
基本模型資料夾: 例如 models/ 包含原始模型檔(無論是從 HF 下載還是轉換)。Data folder: e.g. data/<dataset-name> for your training data.
data 資料夾: 例如 data/ 用於您的訓練數據。Output (LoRA) folder: to save the adapter (if using LoRA) — e.g. outputs/<exp-name>-lora/.
Output (LoRA) 資料夾: 保存適配器(如果使用 LoRA)——例如 outputs/-lora/。Output (merged model) folder: e.g. outputs/<exp-name>-merged/.
Output (merged model) 資料夾: 例如 outputs/-merged/。Converted models folder: e.g. outputs/<exp-name>-gguf/ or directly move to ~/.lmstudio/models/… as appropriate.
轉換后的模型資料夾: 例如 outputs/-gguf/ 或直接移動到 ~/.lmstudio/models/...視情況而定。This separation avoids overwriting something important. Always double-check which model you’re loading or saving to avoid mixing the base and fine-tuned weights inadvertently.
這種分離可避免覆蓋重要內容。始終仔細檢查您正在載入或保存的模型,以避免無意中混淆基本權重和微調權重。
- Disabling unused features: As shown, disable any external logging (unless you explicitly want it) to keep things simple. Similarly, if using the Hugging Face Trainer, set report_to=”none” (or “tensorboard” if you prefer that) to avoid W&B usage. If you see tokenizer parallelism warnings, just set the env var as mentioned. These small things make the process less noisy.
禁用未使用的功能: 如圖所示,禁用任何外部日誌記錄(除非您明確需要)以保持簡單。同樣,如果使用 Hugging Face Trainer,請設置 report_to=“none”(或“tensorboard”,如果您願意的話)以避免使用 W&B。如果您看到 tokenizer parallelism 警告,只需如前所述設置 env var。這些小事使過程不那麼吵鬧。 - Know your tool versions: The ML ecosystem moves fast. By the time you read this, newer versions of Axolotl, PEFT, or PyTorch might have changed behaviors. Check the documentation for the versions you’re using. For instance, the merge_and_unload bug I faced might be fixed in a future Axolotl release. Always refer to official docs or source — many open-source projects have active Discords or forums where you can ask for Mac-specific help.
瞭解您的工具版本:ML 生態系統發展迅速。當您閱讀本文時,較新版本的 Axolotl、PEFT 或 PyTorch 可能已經改變了行為。查看您正在使用的版本的文件。例如,我遇到的 merge_and_unload 錯誤可能會在未來的 Axolotl 版本中修復。始終參考官方文檔或來源 — 許多開源專案都有活躍的 Discord 或論壇,您可以在其中尋求 Mac 特定的説明。
Conclusion 結論
Fine-tuning a large language model on a Mac is possible — and it’s incredibly rewarding to see a model trained on your own data running locally. Throughout this journey, I encountered the rough edges of tooling that wasn’t originally designed with Apple Silicon in mind. By combining the strengths of different tools — Axolotl’s inspiration and config templates, Transformers/PEFT for a custom training loop, and llama.cpp for efficient inference — I managed to create a Rust-savvy Mistral-7B that runs on my Mac Studio.
在 Mac 上微調大型語言模型是可能的,看到使用您自己的數據訓練的模型在本地運行,這是非常有益的。在整個旅程中,我遇到了最初設計時並未考慮 Apple Silicon 的工具的粗糙邊緣。通過結合不同工具的優勢——Axolotl 的靈感和配置範本、用於自定義訓練迴圈的 Transformers/PEFT 以及用於高效推理的 llama.cpp——我設法創建了一個精通 Rust 的 Mistral-7B,它可以在我的 Mac Studio 上運行。
Resources & References: 資源和參考資料:
- Axolotl (Fine-tuning toolkit): GitHub repo and docs.
Axolotl (Fine-tuning toolkit):GitHub 儲存庫和文檔。 - PEFT (LoRA by Hugging Face): PEFT library documentation — covers how to use LoRA and other efficient tuning methods.
PEFT(Hugging Face 的 LoRA):PEFT 庫文檔 — 介紹如何使用 LoRA 和其他高效的調優方法。 - Hugging Face Transformers: The Transformers documentation for details on the Trainer, model loading, etc.
Hugging Face 變壓器:Transformers 文檔,了解有關 Trainer、模型載入等的詳細資訊。 - llama.cpp and GGUF conversion: See llama.cpp’s README and the conversion script usage.
llama.cpp 和 GGUF 轉換: 請參閱 llama.cpp 的 README 和轉換腳本用法。 - LM Studio docs: Guide on importing models in LM Studio (folder structure expectations).
LM Studio 文件: 在 LM Studio 中導入模型的指南(資料夾結構預期)。
Good luck with your fine-tuning experiments, and happy modeling on your Mac! 🍏🤖
祝您的微調實驗好運,並在 Mac 上祝您建模愉快!🍏🤖
Model in Hugging Face: https://huggingface.co/plawanrath/minstral-7b-rust-fine-tuned
Hugging Face 中的模特:https://huggingface.co/plawanrath/minstral-7b-rust-fine-tuned







