Fine-Tuning with Kubeflow Trainer v2
This tutorial walks through how to use Kubeflow Trainer v2 to run supervised fine-tuning (SFT) jobs with LlamaFactory on Kubernetes.
TOC
OverviewPrerequisitesRBAC permissionsBuild the Trainer Image or Use a Pre-Built ImageDownload the notebook and run the exampleFine-Tuning on Ascend NPUs with MindSpeed-LLMScheduling with KueueHow it worksCreate a LocalQueue (optional)Submit a TrainJob with Kueue (optional)Overview
Kubeflow Trainer v2 separates job templates (TrainingRuntime) from job runs (TrainJob), which lets you:
- Define a reusable
TrainingRuntimethat captures the container image, training pipeline steps (dataset init → model init → trainer), and LlamaFactory configuration. - Submit many
TrainJobruns referencing the same runtime, overriding only what changes per experiment — base model, dataset URL, hyperparameters, or GPU resources.
Prerequisites
Before starting, make sure the following are in place:
RBAC permissions
If you encounter RBAC permission errors when creating or managing Kubeflow Trainer v2 resources, stop here and contact your cluster administrator before continuing. Ask the administrator to create a temporary role and bind it to your account or namespace so you have read and write permissions for the trainjobs and trainingruntimes custom resources.
The following example shows how a cluster administrator can grant those permissions to the ServiceAccount used by a workbench named aml-editor:
Replace mlops-demo-ai-test with the namespace where the workbench and Trainer v2 resources run.
Build the Trainer Image or Use a Pre-Built Image
Use our pre-built image alaudadockerhub/fine_tune_with_llamafactory:v0.1.11 or you can build your own image with the provided Containerfile in aml-docs.
Download the notebook and run the example
- Download the notebook to your current workbench in Alauda AI, create a new workbench if you don't have one, and open the notebook.
- Follow the instructions in the notebook to create a
TrainingRuntimeand submit aTrainJobfor fine-tuning a LLaMA-Factory model. The notebook includes example configurations for using theteam-model-cache-pvcshared PVC and Git credentials.
Fine-Tuning on Ascend NPUs with MindSpeed-LLM
For Huawei Ascend NPU clusters, use the MindSpeed-LLM NPU notebook instead of the LlamaFactory GPU notebook.
The MindSpeed-LLM notebook shows how to:
- Use the pre-built
alaudadockerhub/alauda-workbench-jupyter-pytorch-cann-py312-ubi9:v0.1.7image. - Create a Trainer v2
TrainingRuntimewithruntimeClassName: ascendandschedulerName: hami-scheduler. - Submit a Qwen3 fine-tuning
TrainJobthat requests Ascend resources such ashuawei.com/Ascend910B4. - Run the MindSpeed-LLM workflow: Hugging Face checkpoint conversion, dataset preprocessing, and SFT training.
Use this notebook when your cluster provides Ascend NPUs and your model training image must include torch_npu, mindspeed, and mindspeed_llm.
Scheduling with Kueue
Kueue provides job queuing, quota management, and fair scheduling for Kubernetes workloads. When Kueue is installed in your cluster, TrainJobs are held in a suspended state until Kueue admits them based on available quota.
How it works
- A cluster admin creates a
ClusterQueuewith resource quotas (CPU, memory, GPU). - A namespace admin creates a
LocalQueuepointing to theClusterQueue. - Users label their
TrainJobwithkueue.x-k8s.io/queue-nameto submit it to aLocalQueue. - Kueue evaluates the resource request, admits the workload when quota is available, and unsuspends the job.
Refer to Kueue documentation for more details on setting up ClusterQueue and LocalQueue.
Create a LocalQueue (optional)
Before submitting TrainJobs with Kueue, create a LocalQueue in your namespace that references an existing ClusterQueue:
Submit a TrainJob with Kueue (optional)
To integrate with Kueue, add the kueue.x-k8s.io/queue-name label to your TrainJob's metadata.labels. This tells Kueue which LocalQueue the job belongs to:
The rest of the TrainJob spec remains the same. See the notebook for the full example.
When Kueue is enabled, the cluster may have a PodsReady timeout configured (e.g., 5 minutes). If your training image is large and not yet cached on the node, the first attempt may be evicted due to image pull timeout. Resubmitting the job usually succeeds since the image will be cached locally.