Fine-Tuning with Kubeflow Trainer v2

This tutorial walks through how to use Kubeflow Trainer v2 to run supervised fine-tuning (SFT) jobs with LlamaFactory on Kubernetes.

Overview

Kubeflow Trainer v2 separates job templates (TrainingRuntime) from job runs (TrainJob), which lets you:

  • Define a reusable TrainingRuntime that captures the container image, training pipeline steps (dataset init → model init → trainer), and LlamaFactory configuration.
  • Submit many TrainJob runs referencing the same runtime, overriding only what changes per experiment — base model, dataset URL, hyperparameters, or GPU resources.

Prerequisites

Before starting, make sure the following are in place:

RequirementDetails
Kubeflow Trainer v2Installed in your cluster (the trainer.kubeflow.org API group is available)
KueueInstalled in your cluster for job scheduling and quota management (optional but recommended)
A shared PVCA PersistentVolumeClaim accessible by all pods (e.g., team-model-cache-pvc) backed by NFS, Ceph, or local storage like topolvm
Git credentialsA Kubernetes Secret named aml-image-builder-secret with keys MODEL_REPO_GIT_USER and MODEL_REPO_GIT_TOKEN for accessing private Git repositories
GPU nodesNodes with NVIDIA GPUs; the examples use Tesla-T4 nodes — adjust nodeSelector to match your cluster
kubectl accesskubectl configured with permission to create TrainingRuntime and TrainJob resources in your target namespace

RBAC permissions

If you encounter RBAC permission errors when creating or managing Kubeflow Trainer v2 resources, stop here and contact your cluster administrator before continuing. Ask the administrator to create a temporary role and bind it to your account or namespace so you have read and write permissions for the trainjobs and trainingruntimes custom resources.

The following example shows how a cluster administrator can grant those permissions to the ServiceAccount used by a workbench named aml-editor:

apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: aml-editor-trainer-rw
  namespace: mlops-demo-ai-test
rules:
  - apiGroups:
      - trainer.kubeflow.org
    resources:
      - trainjobs
      - trainingruntimes
    verbs:
      - get
      - list
      - watch
      - create
      - update
      - patch
      - delete
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: aml-editor-trainer-rw
  namespace: mlops-demo-ai-test
subjects:
  - kind: ServiceAccount
    name: aml-editor
    namespace: mlops-demo-ai-test
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: aml-editor-trainer-rw

Replace mlops-demo-ai-test with the namespace where the workbench and Trainer v2 resources run.

Build the Trainer Image or Use a Pre-Built Image

Use our pre-built image alaudadockerhub/fine_tune_with_llamafactory:v0.1.11 or you can build your own image with the provided Containerfile in aml-docs.

Download the notebook and run the example

  1. Download the notebook to your current workbench in Alauda AI, create a new workbench if you don't have one, and open the notebook.
  2. Follow the instructions in the notebook to create a TrainingRuntime and submit a TrainJob for fine-tuning a LLaMA-Factory model. The notebook includes example configurations for using the team-model-cache-pvc shared PVC and Git credentials.

Fine-Tuning on Ascend NPUs with MindSpeed-LLM

For Huawei Ascend NPU clusters, use the MindSpeed-LLM NPU notebook instead of the LlamaFactory GPU notebook.

The MindSpeed-LLM notebook shows how to:

  • Use the pre-built alaudadockerhub/alauda-workbench-jupyter-pytorch-cann-py312-ubi9:v0.1.7 image.
  • Create a Trainer v2 TrainingRuntime with runtimeClassName: ascend and schedulerName: hami-scheduler.
  • Submit a Qwen3 fine-tuning TrainJob that requests Ascend resources such as huawei.com/Ascend910B4.
  • Run the MindSpeed-LLM workflow: Hugging Face checkpoint conversion, dataset preprocessing, and SFT training.

Use this notebook when your cluster provides Ascend NPUs and your model training image must include torch_npu, mindspeed, and mindspeed_llm.

Scheduling with Kueue

Kueue provides job queuing, quota management, and fair scheduling for Kubernetes workloads. When Kueue is installed in your cluster, TrainJobs are held in a suspended state until Kueue admits them based on available quota.

How it works

  1. A cluster admin creates a ClusterQueue with resource quotas (CPU, memory, GPU).
  2. A namespace admin creates a LocalQueue pointing to the ClusterQueue.
  3. Users label their TrainJob with kueue.x-k8s.io/queue-name to submit it to a LocalQueue.
  4. Kueue evaluates the resource request, admits the workload when quota is available, and unsuspends the job.

Refer to Kueue documentation for more details on setting up ClusterQueue and LocalQueue.

Create a LocalQueue (optional)

Before submitting TrainJobs with Kueue, create a LocalQueue in your namespace that references an existing ClusterQueue:

apiVersion: kueue.x-k8s.io/v1beta2
kind: LocalQueue
metadata:
  name: local-queue
  namespace: mlops-demo-ai-test
spec:
  clusterQueue: cluster-queue
kubectl apply -f kf-local-queue.yaml

Submit a TrainJob with Kueue (optional)

To integrate with Kueue, add the kueue.x-k8s.io/queue-name label to your TrainJob's metadata.labels. This tells Kueue which LocalQueue the job belongs to:

metadata:
  generateName: trainjob-sft-qwen3-
  namespace: mlops-demo-ai-test
  labels:
    kueue.x-k8s.io/queue-name: local-queue

The rest of the TrainJob spec remains the same. See the notebook for the full example.

NOTE

When Kueue is enabled, the cluster may have a PodsReady timeout configured (e.g., 5 minutes). If your training image is large and not yet cached on the node, the first attempt may be evicted due to image pull timeout. Resubmitting the job usually succeeds since the image will be cached locally.