
Kubeflow AI model training Kubernetes
The task of managing machine learning workloads on Kubernetes has always been complex, particularly when dealing with distributed training and fine-tuning large language models (LLMs). This often requires an in-depth understanding of Kubernetes, involving the orchestration of multiple nodes, GPUs, and handling large datasets with fault tolerance, including AI model training applications, including Kubernetes machine learning applications.
To address these challenges, the Kubeflow community has introduced the Kubeflow Trainer v2 (KF Trainer), which streamlines these processes by abstracting the complexity of Kubernetes for AI practitioners. This article explores how KF Trainer v2 simplifies AI/ML workload management, the evolution of the Kubeflow Trainer, and its impact on AI practitioners.
Kubeflow PyTorch AI model training
Kubeflow Trainer v2 is designed with several key goals in mind. It aims to make AI/ML workloads easier to manage at scale, provide a Pythonic interface for model training, and offer the most scalable PyTorch distributed training on Kubernetes.
Additionally, it includes built-in support for fine-tuning LLMs while abstracting Kubernetes complexity from AI practitioners in the context of AI model training, including Kubernetes machine learning applications in the context of AI model training in the context of Kubeflow Trainer v2, especially regarding Kubernetes machine learning. This initiative consolidates efforts between the Kubernetes Batch Working Group and the Kubeflow community, facilitating a more streamlined approach to managing distributed PyTorch jobs (Kubeflow Blog, 2023). The development of KF Trainer v2 has been a collaborative effort, with contributions from numerous community members and developers.
Their hard work and valuable feedback have been instrumental in shaping the architecture of the Trainer. Special recognition is given to contributors such as Andrey Velichkevich, Tenzin Y, and many others who played pivotal roles in this project.

Kubeflow TensorFlow AI model training
Kubeflow Trainer v2 represents an evolution from the original Kubeflow Training Operator, building on over seven years of experience running ML workloads on Kubernetes. The journey began in 2017 with the introduction of TFJob to orchestrate TensorFlow training on Kubernetes.
At the time, Kubernetes lacked advanced batch processing features required for distributed ML training, prompting the community to develop these capabilities from scratch (Kubeflow Blog, 2023), especially regarding AI model training in the context of Kubeflow Trainer v2, especially regarding Kubernetes machine learning. Over the years, the project expanded to support multiple ML frameworks, including PyTorch, MXNet, MPI, and XGBoost, through specialized operators. In 2021, these efforts were consolidated into the unified Training Operator v1.
The Kubernetes community also introduced the Batch Working Group, developing APIs like JobSet, Kueue, Indexed Jobs, and PodFailurePolicy, which improved the management of HPC and AI workloads, especially regarding AI model training, especially regarding Kubernetes machine learning. Trainer v2 leverages these Kubernetes-native improvements, delivering a more standardized approach to ML training on Kubernetes.
Kubernetes AI model training solutions
One of the main challenges with ML training on Kubernetes is that it often requires AI practitioners to have an understanding of Kubernetes concepts and the infrastructure used for training. KF Trainer v2 addresses this by separating infrastructure configuration from the training job definition using three new custom resource definitions (CRDs): ① TrainingRuntime: A namespace-scoped resource containing infrastructure details for a training job, such as the training image to use, failure policy, and gang-scheduling configuration.
② ClusterTrainingRuntime: Similar to TrainingRuntime but cluster-scoped, including AI model training applications, especially regarding Kubeflow Trainer v2 in the context of Kubernetes machine learning.
③ TrainJob: Specifies the training job configuration, including the training code to run, config for pulling the training dataset and model, and a reference to the training runtime (Kubeflow Blog, 2023). This separation allows platform administrators to define and manage infrastructure configurations required for training jobs, while AI practitioners can focus on model development using the simplified TrainJob resource or Python SDK wrapper in the context of AI model training, particularly in Kubeflow Trainer v2, including Kubernetes machine learning applications.
Kubeflow SDK machine learning integration
A noteworthy feature of Kubeflow Trainer v2 is its redesigned Python SDK, serving as the primary interface for AI practitioners. This SDK provides a unified interface across multiple ML frameworks and cloud environments, abstracting away the underlying Kubernetes complexity.
It enables users familiar with Python to create, manage, and monitor training jobs without dealing with YAML definitions (Kubeflow Blog, 2023), especially regarding AI model training, especially regarding Kubernetes machine learning. The SDK supports multiple ML frameworks through pre-configured runtimes, allowing AI practitioners to focus on their training logic without the need to directly interact with the Kubernetes API. By handling all Kubernetes API interactions, the SDK eliminates the need for AI practitioners to manage Kubernetes-specific details, streamlining the process of running ML jobs across different ML frameworks, Kubernetes infrastructures, and cloud providers.

Kubeflow Trainer API for AI model training
In the past, the Kubeflow Training Operator required users to work with different custom resources for each ML framework, each with its own framework-specific configurations. KF Trainer v2 simplifies this by replacing multiple CRDs with a unified TrainJob API that works with multiple ML frameworks (Kubeflow Blog, 2023).
For instance, creating a PyTorch training job with KF Trainer v2 is much simpler than before, especially regarding AI model training, especially regarding Kubeflow Trainer v2, particularly in Kubernetes machine learning. The unified API allows for more efficient and streamlined job creation, with additional infrastructure and Kubernetes-specific details managed separately by platform administrators. This approach not only simplifies job creation but also enhances the scalability and flexibility of training jobs on Kubernetes.
Custom ML Frameworks Pipeline Integration
One of the challenges with KF Trainer v1 was supporting additional ML frameworks, particularly closed-source ones. The v2 architecture addresses this by introducing a Pipeline Framework that allows platform administrators to extend plugins and support orchestration for their custom in-house ML frameworks.
The Pipeline Framework works through phases like Startup, PreExecution, Build, and PostExecution, each with extension points for custom plugins (Kubeflow Blog, 2023) in the context of AI model training in the context of Kubeflow Trainer v2, particularly in Kubernetes machine learning. This approach enables the addition of support for new frameworks, custom validation logic, or specialized training orchestration without altering the underlying system. The flexibility of the Pipeline Framework ensures that KF Trainer v2 can accommodate a wide range of ML frameworks, enhancing its adaptability to various use cases.
Trainer v2 fine – tuning support
A significant improvement in Trainer v2 is its built-in support for fine-tuning large language models. It offers two types of trainers: ① BuiltinTrainer: Includes fine-tuning logic and allows AI practitioners to start fine-tuning quickly with only parameter adjustments.
② CustomTrainer: Allows users to provide their own training function that encapsulates the entire LLM fine-tuning process (Kubeflow Blog, 2023). In its initial release, TorchTune LLM Trainer is supported as the BuiltinTrainer option, especially regarding AI model training, especially regarding Kubeflow Trainer v2 in the context of Kubernetes machine learning.
It provides pre-configured runtimes for models like Llama-3.2-1B – Instruct and Llama-3.2-3B – Instruct. This approach allows for the future addition of more frameworks as BuiltinTrainer options, further enhancing the capabilities of KF Trainer v2 in supporting large language models.

Kubeflow AI model training Kubernetes
Kubeflow Trainer v2 marks a significant advancement in the management of AI/ML workloads on Kubernetes. By abstracting the complexities of Kubernetes, providing a Pythonic interface, and supporting a wide range of ML frameworks, it empowers AI practitioners to focus on model development and training without being bogged down by infrastructure details in the context of Kubernetes machine learning.
The collaboration between the Kubernetes and Kubeflow communities has led to a more standardized and efficient approach to ML training, making KF Trainer v2 an invaluable tool for AI practitioners and platform administrators alike. With its extensibility and support for LLM fine-tuning, KF Trainer v2 is poised to play a crucial role in the future of AI model training on Kubernetes.