Kubeflow makes more sense when you understand this
Updated: Jul 28
Previously I explained deploying and utilizing Machine Learning Platform concepts and its goals: https://www.romankazinnik.com/post/automate-machine-learning-2-kubeflow-vs-metaflow Let us see now how one gets the most benefit from Kubeflow. It is important to understand that Kubeflow is a platform. That means the cost of Kubeflow deployment includes higher upfront deployment cost and a small overhead during its deployment. Specifically, there is a one-time Kubeflow cluster setup cost paid by MLOps, and a small overhead paid by ML team which now will need to produce all its ML code as persistent docker components. Consider Metaflow If you for any reason working with persistent components is not an option. My target audience is MLOps/DevOps and Machine Learning Engineers/Data Scientists, and I create two separate guides for these two groups. Why separate? Because these two groups have distinctively different goals but often use overloaded terms.
MLOps ... Kubeflow is essentially a layer that abstracts your k8s cluster from all the "ML" nuances such as accuracies, training, overfitting, etc. Follow these steps and coordinate these concepts with your ML team and you will run thousands of experiments without failures. Kubeflow cluster setup: most importantly is to make sure that Kubeflow sample pipelines running 'out-of-the-box' without failures, in both GPU and CPU modes, invoked both with Kubeflow GUI and Kubeflow Jupyter notebook server. In this case, you won't need to re-install the Kubeflow cluster for a long time. We all hate having to re-install clusters because of some forgotten missing functionality after this cluster has been deployed and running products.
Unit testing prior to Kubeflow deployment: Your ML team runs all their experiments as Docker containers and can access all the required remote and cloud storage.
What you get from ML team:
... is an automatically Kubeflow-generated yaml file. ML team created Kubeflow pipelines encoded in this yaml file. It makes sense to inspect the yaml file to double-check resources such as Memory and GPU. Report back any inconsistencies and let ML team fix problems and create a new version of their Kubeflow pipeline, do not edit that yaml file.
Access and Secrets:
Still running ML components as Docker images, ML containers will access all the resources cross-cloud and cross-project. You will see that some access rights, secrets, and user accounts would need to be updated. CPU and GPU:
... still running ML components as Docker images, ML team runs docker images in both CPU and GPU with docker run --gpus all
Switching CPU and GPU will be later used in Kubeflow as follows: add_node_selector_constraint('cloud.google.com/gke-accelerator', 'nvidia-tesla-v100')
Benchmark memory for ML components. That will be used at Kubeflow by explicit declaration as follows: set_memory_limit('16G').set_memory_request('8G')
The two important issues with IO will be the access rights and IO bandwidth. Access rights on Google Cloud Platform will be managed with secrets and GCP user accounts. An example can be GCP buckets that can be accessed from the same Project by particular users or using GCP concept of the key-files. IO Bandwidth will limit the number of Kubeflow/k8s pods running IO at the same time, and the fastest solution would be to limit the number of IO-heavy pods running simultaneously.
ML Engineers and Data Scientists
... Kubeflow is essentially helping you to solve the troubles you would normally face with running your experiments anywhere outside of your laptop. Yes, that means if all your ML experiments run always on your personal laptop - you don't need Kubeflow. And yes, if you have a second computer or a Cloud instance - consider using Kubeflow.
These are the prerequisites to start working with Kubeflow painlessly. By the end of these sequences, you will have your Jupiter notebook running as a Machine Learning flow with all the perks: distributed, persistent, scale, cluster, tracked experimentation.
1. Decompose your code into persistent stateless components: a simple example can be a component that reads input files and outputs ML input data such as tables and time-series. A second component reads ML input data, trains a model, and outputs model inferences.
make your code to run as Docker containers with docker run
You can make all the components reside in a single Docker image. You will probably need help from MLOps engineer to learn how your company manages Docker images.
3. Create a shell script that runs a sequence of docker components:
This step is important because you are still running everything locally and it is much easier to identify problems when your process crashes on your laptop. Once your laptop runs the batch-file, try running the same batch-file from any other computer, or Cloud instance. Now you have a ninety-nine percent chance to have your first Kubeflow pipeline running successfully.
4. Create the first version of the Kubeflow pipeline that reproduces the same shell script. Validate. Using Kubeflow Jupyter notebook, create Kubeflow components for each component in your Docker image with exactly the same command-line parameters as in the batch-file. Essentially, this pipeline must produce exactly the same outputs as the batch-file which makes sense to validate.
5. Create sophisticated Kubeflow pipelines
... this can include training multiple models for the same input by adding multiple training components to your Kubeflow pipeline. These pipelines can grow fast and deployed daily thousand times, you want to be confident about their building blocks. As we have just seen, Kubeflow components are essentially docker image components. This leads to the next phase:
6. Create validation components.
... develop Data validation and Model validation components to accompany all the ML experiments. Take advantage of the Kubeflow experiment tracking! Below are two examples of Kubeflow pipelines: