Kubeflow makes more sense when you understand this. Moving Data Science to Kubeflow in four steps
Updated: Oct 21
Previously I explained deploying and utilizing Machine Learning Platform concepts and its goals: https://www.romankazinnik.com/post/automate-machine-learning-2-kubeflow-vs-metaflow
Let us see now how one gets the most benefit from Kubeflow. It is important to understand that Kubeflow is a Platform:
Effort to deploy a product in Kubeflow is high initially, and small for the following products.
Example: one-time Kubeflow cluster setup effort by MLOps, and a small overhead effort by Data Scientists which will need to make one extra step beyond Jupyter notebooks and pack Python code as persistent docker components.
Remark: it makes sense to consider Metaflow if for any reason working with persistent Docker images is a problem and AWS mode only is not a problem.
Target audience for this blog is MLOps/DevOps and Machine Learning Engineers/Data Scientists. I create two separate guides for these two groups, which have distinctively different goals and often use overloaded terms.
MLOps milestines and objectives:
... Kubeflow is essentially a layer that abstracts Kubernetes Cluster and MLOps from "Machine Learning" technical details such as accuracies, training, overfitting, etc. The goal is to run thousands of experiments with minimal failures and not having to reinstall Kubeflow cluster. Here is how it can be attained:
1. Kubeflow cluster setup: Kubelow installation will shpw several sample pipelines. insure these sample pipelines (1) run 'out-of-the-box' both GPU and CPU modes, (2) invoked both with Kubeflow GUI endpoint, Kubeflow Jupyter notebook and directly on Kubernetes via their yaml files.
2. Unit testing prior before migrating them to Kubeflow: Data Science team should be able to run their experiments as Docker containers and access all the cloud storage data and upload the results.
3. Data Scientists utilize Kubeflow cluster by ... running automatically Kubeflow-generated yaml files. Inspect the yaml file to double-check resources such as Memory and GPU requests and limits. Inconsistencies should be reported to Data Scientists that created the pipelines, no need to edit the automatically generated yaml file.
4. Access and Secrets:
If it runs in Docker it runs in Kubeflow. Run components as Docker images and try to access the cloud resources. Update access rights and Google Cloud Platform secrets and user accounts if needed. 5. CPU and GPU:
If it runs in Docker it runs in Kubeflow. Run components as Docker images both CPU and GPU with docker run --gpus all and observe GPU load. More info on CPU and GPU in Kubeflow is also here: add_node_selector_constraint('cloud.google.com/gke-accelerator', 'nvidia-tesla-v100')
6. IO Resources: access rights and bandwidth. Google Cloud Platform secrets and user accounts much alike to Kubernetes. Example: access GCP buckets by certain users or GCP key-files. Cloud IO Bandwidth will limit the number of Kubeflow/k8s pods that access IO simultaneously, a simple solution to limit the number of IO-heavy pods.
Data Scientists and Machine Learning Engineers milestones and objectives:
... Kubeflow helps to resolve the troubles when one runs experiments anywhere outside of your laptop. Experiments that expected to run on personal laptops don't need Kubeflow. Consider migrating Kubeflow otherwise.
These are the prerequisites to migrate Kubeflow in order to obtain the following: distributed training, persistent am tracked experiments, products scale, cluster.
1. Decompose monolith code into persistent stateless functions (components): a simple example can be a component that reads input files and outputs ML input data such as tables and time-series. A second component reads ML input data, trains a model, and outputs model inferences.
2. Docker -ize:
pack the components as Docker containers with docker run
Initially one can make all the components reside in a single Docker image.
3. Create a shell script that runs a sequence of docker components:
Still run locally and solve problems. Run this batch file from multiple computers and a Cloud instance.
4. Migrate shell script to Kubeflow pipeline. Validate. Using Kubeflow Jupyter notebook, create Kubeflow components using the Docker image. One can use same command-line parameters in the batch-file. This Kubeflow pipeline produces the same output as the batch-file. Validate this!
5. Start creating Kubeflow pipelines
... this can include training multiple models for the same input by adding multiple training components to your Kubeflow pipeline. Kubeflow components are docker image containers that run on Kubernetes cluster.
6. Create validation components.
... develop Data validation and Model validation components to accompany the experiments. Take advantage of the Kubeflow experiment tracking! Below are two snapshots of possible Kubeflow pipelines: