Machine Learning as a Flow Cont-ed: Kubeflow vs. Metaflow
Updated: Mar 8
I am working on a comparative analysis of Kubeflow (released 2018, Google, current version 0.7) and Metaflow (released December 2019, Netflix). At a high level, both flows essentially facilitate ML Platform (a.k.a. MLOps): Integration ML components and storages, support experiments logging and reproducibility,
faster productionize models.
Here are a few references to emphasize why we need MLOps: 3-min: https://youtu.be/sdbBcPuvw40 "Spell: Next Generation Machine Learning Platform", 33-min: https://youtu.be/lu5zHvpQeSI "Managing ML in Production with Kubeflow and DevOps - David Aronchick, Microsoft"
10-min Metaflow read: https://towardsdatascience.com/learn-metaflow-in-10-mins-netflixs-python-r-framework-for-data-scientists-2ef124c716e4
If I needed to explain in one word why one would need MLOps, it would be: "SCALE"
does not aim directly to improve models' accuracies or win Kaggle competition
does scale and facilitates ML teams that want to win Kaggle competitions.
Over the next two weeks I am planning to work on comparing the following functionalities in Kubeflow and Metaflow:
Distributed computing and horizontal scaling
Integration with other tools, libraries, APIs
I will try to make my analysis as reproducible as possible. Below is my current work-in-progress analysis:
Distributed compute TBD
State between components (K) starts and ends within component with minimal pass-state support
Goal Highly subjective (!) (M) Python-based Machine Learning workflows (K) abstract out Kubernetes and add DAG orchestration
Cloud (M) AWS Batch and S3 (K) GCP, Azure, potentially anything
Serving model Provided by external tools AI Hub and TFX at GCP (K) and Sagemaker/AWS for (M)
Horizontal Scaling (K) Kubernetes cluster, TFX or custom (M) AWS ECS
Dependencies (K) lock-in with DAG (M) Minimal
Both Kubeflow and Metaflow support the most common ML/DS common scenarios by managing code, data, and dependencies for each ML/DS experiment. Following common ML/DS scenarios in https://towardsdatascience.com/learn-metaflow-in-10-mins-netflixs-python-r-framework-for-data-scientists-2ef124c716e4, (K) and (M) at a high level deliver: (K) and (M) Collaboration: debug somebody's error, pull up somebody's failed run state in your sandbox/laptop as-is. (M) Resuming a run: A run failed (or was stopped intentionally). You fixed the error in your code. You wish you could restart the workflow from where it failed/stopped. (K) and (M) Hybrid runs: run one step of your workflow locally (maybe the data load step since the dataset is in your downloads folder) but want to run another compute-intensive step (the model training) on the cloud. (K) and (M) Inspecting run metadata: Three data scientists have been tuning the hyperparameters to get better accuracy on the same model. Now you want to analyze all their training runs and pick the best performing hyperparameter set. (K) and (M) Multiple versions of the same package: one is free to choose version, language, packages for any step of one's workflow.
(K) Not locking a particular cloud provider: abstracts out open-source Kubernetes and Airflow/other Orchestration. However, using TensorFlow Extended (TFX) will lock into Apache Beam, provided as Dataflow at Google Cloud Platform. (M) currently locks in AWS Sagemaker/S3/Batch
Running heavy parallel-distributed computational loads in a stable way: example of unstable cluster management when some 'memory thirsty' process causes other processes to abort due to lack of memory resources. Another example is the utilization of resources when GPU components from one pipeline will be able to run in parallel with CPU components from another pipeline. (K) utilizes Kubernetes computational-distributed load, with one potential solution to have memory-GPU threads requirements declared explicitly in pipeline's components.
Integration with various tools and cross cloud providers: (K) runs containers consequently all API service calls can be run as one pipeline as soon as they can be isolated into docker container