Search
  • Roman Kazinnik

Machine Learning as a Flow Cont-ed: Kubeflow vs. Metaflow

Updated: Mar 8

I am working on a comparative analysis of Kubeflow (released 2018, Google, current version 0.7) and Metaflow (released December 2019, Netflix). At a high level, both flows essentially facilitate ML Platform (a.k.a. MLOps): Integration ML components and storages, support experiments logging and reproducibility,

faster productionize models.


Here are a few references to emphasize why we need MLOps: 3-min: https://youtu.be/sdbBcPuvw40 "Spell: Next Generation Machine Learning Platform", 33-min: https://youtu.be/lu5zHvpQeSI "Managing ML in Production with Kubeflow and DevOps - David Aronchick, Microsoft"

10-min Metaflow read: https://towardsdatascience.com/learn-metaflow-in-10-mins-netflixs-python-r-framework-for-data-scientists-2ef124c716e4

If I needed to explain in one word why one would need MLOps, it would be: "SCALE"

How MLOps facilitates better Machine Learning

MLOps flow:

  • does not aim directly to improve models' accuracies or win Kaggle competition

  • does scale and facilitates ML teams that want to win Kaggle competitions.

Over the next two weeks I am planning to work on comparing the following functionalities in Kubeflow and Metaflow:

  • CI/CD

  • ML Features

  • ML Training

  • ML Serving

  • Orchestration

  • Distributed computing and horizontal scaling

  • Cluster support

  • Cloud dependencies

  • Integration with other tools, libraries, APIs

I will try to make my analysis as reproducible as possible. Below is my current work-in-progress analysis:

  • Distributed compute TBD

  • Cluster TBD

  • State between components (K) starts and ends within component with minimal pass-state support

  • Goal Highly subjective (!) (M) Python-based Machine Learning workflows (K) abstract out Kubernetes and add DAG orchestration

  • Cloud (M) AWS Batch and S3 (K) GCP, Azure, potentially anything

  • Serving model Provided by external tools AI Hub and TFX at GCP (K) and Sagemaker/AWS for (M)

  • Horizontal Scaling (K) Kubernetes cluster, TFX or custom (M) AWS ECS

  • Dependencies (K) lock-in with DAG (M) Minimal


Both Kubeflow and Metaflow support the most common ML/DS common scenarios by managing code, data, and dependencies for each ML/DS experiment. Following common ML/DS scenarios in https://towardsdatascience.com/learn-metaflow-in-10-mins-netflixs-python-r-framework-for-data-scientists-2ef124c716e4, (K) and (M) at a high level deliver: (K) and (M) Collaboration: debug somebody's error, pull up somebody's failed run state in your sandbox/laptop as-is. (M) Resuming a run: A run failed (or was stopped intentionally). You fixed the error in your code. You wish you could restart the workflow from where it failed/stopped. (K) and (M) Hybrid runs: run one step of your workflow locally (maybe the data load step since the dataset is in your downloads folder) but want to run another compute-intensive step (the model training) on the cloud. (K) and (M) Inspecting run metadata: Three data scientists have been tuning the hyperparameters to get better accuracy on the same model. Now you want to analyze all their training runs and pick the best performing hyperparameter set. (K) and (M) Multiple versions of the same package: one is free to choose version, language, packages for any step of one's workflow.

(K) Not locking a particular cloud provider: abstracts out open-source Kubernetes and Airflow/other Orchestration. However, using TensorFlow Extended (TFX) will lock into Apache Beam, provided as Dataflow at Google Cloud Platform. (M) currently locks in AWS Sagemaker/S3/Batch

Running heavy parallel-distributed computational loads in a stable way: example of unstable cluster management when some 'memory thirsty' process causes other processes to abort due to lack of memory resources. Another example is the utilization of resources when GPU components from one pipeline will be able to run in parallel with CPU components from another pipeline. (K) utilizes Kubernetes computational-distributed load, with one potential solution to have memory-GPU threads requirements declared explicitly in pipeline's components.

Integration with various tools and cross cloud providers: (K) runs containers consequently all API service calls can be run as one pipeline as soon as they can be isolated into docker container

24 views

© 2018 by Challenge. Proudly created with Wix.com