AlgoRun is a pipeline orchestration system designed for data processing, machine learning model deployments and business automation processes. AlgoRun facilitates versioned, namespaced, portable and easily deployable pipelines on any Kubernetes cluster. It is built to be compatible with any modern data processing tools and languages, which enables simplified orchestration of your entire data pipeline.
AlgoRun leverages best of breed infrastructure components, such as Kubernetes and Apache Kafka, and adds streamlined pipeline management capabilities to empower both data and infrastructure teams. Data teams get to focus on extracting value from data, use the tools they desire, while knowing their brilliant ideas can make it to production. Infrastructure teams get simplified management, monitoring and security of data pipeline deployments, all without sacrificing flexibility.
Examples of some of the data processing abilities are:
- ETL Pipelines for moving data between heterogeneous systems
- Data prep and wrangling for machine learning pipelines
- Schema validation and other data integrity assurance
- Data record lineage and tracing
- Data stream aggregation and centralized feature stores
Examples of some of the machine learning abilities are:
- Model serving and deployments
- Scaling and resource management
- Model version management
Examples of the business automation capabilities are:
- Decision workflows based on the results of Algos in the pipeline
- Integrate with existing legacy systems
- Public API Integrations and execution of external actions
The primary components that make up the system are:
- Data Connectors
In the context of the AlgoRun system, we use the term Algorithm very loosely, in that it can be any discreet processing code in practically any programming language as long as it can be packaged into a docker container. From here on out we will refer to an Algorithm as an Algo and use the terms interchangeably. An Algo can be anything from a script in languages such as Python or R, a machine learning model snapshot, or any executable / cli / console application. In addition, Algos can also expose their processing capabilities through an Http or Grpc server. Ultimately, AlgoRun launches any code as a microservice and captures the output for usage. For machine learning models, AlgoRun is architected primarily for model serving and inference, although it is also possible to be configured for model training.
A Data Connector is either a data source to ingest data into the pipeline or a data sink to write data to external data repositories. Considering the backbone for the data layer of AlgoRun is Apache Kafka, the majority of Data Connectors are in fact containerized Kafka Connectors. Due to the many existing open source connectors in the Kafka ecosystem, AlgoRun comes pre-installed with connectors to the majority of popular SQL, NoSQL and general purpose data repositories.
A pipeline is composed of one or more Algos and Data Connectors. AlgoRun has the ability to execute Algos in any sequence, parallel or serial, to form a completely asynchronous processing pipeline. Each Algo in the pipeline may have multiple inputs and multiple outputs which can be piped to/from any compatible source/destination. Each pipeline also has a single endpoint, which provides an external interface to the capabilities of the pipeline and is exposed as both HTTP and Grpc. The endpoint can expose more than one path to enable multiple pipeline inputs for different content types or use cases such as running two parallel processing pipes in the same pipeline.
Each pipeline also has a Hook, which provides a way to create named events, which will relay any result that is piped to it to any url.
A deployment manages all of the configuration required to deploy a pipeline into Kubernetes. It also provides the logging and monitoring facilities for a pipeline that has been deployed. The deployment is used to expose the external endpoints and ensure all components of the pipeline are running. The deployment process is managed by a Kubernetes Operator component installed in the cluster called the Pipeline Operator.