1. Home
  2. Knowledge Base
  3. Documentation
  4. Pipelines
  5. Pipeline Overview

Pipeline Overview

In AlgoRun parlance, a Pipeline is a directed graph composed of Endpoints, Algos, Data Sources, Data Sinks and Hooks. These components are the fundamental building blocks for creating flexible and massively scalable AI, automation and data processing pipelines. Building a pipeline in AlgoRun involves adding and configuring each component to form a clearly defined version of the complete processing stack. The Pipeline can then be tested locally, shared with others and deployed into a production Kubernetes cluster.

The easiest way to design a pipeline is using the AlgoRun UI visual pipeline designer. If you prefer using your favorite text editor, you can also create the pipeline with a YAML or JSON spec.

Design Canvas

The pipeline design canvas in the AlgoRun UI is a powerful way to visually build your processing pipelines. Below is a sample pipeline showing the available components.

Pipeline Components

Below is an overview of these components and how they can be utilized in your pipelines.

Naming convention

A component in the pipeline can identified using a naming convention. This naming convention is used throughout AlgoRun and helps to identify the location, activity, logs and alerts for a specific component.

{Owner Username}/{Name}:{Version}[{Index}]

Each component added to the pipeline is assigned an unique Index. This index is simply an integer that ensures multiple copies of the same type of component can be added to a pipeline and that there are no exact duplicates.

Endpoints

An endpoint defines how the pipeline will accept requests and data from outside the Kubernetes cluster. An endpoint can have one or more paths, which will be used to form the url. Each path can be independently piped to any compatible input within the pipeline. An endpoint path can be used to segregate incoming data streams from IoT devices, enable external systems to integrate with the pipeline, create parallel processing pipes within a single pipeline and other creative uses.

The endpoint path can be configured with the following options:

  • Name - A short, url friendly name for the endpoint path. Endpoint path name can only contain alphanumeric characters or single hyphens, and cannot begin or end with a hyphen with a maximum length of 50 characters.
  • Description - A human friendly description of the endpoint path and how it is used.
  • Endpoint Type - The endpoint type determines which protocol will be accepted for this path. Currently the possible protocols are HTTP(S) and gRPC, with Kafka and RTMP on the roadmap and coming soon.
  • Message Data Type - The message data type enables you to choose how the incoming data is stored in the pipeline. The options are:

    • Embedded - If embedded, the incoming data will be added directly as the value for the message that will be delivered to Kafka.
    • FileReference - If FileReference, the incoming data will be saved to shared storage and a json message will be generated containing the location information for the file. The json file reference message will then be delivered to Kafka.
  • Content Type - The accepted content type can be defined for this endpoint path. By defining the content type you gain additional features:

    • Ensure only compatible outputs and inputs can be piped to each other
    • Validation of the data being delivered to ensure it matches the content type

Let's take a look at how the endpoint is implemented.

As you can see, a deployed endpoint will have the following resources created:

  • An Ambassador mapping will be generated for the path, which routes ingress traffic from the endpoint path to the appropriate container.
  • A container for the endpoint type is created to handle the appropriate protocol.

Data can then be sent to the endpoint path following using these Url conventions:

Endpoint Type Url
HTTP(S) http(s)://{AlgoRun IP}:{HTTP Port}/{Deployment Owner}/{Deployment Name}/{Endpoint Path}
gRPC http(s)://{AlgoRun IP}:{gRPC Port}/
Kafka Kafka Broker: {AlgoRun IP}:9092
Topic: {Deployment Owner}.{Deployment Name}.{Endpoint Path}
RTMP Coming Soon

Data Sources

A data source enables the pipeline to ingest data from common databases and data repositories. Most of the current Data Sources are based on Kafka Connect, which is a great way to capture data to be loaded into Kafka. There are many open source Kafka connectors included in AlgoRun and we are continually adding new connectors. While Kafka Connect is a great API framework, a data connector can ultimately be built in any containerized application that can be configured to output records to Kafka. You can also extend AlgoRun with new data connectors. Tutorials to come for creating custom data connectors and adding them to AlgoRun.

For more details on adding a Data Source to your pipeline, check out this article.

Data Sinks

Similar to the data source, most data sinks are based on Kafka Connect. A data sink can be added to a pipeline to enable writing the results of any component in the pipeline to common databases and repositories.

For more details on adding a Data Sink to your pipeline, check out this article.

Algos

Algos are the fundamental processing component of a pipeline and their implementation can be anything from simple data transformations to massive deep learning models. By adding an Algo to your pipeline, you can pipe a compatible output from an Endpoint or Data Source into the Algo input.

For more details check out:
Algo Overview
Adding an Algo to your pipeline

Hooks

Hooks enable you to add events to your pipeline that you can route any output to. This capability can be used to create callbacks to external applications and users with the results from any component in the pipeline. When a hook receives an event, the data is then posted to any Webhook Urls attached to the hook.

In order for a client to receive results from the event hooks you have a couple options.

Asynchronous Mode

By default, requests made to a pipeline endpoint are asynchronous, which causes the response from the endpoint to the client application to be returned immediately. In the response, there is a RunID, which identifies the request made by the client. The client can then call the endpoint API again with the RunID to capture the results of the hooks. Due to the fact that the runtime of the pipeline in indeterminate, the client may need to retry until the result is available or the timeout is reached.

Synchronous Mode

When a client calls the endpoint, a parameter can be added to switch to run in synchronous mode. In synchronous mode, the call to the endpoint will only return a response once all results have been received from every hook events, or the timeout is reached. This is especially useful for shorter running pipelines that focus on real-time results.

For more details on using endpoints, check out this article.

Pipes

Pipes are constructs that connect outputs to compatible inputs in the pipeline. The pipe concept is virtual in that it is not an in-memory pipe in the traditional Unix fashion. A pipe in AlgoRun essentially defines where data is routed from and the destination. There are a few rules to keep in mind when creating pipes:

  • An output can be piped to multiple destinations
  • An input can only accept one pipe
  • If an Algo has multiple inputs, the Algo will be ran immediately when data hits the input from a pipe. There are plans to change this to allow aggregating data from multiple inputs before executing the Algo.

Pipeline Versions

A pipeline can be versioned and this allows the pipeline configuration to evolve over time. With versioning you can create snapshots of pipeline configurations for testing and deployment.

New versions can be created in a couple of ways:

  • Clone - The clone option creates a snapshot of the currently selected pipeline version or a previous version you’ve selected to clone. The snapshot is used to create a new version with the new version tag.
  • Blank - The blank option creates an empty pipeline to start building from scratch again. This can useful when initially creating a pipeline. After deploying a pipeline, it is usually wise to clone the running version to make changes rather than starting from scratch.

Was this article helpful?

Related Articles

Share Your Valuable Opinions