Algos are Containers
An Algo in AlgoRun is simply any containerized (code or application built as a docker container) processing logic with a defined set of inputs and outputs. The implementation of the Algo can be a compiled executable, a script or even a HTTP or gRPC server. This flexibility allows for practically any existing code to be used or migrated into the AlgoRun platform. Docker containers are the most widely used container format and should used to package up your Algos.
As you evaluate how to break down your pipeline, think of an Algo as a single processing step. If your script or executable encapsulates many stages, it can be beneficial to break down those stages into individual Algos. This process of breaking down complex pipelines into simpler individual processing blocks pays dividends later. Some of the benefits are:
- Improved reusability
- Loosely coupled logic
- Simplified upgrades of each stage
- Easier scaling
An Algo package may be able to execute more than one command or action. If the Algo is an executable or script, the separate actions should be clearly parameterized with command line arguments that separate the functionality. For server based Algos, each http method and url path implicitly separates the functionality. This separation ensures that when designing a complete pipeline later, there is no ambiguity when analyzing what the Algo is doing.
Adding Algos to AlgoRun
Your Own Code
Have a docker image for your Algo processing code? It will be easy to import that docker image into AlgoRun by creating a YAML Algofile with the metadata needed to execute the Algo. For a detailed breakdown of the Algofile YAML, take a look at the Algofile Reference documentation.
Here is a walkthrough for adding your Algo container to your local AlgoRun catalog.
Don't have a docker image for your Algo created yet? Take a look at these guides:
Importing from AlgoHub
AlgoHub.com has a catalog of Algos and pre-trained models, packaged up and ready to import into your AlgoRun instance. This makes it simple to get started with state of the art machine learning, deep learning and specialized algorithms. The catalog is continually being updated with open source and commercial AI algorithms, tools and integrations that enable quick deployment and implementation in your projects.
Here is a walkthrough for the methods of importing Algos from AlgoHub.com
A Deployed Algo
In order to understand how an Algo is utilized within AlgoRun, take a look at what a single Algo deployment looks like.
Let's unpack how this architecture works.
When an Algo is deployed, a Kubernetes Pod is created with two containers. One container is the actual Algo implementation and the second is the AlgoRunner sidecar container. These containers are deployed as a Kubernetes Deployment resource with a single Pod for both containers. The deployment is also mounted with file system volumes to hold any input and output data. How the Algo is deployed is described in more detail in the Deployment Overview.
Once the Algo pod is deployed and ready, it will begin to process any data that is routed to it through a virtual 'pipe' in the pipeline. The AlgoRunner manages the processing through a series of stages:
- A data record is read from Kafka for all inputs to the Algo. The Kafka data records can be produced in many different ways and we cover that process in more detail in the Pipeline Overview. For now let's just assume we have data sitting in Kafka topics that we would like an Algo to process.
- If a transformer is configured, a data transformation is applied to the record in preparation for the Algo processing.
- The input data is delivered according to the needs of the Algo implementation.
- If the Algo accepts StdIn, HTTP or gRPC input, the data is never written to disk and it is relayed directly to the input.
- If the Algo requires file input, the record(s) are written to the file system and the file path(s) are delivered to the Algo as parameters.
- The Algo output(s) are monitored and the data is produced into the Kafka stream to be piped to any additional pipeline stages.
- If the output is written to StdOut, a HTTP or gRPC response, the output data is not written to disk and it is produced directly to Kafka.
- If the output is written to a file or folder, there are two potential ways to configure the output. The file can either be embedded and produced to Kafka or the files are mirrored to shared storage and a file reference record is produced into Kafka. For more information on output handling see below.
As outlined above, the AlgoRunner is a component of AlgoRun that is deployed as a sidecar for every Algo in a pipeline. It is an important component that provides the Algo execution functionality. By deploying as a sidecar in the same Kubernetes Pod as the Algo, the AlgoRunner will run on the same host and have access to same file system and network. The AlgoRunner itself is a lightweight Golang static binary. Essentially, AlgoRunner is a proxy for all run requests for the Algo. This provides many benefits:
- AlgoRunner implements the Kafka consumer, streaming and transformation logic. This enables you to utilize any code in a distributed, asynchronous and streaming fashion without making changes to your implementation.
- The Algo itself is more secure in that no additional ports are exposed to access the Algo.
- Data can be transformed immediately before being delivered to the Algo.
- The file naming conventions are standardized and eliminates misconfigured / misspelled file references.
In order for AlgoRun to be able to deploy and run the dockerized Algos, additional configuration needs to be defined. Below is some of the core metadata that is required to inform the AlgoRunner how to manage and run the Algo implementation.
Every Algo is scoped to the username or organization that owns the Algo. Similar to the way a git repository is scoped to it's owner in github.com. If adding an Algo to a local AlgoRun instance that is not connected to your AlgoHub.com account, the Algo will be scoped to a reserved name called 'local'.
The Algo name is a url friendly shortened name. The name must must all lowercase consisting of alphanumeric characters or single hyphens. It cannot begin or end in a hyphen and has a maximum or 50 characters.
The Algo Title is a human friendly name with a maximum of 256 characters.
Each Algo definition can contain many versions of the Algo implementation. Each version consists of:
- Version Tag - Each version must be tagged with a semantic version or any arbitrary version tagging method. The version tag does not need to match the version of the code itself but for clarity, it probably should.
- Docker image repository - The url for the docker image repository. This can the AlgoHub.com registry, a public docker registry like hub.docker.com or quay.io, or it can be a private docker registry.
- Docker image tag - The docker image tag to use for this version
- Entrypoint - The executable entrypoint for this version. While this likely stays the same with version changes, sometimes it is required to adjust the startup commands for future versions. If no entrypoint is defined, the entrypoint or command defined in the dockerfile will be used.
The default execution timeout, in Seconds, should be set for the Algo. The actual timeout that is used for Algo execution is determined in the following order:
- Pipeline defined timeout - A pipeline can override the timeout for each usage of the Algo.
- Algo defined timeout - The default Algo timeout as defined by the author.
- Global default timeout - A global execution timeout defined in the AlgoRun configuration.
Parameters are a list of arguments that will be set for the Algo. If the Algo is an executable or script, the parameters are translated into command line arguments and flags. If the Algo is behind a HTTP server, the parameters are sent in the querystring.
The more complete the parameter metadata is, the pipeline designer can better render the UI for configuring an Algo. Each parameter has the following metadata:
- Name - the actual name / key for the command line flag or querystring variable that will be sent.
- Description - a human friendly description of the parameter and how it is used.
- Value - the default value for the parameter. This value will be used when running the Algo if it is not overridden in the pipeline configuration.
- Data Type - the data type for the value of this parameter. By clearly defining the data type, a UI can be constructed in AlgoRun to effectively display possible configuration values and ensure the data is validated.
- Options - if the data type is a Select list of enumerated values, the available options can be defined here.
Every Algo must have one or more input. An input defines how the Algo will accept data to be processed. Each input must be configured with the following:
- Name - A short, url friendly name for the input. The name must must all lowercase consisting of alphanumeric characters or single hyphens. It cannot begin or end in a hyphen and has a maximum or 50 characters.
- Description - A human friendly description of the input and how it is used.
Input delivery type - The input delivery type defines how incoming data will be passed to the Algo. The following options are available:
- StdIn - Input data will be piped into the Standard Input of the executable or script. The Algo must be designed to read from StdIn for this option to be used.
- HTTP(S) - Input data will delivered to the HTTP server using the configured verb (POST, PUT, GET), port, header and path.
- gRPC - Input data will be serialized into the defined Protobuf schema.
- Parameter - Input data will be written to the file system and the file path sent as a command line argument.
- Repeated Parameter - Input data will be written as multiple files to the file system and each file will be delivered as command line arguments that repeat the same parameter name.
- Delimited Parameter - Input data will be written as multiple files to the file system and each file will be delivered as a list of files names delimited by the configured delimiter as a single command line argument.
- Environment Variable - Input data will be written to an environment variable.
- Content Types - The accepted content types can be defined. By defining the content type for an input you gain additional features:
- Ensure only compatible outputs and inputs can be piped to each other
- Validation of the data being delivered to ensure it matches the content type
Every Algo must also have one or more output. An output defined how the Algo delivers the results of it's processing.
- Name - A short, url friendly name for the output. The name must must all lowercase consisting of alphanumeric characters or single hyphens. It cannot begin or end in a hyphen and has a maximum or 50 characters.
- Description - A human friendly description of the output and how it is used.
Output Delivery Type - The output delivery type defines how the result data is written. The following options are available:
- StdOut - Output results are written directly to StdOut. This can be useful for simple Algos that do not require logging, as StdOut cannot be used for logging with this option. All StdOut lines will be delivered into the Kafka output topic.
- File Parameter - Output results are written to a specific file. This file will be watched by the AlgoRunner and if a write is detected, the file will be captured and delivered into the Kafka output topic.
- Folder Parameter - Output results are written to a folder. All files written to this folder will be watched by the AlgoRunner and if a write is detected, the file will be captured and delivered into the Kafka output topic.
- Http Response - When the Algo is a HTTP server, the HTTP response will be delivered into the Kafka output topic.
- Content Type - The content type for the data generated by this output. By defining the content type for an output you gain additional features:
- Ensure only compatible outputs and inputs can be piped to each other
For additional information, check out the Algofile YAML reference documentation. There you will find a sample Algo YAML definitions with every option that is available.