Concepts
The different parts of the DeepStructure system, and how they fit together
Application: Toplevel
Application
An Application is All The Things. The code that defines an Application is usually stored in a git repo. A version of an Application will usually correspond to a git SHA or tag.
An Application can be deployed locally, deployed to DeepStructure Cloud, or other infrastructure.
Deployment: Application Instance
A Deployment is an instance of a particular version of an Application. Not only may Deployments be hosted on various different infrastructures, they may have different configurations via configuration variables, secrets, etc.
Under the covers, a Deployment is persisted by DeepStructure as a set of SQL tables in the configuration DB. The tables can be queried to see the latest state of the Deployment, like a Kubernetes cluster’s configuration can be queried via API server.
Redeploying simply means (i) updating the SQL tables in the configuration DB; (ii) reconciling the changes between the previous version and new version; (iii) switching over to the new version when it’s quiescent (as atomically as feasible).
Rolling back a deployment merely means switching over to a previous version. This is a special operation that’s designed to be as fast as possible, almost always faster than updating to a new version. Rolling back to a previous version requires the previous version’s resources to still be available, i.e. not deleted or garbage collected. If the version has been deleted or garbage collected, then by definition it’s not available for rollback.
(You can also logically “roll back” to a previous Application version that’s been deleted or garbage collected by simply updating to that previous SHA/tag. This is not “magical” though, and goes through the normal, slower, redeployment process.)
A typical set of deployment environments might include:
- Local Environment per developer on your dev machine, for the git branch of the Application you're currently working on
- Dev Environment on DeepStructure Cloud, for the
main
branch of their Application. The Dev Environment will update on each push tomain
. The Dev Environment may be configured differently than Prod Deployment, because the goal is minimize costs and maximize debuggability. - Staging Environment on DeepStructure Cloud, for the
staging
branch of their Application. The Staging Environment will update on each push tostaging
, which might happen nightly, weekly, or manually after a QA process. Staging Environment is configured as closely to Prod Environment as is feasible to faithfully simulate Prod. However, Staging may remove the strictest of access controls that are applied to Prod. - Prod Environment on DeepStructure Cloud, for the
prod
branch. Prod serves production traffic.
And by analogy to other tools:
Git analogue | DeepStructure | Vercel | Cloudflare Pages + Workers |
---|---|---|---|
Repository | Application | Project [*] | Application |
Branch | Environment | Environment | Environment |
SHA1 | Deployment | Deployment | Deployment |
[*] Vercel also uses “app” to casually refer to what they officially call Projects.
Workflow: Independent unit of work
An independent unit of work is something that can run on its own, that has a standalone purpose
For example, within a linux desktop environment, a program like Firefox is an independent unit of work because I can launch Firefox at any time. The purpose it serves is to browse the web.
However, within a live Firefox process tree, a thread is a dependent unit of work. For example, it doesn’t make any sense to run a Firefox IO thread on its own; it only makes sense within the Firefox process tree. The purpose the IO thread serves is to provide a service to other threads in the Firefox process tree.
Workflow
A Workflow is an independent unit of work in a DeepStructure Application. A Workflow might have a purpose such as, “ingest and index a document store for search”, or “generate a marketing email for a particular user and product”.
A DeepStructure Application consists of one or more Workflows. Workflows may have indirect dependencies on each other. For example, a retrieval-augmented generation (RAG) Workflow might depend on indexes that are built by a document-ingestion Workflow.
Workflow code has an entry point at which it starts executing. This is like the main()
entry point in a traditional program — or more analogously, like in an AWS Lambda function.
We’ll sometimes call Workflows “Pipelines” too. There’s no strict definition of either a Workflow or a Pipeline, but in our minds a Pipeline is a special case of a Workflow:
- Pipeline: sequence of synchronous computations structured in the form of a directed acyclic graph (DAG), that computes an output given some inputs. Pipelines always terminate (unless they’re streaming pipelines). Pipelines are common in Big Data applications. For example, Walmart might have a pipeline that runs nightly to compute the price of a loaf of bread for the next day.
- Workflow: sequence of computations that may be synchronous or asynchronous, and may not terminate. Workflows often include intermediate steps where they “sleep” for some amount of time, or make an asynchronous request for a human or LLM to provide some input data.
It’s also worth noting that Workflows (/ Pipelines) can be static or dynamic:
- Static Workflow: all of the data dependencies and intermediate stages are known ahead of time, just by inspecting the code. You could draw a graph diagram of a static Workflow before you ever run it.
- Dynamic Workflow: the complete graph of computations is not known until runtime. You can draw a diagram for a particular run of a dynamic Workflow, but the diagram might be different than the diagram of another particular run of the same dynamic Workflow.
Workflows can also be streaming or not. In a streaming Workflow, the complete set of data that the Workflow is going to operate on is not available when the Workflow starts running. Data will be pushed through the workflow in discrete events.
Run: Workflow Instance
A Run is a particular instantiation of a Workflow, triggered by some external event. Each Workflow Run has a Run Context, which encapsulates data such as a Workflow ID, Run ID, input data, trigger reason, etc. Code can introspect on Runs through DeepStructure APIs.
DeepStructure includes web consoles for viewing the list of Runs — pending, running, paused, completed, and error’d. For each Run, the console shows its graph visualization, Run Context, infrastructure resources assigned to the run, detailed error statuses, etc.
Under the hood, Workflow Runs simply correspond to rows in a Runs table, along with ancillary data. DeepStructure customers can directly query the Runs table to build their own consoles and visualizations.
Component: Dependent unit of work
Workflows run as a sequence of steps. The individual steps are dependent units of work; they usually don’t make sense to run on their own, only within the context of a Workflow.
Component
Workflows compose together Components via data dependencies in order to achieve some goal. Components themselves may comprise other Components that have been composed together.
Components may be written in any supported programming language, and may be stateful or stateless. A stateless Component is relatively simple to understand: it’s like a pure function. A stateful Component is more complicated to grok and better understood from the perspective of Instances, below.
Depending on the context, we might also call Components by the names “Stages” or “Steps” or “Transforms” or “Transformations”.
Instance: Component Instance
An Instance — or we might say Component Instance to be more precise and avoid ambiguity — is a particular activation of a Component in the context of a Workflow Run. A Component is activated for a reason, at a given stage in Run execution, and with given input data. These constitute the Instance’s Activation Record, along with IDs and other metadata.
Stateless Components are relatively straightforward: if the stateless Component is defined as a function in the implementation language, say for example function myTransform(...)
, then an Instance of the Component is simply a call to the function. The Instance’s Activation Record literally begins with a stack frame and the other metadata mentioned above. All of the data dependencies and computations in the myTransform
component are captured by local variables and subroutine calls within the code of myTransform
.
Stateful Components require a bit more unpacking to understand. They can be stateful in one of two ways:
- Instance state: when the Component is activated as an Instance, the Instance may store global state in shared memory or other storage within the runtime environment of the Instance. This shared global state may influence the behavior of sub-Components or other code. So we could say that this kind of Component is stateful because it consists partly of impure functions. This is important primarily for Components in streaming Workflows, whose Instances may be long-lived.
- External state: a Component’s code might interact with the outside world by sending HTTP requests, reading or writing databases, etc. in ways that are not captured by the declared data dependencies in the Workflow itself.
Updated 4 months ago