Nodus

(WARNING: EXPERIMENTAL)

Description

Framework for parallel data-flow based applications.

It is influenced by, inspired by, and in many cases similar to:

In More Detail

Guiding Assumptions

This library and associated commandline tools are most appropriate for these types of (overlapping and somewhat redundant) problems & constraints:

| | | | —— | —- | | Dataflow-Oriented | Problems where the easiest way to look at it is a (possibly branching) pipeline of operations on a stream of data. | | Steady-State | Where the processing nodes and overall application have upper bounds on their memory requirements in order to safely and reliably handle very long running streams. i.e., bounded online algorithms and streaming-algorithms | | Functional | Most of the generator & processing nodes are usually assumed to be side-effect-free and pure (at least given all previous inputs). | | Composable | Easy to make nodes out of combinations and networks of other nodes | | Proportionate| Very easy and fast to do a simple pipeline (for example some simple functions that mutate an integer from within the console), but easily scales up to more complex production-ready solutions/projects. It strives to maintain the following inequality: effort ≤ problem‐complexity. |

It is additionally tuned for (but doesn’t assume) problems with the following properties:

| | | | —— | —- | | Parallel | For example, map-reduce type problems, or wherever strictly sequential simply isn’t required. | | Ordered | When there is a natural ordering to the tokens. e.g., a time-series, or bytes from most IO streams; | | State-Accumulation| Where the tokens can accumulate state and where processing nodes can look at previous process results relative to that token as they pass through the graph. As opposed to destructive changes to the token at each process or lack of tokens altogether (like simple functions). | | Multi-Stream | Potentially create secondary streams that are unsynchronized with the original stream; | | Coordinated | Dataflow type problems where there is a need to synchronize/coordinate multiple orthogonal streams; | | Decoupled | A way to cache intermediate results (e.g., half way through the pipeline) so that multiple stream processing applications can be simultaneously writing results and other unrelated processes reading a stream of results- sometimes in sync with when they are being written by the decoupled process. (some good examples to come). | | Daemonized | Where certain stream sources and graphs of processors and sinks should be managed as a single long-lived system process when desired. | | Simulations/Reruns| Persistent caching nodes (nexus decouplers) etc. allow one to easily simulate reruns of past parts of a stream- possibly generating a new version of subsequently persisted results. | | Rewinding | Processing nodes that require (bounded) rewinding that propagates to other nodes. |

Use Little’s law for queue bounds

Example: Single-Stream Sequential

+---+     +------+     +------+
| G | --> | f(x) | --> | g(x) | -->
+---+     +------+     +------+

Components

Data

Thing Otherwise Known As
Stream Flow, Signal Related ordered data representation- bundled into tokens (chunks). A gradually consumed state
Token Packet, Event, Chunk, Element, Sample Coherent instance of stream data being mutated and passing through the pipelines (potentially in parallel) - special values of EOF & Exceptionals

Stream

Defined by: * The eventual property distribution of its constituent tokens (possibly given as a shorthand name/“token-type”), which, because the token’s internal state ends up mirroring the operations that have occurred, means it is also a way of describing the graph of process nodes that act on tokens within the stream (explicitly or implicitly- don’t know yet) (usually all or a subset of a nodus application specification); * An origination node for that specific stream (not necessarily a standalone origination node like for an app) * A unique session identifier (uuid) identifying this run-instance, possibly shared with other streams (or Nexus’ / Decouplers) * Creates the initial instances of tokens, with a monotonic order indexed from the first token of the session. (Hence, streams can be theoretically infinite into the future, but always have a finite well-defined past, at least in the context of a session). * Sometimes a version, which can affect caching, for example, or conflict resolution when a sink is permanently saving state from a running stream, etc. * No cycles. Cycle-like behavior is handled by creating another stream with a delay (for example).

Stream == token type == (the data accumulated in the token are related) & especially (same intervals/timing)

Token

Session

Nodes

A node may be a generator for one or more streams AND/OR a sink for one or more streams, AND/OR an operator on one or more streams (accepting tokens from one or more upstream nodes and emitting them to one (or more?) downstream nodes. Generally though most nodes will deal with only a single stream- and most of those will be processor nodes, sandwiched between a generator at the beginning and a sink at the end.

Aspect Otherwise Known As
Generator Observable, Source, Origin, Enumerator, Producer, Start Has one or more output streams that originate there & are not among its incoming streams (although the incoming streams may have participated in its creation).
Sink Consumer, Iteratee, Fold, Reducer, End folds over one or more input streams & emits nothing except a final result/stats if the done/EOF token is encountered. Usually in charge of any external side-effects.
Processor Observable, Filter, Enumeratee, Operator, Function, Convolution-operation, Transformer Receives a token from a stream and either passes it through or advances its state before outputting the same stream-type.
Junction Whenever a node has input from more than one signal

Most nodes are intra-stream nodes…

Parameterization (applying the specified or calculated parameters) only happens once, when the first token is about to hit the node. Any state change after that needs to be handled manually.

(maybe nodes with no generators applied get default behavior if they are run standalone- integer sequence generator, for example, or default to stdin, argf, etc…)

Intra-Stream Nodes

Operate on data within a single stream

Type Behavior
Pipe Chains intra-stream nodes together to operate on a token sequentially
Branch Sends same token down multiple parallel branches (w/ or w/o conditions, w/ or w/o automatic remerge (wait) (?), w/ or w/o infered token subselection). Some paths may be skipped/short-circuited due to conditions.
Tap Observer, splice, tee ... Semantic sugar to specify a non-synchronizing branch off of a point in a stream from the perspective of the branch (as if it were a generator). Can also be dynamic and/or temporary.
(Cached) Wrap around a node if it has referential transparency (given all previous input). Might not implement for a while
Process (Map, Node, Function, ...) Simple function on latest token data
System A specialized Process that interfaces with an external application (via stdin/stdout/stderr for the most part?)
External Another specialized Process that interfaces with an external application, this time via some other IPC call/response mechanisms
View (Select?) Changes what the next node will consider the "active data" for the token.
Recombine (Wait, Merge, Synchronize, Waypoint) Named synchronization point that also causes a "view" to be the combination of all merged branches. timeout logic, subselection logic, etc.

|

NOTES:

Stream-level Nodes (Inter-stream)

Application / Runtime

Nodus Application Specification

(could have just as easily been named ‘Stream Processing Graph’ or ‘Nodus Application Specification’, …)

Technically a normal node that happens to:

In practical terms, it:

File/spec hierarchy, auto-loading, paths, … (node-language agnosticism?)

Nodus Execution Engine (nodus command)

Application Groups


Scratch

Token: any object instance but usually openstruct with hash-like access as well (so like HashWithIndifferentAccess) but with WRITE-ONCE semantics! (throws error if the same process/node tries to write the same field a second time)

suggested behavior: - lock-field-value-on-write - unlock-all-field-values (for when a node starts its turn | possibly specify that this proc/node is one that can write to it) - exception on rewrite of a field after it’s locked -

Contributing

Copyright © 2014 Joseph Wecker. MIT License. See LICENSE.txt for further details.