I mentioned in some mail to the IETF list that I have a new idea for
how to do the datagram (i.e. non-flow) mode in Nimrod. I've been thinking
about this as a result of discussions at the Hoston IETF. I think it produces
a more efficient datagram mode than I had hitherto imagined. In fact, it may
be even more efficient than the existing "hop-by-hop" model!
	The scheme is fundamentally a rethink of the division of state in a
datagram network, between routers and the actual packets. Most packets
currently contain only a small amount amount of state associated with the
forwarding process ("forwarding state") - the hop count. I propose that
slightly increasing the amount of forwarding state in packets can produce a
system with useful properties.
	It also borrows some concepts and ideas from SIP and PIP. It turns out
to tie together a number of disparate concepts, including an incomplete bits
of Nimrod (how to specify the "descending" part of the path), information
hiding, interfacing to other routing protocols (both as a transitional
measure, as well as permanently), and the "EGP/IGP split" (such as it is,
since there is no such explicit boundary in Nimrod), as well as datagram mode.
There are also useful features for other parts of the internetwork layer; for
instance, this mode makes it trivial to restrict the amount of resources used
by datagrams in a resource-allocated networks.
	This note is an attempt to pass this scheme on. Thanks to Isidro and
Ram for helping make it a little clearer.

	On these superficially unrelated topics, one thing I had never really
figured out was how to specify the "descending" end of the path. I.e., if you
are trying to compute a path from yourself, A.P.X, to F.Q.Y, and your map
includes the objects A.P.*, A.*, B, C, D, E, F, with no detail on the
internals of B-F, which are not near you, you may be able to specify the path
up to F, but how do you describe the path inside F? One obvious way to handle
the problem above is to say "Oh, you have to go out and get the map of F, etc,
before you can finish specifying the path." This is admittedly clunky, but
might be OK for flow setup; it would not be acceptable for datagram mode,
though. However, not doing this in datagram mode is not unreasonable.
	It turns out to be exactly the same problem as handling information
hiding. I.e., in the example above, F might not be willing to give you the
map, due to an information hiding policy! So, we have to be able to do this
without a map of F. So, whatever mechanism we have for handling this can be
used in the datagram mode as well.
	Obviously, this also relates to the "EGP/IGP split", and handling
areas running other routing architectures. If a packet arrives at the border
of an area, destined for a node inside the area, if the area is not willing to
provide topology info to the rest of the world, it doesn't really matter what
routing architecture is running inside the area, as long as packets that get
to the border of that area correctly wind up at the specified destination.

	To now jump back to the seemingly somewhat different topic of datagram
mode, I've been wary of providing a "hop-by-hop" mode (i.e. one in which each
router independantly makes a decision, based on the destination locator of the
packet, as to the best next-hop for the packet), since to do so requires that
the algorithms and databases in all routers be sufficently coordinated to
guarantee that this process will not result in traffic loops.
	I am aware that there are many schemes (both link-state and
destination vector) which do theoretically guarantee such consistency, but all
of them pretty much depend on correct operation of the participating elements
They are thus not as robust (in real-world terms, where humans make coding
errors, hardware fails in funny ways, etc) as I'd like. Study of the operation
of large systems (electrical, phone, etc) shows failures which result from the
*interaction* of several faults, any one of which was in itself harmless.
	In general, the weakness is that since the algorithm for choosing the
path is fully distributed, you both i) have to trust all the nodes along the
path to make the correct decision, and ii) provide some mechanism to
coordinate those choices to make sure that the results produce a non-looping
path. Both of these represent weak points; the second, by its complex nature,
is more likely to suffer an unforseen failure. Moreover, in the event of an
error, the error is likely to be in a place not under the control of the
affected client.

	As a result of all this, I prefer systems which do not have to make
this guarantee of "near-global consistency"; they are both simpler and more
robust. I would, as an architect, *very* desperately like to avoid any such
design *if at all possible*. It was for this reason that I was thinking that
perhaps the only non-flow mode in Nimrod would be a "source-route in packet"
mode. This guaranteed non-looping paths, but without having to guarantee the
near-global consistency in databases, etc, etc.
	This has several obvious disadvantages. First, the source (or some
agent thereof) has to compute such source routes, so even if they are cached,
there is some inefficiency there. Second, they have to be carried in the
packets, making them more expensive to construct at the source, as well as
bulkier. Third, the processing of such packets in network switches is almost
inevitably more inefficient.
	As a result of this, I have come up with a new idea, which retains
most of the advantages of "source-routed datagrams" (SRD, hereinafter), but
without most of the defects. It will not supplant SRD's (which have
capabilities such as allowing source policies, which cannot be matched by
this scheme), but might serve as a useful adjunct.

	To understand this new datagram mode, you need to understand how
Nimrod treats virtual links (see section 5.5, Virtual Links and Flow Repair).
Source routes (both for flow setup and individual datagrams) can be specified
in terms of high-level virtual entities which describe whole collections of
basic entities. This has numerous advantages:

  - A source route can be computed with a simpler map; i.e. you don't have to
    have complete info about the network topology of the path before you can
    specify it. Otherwise, you'd have to almost always augment your map before
    you could compute a path to a new destination.
  - The source route can be specified in less bits, which is especially
    significant for source routed packets.
  - A route specified by high level abstractions can be locally repaired if a
    particular physical entity fails; i.e. if the endpoints of a K level
    virtual link notice that one of the physical assets making up the link has
    failed, they can try and find a different set of K-1 level entities which
    can be sued to produce the same service. If so, fine the repair is local,
    and nobody outside the K level area have to know. If not, the process
    resurses. Any time an entity (at whatever level) named in a flow setup,
    etc, cannot be repaired, the source has to be notified, of course,
    since the local agents cannot know why that entity was selected.
  - A route specified by high level abstractions can take one of a number of
    alternate paths locally, e.g. to provide resources requested by a resource
    reservation; i.e. load-sharing.

Source routes using these virtual entities can be guaranteed to be non-looping
overall by a recursion process: the top-level source route does not contain
loops (at least, unless the source is stupid, and choses a path with one :-);
each entity which advertises a virtual link is required to make sure that the
implementation of that entity does not create a loop.
	I had imagined that source routes might be specified in terms of these
high-level virtual entities. Obviously, for flow setup, you want to
instantiate a flow in the switches which provide the virtual entity, but this
is not done for the SRD case. I was trying to make the processing of these
source routes as efficient as possible, when I came across some interesting
mechanisms.
	(As a side-note, a conventional source route (either strict or loose)
is also forwarding state in the packet, in that the source route always has
to have some indication of how far processing of it has gone; this indication
is forwarding state too.)


	For efficient handling of SRD's, I had imagined that i) the packet
would contain a pointer into the source route, and ii) routers would
maintain, for each virtual link, a pre-setup flow which instantiates the
virtual link (hereinafter the "VLF", for 'virtual link flow'). When an SRD
shows up at the router at the start of the VLF, it is somehow "associated"
with the VLF until it gets to the end of it, at which time the source route
is consulted, and the packet is routed onto the next VLF. (Obviously,
physical links are just done as a one-shot, without the necessity of a
flow.)
	On thinking about the details, an obvious mechanism suggested itself.
If all packets contain a "flow-id" field, *which will be unused in datagram
packets*, the obvious thing is to store the flow-id of the VLF in that field.
The packet will then traverse the routers between the start and end of the
VLF, being handled just like any normal packet which is part of a flow, i.e.
by the high-efficiency flow-forwarding mechanism. When the packet gets to the
router which is the termination of the VLF, the flow-block will indicate that
the packet needs special handling; it will be examined, the flow-id of the
next VLF will be stuck into the flow-id field of the packet, and off it goes.
	This is slightly unusual, in that you don't visualize the flow-id
field in the packet getting bashed about during transit to refer to
*different* flows, but it works fine. I realized that one minor bug is that if
we define the global flow-id in the packet to be the source EID plus a
source-local flow-id, this idea doesn't work (since you'd need to bash the
source EID to the router at the start of the VLF, or something), so perhaps
the flow-id field can't overload the source EID.
	This mechanism uses the idea from the SIP SR (and a tip of the hatly
hat to Steve for that :-), which is that the routers in between don't need to
grovel around in the packet, but just do the maximally efficient fowarding. In
addition, in a world with resource-allocation going on, that VLF could have a
reource limit associated with it, which would prevent SRD traffic from
interfering with allocated flows.

	So, that makes processing of source routed packets much more
efficient, but still leaves us with the other two disadvantages; that a source
route has to be computed, and carried in the packet. However, once this
mechanism is available for efficient support of SRD's, it is but a small step
to non-SR datagram mode. There is a way to guarantee a strictly non-looping
path, but without a source-route in the packet, using a slight variant.
	The packet contains, in addition to a mungable flow-id field, the
source and destination locators, and a pointer into the locator. The idea
(borrowed from PIP :-) is that the pointer starts out at the lowest level of
the source locator, and moves up that locator, then across to the destination
locator, and then down. In addition to these extra fields in the packet, all
routers have to contain a minimal set of "pre-setup" flows to certain routers
which are at critical places in the abstraction hierarchy.
	(It turns out that these so-called "pre-setup" flows do not actually
have to be setup in advance, but can be created on demand. There is a minimum
set of flows which do have to be *able* to be set up for the system to
operate, however. It is purely a local decision, however, which, if any, of
those flows to set up before there is an actual traffic requirement for them.
As an efficiency move, when a datagram requires that a flow actually be set up
to handle it, the data packet could be sent along with the flow setup request,
avoiding the round-trip delay. So, let's call these flows "datagram mode
flows", or "DMF's", realizing that none of them *have* to be created until
actually needed.)
	While going up the source locator, each "active" router (i.e. one that
actually makes a decision about where to send the packet, as opposed to
handling it as part of a flow) selects a DMF which will take the packet to the
"next higher" level object in the source locator, advances the pointer, and
sends the packet off along that flow. When it gets to the end of that flow,
the process repeats, until the packet reaches a router which is at the least
common intersection of the two locators (i.e. for A.P.Q.R and A.X.Y.Z, this
would be when the packet reaches A).
	The process then inverts, with each active router selecting a DMF
which takes the packet to the next lower object in the destination locator.
So, A would select a flow to A.X, and once it got to A.X, A.X would select a
flow to A.X.Y, etc.
	This mode would have almost none of the disadvantages of SRD, since
the source doesn't have to compute a route, and there is no source route in
the packet, just the source and destination locator (and the source locator is
useful to have anyway when the packet gets to the ultimate destination, to
allow a reply to be sent easily). Again, in a world with resource-allocation
going on, that DMF would have a reource limit associated with it, which would
prevent pure datagram traffic from interfering with other resource
allocations.
	While there is a certain surface similarity between this mechanism and
"classic" VC networks, where the "connection identifier" field in packets is
updated as the packet flows through the network, it is only a surface
similarity. In the VC case, the different values are simply names with a
purely local scope (which is why they have to be modified, as they cross scope
boundaries) for end-end connections. This process is fundamentally different;
a number of otherwise unrelated, independant DMF's are being strung together
to carry a single datagram in an efficient and robust way.

	It can easily be seen that the process guarantees that the resulting
path is loop-free. Each flow selected must necessarily get the packet closer
to its destination (since each flow selection results in the pointer being
monotonically advanced through the locator), and the flows themselves are
guaranteed not to loop when their paths are selected, prior to being set up.
	It is true that each "active" router does have to do these two steps
correctly, but i) the part of the computation which is truly distributed is
reduced to almost nothing, since each router only gets to make a *local*
decision about a strictly-defined element of the locator, and ii) far fewer
routers get to make any routing decision anyway.
	The only requirement on the state in the routers to ensure correct
operation (as opposed to actual implementation errors of the mechanism,
which are obviously a serious problem, but one which must be tackled by
other means) is that router's topology map must not contain incorrect data;
i.e. a map which does not correspond to the real topology of the network.
Consistency with reality is the only requirement, not a complete description
of reality (i.e. the map may be missing lots of data), nor any kind of
consistency with other routers.
	So, I reckon it's thus more robust than any hop-by-hop scheme,
although not as robust as, say, source-routed flow-setup, which is the most
reliable of all, since the original source can detect any failure of any of
the selected devices, and select a new path with avoids failures. (It turns
out that flow-setup can even provide "denial of service" protection, and
although each individual active router along the path can provide a denial of
service protection on its part of the path, this still does not give complete
end-to-end denial of service protection, since a compromised active router can
deny service.)

	It might be possible to remove the "hop count" field, since there
are now some fairly strong guarantees that traffic will not loop, but it
might be useful to leave it in as an additional safety measure against
unforseen failure modes. Currently, the hop count is forwarding state which
is retained in the packet to prevent loops, and that retained state is in
some ways made redundant by some slightly more complex state which is
retained by this method. Removing the hop count would slight increase the
efficiency of forwarding, obviously.
	The forwarding of these packets is, as already noted, quite
efficient, and in non-active routers, is maximally efficient (perhaps more
so than even standard hop-by-hop). In the non-active routers, the packet is
associated with flow in a way that makes possible hardware processing
without any software involvement at all. In active routers, the process of
looking up the next DMF would be about as expensive as the current routing
table lookup, and the main difference would be that the result of that
lookup would have to be stored in the packet, not a great expense.
	As a final note, these new fowarding state fields would not be
covered by an end-end authentication system, any more than the existing hop
count field (which is also forwarding state) would be. This would prevent
problems caused by the fact that the contents of these fields change as the
packet traverses the network.

	The way this solves the "uncompleted source route" problem is obvious.
For source routed packets, you simply take the packet as far as you can with
the source route, then go into this new datagram mode. Flow setups were always
less of an issue, but obviously it's trivial to use the "locator pointer"
feature to construct the balance of the flow path in an iterative fashion
without creating a loop. This works for the "information hiding" case, as well
as the case where the source simply doesn't feel like getting the relevant
topology information.
	It is worth noting that inside an area, you can actually use existing
(or future) hop-by-hop forwarding mechanisms, to handle either datagram or
flow traffic. In either case, the area simply refuses to pass out internal
topology information, and all source routes (be they datagram or flow) stop at
the area boundary. Inside the area, hop-by-hop forwarding carries the traffic
the rest of the way. This should provide a useful capability both for
interoperation during deployment, and also in the future, for those who are
not comfortable using flows; they can continue to use hop-by-hop forwarding if
they wish. I believe that classical hop-by-hop architectures are less robust,
and no more efficient, but if there are some circumstances in which hop-by-hop
forwarding has some advantages, in this area (as so many others :-), Nimrod
leaves the choice up to the user, allowing them to decide on their own where
the prefer to perch on the cost/benfit tradeoff spectrum.
	As an additional feature, you could specify a number of different
"service classes" (akin to the service classes in Unified) which are
supported for pure datagram mode. This would entail creation of a number of
DMF's from each router, to support the different service types offered, and
the state information in the routers would be increased by exactly the same
factor as in Unified (over the amount of state needed for the single service
class). Note that if a particular router chooses not to support a given
service class, or a new service class is deployed incrementally, the result
is not looping traffic, as it would be in a classic hop-by-hop model, since
this method is inherently non-looping.

	To discuss the state requirements a bit, the absolute minimum is
basically zero. First, each interior router in an area contains a DMF to a
border router of the area. Second, each border router of an area contains, at
a minimum, a DMF to each lower-level area and object in that area. (As an
aside, if we can create trees - sort of like multi-cast trees, but with many
source leaves heading toward a single destination - then we only need one tree
per destination across a whole area, as opposed to a separate flow from each
leaf to the destination.) Third, at the top level, each border router of an
area must contain a DMF to each other area at the top level.
	This level of state provides strictly hierarchical routing. There are
pretty obvious optimizations, but this note is already too long to contain the
complete analysis. For example, if you keep DMF's to more than the minimal set
(which is just up to one border router in internal routers, and down to each
object one level down for each border router), and keep your table sorted for
efficient lookups (probably much the same as the current routing table for
hop-by-hop datagrams), you may be able to short-cut.
	For example, using the case above (a packet from A.P.Q.R to A.X.Y.Z),
if A.P.Q is actually a neighbour to A.X.Y, and maintains a flow directly from
A.P.Q to A.X.Y, then when the packet reaches A.P.Q, instead of going the rest
of the way up and down, the pointer can be set into the destination locator at
A.X.Y, and the packet sent there directly.
	I imagine some traffic monitoring and analysis (using *purely* local
algorithms, a traditional Nimrod thing :-) might result in a database being
created over time, which shows which DMF's above and beyond the minimal set
are worth keeping around; perhaps a generic subset might be defined which is
done pretty much everywhere. This traffic monitoring would also show which
flows from the required minimal set of DMF's it would be useful to set up in
advance of actual traffic which needed them. Again, however, all these sets
can be changed in a local, incremental way, without disturbing the operation
of the system as a whole.

	One obvious question concerns the comparison between the amount of
state necessary to contain these DMF's, and the state requirements of either
traditional LS or DV hop-by-hop approaches. It's not too bad, but again,
this is already too long to go into much detail.
	Part of the problem is that you can compare the theoretical minimum
amounts of state needed in each of the variants, but people will start
keeping around more state than that in practise. The theoretical minimum may
result in severely non-optimal paths since it enforces strictly hierarchical
routing, and to avoid this, extra information (and state) will be added. In
any case, the system is heirarchical, so even if the state requirements are
a little higher for this new scheme, it can be run with the same total
amount of resources as hop-by-hop by simply using more levels of heirarchy.
	Direct comparison with the state overhead of classic hop-by-hop
systems is a thus a little difficult, but I can handwave. There, each router
contains a routing table entry for each destination. You can minimize the
amount of state in hop-by-hop for outgoing traffic by only keeping routes to
the nearest border node for all destinations which are outside the area, and
a similar scheme works for incoming traffic. This appears to match
reasonably the state overhead with this new datagram scheme, particularly if
the tree approach is used.

	The principal difference between classic hop-by-hop DV and this
scheme appears to be that extra state (the locator pointer, and flow-id) is
kept in the packet to record where in the process selection of the path has
gotten to, and what the current path of the packet it.
	At a high level, this slight increase in the amount of state carried
in packets is what allows the stronger guarantees of non-looping traffic to be
created, without the system complexity of guaranteeing near-global consistency
(i.e. more complex state) in the routers. This small amount of extra state
appears to me to be a reasonable tradeoff for the increased robustness.


	Noel

PS: Martha says this something like this scheme has been independently
invented elsewhere, but she hasn't sent me a reference yet, so I can't credit
them.

PPS: I need a good name for this new mode. Snappy suggestions welcome!