I mentioned in some mail to the IETF list that I have a new idea for how to do the datagram (i.e. non-flow) mode in Nimrod. I've been thinking about this as a result of discussions at the Hoston IETF. I think it produces a more efficient datagram mode than I had hitherto imagined. In fact, it may be even more efficient than the existing "hop-by-hop" model! The scheme is fundamentally a rethink of the division of state in a datagram network, between routers and the actual packets. Most packets currently contain only a small amount amount of state associated with the forwarding process ("forwarding state") - the hop count. I propose that slightly increasing the amount of forwarding state in packets can produce a system with useful properties. It also borrows some concepts and ideas from SIP and PIP. It turns out to tie together a number of disparate concepts, including an incomplete bits of Nimrod (how to specify the "descending" part of the path), information hiding, interfacing to other routing protocols (both as a transitional measure, as well as permanently), and the "EGP/IGP split" (such as it is, since there is no such explicit boundary in Nimrod), as well as datagram mode. There are also useful features for other parts of the internetwork layer; for instance, this mode makes it trivial to restrict the amount of resources used by datagrams in a resource-allocated networks. This note is an attempt to pass this scheme on. Thanks to Isidro and Ram for helping make it a little clearer. On these superficially unrelated topics, one thing I had never really figured out was how to specify the "descending" end of the path. I.e., if you are trying to compute a path from yourself, A.P.X, to F.Q.Y, and your map includes the objects A.P.*, A.*, B, C, D, E, F, with no detail on the internals of B-F, which are not near you, you may be able to specify the path up to F, but how do you describe the path inside F? One obvious way to handle the problem above is to say "Oh, you have to go out and get the map of F, etc, before you can finish specifying the path." This is admittedly clunky, but might be OK for flow setup; it would not be acceptable for datagram mode, though. However, not doing this in datagram mode is not unreasonable. It turns out to be exactly the same problem as handling information hiding. I.e., in the example above, F might not be willing to give you the map, due to an information hiding policy! So, we have to be able to do this without a map of F. So, whatever mechanism we have for handling this can be used in the datagram mode as well. Obviously, this also relates to the "EGP/IGP split", and handling areas running other routing architectures. If a packet arrives at the border of an area, destined for a node inside the area, if the area is not willing to provide topology info to the rest of the world, it doesn't really matter what routing architecture is running inside the area, as long as packets that get to the border of that area correctly wind up at the specified destination. To now jump back to the seemingly somewhat different topic of datagram mode, I've been wary of providing a "hop-by-hop" mode (i.e. one in which each router independantly makes a decision, based on the destination locator of the packet, as to the best next-hop for the packet), since to do so requires that the algorithms and databases in all routers be sufficently coordinated to guarantee that this process will not result in traffic loops. I am aware that there are many schemes (both link-state and destination vector) which do theoretically guarantee such consistency, but all of them pretty much depend on correct operation of the participating elements They are thus not as robust (in real-world terms, where humans make coding errors, hardware fails in funny ways, etc) as I'd like. Study of the operation of large systems (electrical, phone, etc) shows failures which result from the *interaction* of several faults, any one of which was in itself harmless. In general, the weakness is that since the algorithm for choosing the path is fully distributed, you both i) have to trust all the nodes along the path to make the correct decision, and ii) provide some mechanism to coordinate those choices to make sure that the results produce a non-looping path. Both of these represent weak points; the second, by its complex nature, is more likely to suffer an unforseen failure. Moreover, in the event of an error, the error is likely to be in a place not under the control of the affected client. As a result of all this, I prefer systems which do not have to make this guarantee of "near-global consistency"; they are both simpler and more robust. I would, as an architect, *very* desperately like to avoid any such design *if at all possible*. It was for this reason that I was thinking that perhaps the only non-flow mode in Nimrod would be a "source-route in packet" mode. This guaranteed non-looping paths, but without having to guarantee the near-global consistency in databases, etc, etc. This has several obvious disadvantages. First, the source (or some agent thereof) has to compute such source routes, so even if they are cached, there is some inefficiency there. Second, they have to be carried in the packets, making them more expensive to construct at the source, as well as bulkier. Third, the processing of such packets in network switches is almost inevitably more inefficient. As a result of this, I have come up with a new idea, which retains most of the advantages of "source-routed datagrams" (SRD, hereinafter), but without most of the defects. It will not supplant SRD's (which have capabilities such as allowing source policies, which cannot be matched by this scheme), but might serve as a useful adjunct. To understand this new datagram mode, you need to understand how Nimrod treats virtual links (see section 5.5, Virtual Links and Flow Repair). Source routes (both for flow setup and individual datagrams) can be specified in terms of high-level virtual entities which describe whole collections of basic entities. This has numerous advantages: - A source route can be computed with a simpler map; i.e. you don't have to have complete info about the network topology of the path before you can specify it. Otherwise, you'd have to almost always augment your map before you could compute a path to a new destination. - The source route can be specified in less bits, which is especially significant for source routed packets. - A route specified by high level abstractions can be locally repaired if a particular physical entity fails; i.e. if the endpoints of a K level virtual link notice that one of the physical assets making up the link has failed, they can try and find a different set of K-1 level entities which can be sued to produce the same service. If so, fine the repair is local, and nobody outside the K level area have to know. If not, the process resurses. Any time an entity (at whatever level) named in a flow setup, etc, cannot be repaired, the source has to be notified, of course, since the local agents cannot know why that entity was selected. - A route specified by high level abstractions can take one of a number of alternate paths locally, e.g. to provide resources requested by a resource reservation; i.e. load-sharing. Source routes using these virtual entities can be guaranteed to be non-looping overall by a recursion process: the top-level source route does not contain loops (at least, unless the source is stupid, and choses a path with one :-); each entity which advertises a virtual link is required to make sure that the implementation of that entity does not create a loop. I had imagined that source routes might be specified in terms of these high-level virtual entities. Obviously, for flow setup, you want to instantiate a flow in the switches which provide the virtual entity, but this is not done for the SRD case. I was trying to make the processing of these source routes as efficient as possible, when I came across some interesting mechanisms. (As a side-note, a conventional source route (either strict or loose) is also forwarding state in the packet, in that the source route always has to have some indication of how far processing of it has gone; this indication is forwarding state too.) For efficient handling of SRD's, I had imagined that i) the packet would contain a pointer into the source route, and ii) routers would maintain, for each virtual link, a pre-setup flow which instantiates the virtual link (hereinafter the "VLF", for 'virtual link flow'). When an SRD shows up at the router at the start of the VLF, it is somehow "associated" with the VLF until it gets to the end of it, at which time the source route is consulted, and the packet is routed onto the next VLF. (Obviously, physical links are just done as a one-shot, without the necessity of a flow.) On thinking about the details, an obvious mechanism suggested itself. If all packets contain a "flow-id" field, *which will be unused in datagram packets*, the obvious thing is to store the flow-id of the VLF in that field. The packet will then traverse the routers between the start and end of the VLF, being handled just like any normal packet which is part of a flow, i.e. by the high-efficiency flow-forwarding mechanism. When the packet gets to the router which is the termination of the VLF, the flow-block will indicate that the packet needs special handling; it will be examined, the flow-id of the next VLF will be stuck into the flow-id field of the packet, and off it goes. This is slightly unusual, in that you don't visualize the flow-id field in the packet getting bashed about during transit to refer to *different* flows, but it works fine. I realized that one minor bug is that if we define the global flow-id in the packet to be the source EID plus a source-local flow-id, this idea doesn't work (since you'd need to bash the source EID to the router at the start of the VLF, or something), so perhaps the flow-id field can't overload the source EID. This mechanism uses the idea from the SIP SR (and a tip of the hatly hat to Steve for that :-), which is that the routers in between don't need to grovel around in the packet, but just do the maximally efficient fowarding. In addition, in a world with resource-allocation going on, that VLF could have a reource limit associated with it, which would prevent SRD traffic from interfering with allocated flows. So, that makes processing of source routed packets much more efficient, but still leaves us with the other two disadvantages; that a source route has to be computed, and carried in the packet. However, once this mechanism is available for efficient support of SRD's, it is but a small step to non-SR datagram mode. There is a way to guarantee a strictly non-looping path, but without a source-route in the packet, using a slight variant. The packet contains, in addition to a mungable flow-id field, the source and destination locators, and a pointer into the locator. The idea (borrowed from PIP :-) is that the pointer starts out at the lowest level of the source locator, and moves up that locator, then across to the destination locator, and then down. In addition to these extra fields in the packet, all routers have to contain a minimal set of "pre-setup" flows to certain routers which are at critical places in the abstraction hierarchy. (It turns out that these so-called "pre-setup" flows do not actually have to be setup in advance, but can be created on demand. There is a minimum set of flows which do have to be *able* to be set up for the system to operate, however. It is purely a local decision, however, which, if any, of those flows to set up before there is an actual traffic requirement for them. As an efficiency move, when a datagram requires that a flow actually be set up to handle it, the data packet could be sent along with the flow setup request, avoiding the round-trip delay. So, let's call these flows "datagram mode flows", or "DMF's", realizing that none of them *have* to be created until actually needed.) While going up the source locator, each "active" router (i.e. one that actually makes a decision about where to send the packet, as opposed to handling it as part of a flow) selects a DMF which will take the packet to the "next higher" level object in the source locator, advances the pointer, and sends the packet off along that flow. When it gets to the end of that flow, the process repeats, until the packet reaches a router which is at the least common intersection of the two locators (i.e. for A.P.Q.R and A.X.Y.Z, this would be when the packet reaches A). The process then inverts, with each active router selecting a DMF which takes the packet to the next lower object in the destination locator. So, A would select a flow to A.X, and once it got to A.X, A.X would select a flow to A.X.Y, etc. This mode would have almost none of the disadvantages of SRD, since the source doesn't have to compute a route, and there is no source route in the packet, just the source and destination locator (and the source locator is useful to have anyway when the packet gets to the ultimate destination, to allow a reply to be sent easily). Again, in a world with resource-allocation going on, that DMF would have a reource limit associated with it, which would prevent pure datagram traffic from interfering with other resource allocations. While there is a certain surface similarity between this mechanism and "classic" VC networks, where the "connection identifier" field in packets is updated as the packet flows through the network, it is only a surface similarity. In the VC case, the different values are simply names with a purely local scope (which is why they have to be modified, as they cross scope boundaries) for end-end connections. This process is fundamentally different; a number of otherwise unrelated, independant DMF's are being strung together to carry a single datagram in an efficient and robust way. It can easily be seen that the process guarantees that the resulting path is loop-free. Each flow selected must necessarily get the packet closer to its destination (since each flow selection results in the pointer being monotonically advanced through the locator), and the flows themselves are guaranteed not to loop when their paths are selected, prior to being set up. It is true that each "active" router does have to do these two steps correctly, but i) the part of the computation which is truly distributed is reduced to almost nothing, since each router only gets to make a *local* decision about a strictly-defined element of the locator, and ii) far fewer routers get to make any routing decision anyway. The only requirement on the state in the routers to ensure correct operation (as opposed to actual implementation errors of the mechanism, which are obviously a serious problem, but one which must be tackled by other means) is that router's topology map must not contain incorrect data; i.e. a map which does not correspond to the real topology of the network. Consistency with reality is the only requirement, not a complete description of reality (i.e. the map may be missing lots of data), nor any kind of consistency with other routers. So, I reckon it's thus more robust than any hop-by-hop scheme, although not as robust as, say, source-routed flow-setup, which is the most reliable of all, since the original source can detect any failure of any of the selected devices, and select a new path with avoids failures. (It turns out that flow-setup can even provide "denial of service" protection, and although each individual active router along the path can provide a denial of service protection on its part of the path, this still does not give complete end-to-end denial of service protection, since a compromised active router can deny service.) It might be possible to remove the "hop count" field, since there are now some fairly strong guarantees that traffic will not loop, but it might be useful to leave it in as an additional safety measure against unforseen failure modes. Currently, the hop count is forwarding state which is retained in the packet to prevent loops, and that retained state is in some ways made redundant by some slightly more complex state which is retained by this method. Removing the hop count would slight increase the efficiency of forwarding, obviously. The forwarding of these packets is, as already noted, quite efficient, and in non-active routers, is maximally efficient (perhaps more so than even standard hop-by-hop). In the non-active routers, the packet is associated with flow in a way that makes possible hardware processing without any software involvement at all. In active routers, the process of looking up the next DMF would be about as expensive as the current routing table lookup, and the main difference would be that the result of that lookup would have to be stored in the packet, not a great expense. As a final note, these new fowarding state fields would not be covered by an end-end authentication system, any more than the existing hop count field (which is also forwarding state) would be. This would prevent problems caused by the fact that the contents of these fields change as the packet traverses the network. The way this solves the "uncompleted source route" problem is obvious. For source routed packets, you simply take the packet as far as you can with the source route, then go into this new datagram mode. Flow setups were always less of an issue, but obviously it's trivial to use the "locator pointer" feature to construct the balance of the flow path in an iterative fashion without creating a loop. This works for the "information hiding" case, as well as the case where the source simply doesn't feel like getting the relevant topology information. It is worth noting that inside an area, you can actually use existing (or future) hop-by-hop forwarding mechanisms, to handle either datagram or flow traffic. In either case, the area simply refuses to pass out internal topology information, and all source routes (be they datagram or flow) stop at the area boundary. Inside the area, hop-by-hop forwarding carries the traffic the rest of the way. This should provide a useful capability both for interoperation during deployment, and also in the future, for those who are not comfortable using flows; they can continue to use hop-by-hop forwarding if they wish. I believe that classical hop-by-hop architectures are less robust, and no more efficient, but if there are some circumstances in which hop-by-hop forwarding has some advantages, in this area (as so many others :-), Nimrod leaves the choice up to the user, allowing them to decide on their own where the prefer to perch on the cost/benfit tradeoff spectrum. As an additional feature, you could specify a number of different "service classes" (akin to the service classes in Unified) which are supported for pure datagram mode. This would entail creation of a number of DMF's from each router, to support the different service types offered, and the state information in the routers would be increased by exactly the same factor as in Unified (over the amount of state needed for the single service class). Note that if a particular router chooses not to support a given service class, or a new service class is deployed incrementally, the result is not looping traffic, as it would be in a classic hop-by-hop model, since this method is inherently non-looping. To discuss the state requirements a bit, the absolute minimum is basically zero. First, each interior router in an area contains a DMF to a border router of the area. Second, each border router of an area contains, at a minimum, a DMF to each lower-level area and object in that area. (As an aside, if we can create trees - sort of like multi-cast trees, but with many source leaves heading toward a single destination - then we only need one tree per destination across a whole area, as opposed to a separate flow from each leaf to the destination.) Third, at the top level, each border router of an area must contain a DMF to each other area at the top level. This level of state provides strictly hierarchical routing. There are pretty obvious optimizations, but this note is already too long to contain the complete analysis. For example, if you keep DMF's to more than the minimal set (which is just up to one border router in internal routers, and down to each object one level down for each border router), and keep your table sorted for efficient lookups (probably much the same as the current routing table for hop-by-hop datagrams), you may be able to short-cut. For example, using the case above (a packet from A.P.Q.R to A.X.Y.Z), if A.P.Q is actually a neighbour to A.X.Y, and maintains a flow directly from A.P.Q to A.X.Y, then when the packet reaches A.P.Q, instead of going the rest of the way up and down, the pointer can be set into the destination locator at A.X.Y, and the packet sent there directly. I imagine some traffic monitoring and analysis (using *purely* local algorithms, a traditional Nimrod thing :-) might result in a database being created over time, which shows which DMF's above and beyond the minimal set are worth keeping around; perhaps a generic subset might be defined which is done pretty much everywhere. This traffic monitoring would also show which flows from the required minimal set of DMF's it would be useful to set up in advance of actual traffic which needed them. Again, however, all these sets can be changed in a local, incremental way, without disturbing the operation of the system as a whole. One obvious question concerns the comparison between the amount of state necessary to contain these DMF's, and the state requirements of either traditional LS or DV hop-by-hop approaches. It's not too bad, but again, this is already too long to go into much detail. Part of the problem is that you can compare the theoretical minimum amounts of state needed in each of the variants, but people will start keeping around more state than that in practise. The theoretical minimum may result in severely non-optimal paths since it enforces strictly hierarchical routing, and to avoid this, extra information (and state) will be added. In any case, the system is heirarchical, so even if the state requirements are a little higher for this new scheme, it can be run with the same total amount of resources as hop-by-hop by simply using more levels of heirarchy. Direct comparison with the state overhead of classic hop-by-hop systems is a thus a little difficult, but I can handwave. There, each router contains a routing table entry for each destination. You can minimize the amount of state in hop-by-hop for outgoing traffic by only keeping routes to the nearest border node for all destinations which are outside the area, and a similar scheme works for incoming traffic. This appears to match reasonably the state overhead with this new datagram scheme, particularly if the tree approach is used. The principal difference between classic hop-by-hop DV and this scheme appears to be that extra state (the locator pointer, and flow-id) is kept in the packet to record where in the process selection of the path has gotten to, and what the current path of the packet it. At a high level, this slight increase in the amount of state carried in packets is what allows the stronger guarantees of non-looping traffic to be created, without the system complexity of guaranteeing near-global consistency (i.e. more complex state) in the routers. This small amount of extra state appears to me to be a reasonable tradeoff for the increased robustness. Noel PS: Martha says this something like this scheme has been independently invented elsewhere, but she hasn't sent me a reference yet, so I can't credit them. PPS: I need a good name for this new mode. Snappy suggestions welcome!