Measurement Analysis and Improvement of rerouting in UNINETT

This thesis focuses on the analysis of the rerouting times in UNINETT, the Norwegian research network. Rerouting happens in case of a topology change of the network, and the routers have to calculate new paths to all destinations. The downtimes due to rerouting is a major contributor to the overall service unavailability. Because of this it is of interest to study the different components of these downtimes, and propose changes to speed up the rerouting process. The main goal for the thesis is to improve the service availablility in UNINETT. In UNINETT there are measurements of periods of packet loss available. These measurements, as well as the statistics from the network nodes, are analysed and the results are presented in this thesis. UNINETT is a network with IS-IS as the routing protocol, and fault handling by path restoration. Fault handling by path restoration means there is no spare capacity to switch to in case of a link or node failure. All routers in the network have to be updated about the topology change, and find new paths around the failure. The delay to update the network nodes to a common stable view, is called the convergence time. During this period it is observed packet loss, and it is of interest to make this period as short as possible. The reason for the packet loss during the convergence time is the inconsistency in the router's routing and forwarding tables. The construction of transient loops between nodes can happen in this phase. This will impose extra load to the network, delay, and in worst case loss of packets. These loops are called micro loops and increase the downtime during the convergence period. The parametrization in IS-IS is studied, and changes to the parameter values are proposed. Too much tuning of the parameters may introduce instability in the network, which increase the load to the nodes and links. This can lead to even longer convergence time, and periods with packet loss. The recommended values are tested in a small test lab replicating parts of the topology of Northern Norway in UNINETT. The results from the test are compared with a case study of a failure on the Trondheim-Troms?? link in UNINETT. The observations from the case study show a typical delay of up to 10 s for the convergence time. The results from the test lab show that it is possible to achieve sub-second convergence time, without any compromise on the stability of the network. Due to the small scale of the test lab, the traffic intensity was too low to observe any overload to the nodes or links. This may be a problem in a full scale network like UNINETT, and further testing is recommended before the proposed changes to the parameters are implemented in any production network. The fault handling in UNINETT is also studied, which includes the contribution from the different components to the convergence time. The observations and results from the mentioned case study and test lab are used in this study too. It is observed that the timer delay before the "Shortest Path Tree" computation is run in the routers, is the major contributor to the convergence time. This is improved by tuning the SPT timers, as observed in the results from the test lab. The failure detection and flooding components are also large contributors to the convergence time. The test lab shows that the failure detection is improved by tuning the hello parameters in IS-IS, but a less processor intensive method called BFD is recommended for further study. The parameters triggering the data-link layer timers may also be a possibility to speed up the failure detection, but this is not further investigated in the thesis. The flooding component is reduced by enabling the fast flooding command. The phenomena of micro loops is studied, and the method called oFIB is recommended for implementation in UNINETT. Micro loops are a small contributor to the convergence time in UNINETT today, but the customers' requirements for service availability are increasing, and the necessity of a solution is in near future. In addition to the oFIB method, a fast repair technique like IPFRR may eliminate almost all downtimes during rerouting in UNINETT, but this is subjects for further studies.

Publisher

Institutt for telematikk