Re: Network: Work on the VSS [Archiv]

oles@ovh.net

05.08.10, 22:10

http://status.ovh.net/?do=details&id=363

Good evening,

For the Roubaix 2 datacentre, we have decided to set up the network with a goal of 100% availability. For this we used the Cisco 6509 switches in VSS configuration. It is a system based on two chassis running as a single. With two chassis, everything is doubled, and so we should have 100% availability.

In the real world, we have several problems with the VSS which has resulted in a cut in service and therefore did not meet the
original contract. Basically, we have a chronic problem on the BGP. At least in the modified routing table, the CPU router is 100% for a minimum of 15 minutes. It's not serious, but at the end of 2009, we have put in place strong protections on the internal network which meant isolating every server from the other. We did this across the private vlan and established of an arp proxy. It was a very standard solution and the router responded in place of all the servers and even provided routing in the same vlan. Everything is
very safe. However, the router must respond to all MAC requests of all the servers and processes that run on the VSS and that takes up a lot of CPU.

Normally this works without problems. But just as the system recalculates the routing tables so that the BGP takes 100% CPU, it also prevents the MAC processes from functioning. The result: servers no longer know the MAC and there is a break in
service for 1, 3, or 8 minutes, depending on the importance of the recalculated BGP table(s) .

It is believed that the problem lies with specific BGP routers i.e.: Route reflector. Normally we would have received materials this month but the order was poorly recorded between the distributor and the manufacturer ... so we should receive at best by the end of September ... We decided not to wait for that delivery, and are implementing a solution this weekend.

But we will always have the MAC problem. We have therefore decided to break VSS configurations and leave what has always worked well: the router in a single chassis. We have a little less than 30 routers in the mono chassis that pose no problem. It is only with double frame configuration that we have problems. So we will break the chassis.

So from last week, we have made changes to the VSS to use a configuration based on a single chassis.

We will carry it out in four steps:
- All the links of the datacentre connected to chassis 2 will be reconnected to chassis 1. no break in service since any work is on chassis 1.
- All links to the Internet connected to chassis 2 will be reconnected to chassis 1. no break in service, since any work is on chassis 1.
- Power cut on chassis 2. no break in service since chassis 2 will no longer be used.
- Change the configuration of chassis 1 to single version chassis. As we'll have to reboot the router's hardware, this will result in a 15 minutes break in service that we will perform at 4:00 am at the end of next week - all going well.

We will first attack vss-2, the one that causes the most problems.

Normally, up to Step 4, we won't have many BGP problems. It's only when we come to chassis 2 configurations that these problems can be soon resolved, then step 3 after 2 because all operate on a single chassis. But we are not sure. In all at the end of step 4 it will be fixed.

And as the BGP will be fixed, we think it is likely that the MAC problems will be too. If the BGP does not work well in a double chassis, then maybe other processes do not work well in a double chassis? We will see that too

We regret any small cuts that customers of Roubaix 2 have suffered recently as they are mainly due to problems described here. The wrong choice in hardware is the cause. It was thought that the manufacturer would solve the CPU problems but according to him it is normal. This material is therefore incompatible with our needs. We'll change it. We have badly managed the situation and we should not have asked for factory help but act immediately to find another solution altogether. Error in problem management.

To continue with this transparency, you may have noticed problems on London, Amsterdam and Frankfurt about 14 days ago. We added secure links 14 days ago. From London/Amsterdam and Paris/Frankfurt. Large heavy investments that were decided to make the backbone completely secure and 100% available even in the event of problems on the fibre optic. Adding these links here on the routers, this has caused the saturation of routers available RAM and the London crash. This has resulted in problems for Amsterdam and Frankfurt for the same reasons. Who says routers crash, recalculates BGP and so 100% of CPU on the vss ... therefore those crashes were the result of service in Roubaix 2 We fixed the problem by disabling the unnecessary MPLS but that takes 20% of RAM. Since it is stable.

We thought of changing all routers during the holidays, but the material we wanted to implement is not available and what is available does not work. We have received the new Cisco Nexus 7000 and the BGP does not work but generates error messages ... New equipment and now this ... Bad choice of material yet again. So big challenges gibe perspective ... By going against it also generates the delay in the planned changes in routers. Then we used this moment to check all manufacturer's market to see what will develop in place of what we expected. A job which will cause unexpected delays on other projects ...

Anyway ...

I think we can not be more transparent about the latest events.

All the best,
Octave