Arista Layer-3 Leaf-Spine Fabric with VXLAN HER: Lab Part 3

August 2, 2017
Configuring the Layer-3 Ethernet fabric

by Pablo Narváez

Hello there, welcome to the third article in the series. In this post, we will configure the Layer-3 Fabric (Underlay) so we are ready for VXLAN (Overlay).

During the configuration process, you will notice that the Layer-3 Leaf-Spine design (L3LS) design has a number of elements that need to be considered to implement it.

The diagram below shows the fabric we are building, details will be noted and explained in detail the following sections.

Layer3 Ethernet Fabric
Layer3 Ethernet Fabric

For ease of implementation, I’m going to split the configuration in three parts:

  • Management Network: Out-of-band management
  • Layer-2 Configuration: Servers and Leaf switches
  • Layer-3 Configuration: L3LS interconnects and Leaf-Spine routing

Like any network design IP addressing and logical networks need to be allocated and assigned. For the setup, we will use the IP addressing shown below.

IP Address Allocation Table
IP Address Allocation Table

You don’t really need to collect all this information to configure the lab, I just added the MAC addresses and some description to be more organized and to troubleshooting any problem more easily if necessary. However, if you are interested in getting all this information for you home lab, you can access the VM setting on virt-manager and check the details for every VM as described in the previous post.

MANAGEMENT NETWORK

The out-of-band management network provides access and control of the devices outside of the production network. As the name would imply, the primary use of the OOB network is access and control of infrastructure when the production network is unavailable

To configure this network, use the IP Address Allocation Table as a reference. As an example, the configurations for Leaf01 is shown below.

hostname leaf01
!
username admin role network-admin secret xxxxxx
!
vrf definition mgmt
 rd 0:65010
!
interface Management1
 description oob-mgmt
 vrf forwarding mgmt
 ip address 10.0.0.21/24
!
ip routing
!
no ip routing vrf mgmt
!
logging vrf mgmt host 10.0.0.1
!

Note that a VRF (Virtual Routing and Forwarding) is used for management, this will isolate the management network and make it inaccessible outside its subnet (unless you explicitly allow it). From the host machine (base OS), you will be able to ssh/telnet into the switches and servers through this network. Please check the previous article to see the details of the OOB Network.

LAYER-2 CONFIGURATION

As shown in the network diagram above, we have two different Compute Leafs:

  • Dual-homed Server Leaf
  • Single-homed Server Leaf

For the dual-homed server Leaf, our design will consist of a pair of Leaf switches presented to the servers as a single switch through the use of MLAG (Multi-chassis Link Aggregation). One benefit of using MLAG in this particular design is to eliminate the dependence on spanning-tree for loop prevention so all links between Leaf-switches and servers are active. Servers don’t require knowledge of MLAG and can simply be configured with dynamic or static LACP or NIC Bonding.

Layer2 Leaf-Compute Network
Layer-2 Leaf-Compute Network

In regards to the server gateways, this design uses an Anycast default gateway technique known as Virtual Address Resolution Protocol or VARP. On a MLAG pair both switches coordinate the advertisement of an identical MAC and IP address (the VARP address) for the default gateway on each segment. Each default gateway can receive and service the server requests making a first hop intelligent routing decision without traversing the peer link.

MLAG works well with both Virtual Redundant Router Protocol (VRRP) and Virtual ARP (VARP). There are some reasons why I chose VARP over VRRP: Simple configuration.

However, if you were to deploy a virtual gateway technology in production, choosing VARP would make a lot of sense for the following reasons:

  • Reduces the burden on switch CPUs
  • Switches process all traffic requests independently so there’s no unnecessary traffic traversing the peer-link
  • There is no control protocol or messaging bus as utilized in VRRP so switches don’t send control traffic over the peer-link to maintain gateway coordination or to move the control functions from the primary switch to the peer device in case of failure.

For single-homed servers, we don’t need to enable either MLAG or VARP just configure regular access ports and virtual interfaces (SVI).

MLAG Configuration

Server level configuration always needs to be reviewed as well, particularly with dual-homed active/active configurations. If you need help configuring NIC-Bonding/LACP on the servers, please check the following links:

As a rule of thumb, the MLAG group (domain) configuration must be identical on both switches so look carefully at the very few differences between the switches.

MLAG Domain Diagram
MLAG Domain Diagram

The MLAG peer-VLAN 4094 is created and added to the mlagpeer trunk group. The MLAG peers also must have IP reachability with each other over the peer link (SVI for vlan 4094).

hostname leaf01
!
vlan 4094
 name mlag-vlan
 trunk group mlagpeer
!
interface Vlan4094
 description mlag-vlan
 ip address 172.16.254.1/30
!
hostname leaf02
!
vlan 4094
 name mlag-vlan
 trunk group mlagpeer
!
interface Vlan4094
 description mlag-vlan
 ip address 172.16.254.2/30
!

To ensure forwarding between the peers on the peer link, spanning-tree must also be disabled on this vlan. Once the port channel is created for the peer link and configured as a trunk port on Ethernet6 and Ethernet7, additional VLANs may be added if necessary to transit the peer link. In this example, I’m configuring vlan 11 for server01, adding this vlan to the port channel, and configuring the server-facing ports (Ethernet1 on both switches).

hostname leaf01
!
no spanning-tree vlan 4094
!
interface Port-Channel11
 description service01-portchannel
 switchport trunk allowed vlan 11
 switchport mode trunk
 mlag 11
!
interface Port-Channel54
 description mlag-portchannel
 switchport mode trunk
 switchport trunk group mlagpeer
!
interface Ethernet1
description server01-ens2
channel-group 11 mode active
!
interface Ethernet6
description link_to_leaf02-eth6 (mlag-peerlink1)
channel-group 54 mode active
!
interface Ethernet7
 description link_to_leaf02-eth7 (mlag-peerlink2)
 channel-group 54 mode active
!

Next, we need to configure the actual MLAG domain which must be unique for each MLAG pair. As part of the domain configuration, we are using interface vlan 4094 for IP reachability and Port-Channel 54 as the physical peer link.

hostname leaf01
!
mlag configuration
 domain-id mlagDomain
 local-interface Vlan 4094
 peer-address 172.16.254.2
 peer-link Port-Channel54
 reload-delay 500
!
hostname leaf02
!
mlag configuration
 domain-id mlagDomain
 local-interface Vlan 4094
 peer-address 172.16.254.1
 peer-link Port-Channel54
 reload-delay 500
!

MLAG Verification

To verify the MLAG operation, you can run the following commands:

  • show mlag
  • show mlag config-sanity

Make sure the peer configuration is consistent and the MLAG status is Active. If you see a different MLAG state or any other error in the output check the MLAG troubleshooting guide posted here.

For single-homed servers, ports are configured as access ports and assigned a VLAN. A Switched Virtual Interface (SVI) is created for each VLAN, which acts as the default gateway for the host/workload.

VARP Configuration

Leaf01 and Leaf02 are MLAG peers and are configured to run VARP to provide an active/active redundant first hop gateway for server01 and server04. To provide routing within each rack, the Leaf nodes of an MLAG domain must be configured with an IP interface in every subnet.

hostname leaf01
!
int vlan 11
 description service01-gateway
 ip add 192.168.11.2/24
 ip virtual-router address 192.168.11.1
!
ip virtual-router mac-address 001c.aaaa.aaaa
!
hostname leaf02
!
int vlan 11
 description service01-gateway
 ip add 192.168.11.3/24
 ip virtual-router address 192.168.11.1
!
ip virtual-router mac-address 001c.aaaa.aaaa
!

The global common virtual MAC address is unique for each MLAG domain. In this example, the default gateway for vlan 11 uses 192.168.11.1 which is resolved into the virtual MAC address 001c.aaaa.aaaa.

NOTE: As stated above, there’s no need to configure MLAG/VARP on Leaf03.

Repeat the same steps to configure the remaining VLANs (vlans 11-12).  When this is done, you must be able to reach the default gateway IP addresses from the servers assuming everything is configured correctly on the server side.

LAYER-3 CONFIGURATION

Leaf-Spine Interconnects

All Leaf switches are directly connected to all spine switches. In a L3LS topology all of these interconnections are routed links. These routed interconnects can be designed as point-to-point links or as port channels. For production environments, there are pros and cons to each design, Leaf-Spine interconnects require careful consideration to ensure uplinks are not over-subscribed. Point-to-point routed links will be the focus of this guide.

Point-to-Point Routed Interfaces
Point-to-Point Routed Interfaces

As you can see, each Leaf has a point- to-point network between itself and each Spine. In real-life environments, you need to strike the right balance between address conservation and leaving room for the unknown. Using a /31 mask will work as will a /30, the decision will depend on your personal circumstances.

Check the configuration for Leaf01, then you can configure the remaining switches as described in the IP Address Allocation Table.

# leaf01
...
interface Ethernet1
 description server01-ens2
 switchport access vlan 11
!
interface Ethernet2
 description server02-ens2
!
interface Ethernet3
 description server03-ens2
!
interface Ethernet4
 description link_to_spine01-eth1
 no switchport
 ip address 172.16.0.2/30
!
interface Ethernet5
 description link_to_spine02-eth1
 no switchport
 ip address 172.16.0.14/30
!
interface Ethernet6
 description link_to_leaf02-eth6 (mlag-peerlink1)
 channel-group 54 mode on
!
interface Ethernet7
 description link_to_leaf02-eth7 (mlag-peerlink2)
 channel-group 54 mode on
!
interface Loopback0
 description router-id
 ip address 10.0.1.21/32
!
....
!
ip routing
no ip routing vrf mgmt
!

Border Gateway Protocol (BGP) Design

Leaf and Spine switches are interconnected with Layer-3 point-to-point links, and every Leaf is connected to all Spines with at least one interface. Also, there’s no direct dependency or interconnection between Spine switches. All the Leaf nodes can send traffic evenly towards the Spine through the use of Equal Cost Multi Path (ECMP) which is inherent to the use of routing technologies in the design.

NOTE: We have just two Spine switches in our lab, but you can add additional nodes on demand. It’s not required to have an even number of Spine switches, just make sure to have at least one link from each Leaf to every Spine.

Even though you can use OSPF, IS-IS or BGP as the fabric routing protocol, BGP has become the routing protocol of choice for large data centers. Some of the reasons to choose BGP over its alternatives are:

  • Extensive Multi-Vendor interoperability
  • Native Traffic Engineering (TE) capabilities
  • Minimized information flooding, when compared to link-state protocols
  • Reliance on TCP rather than adjacency forming
  • Reduced complexity and simplified troubleshooting
  • Mature and proven stability at scale

As you may know, we have two options to deploy BGP as the fabric routing protocol: eBGP vs iBGP. There are pros and cons for each of them…

eBGP vs. iBGP

There are number of reasons to choose eBGP but one of the more compelling reasons is simplicity, particularly when configuring load sharing (via ECMP) which is one of the main design goals of the L3LS. Using eBGP ensures all routes/paths are utilized with the least amount of complexity and fewest steps to configure.

I’ve tested both options and my personal choice is eBGP even on production environments. Although an iBGP implementation is technically feasible using eBGP allows for a simpler less complex design that is easier to troubleshoot.

NOTE: When integrating a MLAG Leaf configuration into a Layer-3 Leaf-Spine, iBGP peering is recommend between the MLAG peers. The reason for the peering is due to specific failure conditions that the design must take into consideration, this will be explained in detail in the next section.

BGP Autonomous System Number (ASN)

BGP supports several designs when assigning Autonomous System Numbers (ASN) in a L3LS topology. For this lab, the Common Spine ASN – Discrete Leaf ASN design will be used.

This design uses a single ASN for all spine nodes and discrete ASNs for each leaf nodes. Some benefits of this design are:

  • Each rack can now be identified by its ASN
  • Traceroute and bgp commands will show discrete AS making troubleshooting easier
  • Uses inherent BGP loop prevention
  • Unique AS numbers help troubleshooting and don’t require flexing the EBGP path selection algorithm

As an alternative, you can use the Common Spine ASN – Common Leaf ASN design where a common (shared) ASN will be assigned to the Spine nodes and another ASN to the Leaf nodes. If you want to try this option, please check the configuration guide posted here.

BGP Configuration

For the Spine configuration the default BGP distance is altered to give preference to external BGP routes (this might not be necessary for the lab, but keep it in mind when deploying this configuration in production environments). Leaf neighbors are also defined and utilize a peer-group to simplify configuration.

Note that all spine switches share a common ASN while each Leaf-pair has a different ASN, see the BGP diagram below for details.

bgp_asn_scheme-1
BGP ASN Scheme

NOTE: This guide will use the private AS numbers between 64512 through 65535.

Loopback interfaces will be used as the router-id on each switch, so we are configuring a Loopback0 interface with a /30 mask for every switch.

Follow this table below to configure the loopback interfaces.

Loopback IP Address Allocation Table
Loopback IP Address Allocation Table

To start, let’s see the  Spine switches configuration.

hostname spine01
!
router bgp 65020
 router-id 10.0.1.11
 bgp log-neighbor-changes
 distance bgp 20 200 200
 maximum-paths 2 ecmp 64
 neighbor 172.16.0.2 remote-as 65021
 neighbor 172.16.0.6 remote-as 65021
 neighbor 172.16.0.10 remote-as 65022
 network 10.0.1.11/32
!
hostname spine02
!
router bgp 65020
 router-id 10.0.1.12
 bgp log-neighbor-changes
 distance bgp 20 200 200
 maximum-paths 2 ecmp 64
 neighbor 172.16.0.14 remote-as 65021
 neighbor 172.16.0.18 remote-as 65021
 neighbor 172.16.0.22 remote-as 65022
 network 10.0.1.12/32
!

This example uses static BGP peer groups. When a static peer group is created, the group name can be used to apply the configuration to all members of the group.

The Leaf switch configuration is very similar to the Spine, a single peer-group is utilized to peer with the spine with a standard configuration.

hostname leaf01
!
router bgp 65021
 router-id 10.0.1.21
 bgp log-neighbor-changes
 distance bgp 20 200 200
 maximum-paths 2 ecmp 2
 neighbor ebgp-to-spine-peers peer-group
 neighbor ebgp-to-spine-peers remote-as 65020
 neighbor ebgp-to-spine-peers maximum-routes 12000
 neighbor 172.16.0.1 peer-group ebgp-to-spine-peers
 neighbor 172.16.0.13 peer-group ebgp-to-spine-peers
 neighbor 172.16.254.2 remote-as 65021
 neighbor 172.16.254.2 next-hop-self
 neighbor 172.16.254.2 maximum-routes 12000
 network 10.0.1.21/32
 redistributed connected
!
hostname leaf02
!
router bgp 65021
 router-id 10.0.1.22
 bgp log-neighbor-changes
 distance bgp 20 200 200
 maximum-paths 2 ecmp 2
 neighbor ebgp-to-spine-peers peer-group
 neighbor ebgp-to-spine-peers remote-as 65020
 neighbor ebgp-to-spine-peers maximum-routes 12000 
 neighbor 172.16.0.5 peer-group ebgp-to-spine-peers
 neighbor 172.16.0.17 peer-group ebgp-to-spine-peers
 neighbor 172.16.254.1 remote-as 65021
 neighbor 172.16.254.1 next-hop-self
 neighbor 172.16.254.1 maximum-routes 12000
 network 10.0.1.22/32
 redistributed connected
!

NOTE: The “redistribute connected” command will redistribute all the directly connected interfaces into BGP for connectivity testing purposes. In production, link addresses are not typically advertised. This is because:

  • Link addresses take up valuable FIB resources. In a large CLOS (Leaf-Spine) environment, the number of such addresses can be quite large
  • Link addresses expose an additional attack vector for intruders to use to either break in or engage in DDOS attacks

When we have an MLAG domain as part of a Layer-3 fabric, iBGP peering is recommend between the MLAG peers. The reason for the peering is due to specific failure conditions that the design must take into consideration; such failures include the Leaf-Spine uplinks, routes learned via iBGP will come into effect if all uplinks fail.

Let’s say all Leaf01 uplinks fail, with an iBGP peering between Leaf01 and Leaf02 any server traffic forwarded to Leaf01 would follow the remaining route pointing to Leaf02 and then be ECMP-routed to the Spine.

NOTE: In normal operation paths learned via eBGP (Leaf to Spine uplinks) will always be preferred over paths learned via iBGP (MLAG peers).

The neighbor next-hop-self command configures the switch to list its address as the next hop in routes that it advertises to the specified BGP-speaking neighbor or neighbors in the specified peer group. This is used in networks where BGP neighbors do not directly access all other neighbors on the same subnet.

Route Advertising

In production environments, you need to ensure that only the proper routes are advertised from Leaf switches, so a route map must be applied to the Spine BGP-peers. The route map references the prefix-list which contains the routes that are intended to be advertised to the Spine.

Although not mandatory, using a route map or a prefix list provides a level of protection in the network. Without a route map random networks could be created at the Leaf, which would automatically be added to the routing table.

BGP Verification

You can verify the BGP operation by running the following commands:

  • show ip bgp summary
  • show ip route

The state for all neighbors should be ESTABLISHED.

Since the server VLANs will be encapsulated in VXLAN between VTEPs, we don’t need to advertise them into BGP so I’m going to filter networks 192.168.11.0/24, 192.168.12.0/24, 192.168.13.0/24 out of the Leaf.

# leaf01 and leaf02
ip prefix-list filter-out-to-spine seq 10 deny 192.168.11.0/24
ip prefix-list filter-out-to-spine seq 20 deny 192.168.12.0/24
ip prefix-list filter-out-to-spine seq 20 deny 192.168.13.0/24 
ip prefix-list filter-out-to-spine seq 30 permit 0.0.0.0/0 le 32
!
router bgp 65021
   neighbor ebgp-to-spine-peers prefix-list filter-out-to-spine out

# leaf03
ip prefix-list filter-out-to-spine seq 10 deny 192.168.11.0/24
ip prefix-list filter-out-to-spine seq 20 deny 192.168.12.0/24
ip prefix-list filter-out-to-spine seq 20 deny 192.168.13.0/24 
ip prefix-list filter-out-to-spine seq 30 permit 0.0.0.0/0 le 32
!
router bgp 65022
   neighbor ebgp-to-spine-peers prefix-list filter-out-to-spine out

Once the filter-list is applied on the Leaf switches, the output of “show ip route” on the Spine should display the loopback interfaces and point-to-point links but no server networks must be shown.

spine01#show ip route
VRF name: default
Codes: C - connected, S - static, K - kernel,
 O - OSPF, IA - OSPF inter area, E1 - OSPF external type 1,
 E2 - OSPF external type 2, N1 - OSPF NSSA external type 1,
 N2 - OSPF NSSA external type2, B I - iBGP, B E - eBGP,
 R - RIP, I L1 - ISIS level 1, I L2 - ISIS level 2,
 O3 - OSPFv3, A B - BGP Aggregate, A O - OSPF Summary,
 NG - Nexthop Group Static Route, V - VXLAN Control Service

Gateway of last resort is not set

B E 10.0.1.1/32 [20/0] via 172.16.0.25, Ethernet4 
                       via 172.16.0.33, Ethernet5
 B E 10.0.1.2/32 [20/0] via 172.16.0.25, Ethernet4 
                        via 172.16.0.33, Ethernet5
 C 10.0.1.11/32 is directly connected, Loopback0 
B E 10.0.1.21/32 [20/0] via 172.16.0.2, Ethernet1
                        via 172.16.0.6, Ethernet2 
B E 10.0.1.22/32 [20/0] via 172.16.0.2, Ethernet1
                        via 172.16.0.6, Ethernet2
 B E 10.0.1.23/32 [20/0] via 172.16.0.10, Ethernet3
 B E 10.0.2.1/32 [20/0] via 172.16.0.2, Ethernet1
                        via 172.16.0.6, Ethernet2
 B E 10.0.2.2/32 [20/0] via 172.16.0.10, Ethernet3
 C 172.16.0.0/30 is directly connected, Ethernet1
 C 172.16.0.4/30 is directly connected, Ethernet2
 C 172.16.0.8/30 is directly connected, Ethernet3
 B E 172.16.0.12/30 [20/0] via 172.16.0.2, Ethernet1
                           via 172.16.0.6, Ethernet2
 B E 172.16.0.16/30 [20/0] via 172.16.0.2, Ethernet1
                           via 172.16.0.6, Ethernet2
 B E 172.16.0.20/30 [20/0] via 172.16.0.10, Ethernet3
 C 172.16.0.24/30 is directly connected, Ethernet4
 C 172.16.0.32/30 is directly connected, Ethernet5
 B E 172.16.254.0/30 [20/0] via 172.16.0.2, Ethernet1
                            via 172.16.0.6, Ethernet2

There you go! The Underlay (L3SL Fabric) is up and running. In the next post, we will configure and test VXLAN.

You can always check my github repository to download the configuration files.

Articles in the series:

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s