ESXi as a BGP Speaker

Have you ever wanted to remove L2 between datacenter switches? I was talking with one of my friends Shakes and he was told by VMware that L2 was a must between TORs for ESXi to work if you had interfaces across the TORs. Well that seems to defeat the point of spine-leaf to me. A fully routed infrastructure is what spine-leaf promised. There have always been issues when it comes to this and ESXi. ESXi can only have one default gateway per IP stack and it doesn’t have any dynamic routing. Sure you can configure a default gateway override on a vmkernel interface but look on the host routing table and you will find only one default gateway listed.

Network       Netmask          Gateway       Interface  Source
------------  ---------------  ------------  ---------  ------
default       0.0.0.0          192.168.20.1  vmk0       MANUAL
10.77.77.0    255.255.255.0    0.0.0.0       vmk2       MANUAL
192.168.20.0  255.255.255.0    0.0.0.0       vmk0       MANUAL
192.168.22.0  255.255.255.254  0.0.0.0       vmk1       MANUAL

In this example you can see the single default route. I configured two in the host client. So this isn’t a problem except when you want redundancy across multiple switches. To solve this issue we have forever been using a VLAN or more recently a VXLAN to allow L2 across switches. This is the same for all the host services.

Enter ExaBGP. ExaBGP is a BGP speaker implemented in Python. Luckily the more recently ESXi builds have Python installed. There are just a few things to get it running. First install pip on the host:

#unload the firewall so wget works
esxcli network firewall unload
#get the old pip to work with python on the host
wget https://bootstrap.pypa.io/pip/3.5/get-pip.py
#install pip
python get-pip.py

Now install exabgp version 4.1.2 anything new won’t install

python -m pip install "exabgp==4.1.2"

Now the host needs some interfaces and a few configurations. A drawing helps:

vSwitch0 and vSwitch1 both have a single uplink on an access port to a vlan. vSwitch2 has no uplinks. I configured BGP on my switch like this:

router bgp 65534
 no synchronization
 bgp log-neighbor-changes
 network 10.77.77.0
 network 192.168.20.0
 network 192.168.10.0
 redistribute connected
 neighbor 10.77.77.2 remote-as 65532
 neighbor 192.168.20.5 remote-as 65532
 maximum-paths 8
 no auto-summary

ExaBGP took some trial and error but I found that making a config.ini like this:

neighbor 192.168.20.1 {
        router-id 192.168.20.5;
        local-address 192.168.20.5;
        local-as 65532;
        peer-as 65534;
}
neighbor 10.77.77.1 {
        router-id 192.168.20.5;
        local-address 10.77.77.2;
        local-as 65532;
        peer-as 65534;
}

Would bring up the BGP neighbors with the switch. ExaBGP also needs some named pipes for communication with the process. If you haven’t configured these the output will tell you what to do. Basically:

mkdir /var/run/exabgp
mkfifo /var/run/exabgp/exabgp.in
mkfifo /var/run/exabgp/exabgp.out
chmod 600 /var/run/exabgp/*

Now load ExaBGP

env exabgp.profile.enable=true \
 exabgp.profile.file=~/profile.log \
 exabgp.log.packets=true \
 exabgp.daemon.user=root \
 exabgp.daemon.daemonize=true \
 exabgp.daemon.pid=/var/run/exabgp.pid \
 exabgp ./config.ini

You should get output similar to the following

 | welcome       | Thank you for using ExaBGP
 | version       | 4.1.2-d006a34a
 | interpreter   | 3.5.7 (default, Jun 14 2019, 02:50:41)  [GCC 4.6.3]
 | os            | VMkernel localhost 6.7.0 #1 SMP Release build-14320388 Aug  5 2019 02:37:06 x86_64
 | installation  |
 | advice        | environment file missing
 | advice        | generate it using "exabgp --fi > /etc/exabgp/exabgp.env"
 | cli control   | named pipes for the cli are:
 | cli control   | to send commands  /var/run/exabgp/exabgp.in
 | cli control   | to read responses /var/run/exabgp/exabgp.out
 | daemon        | ExaBGP can not fork when logs are going to stdout
 | configuration | performing reload of exabgp 4.1.2-d006a34a
 | reactor       | loaded new configuration successfully
 | daemon        | Created PIDfile /var/run/exabgp.pid with value 2100234
 | outgoing-1    | --------------------------------------------------------------------
 | outgoing-1    | the connection can not carry the following family/families
 | outgoing-1    |  - peer is not configured for ipv4/multicast
 | outgoing-1    |  - peer is not configured for bgp-ls/bgp-ls
 | outgoing-1    |  - peer is not configured for ipv6/flow-vpn
 | outgoing-1    |  - peer is not configured for bgp-ls/bgp-ls-vpn
 | outgoing-1    |  - peer is not configured for ipv6/multicast
 | outgoing-1    |  - peer is not configured for l2vpn/evpn
 | outgoing-1    |  - peer is not configured for ipv4/rtc
 | outgoing-1    |  - peer is not configured for ipv4/nlri-mpls
 | outgoing-1    |  - peer is not configured for ipv6/unicast
 | outgoing-1    |  - peer is not configured for ipv4/flow
 | outgoing-1    |  - peer is not configured for ipv6/mpls-vpn
 | outgoing-1    |  - peer is not configured for ipv4/flow-vpn
 | outgoing-1    |  - peer is not configured for l2vpn/vpls
 | outgoing-1    |  - peer is not configured for ipv6/nlri-mpls
 | outgoing-1    |  - peer is not configured for ipv4/mpls-vpn
 | outgoing-1    |  - peer is not configured for ipv6/flow
 | outgoing-1    | therefore no routes of this kind can be announced on the connection
 | outgoing-1    | --------------------------------------------------------------------
 | reactor       | connected to peer-1 with outgoing-1 192.168.20.5-192.168.20.1

What you are really interested in is the connected to peer messages

17:02:35 | 2100933 | reactor       | connected to peer-2 with outgoing-1 10.77.77.2-10.77.77.1
17:02:35 | 2100933 | reactor       | connected to peer-1 with outgoing-2 192.168.20.5-192.168.20.1

Here you can see the connection to both interfaces on the switch. Now to advertise a route to the switch. ExaBGP is programmable and doesn’t do anything by default. By writing to the in named pipe commands can be given. We want to announce 192.168.22.1/23. Open a new terminal and run this

echo "announce route 192.168.22.1/32 next-hop self" > /var/run/exabgp/exabgp.in

In the ExaBGP termnial window you should see:

api | route added to neighbor 10.77.77.1 local-ip 10.77.77.2 local-as 65532 peer-as 65534 router-id 192.168.20.5 family-allowed in-open, neighbor 192.168.20.1 local-ip 192.168.20.5 local-as 65532 peer-as 65534 router-id 192.168.20.5 family-allowed in-open : 192.168.22.1/32 next-hop self

Now the switch has a route to 192.168.22.1

#show ip route 192.168.22.1
Routing entry for 192.168.22.1/32
  Known via "bgp 65534", distance 20, metric 0
  Tag 65532, type external
  Last update from 10.77.77.2 03:36:59 ago
  Routing Descriptor Blocks:
    192.168.20.5, from 192.168.20.5, 03:36:59 ago
      Route metric is 0, traffic share count is 1
      AS Hops 1
      Route tag 65532
  * 10.77.77.2, from 10.77.77.2, 03:36:59 ago
      Route metric is 0, traffic share count is 1
      AS Hops 1
      Route tag 65532

And I can ping the interface:

Pinging 192.168.22.1 with 32 bytes of data:
Reply from 192.168.22.1: bytes=32 time<1ms TTL=62
Reply from 192.168.22.1: bytes=32 time<1ms TTL=62
Reply from 192.168.22.1: bytes=32 time=1ms TTL=62
Reply from 192.168.22.1: bytes=32 time<1ms TTL=62

Observant readers may have noticed that vmk1 is configured with a /31 on the host and I advertised a /32. This isn’t a big deal for the most part but be aware you did this. The vmkernel for this interface doesn’t have an uplink. How is this possible? ESXi will happily spoof the return packet on a different interface. As long as your switch will accept the spoofed packet (most will) everything will work. Now the return traffic will always take the path of the default gateway on the host. You can switch it between the two switch interfaces and traffic never stops. For instance to switch it to 10.77.77.1:

esxcfg-route -a 0.0.0.0/0 10.77.77.1
Adding static route 0.0.0.0/0 to VMkernel

All the services I tried worked. If you wanted to get fancy you can write a python script that will watch stdout from ExaBGP and look for route withdraws or neighbor changes to automate the default route configuration on ESXi. Since this was just an experiment and there is no way VMware would support it I didn’t make that part work.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: