I’m moving my kubernetes setups to dual stack IPv4+IPv6, and, naturally, I want cilium to handle the BGP part of that so I could just use the LoadBalancer services’ assigned IPv6 addresses directly. For that, of course, the VM host node needs to know where to route them, and this is where BGP comes into play.

Cilium natively supports BGP using gobgp. Host-side, I tried a few options, but I settled on bird2. Gobgp would be nice, I suppose, but it cannot handle the FIB natively, and instead relies on Zebra to do that, and after trying to make them talk to each other for a couple hours, I just gave up.

What’s that FIB Thing? Routing is a pretty complex topic, but, in a nutshell, BGP is how two applications exchange routing information. Now, the application on your router might receive a route to e.g. 192.168.10.0/24, but the actual OS won’t have any idea about it. OSes, Linux in particular, use a FIB aka a forwarding information base, which is a table (or several tables) that tell it how to route the traffic. When you run ip r in your shell, you’re looking at the FIB.
Bird2 worked perfectly out of the box for me previously, but I never had IPv6 clusters. I started with a known good config that worked in my homelab for IPv4:

# Device is not a real routing protocol. It's how bird2 learns about the
# network interfaces available
protocol device {
}

# This is the part that talks to the Kernel FIB and writes the IPv4 routes in there.
# We use a dedicated table to reduce the clutter.
protocol kernel {
  kernel table 1336;
  persist;
  ipv4 {
    export all;
  };
}

# Finally, this is the part that talks to cilium agents. Or, rather, the part
# where cilium agents talk to bird2, becasue it's passive (i.e. doesn't try to
# establish the connection on its own).
protocol bgp {
  local 10.224.1.1 as 65100;
  neighbor range 10.224.1.0/24 as 65100 internal;

  bfd;
  direct;
  passive;

  ipv4 {
    import all;
  };
}

This setup looks rather straightforward (also, horrendously insecure, don’t use it in production). Seemingly, all I needed for IPv6 was to add a protocol kernel section to export IPv6 routes, and then an ipv6 section in bgp to import them. Of course, if it was that simple, there would be no blog post.

When cilium advertises IPv6 routes over IPv4 connection, it, strangely enough, uses IPv4 address as the next hop. Having IPv4 address as a next hop for an IPv6 packet is illegal, though, so bird drops those:

bird[7755]: dynbgp2: Invalid NEXT_HOP attribute - mismatched address family (10.224.129.15 for ipv6)
bird[7755]: dynbgp2: Invalid route fd00:ff00:112::/64 withdrawn

In practice, that means clilium can only push IPv4 routes. I poked around, and I was advised to try and negotiate the BGP session using IPv6 in the cilium’s slack:

Cilium bgp control-plane offloads nexthop selection to gobgp, it sets ‘0.0.0.0’ for v4 and ‘::’ for v6, as you see in the routes output. One workaround would be to setup BGP peering over IPv6 addresses and advertise v4, and v6 address families over it. This is one example of it https://github.com/cilium/cilium/blob/main/contrib/containerlab/bgp-cplane-dev-dual/bgpp.yaml
Supposedly, that’d be as easy as changing the local 10.224.1.1 to whatever the local IPv6 address is, and updating the neighbor range. I didn’t have any IPv6 address on the bridge, other than link-local one, and I couldn’t use the link-local one because bird2 doesn’t support - characters in interface names, so I couldn’t specify fe80::1%vmbr-dev as a local address. I didn’t want to use a publicly routable address on the bridge either—that meant more fiddling with the firewall. fd00:: to the rescue, then! It’s a unique local address prefix (basically, the 10.0.0.0/8 of IPv6).

Armed with that, I added local fd00::1 as 65100; to my config, allowed the neighbor range 2a01::/16 for a test, and gave it a spin.

bird[7755]: bgp1: Incoming connection from 2a01:4f8:173:fe11:6f81:e7cd:2787:fb71 (port 35581) accepted
bird[7755]: GW6_1: Initializing
bird[7755]: GW6_1: Starting
bird[7755]: GW6_1: State changed to start
bird[7755]: GW6_1: Waiting for 2a01:4f8:173:fe11:6f81:e7cd:2787:fb71 to become my neighbor

Looking at tcpdump, it seemed that cilium would start the TCP session, send an OPEN, and bird2 would just give up on it, without replying.

I tried a few configurations and gave up, deciding I should try and see if link local addresses would work better. For that, I had to rename all my interfaces, but, thanks to Nix, that was a single change in one function, and then a reboot (networkd doesn’t clean up after itself, and I wanted to make sure the state is fresh).

Unfortunately, listening on fe80::1%vmbrdev and telling cilium to send traffic there ended up an even worse experience—there wasn’t even an attempt to set up the TCP session. Indeed, the cilium peering policy looked like

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
apiVersion: cilium.io/v2alpha1
kind: CiliumBGPPeeringPolicy
metadata:
  name: host-peering
spec:
  nodeSelector:
    matchLabels:
      kubernetes.io/os: linux
  virtualRouters:
    - exportPodCIDR: true
      localASN: 65100
      neighbors:
        - connectRetryTimeSeconds: 5
          eBGPMultihopTTL: 1
          holdTimeSeconds: 90
          keepAliveTimeSeconds: 30
          peerASN: 65100
          peerAddress: 'fe80::1/128'
          peerPort: 179

Can you spot what’s missing? fe80::1/128 can be on pretty much any interface, it’s link local! You can’t even ping fe80::1 and expect it to work, you have to be explicit, eg. with ping fe80::1%eth0. Cilium, however, did not specify the neighbor interface when it created a new peer.

Link local addressing wouldn’t work. Pering over IPv4 wouldn’t work. Pering over IPv6 wouldn’t work. The latter seemed the most promising though, because clearly cilium tried to do something. It was bird2 who wasn’t cooperating. It’s a pretty old piece of software, written in somewhat less exciting C, so I looked for external help before installing gdb on my server. Bird support is gone through mailing lists (which are still better than discord), and I hit a jackpot: Maria Matejka from the Bird team replied to me almost immediately, clarified the issue and explained the error message:

The error message actually means “you requested direct connection but i can’t see the right interface to use because there is no interface with this range assigned”.
I had direct in my config because that’s how things were connected in the IPv4 world. In IPv6 world, I effectively tried to talk from 2a01:4f8:173:fe11:6f81:e7cd:2787:fb71 to fd00::1. Even though the kernel had a route for 2a01:4f8:173:fe11::/64, so it knew the packets would need to end up on vmbrdev, bird had no idea about that. The fixes started to come together.

One option was to use multihop 1 instead direct. Multihop is a way to talk to the routers that are not directly connected. Using this option offloads the traffic routing decisions from bird2 back to the kernel’s FIB. It’s not the most elegant solution, though, because the other router is actually directly connected.

A better option is to either read that route from the kernel FIB, or just hardcode it inside bird2 itself:

protocol static {
  ipv6;
  route 2a01:4f8:173:fe11::/64 via "vmbrdev";
}

With this configuration, bird2 knows that 2a01:4f8:173:fe11:6f81:e7cd:2787:fb71 is a directly connected peer. The session establishes. But wait. The IPv4 routes don’t propagate, now!

bird[88898]: GW6_1: Invalid NEXT_HOP attribute - mismatched address family (2a01:4f8:173:fe11:1b2e:5bc5:ff63:2ee0 for ipv4)
bird[88898]: GW6_1: Invalid route 10.194.112.0/24 withdrawn

Gobgp, in its infinite wisdom, switched all routes to IPv6 next hops now. Thankfully, that is easily fixed with a extended next hop configuration, even though I don’t quite understand how it works:

$ ip r s table 1336
10.194.112.0/24 via inet6 2a01:4f8:173:fe11:6f81:e7cd:2787:fb71 dev vmbrdev proto bird metric 32

The way that route record reads is: for packets from 10.194.112.0/24, send them to interface vmbrdev via 2a01:4f8:173:fe11:6f81:e7cd:2787:fb71. That via bit effectively means the MAC address. In a typical IPv4 route, the router’s MAC would be deduced by ARP requests, but nothing really prevents the kernel to use IPv6’s NDP to do the same. The address itself doesn’t matter, as we don’t actually send the packets to it, only to the same physical target.
In the end, my bird2 config looks like this:

# Extreme verbosity for debugging
log syslog all;
debug protocols all;

protocol device {
}

protocol kernel {
  kernel table 1336;
  persist;
  ipv4 {
    export all;
  };
}

# Export IPv6 routes, but not those that we learned statically
protocol kernel {
  kernel table 1336;
  persist;
  ipv6 {
    export filter {
      if source = RTS_STATIC then reject;
      accept;
    };
  };
}

# Static route definitions for the birdges
protocol static {
  ipv6;
  route 2a01:4f8:173:fe11::/64 via "vmbrdev";
}

protocol bgp {
  # Listen on the bridge's ULA address
  local fd00::1 port 179 as 65100;
  neighbor range 2a01::/16 port 179 as 65100 internal;

  dynamic name "GW6_";
  direct;
  passive;

  ipv4 {
    extended next hop yes;
    import all;
  };

  ipv6 {
    import all;
  };
}

This finally gets the pod CIDRs from the kubernetes cluster into my VM host’s FIB.

Now, to improve this further, it’d be nice to get rid of the ULA (fd00::1), but it’s currently impossible with cilium. Another optimization is to ditch radvd and do router advertisements with bird2 too, given it has to know the bridge prefixes anyway now.

My heartfelt gratitude goes to Maria of Bird project, and Harsimran Pabla, and Anders Ingemann from the cilium slack.