Introduction

One of the advantages of working in a support role is that you get to see hundreds or even thousands of different implementations of the same solution. When I worked in the ACI Business Unit I saw a lot of patterns with how the solution was implemented in a less than optimal manner. Some mistakes are a simple fix, but some literally cannot be rectified without redeploying the entire fabric.

I wrote down all the the mistakes that I could think of and narrowed it down to my top 5, based on an equal weight of frequency seen, and impacts felt.

Mistake #5 - Naming Objects With Their Type

The first mistake in this list is minor, but it can cause a lot of visual clutter in the long run. In one line this problem can be described as "when you create an EPG, you do not need to include 'EPG' anywhere in the name". This is such a common mistake, and I think it just comes from the fact that most users don't know what the distinguished names are going to look like for each object when they are creating them.

For example, if you have an EPG that you want to call Dell_BS_1, there is no need to call it EPG_Dell_BS_1, or Dell_BS_1_EPG, because when you look at the dn for this object, it will already include "epg-":

 uni/tn-London/ap-EEG/epg-Dell_BD_1

It sounds like a small thing, but I have seen so many fabrics that do this for every type of object: VLAN Pools, Domains, AEPs, APs, BDs, VRFs, etc., and it makes reading the object names so much more difficult. It could take something that looks like this:

 uni/tn-London/ap-EEG_Hosts/epg-Dell_BD_1

And turn it into this, which is much more difficult to read:

 uni/tn-Tenant-London/ap-AP-EEG_Hosts/epg-EPG-Dell_BD_1

#4 - Fabric Setup Policy Foresight

This is really two separate issues, but they occur at the same time, which is when setting up a new fabric.

fabric setup

1) Not using Fabric ID 1 or the same Fabric ID at different sites

The first issue can cause some serious pain down the road if you do not plan ahead. Many users pick a fabric ID without realizing that in order for two fabrics to be a part of the same multisite domain, they must have the same fabric ID.

For instance, if you create a Fabric in Miami and use Fabric ID 1, then you have another fabric that is set up in New York with fabric ID 2, you cannot join these two sites together, because they have different site IDs. Unfortunately, the only way to change this is to completely wipe one of the fabrics and rebuild it from scratch, this time making sure the Fabric IDs are the same.

2) Using the incorrect ID ranges for your Spines and Leafs

For issue number two, you want to have a plan for how your Spine and Leaf naming scheme will scale. Too many times I have seen customers use Leafs IDs in the 100s range, and start their Spines in the 200s.

It works OK at first when you have Leaf-101, Leaf-102 and Spine 201 and Spine 202. But what happens when you get your 201st Leaf? The Leaf ID range runs into the Spines and it can get very confusing. For this reason I would recommend using the 1000s range for the Spine and the 100s for the Leafs. This also allows you do use different ranges like 100-199 for Pod A and range 200-299 for Pod B, etc., which can be very helpful.

#3 - Letting Faults Get Out of Hand

Fault systems have been around for a long time, and as anyone who has used them knows, they can either be highly effective troubleshooting and debugging tools, or a completely useless mess. In my time at Cisco I admittedly saw very few customers that were able to keep their faults tidy (I suspect there is a sampling bias, as customers who keep their faults clean likely wouldn't be opening as many cases), but when we did get on a box that had few to no faults, it was almost like magic. Any issue that pops up when configuring or operating the fabric is immediately visible and ready to troubleshoot.

Which of these two fabrics would you rather deploy new configuration on:

This fabric:

Or this fabric:

Keep your faults at near 0 and utilize the full power of ACI.
P.S. this is a service that we offer at Good Gateway if you don't want to do it all yourself.

#2 - Hanging on to Legacy Gear

This was a very close second, because I have seen this issue cause more problems than I can count. If I could quickly describe what an ACI deployment that has been set up for success looks like, it would certainly include being fully committed to the ACI migration, which means connecting ACI Leafs directly to the edge devices, and not through legacy equipment.

Often customers want to hang on to their legacy gear, either because they are comfortable with how it functions, or they think they still need to extract some value from it before it becomes obsolete. What tends to happens though, is that by mixing in legacy equipment, you get the worst of both worlds, and really never fully utilize the power of your ACI fabric.

For example, many customers are migrating away from a pair of Nexus 7000s that were acting as their core routers. They were running HSRP for their first hop redundancy, and spanning-tree protocol to prevent loops, which they now want to integrate into ACI. This can be done, and there is documentation to do so, but with ACI both of these features are now completely made obsolete, as you have pervasive gateway SVIs across all leafs and there is no need for spanning-tree in a fully meshed ISIS environment.

#1 - Overconfiguring and/or Checkbox Guessing

This is the practice of simply "clicking until it works", which can be tempting in the moment, but over time you will accrue a lot of technical debt that you will eventually have to pay back, and often with interest.

I had to put this at number 1 for two reasons. The first, is that having a deep understanding of how traffic flows inside your network can be very difficult, even if done correctly, but once you start to lose that understanding, you will have to succumb more and more to simply "guessing" with enabling and disabling features in the GUI to get your traffic to work. This makes it a compounding mistake that only gets worse with time and then when the you know what hits the proverbial fan, it's going to be extremely difficult to untangle how traffic should be flowing in your fabric.

The second reason is that you can be exposing yourself to defects for features for which you aren't even utilizing. One example I can think of is a customer that had both shared-services boxes checked under their external L3Out, but they weren't running any shared services features anywhere on the fabric. This didn't cause any issues until they hit a shared-services bug that they shouldn't have been exposed to the in the first place.

If your traffic isn't flowing after a new configuration change, take the time to trace it out and understand where it is failing. And if you do need to check a strange box to get it to work, make sure to go back and research the feature and everything that goes along with it.

The Top 5 Most Common Mistakes in ACI