We are working through an initial setup of NSX on the latest build (6.4.5.13282012). Have the primary/secondary NSX Managers (one at each of our two sites) and 3 controllers up. We are at the point where we've configured VXLAN on three clusters. I setup a new VLAN on the physical switches for the VXLAN transport to contain the VTEPs. Everything shows green/happy/connected in Install and Upgrade area, but only 4 hosts of the 11 are responding to ping on their VTEP vmk interface. One VTEP per server. There are three clusters we've deployed to so far:
DC1-Mgmt/Edge:
|--M1 (Succeeds)
|--M2 (Succeeds)
DC1-Compute:
|--S1 (Fails)
|--S2 (Fails)
|--S3 (Succeeds)
|--S4 (Fails)
|--S5 (Succeeds)
DC2-Compute
|--S1 (Fails)
|--S2 (Fails)
|--S3 (Fails)
|--S4 (Fails)
For the life of me I cannot figure out why the I can't ping some of these interfaces, and particularly why two of the five servers in the DC1-Compute cluster are working while the others fail. We've tried ping from switches, from other hosts, etc. with the same results.
Both DC1 clusters on the same switch stack and every host is in the same configuration group, so their interface configurations on the physical switch side are 100% identical. The DC2 cluster is an identical switch stack running the same software version with the same configuration. A VLAN was created for each site (DC1 and DC2) and each is just for the VXLAN traffic. We have jumbo frames running everywhere for it no other devices have an IP on these networks save for the gateway IP.
DC1-Mgmt and DC2-Compute both use an enhanced LACP support / LACPv2 DVS with "route based on originating virtual port" load balancing. I've tried other LB methods. The lag is selected as the active uplink for vDN portgroups on each. DC1-Compute uses a standard/traditional LACP DVS. All three VDS have other similarly configured port groups and VLANs that are running in production and work fine - I can't see anything functionally different about the "vxw-vmknicPg-dvs-etc-etc" portgroup that NSX has made for VXLAN transport. All 5 servers in the DC1-Compute cluster are 100% identical with the exception of their vSphere build (slightly different 6.5 versions), but it doesn't correspond to which ones work. The 2 DC1-Mgmt servers and 4 DC2-Compute servers are the latest 6.7 U1a builds.
We have a call planned with VMware on Monday, but I'm hoping I can do my own poking around and find a resolution before then. Thanks in advance for any ideas!