When Microsoft DHCP Server Lies

Row of hard drives in a datacenter.

The Problem

Recently at work, we had an issue where one of our guest networks wasn’t allowing new hosts onto the network.  After some investigation, it appeared that we had simply run out of available IP addresses in the DHCP pool.  We added 100 addresses and thought we were done until we noticed that within seconds those 100 addresses had too been used.  Thankfully the IP address range for that subnet had plenty of headroom so we added another /24.

Unfortunately, within hours the new /24 was also consumed.  At this point we began to suspect a Denial of Service attack, but we were wrong.  Looking at the DHCP server for patterns or possibly spoofed MAC addresses we discovered something we weren’t looking for.  Having exported the current DHCP lease table to Excel, we noticed that there were only 900 or so records, but the DHCP server was reporting 2000+ used IP addresses.  This got us to distrust the GUI.

Digging Further

After some quick Google-Fu we found some Powershell commands to examine the DHCP database.

Get-DhcpServerv4Lease -scopeid 10.1.2.0 -AllLeases | Select-Object IPAddress,LeaseExpiryTime,AddressState

What we noticed when running the Powershell was that it showed all 2000+ addresses.  The addresses that were missing from the GUI showed a state of Active, but their LeaseExpiryTime was old enough that the should have been in a state of Expired.  In reading the documentation to figure this out, we learned a few things.  First, when an IP Address is not renewed, Microsoft DHCP normally marks it as Expired, but it leaves it as unavailable for a grace period of 4 hours.  Within that time if the original host requests an IP it will be given back to it.  After 4 hours a scavenger process that runs hourly will mark the IP as Available and it will be put back into the pool for all new requests.  In our case though since they were marked as Active, the scavenger ignored the LeaseExpiryTime and never marked them as Available.  

To fix this situation we first tried running a PowerShell script to export the current database to CSV.  We used a text editor to remove all of the entries where the LeaseExpiryTime was less than the current lease time.  We then ran another PowerShell script to import the edited CSV to remove the old address leases.

Get-DhcpServerv4Lease -scopeid 10.1.2.0 -AllLeases | Select-Object IPAddress,LeaseExpiryTime,AddressState | Export-Csv -Encoding ascii -Path .\leases.csv

Import-Csv -Path .\leases.csv | Remove-DhcpServerv4Lease

Success!

Unfortunately this still did not mark the addresses as Expired or Available.  Some more Google-Fu led us to some indications that this might be related to issues with DHCP failover between this server and the standby.  We removed the failover relationship and tried the Remove-DhcpServerv4Lease command again. This time all of the addresses that were older than the LeaseExpiryTime were marked as Expired.  Four hours later they were all removed from the database and our IP pool was back to normal. 

I would like to say that this all happened within an hour or so, but it was in actuality spread out over 48 hours with 5 engineers chasing our tails.  Hopefully putting this out on the Internet will help someone else in the future.