Your own domain, where your word is law

Having a domain is one of the most complex requirements to learn and practice advanced SQL architecture setups. Sure, you can build a domain-independent cluster and even build an Availability Group on top of it. You can also use local accounts to run your SQL services, and SQL authentication, but you’ll miss the opportunity to experience many configurations that are usual in a production environment. So, despite not being part of a DBA’s regular responsibilities, this post will cover the installation and configuration of a domain controller as well as a DNS server.

But first, a disclaimer: this does not reflect how a production environment looks, nor it does it intend to. A production environment has high availability and redundancy requirements to ensure there will not be a potential single point of failure, and can consist of several machines acting as domain controllers, and another set of machines acting as DNS servers.

This is the most basic setup I can think of so you can build it in your own home using Windows Server 2019, and the one I have configured for myself every time I’ve built a home lab for learning and testing purposes.

Installing the role

The role (whose full name is “Active Directory Domain Services” or AD DS) requires a DNS to be installed on the network. Since this post is meant to show how to configure it for local tests, the same box will be given both roles to reduce complexity in the setup.

With both roles installed, it is time to configure them.

Building the foundations: setting up your DNS

It wouldn’t make sense to have a DNS server that needs a DNS itself to be found on this scenario, so first let’s set a static IP address to it.

  • (Red) Ensure the selected IP won’t collide with any of your other VMs nor with the host itself (consult your HyperVisor for this, most common being Hyper-V or VMware), and leave the default gateway empty if you don’t want this VM to have access to the outside world
  • (Blue) Since this will machine will act as a DNS server on this brand new lab there are no existing DNS servers to use, so leave these values blank.

With a static IP set, open the DNS manager console to configure the DNS server. This console can be found in the start menu, or in the Server Manager Dashboard, under Tools.

Now it is time to get the DNS configured.

Since this DNS server will serve a small network (just a handful of machines), the simpler configuration will be enough for that purpose.

And only one DNS server will be set, so this server will maintain the DNS forward zone by itself.

Time to name our DNS: I’ve used sqlozanot.com to differentiate it from the one I have already set on a different machine named sqlozano.com.

And this being a brand new DNS for a brand new zone, there is no DNS zone file to load, just select the name for the new one to be created.

I have selected not to allow dynamic updates of the DNS entries: this will let me control the mapping of DNS and IPs. I consider it a good practice to learn the basics on how to configure it yourself, so when you work in a place where you don’t have access to it, you will know how to properly express your request to whoever will be handling these changes for you.

Having a single DNS means we have nowhere else to forward the requests this one can’t handle (for example, trying to reach servers not registered in this one, like those on the sqlozano.com domain).

The wizard will now look for root hints (think of them as the entries you configure in your Windows hosts file to hard code addresses), but this being a brand new machine and DNS, it won’t find any.

Now the DNS is ready to be saved and be used.

Just ignore this last message: as seen above, there are no root hints to be configured.

This message startled me the first 4 times I configured my own DNS server. And I bet there will be a fifth.

The DNS is almost ready. Before moving on to the Active Directory setup, a reverse zone will be configured as well.

Time to put the “Active” in “Active Directory”

The Server Manager dashboard will remind you the Active Directory Domain Services haven’t been configured yet, and will even give you a shortcut to promote the machine as a domain controller, which is exactly the purpose of this post.

For this basic setup, a brand new forest will be created, named sqlozanot.com after the DNS zone sqlozanot.com created in the previous steps. The new domain may be named differently, but that would require the configuration of a DNS zone with the same name, and this setup is aiming to be as simple as possible.

Set a password so the Active Directory can be repaired in case of a failure to continue its configuration.

This warning is related to the DNS settings we are configuring. Since this is going to be part of a home lab and won’t be external connections to it, there is no need to worry about it at the moment.

Just confirm the NETBIOS name and move to the next.

I only had to worry about NETBIOS a couple of times in my life due to Microsoft Distributed Transaction Coordinator issues with SQL Servers, so I’d recommend not changing this unless you know what you are doing.

When selecting the folders to store all of AD’s related files, it would be a good idea to ensure antivirus won’t mess with these folders, and even locate them on a separate drive with additional backups for safety. But again, the simplest setup possible is being used so using the default values here.

Review the settings before applying those changes: those can be saved as a PowerShell script so the setup can be reproduced elsewhere if the machine must be rebuilt.

This is the complete summary of the options selected

Configure this server as the first Active Directory domain controller in a new forest.

The new domain name is "sqlozanot.com". This is also the name of the new forest.

The NetBIOS name of the domain: SQLOZANOT

Forest Functional Level: Windows Server 2016

Domain Functional Level: Windows Server 2016

Additional Options:

  Global catalog: Yes

  DNS Server: Yes

  Create DNS Delegation: No

Database folder: C:\Windows\NTDS

Log file folder: C:\Windows\NTDS

SYSVOL folder: C:\Windows\SYSVOL

The DNS Server service will be configured on this computer.

This computer will be configured to use this DNS server as its preferred DNS server.

The password of the new domain Administrator will be the same as the password of the local Administrator of this computer.

And the PowerShell script to create it (as generated by the Wizard):

Import-Module ADDSDeployment
Install-ADDSForest `
-CreateDnsDelegation:$false `
-DatabasePath "C:\Windows\NTDS" `
-DomainMode "WinThreshold" `
-DomainName "sqlozanot.com" `
-DomainNetbiosName "SQLOZANOT" `
-ForestMode "WinThreshold" `
-InstallDns:$true `
-LogPath "C:\Windows\NTDS" `
-NoRebootOnCompletion:$false `
-SysvolPath "C:\Windows\SYSVOL" `
-Force:$true

If all the prerequisites for the installation of Active Directory Domain Services are met, you are ready to complete the installation.

I got some warnings but this is not going to be a production environment, so I’ll just go with them

Once the installation is complete, you’ll be logged out and the machine will reboot.

On your next login on the machine, you’ll see the Administrator account belongs to the new domain.

And that’s it, you are the sole ruler of your own domain.

Welcome to outsider.sqlozanot.com, capital of sqlozanot.com. Population: 1

But now that you have both DNS and Active Directory, let’s finish up with the configuration of a Reverse Lookup Zone

What is a Reverse Lookup Zone?

Think of a DNS Forward Lookup Zone as your phone agenda. You usually search for the name of your contact since it is easier to remember than a bunch of numbers. However, sometimes you may want to figure out who a certain IP address belongs to. If only the Forward Lookup Zone were available, the only way to find that out would be to query every single DNS entry and compare their IP address to the one you are looking for. A Reverse Lookup Zone maps the information the other way around.

Creating a Reverse Lookup Zone for your DNS

Using the DNS console, a wizard will guide you through the whole process

Since this is the only DNS on this home lab, the new zone will be the primary. And having the Active Directory in the same machine allows this zone to be stored in it.

All servers in this forest (the only one we have) will replicate this zone’s data.

For simplicity I have used IPv4 on my boxes, so the zone will be configured for IPv4 addresses.

All the machines built will have their IPs in the 192.168.1.1 – 192.168.1.255 range, so that’s the Network ID used to identify this zone.

This being a zone integrated with the Active Directory created previously, allows for secure dynamic updates no (this was not an option during the initial installation of the DNS).

Now the Reverse Lookup Zone is ready to be created.

How is our DNS looking now?

Since there is no other machine than the DNS server itself, there is not much to see at the moment, so below are some screen captures of the same DNS with some additional machines added.

Notes on this post

The machine used for this post runs Windows Server 2019. In the past I set similar testing environments using Windows 2012 and Windows 2016 and the steps are very similar, so you shouldn’t have any trouble following these instructions with those operating systems.

The node that wasn’t

There are many ways a Windows cluster may get into problems, but this post is going present a specific one that I recently came into (as in “inadvertently provoked”). You may have a perfectly healthy cluster and suddenly one of the nodes is gone. You haven’t made any change to the cluster, but somehow the node won’t start its cluster services, looking offline for the rest of the cluster despite not having any communication issues.

Initial state

AGCLU01 with 4 nodes: AG01, AG02, AG03 and AG04

AGCLU02 with 2 nodes

AGCLU02 with 2 nodes: AG05 and AG06

The mess up

For some reason (the box crashed and is unrecoverable, the requirements for the cluster have changed, or was just a proof of concept), a node of the AGCLU01 cluster (AG04) is no longer available (either broken beyond repair, simply shut down permanently, or the machine has been properly decommissioned). Due to the node being unexpectedly lost or its decommission performed before it was evicted from the cluster, it has resulted in AGCLU01 ending up with only 3/4 nodes online at any given time.

AGCLU01 with 4 nodes: AG01, AG02, AG03, and an offline AG04

Meanwhile, a new AG04 machine is built with the same IP as the old one, since we have some set rules on the IP address assigned to a box based on their names for ease of identification. The requirements for our cluster has changed, and now each of them only needs 3 nodes so this new AG04 is added to the cluster AGCLU02.

AGCLU02 with 3 nodes: AG05, AG06, and a brand new AG04

Later, we found out AGCLU01 still has a ghost entry for a node named AG04 that no longer exist, so we decide to evict it from its old cluster AGCLU01.

Spring cleaning on AGCLU01

The node will remain “Processing” the eviction order for a while: don’t expect it to complete any time soon (I waited for several minutes until I gave up and just hit refresh).

A clean AGCLU01.sqlozano.com with only the three current nodes: AG01, AG02, and AG03

So we got out AGCLU01 cluster all nice and clean with its 3 nodes? Let’s take a look at AGCLU02 and its 3 nodes.

AGCLU02 with 3 nodes: AG05, AG06, and an unexpectedly “downed” AG04

What’s happened to AG04? The box is up and running, so let’s check the cluster services.

A disabled Cluster Service is never a good sight on a cluster node

The first reaction

The cluster services are disabled, but that is not a big deal. Surely we can fix that by just starting it manually…

…or maybe not

What can the system log tell us about that?

Filtering by the FailoverClustering source, the following errors can be found on AG04’s System logs at the time of its eviction from AGCLU01

If I had my speaker on, I’d hear the system log screaming at me
Event ID: 4621
Task Category: Cluster Evict/Destroy Cleanup
Message: This node was successfully removed from the cluster
Event ID: 4615
Task Category: Cluster Evict/Destroy Cleanup
Message: Disabling the cluster service during cluster node cleanup, has failed. The error code was '1115'. You may be unable to create or join a cluster with this machine until cleanup has been successfully completed. For manual cleanup, execute the 'Clear-ClusterNode' PowerShell cmdlet on this machine.
Event ID: 4629
Task Category: Cluster Evict/Destroy Cleanup
ssage: During node cleanup, the local user account that is managed by the cluster was not deleted. The error code was '2226'. Open Local Users and Groups (lusrmgr.msc) to delete the account.
Event ID: 4627
Task Category: Cluster Evict/Destroy Cleanup
Message: Deletion of clustered tasks during node cleanup failed. The error code was '3'. Use Windows Task Scheduler to delete any remaining clustered tasks.
Event ID: 4622
Task Category: Cluster Evict/Destroy Cleanup
Message: The Cluster service encountered an error during node cleanup. You may be unable to create or join a cluster with this machine until cleanup has been successfully completed. Use the 'Clear-ClusterNode' PowerShell cmdlet on this node.

Followed by the same error message repeated every 15 seconds:

Event ID: 1090
Task Category: Startup/Shutdown
Message: The Cluster service cannot be started. An attempt to read configuration data from the Windows registry failed with error '2'. Please use the Failover Cluster Management snap-in to ensure that this machine is a member of a cluster. If you intend to add this machine to an existing cluster use the Add Node Wizard. Alternatively, if this machine has been configured as a member of a cluster, it will be necessary to restore the missing configuration data that is necessary for the Cluster Service to identify that it is a member of a cluster. Perform a System State Restore of this machine in order to restore the configuration data.

What’s going on in the registry?

Let’s see how a “healthy” registry looks like in a cluster node, compared to our AG04

Left: AG04 | Right: AG05 with the highlighted “Cluster” hive

That’s it, the “Cluster” hive is missing from the registry. It was removed when the node was evicted from AGCLU01. Even though we meant to remove the node from AGCLU01 only, the command was sent through the network to the new AG04 node, and it received the order to remove all information regarding clusters it may retain.

Why did the cluster mistook the new AG04 for the old AG04?

In order to figure out why it was happening, I reproduced the following scenarios

  • Old DNS (AG04) with old IP (AG04’s).
  • Old DNS (AG04) with a new IP.
  • New DNS (AG07) with old IP (AG04’s), with old DNS (AG04) still active and pointing to the old IP (AG04’s).

and only the “Old name, old IP” combination caused this particular issue.

Although I couldn’t identify how the cluster managed to check both the DNS and the IP address, it appears the cluster sent the order to evict the node across the network, and it reaches a machine with the same name and same IP. This is good enough for most cases, but unfortunately doesn’t verify the machine receiving the order to clean its cluster configuration records is a member of the cluster sendind out the order.

How do I fix my cluster now?

The first reaction would be adding the server back on the AGCLU02 cluster, but we can’t add a server back into a cluster it is a member of

AG04 is a member of AGCLU02, and can’t be added twice

Well, maybe it can be added back to the first cluster it belonged to, AGCLU01

Somehow AG04 still thinks it’s a cluster node

No, it can not. Let’s try cleaning the node’s cluster configuration running

Clear-ClusterNode

No luck: still getting the same error when trying to add it to AGCLU02

AG04 still shows in the AGCLU02 list of nodes, since Clear-ClusterNode runs on the node and won’t change the cluster records

But what of AGCLU01?

Finally some progress

Now I can add AG04 to the cluster AGCLU01 but not to the cluster it should belong now, AGCLU02, which retains some configurations and registry entries that identify this node as a member of the cluster already. But since I really want to get that AG04 node into AGCLU02, I’ll evict it from the cluster in order to be able to add it back again

Evicting an offline node… what could go wrong?

Now let’s try and add AG04 back to AGCLU02

Just a few more clicks until we recover our node

And we are back in business

Welcome home, AG04

How to avoid this in the first place?

First of all, always destroy your clusters cleanly: only when a machine is unrecoverable an offline node should be evicted from the cluster.

But if you must evict an offline node, make sure the DNS of the node to be evicted is no longer used, and if still exists is not pointing to a valid IP address assigned to a node member of an existing cluster.

And if the offline node evicted is brought back online, clean it’s cluster configuration, if only to keep it clean of components and avoid having error messages in the system log.

Notes on this test

This test was performed running Windows Server 2019 machines, based on a real world issue ocurred on machines running Windows Server 2016.