browse
Introduction
This article is a continuation example from this article: How the VA communicates with Umbrella and local DNS. Please begin at the article above before continuing here. Just looking for an overview, click above.
Detailed Example
Once the forwarder has determined if a query needs to go to the local DNS or to Umbrella, it then starts the process of sending the queries to the upstream servers. Using an increasing timeout value, it will query each server in random order at the current timeout value until it either gets a response or has queried all the servers. Then it tries the next timeout value, and so on.
The exact steps taken are as follows:
- Randomly order the servers.
- Set the RTT timeout value to 2.0
For each server in order, do the following:
1. Check if the server's RTT is in the cache. If so, and if it is equal to or greater than the current timeout, skip this server.
2. If the server's RTT is not greater than the current timeout level, query the server.
- If the server responds to the query within the timeout, cache the response time as the RTT for that server, return the response and then stop.
- Else, cache the current timeout value as the RTT for that server.
3. Set the timeout value to 2.5, then repeat steps 1 and 2
4. Set the timeout value to 4.0, then repeat steps 1 and 2
5. Set the timeout value to 4.5, then repeat steps 1 and 2
If we still don't have a response at this point after exceeding the highest timeout level for each server, we query one random server one last time at the highest timeout level. If that server fails to respond, we will return a SERVFAIL response. Any subsequent queries of this nature are re-queried according to the current timeout level and what is listed in the RTT cache.
Exceptionally Detailed Example
Let's say that I have two internal servers configured in my VA, one locally where the average RTT is ~20 msec ("localdns1"), and one on the other side of the planet where the average RTT is ~2.1 seconds ("localdns2").
The first query to hit the VA might look like this:
- Randomly ordering the servers results in localdns2 first, then localdns1
- The timeout is set to 2.0 seconds.
- localdns2 isn't in the RTT cache, so we continue.
- Query is sent to localdns2.
- After 2 seconds, the query hasn't responded, so we set the RTT to 2.1 seconds for that server.
- localdns1 isn't in the RTT cache, so we continue.
- Query is sent to localdns1.
- localdns1 responds 20 msec later. We set the RTT to 0.02 seconds for that server, and then return the response.
Note that the query time here would be slightly above 2.12 seconds in total.
Then, the next query comes in, which would look like this:
- Randomly ordering the servers results in localdns2 first, then localdns1
- The timeout is set to 2.0 seconds.
- localdns2 is in the RTT cache, but its value is larger than 2.0 seconds, so we skip this server.
- localdns1 is in the RTT cache, but its value is lower than 2.0 seconds, so we continue.
- Query is sent to localdns1.
- localdns1 responds 20 msec later. We refresh the RTT to 0.02 seconds for that server, and then return the response.
Note the query time here would be slightly above 0.02 seconds in total. Thus, we've now correctly started to prefer the closer server.
That pattern would be repeated for the next hour until the RTT cache entry for localdns2 expires (other than the random ordering of the servers may change each time). At that point, the first scenario would repeat once, refreshing the RTT cache for localdns2.
Now let's say, at some point, localdns1 goes down and completely stops responding to queries. Assuming localdns2's RTT hasn't expired, the next query to come in might look like this:
- Randomly ordering the servers results in localdns2 first, then localdns1
- The timeout is set to 2.0 seconds.
- localdns2 is in the RTT cache, but its value is equal to 2.0 seconds, so we skip this server.
- localdns1 is in the RTT cache, but its value is lower than 2.0 seconds, so we continue.
- Query is sent to localdns1.
- After 2 seconds, the query hasn't responded, so we set the RTT for localdns1 to 2.0 seconds.
- The timeout is increased to 2.5 seconds.
- localdns2 is in the RTT cache, but its value is lower than 2.5 seconds, so we continue.
- Query is sent to localdns2
- localdns2 responds 2.1 seconds later. We set the RTT to 2.1 seconds for that server, and then return the response.
Note that the query time here would have been slightly over 4.1 seconds in total. This may result in the requesting client seeing a DNS timeout when it first attempts the query.
The next query to come in might look like this:
- Randomly ordering the servers results in localdns1 first, then localdns2
- The timeout is set to 2.0 seconds.
- localdns1 is in the RTT cache, but its value is equal to 2.0 seconds, so we skip this server.
- localdns2 is in the RTT cache, but its value is larger than 2.0 seconds, so we skip this server.
- The timeout is increased to 2.5 seconds
- localdns1 is in the RTT cache, but its value is lower than 2.5 seconds, so we continue.
- Query is sent to localdns1.
- After 2.5 seconds, the server hasn't responded, so we set the RTT to 2.5 seconds for that server.
- localdns2 is in the RTT cache, but its value is lower than 2.5 seconds, so we continue.
- Query is sent to localdns2
- localdns2 responds 2.1 seconds later. We refresh the RTT to 2.1 seconds for that server, and then return the response.
Note that the query time here would have been slightly over 4.6 seconds.
The next query to come in might look like this:
- Randomly ordering the servers results in localdns1 first, then localdns2
- The timeout is set to 2.0 seconds.
- localdns1 is in the RTT cache, but its value is higher than 2.0 seconds, so we skip this server.
- localdns2 is in the RTT cache, but its value is higher than 2.0 seconds, so we skip this server.
- The timeout is increased to 2.5 seconds
- localdns1 is in the RTT cache, but its value is equal to 2.5 seconds, so we skip this server.
- localdns2 is in the RTT cache, but its value is lower than 2.5 seconds, so we continue.
- Query is sent to localdns2
- localdns2 responds 2.1 seconds later. We refresh the RTT to 2.1 seconds for that server, and then return the response.
Note that the query time here would be slightly over 2.1 seconds, so we've now successfully switched preference to localdns2. localdns1 wouldn't be queried again until its RTT cache entry of 2.5 seconds expires 1 hour later.
Now, hypothetically, let's say localdns2 stops responding to queries as well. The next query might look like this:
- Randomly ordering the servers results in localdns1 first, then localdns2
- The timeout is set to 2.0 seconds.
- localdns1 is in the RTT cache, but its value is higher than 2.0 seconds, so we skip this server.
- localdns2 is in the RTT cache, but its value is higher than 2.0 seconds, so we skip this server.
- The timeout is increased to 2.5 seconds
- localdns1 is in the RTT cache, but its value is equal to 2.5 seconds, so we skip this server.
- localdns2 is in the RTT cache, but its value is lower than 2.5 seconds, so we continue.
- Query is sent to localdns2
- After 2.5 seconds, the server hasn't responded, so we set the RTT to 2.5 seconds for that server.
- The timeout is increased to 4.0 seconds
- localdns1 is in the RTT cache, but its value is lower than 4.0 seconds, so we continue.
- Query is sent to localdns1.
- After 4.0 seconds, the query hasn't responded, so we set the RTT to 4.0 seconds for that server.
- localdns2 is in the RTT cache, but its value is lower than 4.0 seconds, so we continue.
- Query is sent to localdns2
- After 4.0 seconds, the query hasn't responded, so we set the RTT to 4.0 seconds for that server.
- The timeout is increased to 4.5 seconds
- localdns1 is in the RTT cache, but its value is lower than 4.5 seconds, so we continue.
- Query is sent to localdns1.
- After 4.5 seconds, the server hasn't responded, so we set the RTT to 4.5 seconds for that server.
- localdns2 is in the RTT cache, but its value is lower than 4.5 seconds, so we continue.
- Query is sent to localdns2
- After 4.5 seconds, the server hasn't responded, so we set the RTT to 4.5 seconds for that server.
- We still don't have a response, and we've reached the end of our timeout values. So, we send one final query to localdns1.
- After 4.5 seconds, the server hasn't responded, and a SERVFAIL response is returned.
Note that this took 24 seconds to happen. However, the RTT value for both servers is now at 4.5, so for every subsequent query, we should only have to wait 4.5 seconds to get a SERVFAIL for this query.
Comments
0 comments
Article is closed for comments.