We’ve got a number of Stratum 2 NTP servers intended to serve all our pods. Since my original problem was getting NTP traffic to egress across the public internet I’ll be setting up two boundary NTP servers at Stratum 3 in order to proxy the traffic. This will also prevent one pod from greatly adversely effecting others when I mis-configure them.

The template for the NTP service themselves is fairly straight forward. We only want to honor requests from the local VPC while ignoring all public requests. Additionally we should only allow administrative commands from the local machine as oppsed to the VPC.

driftfile /var/lib/ntp/drift

# don't answer queries from strangers
restrict default ignore
restrict default -6 ignore

# Allow all operations from localhost
restrict 127.0.0.1
restrict ::1

# Service to the local network
restrict ${ipv4_net} mask ${ipv4_mask} nomodify notrap nopeer noquery

# Use metadata host as a time source
server 169.254.169.123 iburst                  # AWS metadata time
restrict 169.254.169.123 nomodify notrap nopeer noquery

# Use offical Stratum 2 time servers for our AZs
server us-west-2b.stratum-2.invalid iburst
restrict us-west-2b.stratum-2.invalid nomodify notrap nopeer noquery
server us-west-2c.stratum-2.invalid iburst
restrict us-west-2c.stratum-2.invalid nomodify notrap nopeer noquery

# Enable additional logging.
logconfig =clockall =peerall =sysall =syncall

# Listen only on the primary network interface.
interface listen eth0
interface listen lo

# CVE-2013-5211 fix
disable monitor

Each NTP server takes about 5-10 minutes to come up and be confident about their time. Confidence is established by hearing about from higher stratum. One may monitor the progress via ntpq -pn on the host or by providing on the host as the last argument. The Reach field in the table is an octal bitmask representing sample periods the local NTPD was expecting to hear back from a server. Each bit represents one sample and is left shifted for each new sample. The service remembers the last 8 samples, meaning a fully reachable service will show as 377.

Discovery

Next up is how the other ndoes in the cluster will receive their updates. There were two paths which one could approach these problems with. First would be modifying DHCP options so on lease renewal it would utilize the given NTP services. I’m not entirely convinced EC2 Linux is configured to honor those options as I saw no plubming to rewrite /etc/ntp.conf anywhere. I opted for allowing the NTP hosts to modify the a pod’s private DNS to provide specific.

On boot up the server will connect to the Route 53 service and register itself under a well known name according to the availability zone its running under, such as ntp.az-e.pod. The only catch with this approach is ntpd only queries for the address of the service on boot. If there is no AAAA or A record then the server entry is discarded. This is a benefit as we can place all AZ entries in a single file and missing services will be ignored. It’s also a curse when the NTP services are restarted as all nodes will need their NTP services cycled to track the changes.

Stratum 4 Configuration

A Stratum 4 tier with the above auto-configuration would look like the following. I chose to use the AWS metadata service in addition to the pod sources to ensure both are close enough. Since many of our hosts perform AWS operations knowing time drift to AWS’s clocks is helpful.

server 169.254.169.123 iburst                  # AWS metadata time
restrict 169.254.169.123 nomodify notrap nopeer noquery
server ntp.az-a.pod iburst
server ntp.az-b.pod iburst
server ntp.az-c.pod iburst
server ntp.az-d.pod iburst
server ntp.az-e.pod iburst
server ntp.az-f.pod iburst
server ntp.az-g.pod iburst

To restart ntpd you can use the following command: sudo service ntpd restart && sleep 10 && ntpq -pn. The sleep is helpful to ensure the initial timing is received. The ibrust option means there are 5 queries spaced 2 seconds apart sent to the server to bring the node on-line quickly. So anything below 5 seconds is really to short to get useful stable state information. If a service was not heard from I’ve heard reports it will show up as .INIT. in the refid field, however my experience show it will show as .STEP. for a long while first.