All Articles
·
Vincent Bean Vincent Bean

DNS Monitoring for Agencies: Prevent Hijacking & Fix Fast

DNS is the foundation everything else sits on. When it breaks, your client's site goes offline, their email stops working, and their SEO starts bleeding. And when you find out it can take hours to fix because DNS is slow. DNS incidents can be caught in minutes with the right monitoring in place.

Why DNS Is the a Single Point of Failure for Agencies

Unlike a server crash, which usually triggers an immediate alert, DNS failures are quiet. A misconfigured record propagates silently across the internet. Your uptime monitor might not catch it right away and by the time you get a "site down" alert, the wrong records may have already been cached by resolvers worldwide with a 24-hour TTL.

Agencies face a compounding version of this problem. You're managing dozens or hundreds of domains across multiple registrars, DNS providers, and account credentials. A client's IT admin changes their MX records without telling you. A registrar auto-renew fails because a credit card expired. A junior dev accidentally clobbers NS records during a migration. None of these people meant to cause an outage. None of them sent you a ticket. You find out when the client calls.

The cost isn't just the incident itself. It's the erosion of trust, the SLA breach explanation, and the 11pm Friday firefight that could have been a 10-minute fix on Friday afternoon if you'd caught it early.

DNS Records Every Agency Must Monitor

If you're going to monitor DNS, you need to know which records actually matter and why.

A and AAAA records are your first line of visibility. An unexpected IP change is either a misconfigured migration or something more sinister. Either way, you want to know immediately. It's also nice to get a notification if the change was intentionally made so you know it's working.

NS records are the most critical record type to watch. A changed NS record means someone else now controls all DNS resolution for that domain. This is the canonical first step in a domain hijacking attack. Any non intentional change to NS records should trigger your highest-severity alert.

MX records control where email goes. An accidental change doesn't just break delivery, it can silently redirect sensitive client communications to the wrong server.

SPF, DKIM, and DMARC are the email authentication records. Drift in these TXT records destroys deliverability and opens phishing attack surface. The annoying part is that email often keeps partially working because some messages get through and some get silently rejected.

CNAME and TXT records often cover third-party integrations and domain verification tokens for example. Remove the wrong one and a service breaks.

DNS monitoring sits within a broader hierarchy of checks your agency should be running. If you want to see where it fits in the full picture, the 7 levels of website monitoring is a useful framework for thinking about this.

Detecting DNS Record Drift Before It Causes Outages

Record drift detection starts with a known-good baseline. For every client domain, you need a snapshot of all critical records at a point in time when everything is working correctly. This is your source of truth. Without it, you can't distinguish an authorized change from an unauthorized one.

The monitoring approach itself is straightforward: periodic checks against that baseline, comparing current values to expected values.

Common scenarios I've encountered:

  • A client's IT manager adds a new MX record for a third-party email service without removing the old ones, breaking mail routing.

  • A registrar auto-renew fails because the payment wasn't made, causing NS records to flip to the registrar's parking nameservers.

  • A CDN migration leaves stale CNAME records pointing at a decommissioned edge node.

  • Someone updates the SPF record and accidentally removes an allowed sender.

Not all changes are equal. Any NS record change should fire an immediate alert to your emergency channel. An A record change needs urgent attention. A new TXT record is lower priority but still worth logging. This severity tiering matters when you're monitoring 50+ domains alert fatigue will cause you to start ignoring notifications, which defeats the purpose entirely.

Vigilant's DNS monitoring does exactly this: it continuously watches all records across your client domains and alerts you the moment anything changes, so you catch drift before it becomes an outage. With smart notifications you can setup exactly for which record types you want to be notified.

Domain Hijacking Prevention Checklist for Agencies

Monitoring detects problems. Prevention stops them from happening. Here's the checklist I'd apply to every client domain under your management.

Registrar hardening:

  • Enable registrar lock (clientTransferProhibited) on every domain. This prevents unauthorized transfers without your explicit approval.

  • Enforce two-factor authentication on all registrar accounts. Store credentials in a team password manager with audit logging - you need to know who accessed what and when.

  • Enable DNSSEC where the registrar and DNS provider both support it. Cryptographically signed records prevent cache poisoning attacks even if an attacker gets between a resolver and your authoritative server.

Domain inventory:
Maintain a single source of truth for your entire domain portfolio. At minimum, track: registrar, expiry date, NS provider, auto-renew status, account owner, and an emergency contact for each domain. A spreadsheet works fine. Just don't keep this information distributed across team members' heads.

Expiry management:
Set calendar reminders at 90, 60, and 30 days before expiry for every domain. Auto-renew is not a safety net, payment methods expire, billing emails go to spam, and credit cards get replaced. I've seen a client's primary domain lapse because their registrar was sending renewal notices to an email address that no longer existed.

Access control:
Audit registrar account access quarterly. Remove former employees. Verify emergency contacts are still valid. Critically: ensure your agency retains admin access even when the client nominally holds the registrar account. You don't want to discover during an incident that only the client's ex-employee had the login.

Ownership documentation:
Document the authoritative chain for every domain: who owns the registrar account, who controls the NS provider, and who manages individual records. This sounds bureaucratic until 2am when something breaks and nobody knows who has the Cloudflare login.

Alert Routing by Client: Scaling DNS Monitoring Across Your Portfolio

A single Slack channel for all (DNS) alerts across your entire client portfolio is a great way to train your team to ignore notifications. When everything goes to one place, alert fatigue sets in fast.

The fix is routing alerts to the right people for the right domains. Group domains by client or project, then route DNS alerts to the account manager or technical lead responsible for that relationship. A DNS change for Client A should appear in Client A's Slack channel or email thread, not in a firehose that includes Client B through Client Z.

Build escalation paths. If the primary contact hasn't acknowledged a high-severity alert within 15 minutes, escalate to the team lead. After another 15 minutes, escalate to the agency owner. This isn't bureaucracy, it's insurance against the scenario where the right person is in a meeting or on holiday.

Some clients also want direct visibility. Vigilant's client pages feature lets you give each client a branded dashboard showing their domain health status in real time, and its notification system routes alerts to Slack, Email, or Discord - so you can implement this tiered approach without building custom tooling.

DNS Incident Runbook: From Alert to Resolution in Minutes

When an alert fires, you don't want to be making decisions from scratch. Here's the runbook I'd give to any agency technical team.

Step 1 - Check. Query multiple resolvers to confirm the alert is real. Check `8.8.8.8`, `1.1.1.1`, and a regional resolver in your client's primary market. A single resolver anomaly might be a monitoring false positive. Consistent results across multiple resolvers is not.

dig @8.8.8.8 govigilant.io A
dig @1.1.1.1 govigilant.io A
dig @9.9.9.9 govigilant.io NS

Step 2 - Classify. Determine whether this is unauthorized (potential hijack), accidental (misconfiguration), or expected (migration in progress). Check with the team and client before assuming the worst.

Step 3 - Contain. If the change appears unauthorized: log into the registrar immediately, re-enable registrar lock, change the account password, revoke any API keys you don't recognize, and contact the registrar's abuse team. Move fast as time matters when a domain is being hijacked.

Step 4 - Remediate. Restore correct records from your known-good baseline. Lower the TTL on affected records to 60-300 seconds before making changes. This means resolvers will pick up your fix much faster than if you're working against a 3600-second cache. Vigilant keeps a history of every monitored DNS record.

Step 5 - Verify propagation. Don't assume the fix is live globally just because it's correct at the authoritative server. Use `dig +trace` to follow the full resolution chain, and check a propagation tool across multiple geographic regions.

dig +trace govigilant.io
dig +trace govigilant.io MX

Step 6 - Post-incident. Document what happened, how it was detected, the time to resolution, and what preventive measures you're adding. Write a client-appropriate summary, not a technical autopsy, but enough to show you caught the problem and resolved it professionally.

For a broader view of preventing all types of downtime and not just DNS, this guide on how to prevent website downtime covers the full picture.

TTL Planning and Propagation Strategy for Agencies

TTL values are one of those things that seem like a minor detail until you're in the middle of an incident and realize your fix won't propagate for another 23 hours.

TTL is the number of seconds a resolver is allowed to cache a record before asking for a fresh copy. A record with a 3600-second TTL might be cached for up to an hour by any resolver that already has it. Your change at the authoritative server doesn't matter to those resolvers until their cache expires.

Recommended TTL values by record type:

  • NS records: 86400s (24 hours). These rarely change intentionally, so stability is the priority.

  • A/AAAA records: 300-3600s. Lower end if you expect migrations; higher end for stable production sites.

  • MX records: 3600s. Email routing doesn't change often, but you want reasonably fast propagation if it does.

  • SPF/DKIM/DMARC TXT records: 3600s.

The pre-migration TTL strategy is something I'd treat as mandatory practice: lower TTLs to 300 seconds at least 48 hours before any planned DNS change. This way, if something goes wrong, the blast radius is minutes rather than hours. Once the migration is confirmed and stable, raise TTLs back to production values to reduce query load on your authoritative servers.

What I see agencies get wrong most often: setting TTLs permanently low because "it feels safer." It doesn't, it just generates unnecessary query load and slows down resolution in some cases. The right TTL for a stable, rarely-changing record is high. The right TTL for a record you're about to change is low.

Putting It All Together: A DNS Monitoring Stack for Your Agency

The framework here is three pillars: prevention (registrar hardening, access control, DNSSEC), detection (automated record monitoring, drift alerts, propagation checks), and response (incident runbook, escalation paths, post-mortems).

The practical stack that covers all three:

  • A registrar with API access, lock support, and solid 2FA.

  • A DNS hosting provider with audit logs so you can see who changed what and when.

  • A monitoring tool that continuously watches records across your entire portfolio and routes alerts intelligently.

  • A documented incident management process your whole team knows about before anything breaks.

I built Vigilant as an open-source, self-hostable monitoring platform partly because this kind of comprehensive monitoring across a client portfolio shouldn't require stitching together five different tools. It monitors DNS records across your client domains, fires alerts via Slack, Email, or Discord, and gives you branded client-facing status pages and PDF reports so proactive DNS monitoring becomes a visible, tangible value-add in your client relationships rather than invisible infrastructure work.

The ROI argument is simple: one prevented DNS incident per year easily justifies the time to set this up properly. And with sensible defaults the time to setup is minimal.

Start this week. Take your top 10 highest-value client domains. Establish baselines for all critical records. Set up monitoring. Create a runbook your team can actually follow. That's the whole thing.

DNS disasters aren't inevitable. They're the predictable result of managing domains reactively. Monitor proactively, and they become rare exceptions rather than regular crises.

Start Monitoring within minutes.

Enter a client's domain and see what Vigilant monitors, setup takes just 2 minutes per site.
Vigilant comes with sensible defaults so onboarding new clients is effortless.

Quick website scan

Enter your domain and do a quick scan of your website to see what Vigilant does.