Performance¶
Ferrous DNS is engineered from the ground up for throughput. Every component in the query path was designed with latency and allocation as first-class constraints — not afterthoughts.
This page explains how the system achieves its numbers, layer by layer.
The Hot Path¶
Every DNS query traverses this sequence. The goal: respond in microseconds when cached, in milliseconds when not.
UDP packet received
│
▼
L1 cache lookup ~1-3µs ← per-thread, zero locks
│ miss
▼
L2 cache lookup ~10-20µs ← shared, per-shard lock
│ miss
▼
In-flight check ~200ns ← is someone already fetching this?
│ nobody fetching
▼
Upstream query ~1-50ms ← DoH / DoT / DoQ / UDP
│
▼
Write to L2 + L1
│
▼
Send UDP response
Nothing in this path allocates memory for cache hits. No global locks. No expensive system calls for timing.
L1/L2 Hierarchical Cache¶
L1 — Per-Thread, Lock-Free¶
Each worker thread has its own private L1 cache. Because it is private to the thread, there is zero synchronization overhead.
- No locks, no contention -- direct memory access
- Holds the hottest ~100-500 entries per thread
- L1 hit overhead: ~1-3µs P99
L2 — Shared, Sharded Cache¶
L2 is a shared cache split into independent shards (default: 4x CPU core count). Each shard has its own lock, so queries for different domains never block each other.
16-core machine → 64 shards
Query "google.com" → hash → shard #17 → lock shard #17 only
Query "reddit.com" → hash → shard #31 → lock shard #31 only
↑ never blocks each other
Under real-world load with hundreds of distinct active domains, contention is effectively zero.
- Capacity: up to 200,000 entries (configurable)
- L2 hit overhead: ~10-20µs P99
- Optimized hash function for fast domain name lookups
Why Two Levels?¶
L1 absorbs the hottest queries (top ~0.1% of domains queried thousands of times per minute) without touching shared memory at all. L2 handles the long tail. Together they keep the cache hit rate above 95% for typical networks.
Fast Negative Lookups¶
A significant fraction of DNS queries hit domains that are simply not in the blocklist. Without a quick pre-check, each query would trigger a full blocklist lookup across potentially millions of entries.
Ferrous DNS uses a probabilistic filter that answers one question almost instantly: "Is this domain definitely not in the blocklist?"
Query: "example.com"
│
▼
Quick pre-check:
"Could this be blocked?" → NO → skip all blocklist checks instantly
→ YES → run full blocklist lookup (possible match)
- A "no" answer is guaranteed correct -- no blocked domain is ever missed
- False positive rate is kept very low
- Concurrent-safe with no locking
- Negligible overhead per lookup, regardless of blocklist size
For the ~99% of queries hitting common non-blocked domains, the entire blocklist engine adds negligible overhead per query.
In-Flight Coalescing (Cache Stampede Prevention)¶
Without coalescing, a cache miss for a popular domain under high load triggers dozens of simultaneous upstream requests. Imagine 50 clients querying api.github.com at the moment the cache entry expires:
Without coalescing:
Client 1 → cache miss → upstream request
Client 2 → cache miss → upstream request ← 50 redundant upstream requests
Client 3 → cache miss → upstream request
...
Client 50 → cache miss → upstream request
With coalescing:
Client 1 → cache miss → becomes "leader" → sends 1 upstream request
Client 2 → cache miss → sees in-flight entry → waits on channel
Client 3 → cache miss → sees in-flight entry → waits on channel
...
Client 50 → upstream responds → all 50 clients receive the answer simultaneously
The first query to see a cache miss becomes the "leader" and starts the upstream request. All subsequent requests for the same domain wait on a notification channel and receive the response the moment it arrives -- at zero additional upstream cost.
If the upstream request fails or is cancelled, all waiting clients are notified immediately and the tracking entry is cleaned up automatically.
Under load with many clients hitting the same popular domain, this eliminates the thundering-herd problem entirely and reduces upstream traffic by orders of magnitude.
Optimistic Prefetch (Background Refresh)¶
When a popular cache entry's TTL drops below a configurable threshold (default: 75% consumed), a background task proactively refreshes it before it expires.
Entry TTL: 3600s
0s ─────────────────────────────────── 3600s
▲ │
│ Would expire
Background refresh (never reached —
triggered at 2700s already refreshed)
(75% of TTL consumed)
Clients continue receiving cached responses with zero wait while the background task fetches a fresh answer. The entry is never cold for active domains.
Eligibility criteria (all must be met):
| Criterion | Config key | Default |
|---|---|---|
| Minimum total hits | cache_min_frequency | 10 |
| Minimum hits per minute | cache_min_hit_rate | 2.0 |
| Last accessed within | cache_access_window_secs | 43200 (12h) |
| Remaining TTL fraction below | cache_refresh_threshold | 0.75 |
This keeps the effective hit rate close to 100% for actively-used domains as their TTLs cycle naturally.
LFU-K Eviction with Sliding Window¶
When the cache reaches capacity, an eviction policy decides which entries to remove. Simple LRU can discard a domain queried 10,000 times that happened to be quiet for the last two minutes — replaced by one queried twice a minute ago.
Ferrous DNS uses LFU-K: the K most recent access timestamps are tracked per entry and used to compute a sliding-window frequency score:
This gives weight to sustained, frequent access rather than historical patterns. A domain popular 6 hours ago but idle since scores lower than one queried 5 times in the last minute.
Three eviction strategies are available:
| Strategy | When to use |
|---|---|
hit_rate (default) | Mixed workloads — preserves the most actively queried entries |
lfu | Stable workloads with predictable query distribution |
lru | Bursty workloads with strong temporal locality |
Zero-Allocation Hot Path¶
Go-based DNS servers (Blocky, AdGuard Home) suffer from garbage collector pause spikes under load. Rust eliminates GC entirely. On top of that, Ferrous DNS enforces a strict no-allocation policy on the cache hit path:
- Shared domain strings -- domain names are stored once and shared by reference. Copying a reference costs ~1ns with no memory allocation
- Stack-allocated record sets -- most DNS responses contain 1-4 records, which are stored on the stack without heap allocation
- Zero-copy case comparison -- DNS names are compared case-insensitively without creating temporary copies
- Fast hashing -- an optimized hash function for short strings (domain names) provides ~3x faster lookups than the standard approach
Low-Overhead Timing¶
Measuring cache hit latency requires a fast timer. Standard system clock calls cost ~20ns on x86_64, which adds measurable overhead to ~1µs cache hit operations.
Ferrous DNS reads the CPU's hardware timestamp counter directly, costing only ~1-5ns -- roughly 4-20x cheaper than a standard clock call. On ARM platforms, it falls back to a fast kernel clock with ~10-15ns overhead.
Async Query Log Pipeline¶
Logging every DNS query to SQLite without blocking the resolver requires a carefully designed pipeline. The DNS handler never waits for disk I/O.
DNS handler (hot path)
│
│ try_send() ← non-blocking; drops entry if channel is full
▼
Async channel (10k–200k capacity)
│
│ batch read every 200ms (up to 2,000 entries per cycle)
▼
Background flush task
│
│ single INSERT transaction per batch
▼
SQLite (WAL mode)
Batching is critical: a single transaction with 2,000 rows is ~100x faster than 2,000 individual transactions. At very high query rates, query_log_sample_rate lets you log 1 in N queries to cap write volume without losing visibility.
Optimized Memory Allocator¶
Ferrous DNS uses a high-performance memory allocator optimized for server workloads:
- 2-3x faster than the default system allocator for small, short-lived allocations
- Per-thread memory pools minimize cross-thread contention
- Reduces long-term memory fragmentation under sustained server load
Parallel Upstream Strategy¶
On a cache miss, Ferrous DNS queries multiple upstream servers simultaneously and returns the fastest response:
Cache miss for "example.com"
│
├──► DoQ dns.adguard-dns.com responds in 8ms
├──► DoH cloudflare-dns.com responds in 12ms ← discarded
└──► DoH dns.google responds in 6ms ← returned
Client receives the answer in 6ms instead of waiting for the slowest upstream
This eliminates the tail-latency risk of any single upstream being momentarily slow. Upstream health is monitored continuously and unhealthy servers are excluded automatically.
UDP Buffer Tuning¶
The OS UDP receive buffer determines how many packets the kernel queues before the application processes them. With default buffer sizes, large query bursts overflow the kernel queue and are dropped silently.
Ferrous DNS sets enlarged socket buffers (8 MB send and receive) at startup. This directly improves the "queries lost" metric under peak load.
Production Build¶
For maximum performance, always build with native CPU optimizations:
This enables AVX2/SSE4 vectorized string operations, CPU-specific branch prediction, and proper inlining of intrinsics like RDTSC. The gap between a generic --release build and a target-cpu=native build is measurable in the hashing and comparison code.
Summary¶
| Optimization | Benefit |
|---|---|
| L1 per-thread cache | Lock-free hits, ~1-3µs P99 |
| L2 sharded cache | Near-zero contention, ~10-20µs P99 |
| Fast negative lookups | Non-blocked domains skip blocklist checks instantly |
| In-flight coalescing | N identical cache-miss queries to 1 upstream request |
| Optimistic prefetch | Hot entries never expire; near-100% hit rate |
| Frequency-based eviction | Preserves most-active entries under memory pressure |
| Zero-allocation hot path | No memory allocation on cache hit path |
| Hardware timestamp counter | Hot-path timing at ~1-5ns vs ~20ns syscall |
| Async query log pipeline | Query logging never blocks the resolver |
| Optimized memory allocator | 2-3x faster allocation than system default |
| Parallel upstream strategy | Lowest cache-miss latency, transparent failover |
| UDP buffer tuning | Absorbs large bursts without packet loss |
Benchmark Results¶
Host: Intel Core i9-9900KF @ 3.60GHz | 8 cores / 16 threads / 46 GB RAM | Arch Linux | Kernel 6.18.16-1-lts Tool: dnsperf 2.14.0 | 60s per server | 10 concurrent clients | 187 domains (A, AAAA, MX, TXT, NS)
Docker config (identical for all servers):
Setting Value CPUs cpuset: 0-15— 16 threadsNetwork host mode Upstreams plain UDP 8.8.8.8/1.1.1.1(parallel)Cache enabled Blocking / denylists disabled — isolates raw forwarding performance Rate limiting disabled Log level info Query logging (disk I/O) disabled
| Server | QPS | Avg Lat | P99 Lat | Completed | Lost |
|---|---|---|---|---|---|
| ⚡ Unbound (C) | 1,018,691 | 0.74ms | 2.05ms | 100.00% | 0.00% |
| ⚡ PowerDNS (C++) | 797,600 | 1.11ms | 3.47ms | 100.00% | 0.00% |
| 🦀 ferrous-dns | 511,413 | 1.89ms | 47.50ms | 100.00% | 0.00% |
| 🔷 Blocky (Go) | 98,574 | 9.49ms | 21.39ms | 99.99% | 0.01% |
| 🛡️ AdGuard Home | 97,808 | 3.90ms | 15.59ms | 99.87% | 0.13% |
| 🕳️ Pi-hole | 558 | 2.55ms | 24.83ms | 73.63% | 26.37% |
ferrous-dns vs competitors: 5.2× faster than AdGuard Home | 5.2× faster than Blocky | 916× faster than Pi-hole
Unbound and PowerDNS Recursor lead as purpose-built pure recursive resolvers (C and C++) — no REST API, no Web UI, no database, no blocking engine. ferrous-dns runs all of these in the same single-process binary.
Pi-hole's loss rate reflects its architectural ceiling: FTL v6 is mostly single-threaded and saturates under concurrent load from other containers sharing the same CPU pool.
Cache hit P99: ~10–20µs | Cache miss P99: ~1–3ms | Hit rate: ~95%
Full benchmark report and methodology: bench/benchmark-results.md