Performance¶

Ferrous DNS is engineered from the ground up for throughput. Every component in the query path was designed with latency and allocation as first-class constraints — not afterthoughts.

This page explains how the system achieves its numbers, layer by layer.

The Hot Path¶

Every DNS query traverses this sequence. The goal: respond in microseconds when cached, in milliseconds when not.

UDP packet received
        │
        ▼
  L1 cache lookup             ~1-3µs     ← per-thread, zero locks
        │ miss
        ▼
  L2 cache lookup             ~10-20µs   ← shared, per-shard lock
        │ miss
        ▼
  In-flight check             ~200ns     ← is someone already fetching this?
        │ nobody fetching
        ▼
  Upstream query              ~1-50ms    ← DoH / DoT / DoQ / UDP
        │
        ▼
  Write to L2 + L1
        │
        ▼
  Send UDP response

Nothing in this path allocates memory for cache hits. No global locks. No expensive system calls for timing.

L1/L2 Hierarchical Cache¶

L1 — Per-Thread, Lock-Free¶

Each worker thread has its own private L1 cache. Because it is private to the thread, there is zero synchronization overhead.

No locks, no contention -- direct memory access
Holds the hottest ~100-500 entries per thread
L1 hit overhead: ~1-3µs P99

L2 — Shared, Sharded Cache¶

L2 is a shared cache split into independent shards (default: 4x CPU core count). Each shard has its own lock, so queries for different domains never block each other.

16-core machine → 64 shards

Query "google.com" → hash → shard #17 → lock shard #17 only
Query "reddit.com" → hash → shard #31 → lock shard #31 only
                                         ↑ never blocks each other

Under real-world load with hundreds of distinct active domains, contention is effectively zero.

Capacity: up to 200,000 entries (configurable)
L2 hit overhead: ~10-20µs P99
Optimized hash function for fast domain name lookups

Why Two Levels?¶

L1 absorbs the hottest queries (top ~0.1% of domains queried thousands of times per minute) without touching shared memory at all. L2 handles the long tail. Together they keep the cache hit rate above 95% for typical networks.

Fast Negative Lookups¶

A significant fraction of DNS queries hit domains that are simply not in the blocklist. Without a quick pre-check, each query would trigger a full blocklist lookup across potentially millions of entries.

Ferrous DNS uses a probabilistic filter that answers one question almost instantly: "Is this domain definitely not in the blocklist?"

Query: "example.com"
         │
         ▼
   Quick pre-check:
   "Could this be blocked?" → NO  → skip all blocklist checks instantly
                             → YES → run full blocklist lookup (possible match)

A "no" answer is guaranteed correct -- no blocked domain is ever missed
False positive rate is kept very low
Concurrent-safe with no locking
Negligible overhead per lookup, regardless of blocklist size

For the ~99% of queries hitting common non-blocked domains, the entire blocklist engine adds negligible overhead per query.

In-Flight Coalescing (Cache Stampede Prevention)¶

Without coalescing, a cache miss for a popular domain under high load triggers dozens of simultaneous upstream requests. Imagine 50 clients querying api.github.com at the moment the cache entry expires:

Without coalescing:

Client 1  → cache miss → upstream request
Client 2  → cache miss → upstream request  ← 50 redundant upstream requests
Client 3  → cache miss → upstream request
...
Client 50 → cache miss → upstream request

With coalescing:

Client 1  → cache miss → becomes "leader" → sends 1 upstream request
Client 2  → cache miss → sees in-flight entry → waits on channel
Client 3  → cache miss → sees in-flight entry → waits on channel
...
Client 50 → upstream responds → all 50 clients receive the answer simultaneously

The first query to see a cache miss becomes the "leader" and starts the upstream request. All subsequent requests for the same domain wait on a notification channel and receive the response the moment it arrives -- at zero additional upstream cost.

If the upstream request fails or is cancelled, all waiting clients are notified immediately and the tracking entry is cleaned up automatically.

Under load with many clients hitting the same popular domain, this eliminates the thundering-herd problem entirely and reduces upstream traffic by orders of magnitude.

Optimistic Prefetch (Background Refresh)¶

When a popular cache entry's TTL drops below a configurable threshold (default: 75% consumed), a background task proactively refreshes it before it expires.

Entry TTL: 3600s

 0s ─────────────────────────────────── 3600s
                   ▲                       │
                   │                   Would expire
          Background refresh          (never reached —
          triggered at 2700s           already refreshed)
          (75% of TTL consumed)

Clients continue receiving cached responses with zero wait while the background task fetches a fresh answer. The entry is never cold for active domains.

Eligibility criteria (all must be met):

Criterion	Config key	Default
Minimum total hits	`cache_min_frequency`	10
Minimum hits per minute	`cache_min_hit_rate`	2.0
Last accessed within	`cache_access_window_secs`	43200 (12h)
Remaining TTL fraction below	`cache_refresh_threshold`	0.75

This keeps the effective hit rate close to 100% for actively-used domains as their TTLs cycle naturally.

LFU-K Eviction with Sliding Window¶

When the cache reaches capacity, an eviction policy decides which entries to remove. Simple LRU can discard a domain queried 10,000 times that happened to be quiet for the last two minutes — replaced by one queried twice a minute ago.

Ferrous DNS uses LFU-K: the K most recent access timestamps are tracked per entry and used to compute a sliding-window frequency score:

Score = accesses in the last K timestamps / time span of those K accesses

This gives weight to sustained, frequent access rather than historical patterns. A domain popular 6 hours ago but idle since scores lower than one queried 5 times in the last minute.

Three eviction strategies are available:

Strategy	When to use
`hit_rate` (default)	Mixed workloads — preserves the most actively queried entries
`lfu`	Stable workloads with predictable query distribution
`lru`	Bursty workloads with strong temporal locality

Zero-Allocation Hot Path¶

Go-based DNS servers (Blocky, AdGuard Home) suffer from garbage collector pause spikes under load. Rust eliminates GC entirely. On top of that, Ferrous DNS enforces a strict no-allocation policy on the cache hit path:

Shared domain strings -- domain names are stored once and shared by reference. Copying a reference costs ~1ns with no memory allocation
Stack-allocated record sets -- most DNS responses contain 1-4 records, which are stored on the stack without heap allocation
Zero-copy case comparison -- DNS names are compared case-insensitively without creating temporary copies
Fast hashing -- an optimized hash function for short strings (domain names) provides ~3x faster lookups than the standard approach

Low-Overhead Timing¶

Measuring cache hit latency requires a fast timer. Standard system clock calls cost ~20ns on x86_64, which adds measurable overhead to ~1µs cache hit operations.

Ferrous DNS reads the CPU's hardware timestamp counter directly, costing only ~1-5ns -- roughly 4-20x cheaper than a standard clock call. On ARM platforms, it falls back to a fast kernel clock with ~10-15ns overhead.

Async Query Log Pipeline¶

Logging every DNS query to SQLite without blocking the resolver requires a carefully designed pipeline. The DNS handler never waits for disk I/O.

DNS handler (hot path)
        │
        │  try_send()  ← non-blocking; drops entry if channel is full
        ▼
  Async channel  (10k–200k capacity)
        │
        │  batch read every 200ms (up to 2,000 entries per cycle)
        ▼
  Background flush task
        │
        │  single INSERT transaction per batch
        ▼
  SQLite (WAL mode)

Batching is critical: a single transaction with 2,000 rows is ~100x faster than 2,000 individual transactions. At very high query rates, query_log_sample_rate lets you log 1 in N queries to cap write volume without losing visibility.

Optimized Memory Allocator¶

Ferrous DNS uses a high-performance memory allocator optimized for server workloads:

2-3x faster than the default system allocator for small, short-lived allocations
Per-thread memory pools minimize cross-thread contention
Reduces long-term memory fragmentation under sustained server load

Parallel Upstream Strategy¶

On a cache miss, Ferrous DNS queries multiple upstream servers simultaneously and returns the fastest response:

Cache miss for "example.com"
         │
         ├──► DoQ  dns.adguard-dns.com  responds in  8ms
         ├──► DoH  cloudflare-dns.com   responds in 12ms  ← discarded
         └──► DoH  dns.google           responds in  6ms  ← returned

Client receives the answer in 6ms instead of waiting for the slowest upstream

This eliminates the tail-latency risk of any single upstream being momentarily slow. Upstream health is monitored continuously and unhealthy servers are excluded automatically.

UDP Buffer Tuning¶

The OS UDP receive buffer determines how many packets the kernel queues before the application processes them. With default buffer sizes, large query bursts overflow the kernel queue and are dropped silently.

Ferrous DNS sets enlarged socket buffers (8 MB send and receive) at startup. This directly improves the "queries lost" metric under peak load.

Production Build¶

For maximum performance, always build with native CPU optimizations:

RUSTFLAGS="-C target-cpu=native" cargo build --release

This enables AVX2/SSE4 vectorized string operations, CPU-specific branch prediction, and proper inlining of intrinsics like RDTSC. The gap between a generic --release build and a target-cpu=native build is measurable in the hashing and comparison code.

Summary¶

Optimization	Benefit
L1 per-thread cache	Lock-free hits, ~1-3µs P99
L2 sharded cache	Near-zero contention, ~10-20µs P99
Fast negative lookups	Non-blocked domains skip blocklist checks instantly
In-flight coalescing	N identical cache-miss queries to 1 upstream request
Optimistic prefetch	Hot entries never expire; near-100% hit rate
Frequency-based eviction	Preserves most-active entries under memory pressure
Zero-allocation hot path	No memory allocation on cache hit path
Hardware timestamp counter	Hot-path timing at ~1-5ns vs ~20ns syscall
Async query log pipeline	Query logging never blocks the resolver
Optimized memory allocator	2-3x faster allocation than system default
Parallel upstream strategy	Lowest cache-miss latency, transparent failover
UDP buffer tuning	Absorbs large bursts without packet loss

Benchmark Results¶

Host: Intel Core i9-9900KF @ 3.60GHz | 8 cores / 16 threads / 46 GB RAM | Arch Linux | Kernel 6.18.16-1-lts Tool: dnsperf 2.14.0 | 60s per server | 10 concurrent clients | 187 domains (A, AAAA, MX, TXT, NS)

Docker config (identical for all servers):

Setting Value

CPUs cpuset: 0-15 — 16 threads

Network host mode

Upstreams plain UDP 8.8.8.8 / 1.1.1.1 (parallel)

Cache enabled

Blocking / denylists disabled — isolates raw forwarding performance

Rate limiting disabled

Log level info

Query logging (disk I/O) disabled

Server	QPS	Avg Lat	P99 Lat	Completed	Lost
⚡ Unbound (C)	1,018,691	0.74ms	2.05ms	100.00%	0.00%
⚡ PowerDNS (C++)	797,600	1.11ms	3.47ms	100.00%	0.00%
🦀 ferrous-dns	511,413	1.89ms	47.50ms	100.00%	0.00%
🔷 Blocky (Go)	98,574	9.49ms	21.39ms	99.99%	0.01%
🛡️ AdGuard Home	97,808	3.90ms	15.59ms	99.87%	0.13%
🕳️ Pi-hole	558	2.55ms	24.83ms	73.63%	26.37%

ferrous-dns vs competitors: 5.2× faster than AdGuard Home | 5.2× faster than Blocky | 916× faster than Pi-hole

Unbound and PowerDNS Recursor lead as purpose-built pure recursive resolvers (C and C++) — no REST API, no Web UI, no database, no blocking engine. ferrous-dns runs all of these in the same single-process binary.

Pi-hole's loss rate reflects its architectural ceiling: FTL v6 is mostly single-threaded and saturates under concurrent load from other containers sharing the same CPU pool.

Cache hit P99: ~10–20µs | Cache miss P99: ~1–3ms | Hit rate: ~95%

Full benchmark report and methodology: bench/benchmark-results.md

Setting	Value
CPUs	`cpuset: 0-15` — 16 threads
Network	host mode
Upstreams	plain UDP `8.8.8.8` / `1.1.1.1` (parallel)
Cache	enabled
Blocking / denylists	disabled — isolates raw forwarding performance
Rate limiting	disabled
Log level	info
Query logging (disk I/O)	disabled