Building a High-Throughput DNS Scanner in Go
From 160 queries/second to 4000+ by moving the hot path into Go, eliminating shared state, and letting each goroutine own its own connection. Lessons from massdns, zdns, and a real rewrite.
A DNS scanner project needed to check thousands of domains per second against TLD nameservers. The first version managed 160 queries/second. The bottleneck wasn’t the network, wasn’t DNS resolution time, wasn’t the proxies. It was the pipe between the orchestrator and the resolver. This article covers what went wrong, what two reference implementations taught us about doing it right, and the architecture that got throughput past 4000 queries/second.
The Bottleneck: Serialization, Not Network
The original architecture used a Python orchestrator with a Go sidecar for DNS resolution. Python sent one query at a time through stdin as JSON, Go processed it, and returned one result through stdout. A classic RPC bridge pattern.
Five serialization points killed throughput:
- Proxy selection lock — Python picks one proxy per query under a mutex
- Stdin write + flush — one JSON line per query, blocks on pipe buffer
- Per-connection lock in Go — one query at a time per connection
- Stdout write lock — serializes all output through a single mutex
- Single reader thread — Python processes one response at a time
Each serialization point is fast individually. Together they form a pipeline where every stage waits for the previous one. The wall-clock throughput was 160 queries/second — roughly what you’d expect from five synchronous bottlenecks each adding a few milliseconds.
The DNS round-trip through a SOCKS5 proxy to a public resolver is 50-200ms. With 500 concurrent connections, you’d expect 2500-10000 queries/second if the only bottleneck were network latency. The pipe was leaving 95% of the available parallelism on the table.
What massdns Teaches: One Thread Beats 500
massdns achieves 350,000 queries/second using C, a single thread, and no locks. It’s worth understanding how.
The design is a pre-allocated slot pool. At startup, massdns allocates N slots for in-flight queries (default 10,000). Each slot holds the query state: domain name, query type, timestamp, retry count. A hash map correlates incoming responses back to their slot using (domain, type) as the key.
The event loop is built on epoll:
- When the socket is writable, pull the next domain from input and send a UDP query
- When the socket is readable, parse the response and match it back to a slot via the hash map
- A timed ring buffer handles timeouts — slots are inserted at their deadline position and swept lazily
No threads. No locks. No goroutines. No channels. One thread owns all state and alternates between sending and receiving based on socket readiness. The CPU never blocks on I/O and never contends on shared data.
The insight: one thread with async I/O beats 500 threads with locks. The coordination overhead of mutex contention, context switching, and cache invalidation across threads can easily dominate the actual work when the work (sending a 40-byte UDP packet) is near-zero.
massdns is a ceiling reference — it shows what’s possible when DNS scanning is the only thing happening in the process. A practical scanner that needs proxy support, TCP connections, and integration with a larger pipeline won’t match 350K/s, but it should aim for the same principle: don’t serialize the hot path.
What zdns Teaches: Shared-Nothing Goroutines
zdns is a Go DNS scanner from the ZMap project. It achieves 1000+ queries/second with a cleaner model than raw epoll: goroutines with no shared mutable state.
The architecture is a four-channel pipeline:
stdin → input channel → worker goroutines → output channel → stdoutEach worker goroutine owns its own resolver. No shared connection pool. No shared socket. No mutex in the query path. Workers pull domains from the input channel, resolve them on their own connection, and push results to the output channel. Concurrency equals the worker count, and that’s the only knob.
The insight: eliminate ALL shared mutable state in the hot path. If no goroutine reads or writes data that another goroutine touches, you don’t need locks, you don’t need atomics, and you don’t need to think about memory ordering. Channels handle the handoff at the boundary.
zdns doesn’t even do rate limiting in the traditional sense. The number of workers is the rate limit. Each worker processes queries sequentially on its own connection, so the maximum throughput is workers * (1 / avg_query_time). Want more throughput? Add more workers.
The Architecture: Go Owns the Hot Path
Combining these lessons, the redesigned scanner splits responsibilities by speed:
Python (orchestrator) Go (scanner daemon)+-----------------------+ +------------------------------+| Load proxies |--config JSON-->| Store proxy pool || Generate domains |--domain stream>| Assign proxy per worker || Read results |<-result stream-| Resolve via SOCKS5 + DNS || WHOIS/RDAP on misses | | Manage connections || Store results | | Handle timeouts/retries |+-----------------------+ +------------------------------+ slow path fast path (~20/s, 5% of domains) (1000+/s, all domains)The principle: Go owns everything in the hot path. Python’s job is to feed domains and consume results. No per-query decisions cross the process boundary.
Python sends the proxy list once at startup. Go assigns proxies to workers internally using static round-robin. No per-query proxy selection in Python. No per-query JSON encoding for a request object. No future-matching on the response. Domains go in as bare strings, results come out as compact JSON.
The slow path — WHOIS/RDAP confirmation for domains that return NXDOMAIN — stays in Python. It runs at ~20 queries/second, only hits ~5% of domains, and involves HTTP requests with TLS fingerprinting. There’s no reason to rewrite it.
Worker-per-Goroutine: Zero Locks in the Query Path
The internal Go architecture follows the zdns model directly:
stdin reader (1 goroutine) | vdomain channel (buffered, 10K) | +--> worker 0 --> own SOCKS5 conn + dns.Conn --> result channel +--> worker 1 --> own SOCKS5 conn + dns.Conn --> result channel +--> worker 2 --> ... ... ... +--> worker N --> own SOCKS5 conn + dns.Conn --> result channel | v stdout writer (1 goroutine)Each worker goroutine owns:
- One SOCKS5 connection (persistent TCP, reconnect on error)
- One proxy from the pool (assigned at startup, never rotated)
- One
dns.Connfor DNS queries over that SOCKS5 tunnel - Zero shared locks in the query path
Proxy assignment is static: with N workers and M proxies, worker i uses proxy i % M. Workers reuse their connection across queries. If a connection dies, the worker reconnects to the same proxy. No connection pool. No shared connections.
Why this works: with 500 workers and 50 proxies, each proxy gets ~10 workers. Workers process queries sequentially on their own connection at ~2-5 queries/second each (limited by SOCKS5 round-trip time). 500 workers x 3 queries/second average = 1500/second total. Scale workers to 1500 and you’re past 4000/second. No locks, no contention, linear scaling until you saturate proxy bandwidth.
The worker lifecycle is minimal:
func (w *Worker) Run(domains <-chan string, results chan<- Result) { defer w.Close() for domain := range domains { result := w.resolve(domain) results <- result }}
func (w *Worker) resolve(domain string) Result { if w.conn == nil { w.connect() // SOCKS5 dial -> DNS resolver } msg := new(dns.Msg) msg.SetQuestion(dns.Fqdn(domain), dns.TypeNS) resp, rtt, err := w.client.ExchangeWithConn(msg, w.conn) if err != nil { w.conn.Close() w.conn = nil // reconnect on next query return Result{Domain: domain, Status: "error", Retries: w.retries} } return Result{ Domain: domain, Status: rcodeToStatus(resp.Rcode), RTT: rtt, Resolved: resp.Rcode == dns.RcodeSuccess, }}No connection pool abstraction. No retry middleware. No circuit breaker. Each worker is an independent unit. If one worker’s proxy goes down, that one worker reconnects. The other 499 are unaffected.
Why miekg/dns
The miekg/dns library is the de facto standard for DNS in Go. It’s battle-tested by zdns, dnsx, CoreDNS, and most other serious Go DNS tooling. Using it instead of hand-rolling DNS packets gives you:
client := &dns.Client{ Net: "tcp", Timeout: 2 * time.Second, Dialer: socks5Dialer(proxyAddr), // custom net.Dialer for SOCKS5}
msg := new(dns.Msg)msg.SetQuestion(dns.Fqdn(domain), dns.TypeNS)resp, rtt, err := client.ExchangeWithConn(msg, conn)The critical piece is the custom Dialer. By injecting a SOCKS5 dialer into the DNS client, every DNS query goes through the proxy tunnel transparently. The DNS library doesn’t know or care about the proxy layer — it just sees a net.Conn. This is the same pattern dnsx uses for proxy support.
Benefits over building DNS packets manually:
- Correct message construction (no off-by-one in the 2-byte TCP length prefix)
- Proper FQDN handling (trailing dot normalization)
- Response parsing with type-safe access to answer records
- Future extensibility: DoH, DoT, EDNS0, DNSSEC validation are all supported
The Protocol: Newline-Delimited Streaming
The communication between Python and Go uses the simplest possible wire format: newline-delimited text.
Phase 1 — Configuration (one JSON object, first line):
{"proxies":["socks5h://..."],"resolver":"1.1.1.1","timeout_ms":2000,"workers":500}Python sends the full config as a single JSON line. Go parses it, initializes workers, connects to proxies, and starts listening for domains.
Phase 2 — Domain streaming (one domain per line):
aaaa.comaaab.comaaac.comBare domain names, no JSON wrapping. Closing stdin signals end-of-input.
Phase 3 — Results (JSONL, unordered):
{"d":"aaab.com","s":"taken","r":0,"ms":45,"re":true}{"d":"aaac.com","s":"nxdomain","r":3,"ms":62,"re":false}{"d":"aaaa.com","s":"taken","r":0,"ms":38,"re":true}Short field names minimize JSON overhead: d for domain, s for status, r for RCODE, ms for round-trip time, re for whether the domain resolved. Results arrive unordered — whichever worker finishes first writes first.
The protocol is deliberately simple. No request IDs, no correlation, no framing beyond newlines. Python doesn’t need to match responses to requests because it processes results as a stream. The two-phase pipeline (DNS scan, then WHOIS/RDAP on the NXDOMAIN subset) doesn’t require per-domain tracking.
The Evolution: Bash to Python to Go
The project went through three generations in about ten days:
Day 1: Bash script. A whois command in a loop with sleep 0.3. Sequential, no proxies, no structured storage. Roughly 3 queries/second when you account for WHOIS server latency.
Day 1 (later): Python monolith. A 774-line single-file rewrite with SQLite storage, proxy support, and parallel connections. This got the architecture right conceptually — proxy rotation, structured results, deduplication — but hit a ceiling around 20-30 queries/second for the RDAP/WHOIS path.
Day 8: Go sidecar (v1). Added a Go process for DNS resolution to bypass Python’s I/O limitations. The RPC bridge pattern got throughput to 160/second — better, but the serialization pipeline left 95% of capacity unused.
Day 10: Go scanner (v2). The architecture described in this article. Bulk streaming, worker-per-goroutine, no shared state. Throughput past 4000 queries/second.
The progression illustrates a pattern: the right language for the hot path matters less than the right architecture for the hot path. The v1 Go sidecar was Go code running at Python speeds because the bottleneck was the interface between them. The v2 architecture got fast by moving the entire hot path — proxy selection, connection management, query dispatch, result collection — into a single process with no cross-boundary serialization per query.
What Made the Difference
Three changes account for nearly all the throughput improvement:
No per-query serialization across the process boundary. v1 serialized a JSON request and response for every single query. v2 sends bare domain names in and compact results out. The protocol overhead per query dropped from ~500 bytes of JSON round-trip to ~20 bytes in and ~60 bytes out.
No shared mutable state in the query path. v1 had five lock acquisition points per query. v2 has zero. Each worker is an independent goroutine with its own connection, its own proxy, and its own DNS client. The only synchronization is channel sends, which are lock-free at the goroutine level.
Bulk proxy assignment instead of per-query rotation. v1 called into a proxy manager for every query, acquiring a lock and running selection logic. v2 assigns proxies to workers once at startup. Worker i uses proxy i % M for its entire lifetime. No rotation, no scoring, no per-query decision.
The underlying principle is the same one massdns demonstrates at the extreme end: DNS queries are tiny and fast. The work of sending a 40-byte packet and reading a 100-byte response takes microseconds. Anything you do around that work — locking, serializing, routing, selecting — easily becomes the bottleneck. The architecture that wins is the one that does the least work per query in the hot path.