notes

random observations on infra, networking, and systems design.

BGP Hijacking Mitigation in Small Networks

Running a small autonomous network requires careful attention to BGP security practices. Most operators assume their ISP filters bogus announcements, but RPKI adoption remains spotty across the globe. I recently audited our edge routers and found that we were only validating routes for about 60% of upstream prefixes—not ideal.

The key insight is that RPKI ROAs (Route Origin Authorizations) are only useful if your upstream actually validates them. We ended up implementing a local policy where any announcement lacking a valid ROA gets tagged and logged, but not dropped. This let us observe the state of the routing ecosystem without breaking anything. After three weeks, we could safely raise the bar to reject invalid routes on our peers.

One thing that surprised me: RPKI validators need to be kept reasonably fresh. We run two independent instances (Krill on one, FORT on the other) and cross-check their outputs daily. The divergences usually indicate a misconfiguration somewhere, and that signal has been invaluable.

rpki-validator runs on separate VLANs
primary:   10.100.50.10 (Krill)
secondary: 10.100.50.11 (FORT)
Update interval: 3600s
Cache freshness check: daily via cron

For teams just starting out, I'd recommend spending a day reading the RPKI RFCs and understanding your upstream's validation posture before deploying anything. The operational cost of a single misconfigured announcement can spike quickly.

Wire Protocol Tuning: MTU and Fragmentation Avoidance

Path MTU discovery is one of those "should just work" systems that often doesn't. We spent three weeks debugging why certain remote sites experienced intermittent packet loss on otherwise healthy links. The culprit: a misconfigured edge switch that was silently dropping ICMP unreachable messages (needed for PMTUD feedback).

The fix was straightforward once we found it: ensure that your entire path supports at least 1500-byte Ethernet frames, and measure the actual effective MTU by crafting test packets with the Don't Fragment bit set. We wrote a small script that probes the path and reports back the negotiated MTU for each hop.

What I learned: never assume that MTU defaults work end-to-end, especially if you're running multiple administrative domains (VPN overlays, legacy hardware, containers, etc.). Even a single misconfigured interface can cause packet loss that looks random from the application's perspective.

# Path MTU discovery test
ping -M do -s 1472 [target-ip]
# Typical safe MTU for most paths: 1280 (IPv6) or 1500 (IPv4)
# Our internal paths run 9000 (jumbo frames) but edge is capped at 1500

The lesson: instrumenting your network to understand its real constraints pays dividends. We now run MTU probes as part of our baseline health checks on every production path.

Self-Hosted DNS: Fallback Strategies

Running authoritative DNS in-house means you're responsible for availability. We learned this the hard way when a single router flap cascaded into DNS outages because our resolver's upstream didn't have a secondary. Now we maintain three independent resolvers across different ASes, all synchronized via a private transfer channel.

Key principle: your resolver must never be a single point of failure. We use GeoDNS to steer clients to the nearest healthy resolver, and each resolver can fall back to a trusted public resolver (Google, Quad9) for non-authoritative zones. This way, even if our entire infrastructure goes down, services don't immediately fail.

We also discovered that response time is as important as availability. A 100ms slower response can trigger timeouts in impatient client code. Our resolvers now live close to our edge, and we cache aggressively (within TTL bounds) to serve most queries in under 5ms.

Last updated 2026-04