Post-Quantum TLS Is Already On, and It Broke Your Router

Hybrid post-quantum TLS (X25519MLKEM768) is negotiating by default right now, and the operator problem is a 1,400-byte ClientHello that fragments and breaks SNI routing.

A standard TLS 1.3 ClientHello fits in one packet. It has for the entire life of the protocol. As of this year, on a growing share of your connections, it does not, and that one change is quietly breaking SNI routing on edges that have nothing to do with cryptography.

Here is the part most post-quantum coverage misses, including the piece this site ran earlier. The migration is usually framed as a deadline: NIST finalized the standards, CNSA 2.0 and FIPS set the clocks, and you have until roughly 2030 to pull out RSA and ECC. True, and already behind the facts. The hybrid post-quantum key exchange is not sitting in a backlog waiting for your migration plan. It is negotiating on your TLS connections today, by default, whether or not anyone on your team opened a ticket.

The default flipped while you were reading the deadline memos

The mechanism is X25519MLKEM768, the hybrid group that pairs classical X25519 ECDHE with ML-KEM-768, the standardized successor to CRYSTALS-Kyber. The clients did not wait for you.

Chrome turned on its predecessor by default in version 124, back in April 2024. AWS added ML-KEM hybrid TLS to KMS, ACM, and Secrets Manager non-FIPS endpoints on April 7, 2025, and scheduled the older Kyber group for removal in 2026. Akamai began rolling it to client-facing connections on September 1, 2025, and to origin connections through late 2025. They all implement the same thing: the IETF draft draft-ietf-tls-ecdhe-mlkem.

Add that up. A meaningful fraction of your inbound and outbound TLS 1.3 handshakes are already post-quantum. And the operator-visible consequence is not cryptographic at all. It is a packet-size problem, which is a much dumber and more annoying class of problem to debug.

32 bytes became 1,216, and that is the whole story

A classical X25519 key share is 32 bytes. The X25519MLKEM768 public key the client now stuffs into the ClientHello is 1,216 bytes, per Akamai's measurements. That single substitution drags the ClientHello from the old 300 to 500 byte norm up past 1,400 bytes, which lands right at or above the typical 1,460 byte TCP MSS.

So the ClientHello fragments. It splits across two TCP segments. For TLS 1.3 this is new behavior, and the entire ecosystem of middleboxes that grew up assuming a single-packet ClientHello never had to handle it before, because it never happened.

The cryptography works fine. The math is cheap. That is exactly what makes this dangerous: the failure does not show up where anyone is looking.

Why this lands on the routing team, not the crypto team

The failure mode is silent, intermittent, and it lands on the people who route traffic, not the people who picked the cipher suite.

When a ClientHello fragments across two segments, any middlebox that reads the SNI out of the first packet to make a routing decision can fail. The name it needs is in the second segment. It never reassembles, so it never sees the hostname.

Traefik's TCP router with a HostSNI rule is the documented, reproducible case. There is a community forum thread walking through it: with a post-quantum client, the SNI is not in the first segment, the HostSNI match fails, and the connection does not route. Swap the client back to classical X25519 and the same request works immediately. Nothing in the certificate changed. Nothing in the policy. Nothing you would think to inspect.

The affected set is broad. TCP-mode reverse proxies and L4 load balancers doing SNI inspection. Older firewalls and DPI appliances that assume one-packet ClientHellos. And per Akamai, "faulty server implementations" that simply reject a ClientHello larger than they expected, rather than reading the second segment.

Think about who gets paged. Someone reports an intermittent connection failure to one backend. The cert is valid. The route config is unchanged. The TLS layer reports nothing wrong because, from its point of view, nothing is. The trigger is a browser update on a client you do not control, flipping a default before your edge was ready for a fragmented handshake.

The round trip you pay instead

There is a second cost, and it is a tax either way.

If a server does not support the hybrid group, a well-behaved client falls back, but that fallback costs an extra round trip through HelloRetryRequest. So some clients send a smaller initial ClientHello to dodge fragmentation, and they pay the round trip instead. You are choosing between a fragmented first packet and a second round trip. You are not avoiding the cost, just relocating it.

This is the trap, stated plainly. AWS measured a 0.05% drop in max transaction rate with connection reuse on. That number is so small it tells teams there is nothing to test. The breakage hides one layer down, at the packet and middlebox boundary, precisely where the negligible crypto cost convinced everyone not to look.

The library footguns that downgrade you silently

Even on the server and client config side, the defaults will bite you.

OpenSSL's SSL_CTX_set1_curves has an unintuitive split between which groups you advertise and which you actually include a key share for. BoringSSL behaves differently: it auto-includes every group you name. Misread either one and you send a key share you did not intend, or you omit one you did. Neither throws an error. You get a quiet downgrade, which is the worst possible outcome because it looks like success.

On the client side, AWS SDK for Java v2 needs version 2.30.22 or greater to negotiate ML-KEM. Older clients fall back to classical gracefully once Kyber is removed in 2026, per the April 7, 2025 AWS announcement. Graceful, yes. But "graceful classical fallback" means you silently lose the post-quantum protection you thought you had, and nobody gets an alert.

The counterpoint, and why it does not save you

The obvious objection: if most clients fall back gracefully, why care?

Because the failures that do not fall back gracefully are the routing ones, and those are invisible from the crypto layer. A graceful cipher fallback does nothing for you when the SNI never reached the router in the first place. The connection does not negotiate a weaker handshake. It does not connect at all, or it routes to the wrong place. Cipher negotiation and TCP-segment reassembly are different machines, and the one that is failing is the one nobody instrumented.

And to be clear, none of this is a reason to disable hybrid PQC. Harvest-now-decrypt-later is real: the confidentiality benefit accrues the day you turn it on, against traffic an adversary is recording today to decrypt later. The deadlines are real and they are separate from this. The point is narrower and more immediate. The on-the-wire reality arrived before your migration project did, and your edge is where it shows up.

What to check this week

Do these in order. Each one maps to a specific failure above.

Reproduce a real post-quantum ClientHello against your own ingress. Run openssl s_client -groups X25519MLKEM768 -connect your-edge:443 -servername host, then watch it in tcpdump and confirm the ClientHello fragments across two segments and that the SNI still routes. If routing fails only with the PQ group and works with classical X25519, you have found the bug before your users filed the ticket.
Audit every TCP-mode SNI inspector. Traefik TCP routers, HAProxy in TCP mode, any L4 SNI-routing load balancer, any DPI firewall. The single question for each: does it reassemble a multi-segment ClientHello before reading SNI, or does it read only the first packet? Reading only the first packet is your trigger condition. For Traefik specifically, track the fragmentation issue in the forum thread and pin to a version that reassembles.
Pin your AWS client versions deliberately. Upgrade AWS SDK for Java v2 to 2.30.22 or greater before Kyber is removed in 2026, or consciously accept the silent classical fallback. Pick one. Do not discover the answer in a log six months from now.
Get the library flag right once, then move on. On OpenSSL stacks, confirm whether SSL_CTX_set1_curves is advertising X25519MLKEM768 or actually including a key share for it. On BoringSSL, remember it includes every group you name. A misconfiguration here is a quiet downgrade, not an error you will see.
Treat FIPS and HSM paths as a separate track. AWS enabled hybrid on non-FIPS endpoints first. If you sit on FIPS endpoints or HSM-backed key paths, do not assume the hybrid group is available, and do not assume your validated module covers ML-KEM yet. Confirm before you build a dependency on it.

The crypto layer will keep reporting that everything is fine. Test the packet path, because that is where this actually breaks.