No School Like The Old School

I really need to learn to leave DNS alone :)

DNS TXT Record Parsing Bug in LibSPF2
A relatively common bug parsing TXT records delivered over DNS, dating at least back to 2002 in Sendmail 8.2.0 and almost certainly much earlier, has been found in LibSPF2, a library frequently used to retrieve SPF (Sender Policy Framework) records and apply policy according to those records.  This implementation flaw allows for relatively flexible memory corruption, and should thus be treated as a path to anonymous remote code execution.  Of particular note is that the remote code execution would occur on servers specifically designed to receive E-Mail from the Internet, and that these systems may in fact be high volume mail exchangers.  This creates privacy implications.  It is also the case that a corrupted email server is a useful “jumping off” point for attackers to corrupt desktop machines, since attachments can be corrupted with malware while the containing message stays intact.  So there are internal security implications as well, above and beyond corruption of the mail server on the DMZ.

Apparently LibSPF2 is actually used to secure quite a bit of mail traffic – there’s a lot of SPAM out there.  Fix is out, see http://www.libspf2.org/index.html or your friendly neighborhood distro.  Thanks to Shevek, CERT (VU#183657), Ken Simpson of MailChannels, Andre Engel, Scott Kitterman, and Hannah Schroeter for their help with this.

Edit:  Special thanks, incidentally, to Coverity, who upon hearing that there was one bug, ran their static analyzer on LibSPF2 and found six more.  Cool!

[cite required]

Someone asked for a cite on the Consumer Reports claims in my Black Hat 2008 slides.  I went and tracked this down, and I actually picked this up from the Meandering Wildly blog.  Looks like I misread this a bit — a previous dataset had come from Consumer Reports, but the data in my Black Hat deck actually came from Venafi, a security firm that specializes in systems management.  Some collateral with more of their SSL data is here.  Their methodology for collecting the data, according to Meandering Wildly:

It’s a phone poll, so it’s subject to standard errors of self-reporting, and their margin of error (2.5%) is given for a 0.1 confidence interval, which is a little slack for my tastes, but they have a large (N>1000), US-Census-representative sample, which maybe gives us intellectual permission enough to keep playing.

Of course, I also spoke about the one case we have hard data on — when the New Zealand bank’s cert went bad, and 99% of people didn’t care.  Information on that case can be found here.  I do wonder how these numbers might be changing in light of IE8 and FF3’s dramatically improved invalid SSL certificate experiences.

In general, anything I claim, I’m only too happy to back up, so if you have any questions regarding any of the details from a talk I’ve given, don’t hesitate to ask.

This Was Not What I Had In Mind

So, after I completely whiffed on the initial disclosure of the DNS flaw, I wrote the following:

So there’s been some skepticism about the DNS flaw. I want to be clear: It was richly deserved. A “put up or shut up” mentality is critical to the survival of our industry. It’s just too easy to make stuff up, if you can just wave away detractors with “I can’t prove it…it’d be UNSAFE.”

The danger from that statement is very tempting and very real. Our credibility as an industry — ultimately, our ability to get bugs fixed — depends on that statement being called out as the bullsh*t that it is.

We, as an industry, have gone back and forth on full disclosure (i.e. tell everyone) vs. responsible disclosure (i.e. tell everyone after the vendor has had a chance to fix things). Partial disclosure has always been looked down upon, rightfully so, because it’s so amazingly easy to abuse. But if our goal is to protect customers, and one particular bug will affect almost all of them, and a phased disclosure of information will protect the greatest number of customers possible — then perhaps there’s a place for this mode.

It’s certainly not a path you can safely decide to take by yourself, however. That’s what I did, when I refused to tell anyone else in the security industry what the bug was. It’s not something I’d ever do again. It’s not just that you can’t vouch for your own bugs. It’s that, without peer review, you don’t know what bugs people are going to think you’re recapitulating, and you even don’t really understand the severity of your issue.

It’s your blood, sweat, and tears in there. Hard to be impartial.

Tie that in with — it’s not just your credibility you’re betting with, when you go out with partial disclosure, but the credibility of every security researcher working to fix problems — and it’s really not something you can do alone.

That’s what I thought was clear. But what’s happened since my Black Hat talk is something really problematic. Partial Disclosures are occurring, and rather than the press saying “Well, we’ll wait until there’s more information to report on this”, they’re saying, “Wow! This must be like the Kaminsky bug! The Internet’s going to die!”

Awww hell no.

Look. Not every bug is Internet killing. I mean, it’s 2008, if you can’t profit on it nobody’s doing it. And killing the Internet is the first thing you’re told not to do in “Being An Internet Parasite 101″. There’s no profit, so it’s not going to happen — well, barring nation-state level extortion, anyway.

So, what’s been going on with partial disclosures?

First, there’s RSnake and Jeremiah Grossman with Clickjacking. Oh, it’s definitely a bug, unless you think any web site should be able to snap photos from that camera in your laptop. It’s not Internet ending, as RSnake has visibly and repeatedly had to explain to people. But it’s something that needs to get fixed. Why is it a partial disclosure? Because the vendor asked — nicely — for a little more time to complete the responsible disclosure process.

I think everyone can agree that it would have been better to have stayed fully dark until a vendor patch was ready, if waiting for a vendor patch was the path to take.  Partial disclosure is not good here, but positive working relationships with vendors are.  At least we have pre-emptive excision from “Internet Killing” class.

Not so with the newest case.  Now, there’s Robert E. Lee and Jack Louis with their TCP Denial of Service attacks. Now, it’s a bit silly to assume Jack Louis doesn’t know the history of TCP attacks, as it’s silly to assume I don’t know the history of DNS attacks. (You’d be amazed how many people thought I’d just reinvented the birthday attack.) Jack’s written more crazy TCP code than you have, for all values of you including me and possibly Fyodor. Do their attacks work, mostly as they’re saying? Almost certainly. There’s dozens of weird corner cases in TCP where resources and timers are allocated. It’s entirely feasible that at least some of them have nasty effects on the system above and beyond three way handshake flood.

We’ll have to see. But while Robert and Jack appear to have just wanted to have talked about them at T2, something has gone wrong.  People are reporting this as a Internet Killing class bug, because the last time someone wanted a couple more weeks, it appeared to actually be one.

Look.  It’s a DoS, from non-spoofable address space.  We’ve operationally been surviving DoS’s for years and years; if we have a more efficient DoS, that correlates to a smaller botnet. Well, we’ve got freakin’ huge botnets out there, and we’re doing OK. Things go down for a little while, and then lots of IP space gets blocked.

Again, this is Jack Louis we’re talking about, so I’d bet any detractor in the world he has some ridiculously cool resource exhaustion attacks to talk about.  He probably knows more about TCP than you do.  (If you don’t know what Time-Wait Assassination is, he definitely does.)  But the “meta-message” that “I’m not telling you everything, because the Internet will come to an end if I do” almost certainly comes from his desire to finish up some of the cooler attacks, colliding with my reticence to talk this summer because I was trying to get people to patch.

That’s not Jack or Robert or RSnake’s fault. I do think it’s my responsibility to clean up, if not for all the researchers of the world, then for the press.

To all the people in the press: Guys, you’re all awesome, but someone’s going to lie to you. They are going to lie to you, because it will make them money. (See above, it’s 2008.) Consider: A partial disclosure only makes sense, operationally, as a call to arms. It is a disclosure that there is a fault, for which action must be taken. If the action to be taken is to spend a ton of money on a new and fabulous gizmo, then some portion of your readership will go ahead and do just that. And indeed, the longer the period of time between partial disclosure and “full” disclosure, the more gizmos will be sold, and the more money will be made.

To the nth degree, this strategy incentivizes fly-by-night shops to make bugs up, and not even disclose them to the vendor, because that would stop the gravy train as the vendor called foul.

So, press, I must apologize to you. I have put you in the situation where either you acquire complete validation of a flaw from a second source — and thus risk getting scooped! — or you get played. Since I’ve done this to you, I’m going to do something about this. I am going to create something of a community council, that will pre-vet (under legal and strict NDA) any bug that someone claims is so very important that it cannot be disclosed to the point of independent reproduction. Members of this council will have to have publicly presented work in the subject area that is under consideration. I’ve spoken to a decent number of people, and everyone is somewhere between very pissed and legitimately afraid of a flood of unjustified partial disclosures.

Faced with an unending stream of “is the Internet dead yet?” Slashdot posts, everyone I’ve spoken to appears fully on board with providing an honest judgement regarding the legitimacy of findings.

Now, I expect we will reject, out of hand, almost all claims. But we will do so, with the full technical argument brought by the finder, rather than presumptions based on old flaws. Attacking the strawmen implied by partial disclosure is a losing scenario for literally everyone involved.

It stops here. Reporters: There will be a more formal process coming, but please mail or call me if anything more of this nature shows up. Hackers: Would you volunteer to enter an NDA with someone, to help publicly assert whether this sort of behavior is legitimate? Mail me too, especially if you wouldn’t trust me vouching for a given bug.

I think as a community, we’re going to need to determine the limited number of scenarios where the benefits from partial disclosure outweigh the risks. Again, partial disclosure only makes sense as a call to arms. What are users supposed to do, if not install a patch or implement some reasonable operational procedures? A partial disclosure is an ask — what are you asking for?

Vendor intransigence is not an excuse. Patches cannot be reliably synthesized in the time between partial disclosure and when someone figures the attack out. If things are so broken that there must be a release, then the release must be compelling enough that the intransigent vendor cannot deny the validity of the findings.  Such a release is not going to be partial!

(Of course, the existence of a council might actually help with occasional intransigence. Hackers — if you’re sitting on a bug you think is Internet threatening, and you’re not getting any progress with the vendor, mail me too.)

There are of course other factors. Effect on infrastructure, which patches very slowly, is one. The severity of the bug, the difficulty of mitigating attacks using existing operational procedures, and oh yes, the actual novelty of the attack all matter as well (there’s a reason it’s called ‘news’). Put simply, no bug that does not actually threaten the stability of the network, in a way we aren’t already mitigating, should ever follow partial disclosure.

I will say, it’s totally OK to fix old and severe bugs that nobody deigned to repair. But you actually do need to get them fixed :)

So, what options do Robert and Jack have with the sockstress bugs? Without full details on the attack, the only detail everybody can judge is that there are no patches for this attack — but the same operational mitigations that we use for botnets, will almost certainly cover this issue as well. Both are dings against this bug getting slotted into the Internet-Threatening Partial Disclosure bucket. Things are complicated, however, by the fact that this has already gone off the rails and into the press and vendor realms.

What exactly they should do, really does depend, and isn’t really my place to say.  After all, it’s their bug, they can do what they want.  But now that this has gotten into vendor hands, I think they need to weigh in, to Robert and Jack, what their perspective on this issue is.  It’s entirely possible that this issue has been wildly overblown — it’s a legitimate class of bug, yes, but nothing to panic over and something that can be discussed without “Killing the Internet”.  In this case, there’s nothing wrong with Robert and Jack releasing one of their many bugs, just so they can stop talking in theory.

If it’s, in fact, a more severe issue — severe enough to pull the talk, at least as per the vendor — I do think the community needs to have its say on the nature of the fault class.  It’d just be too tempting for a fly-by-night operation to threaten vendors into ‘playing along’.  A talk can be pulled to be nice to a vendor, or a talk can be pulled because it’s a massive scale threat, but only the community can judge which is which.

Yes, this is all enormously messy. The next time anyone asks why partial disclosure is not more common — this is why.

All Roads Lead To Rome

So Sarah Palin’s webmail was hacked recently, ostensibly through a “forgot my password” attack. Venturebeat’s Dean Takahashi remembered that I’d recently been warning about these systems at Black Hat, and solicited my opinion. Here’s what I had to say.

My observation then was that the unifying theme of the bugs of 2008 has been a complete failure to authenticate.

I have to admit, I’m a little surprised to see the theme infecting the election. But, there it is. Webmail providers have a particularly tricky problem with “Forgot My Password” links: They can’t presume you have some mail address to send a password or a reset link to, because they *are* your mail address. With nothing else they can go on, they end up trying personal entropy — secrets like when you were born, where you went to school, etc.

In an increasingly less private society, “secrets” like your birthday are easier and easier to acquire from just normal people — let alone massively visible Vice Presedential nominees like Sarah Palin. So personal entropy is now struggling even more as a mechanism to authenticate.

People have suggested — why not use the telephone system? Everyone has SMS (text messaging). From one perspective, this is completely true. From another, in this increasingly less private society, a decent number of people are specifically averse to having to permanently identify themselves to websites. (Skip a few chapters, and you can watch SMS spam explode as every website collects those numbers ‘in case you forget your password’.) And so we end up at OpenID and its ilk, which attempt to solve the problem of password forgetting by having all sites (effectively) share the same password, or at least authentication technology (since you might use a key fob to log into your OpenID provider). This has some downsides, but isn’t necessarily bad.

One quirky thing, given the election, is how electronic voting and the latest Forgot My Password hack play into one another. People want to vote, but they want their vote to be secret, but they want to be able to detect fraud, which normally requires validating the voter to their vote. People also want to log into their websites, but they want their real identity to be obscured, but they want to still be able to get in if they forget their password, which normally requires validating the real identity to the account. We can say this is ridiculous all day, but there are many people who won’t vote if their ballot isn’t perceived as secret, and there are many people who won’t use the web if their personal identity isn’t perceived as secret.

Notice how the big new feature in all the new browsers is secret (read: porn) browsing. Funky times we live in, eh?

Thinking about it some more, it’s actually impressive, bordering on spooky, how the Sarah Palin hack plays into all sorts of issues surrounding IT. It’s not just the woeful state of authentication, or the quiet but deep desire for [a|A]nonymous connectivity to the Net that enabled the hack in the first place.

No, what’s interesting me now is how everyone’s so very surprised that Palin would use a personal email account for official purposes. Not that I’m defending these actions — the political side of me is a staunch supporter of transparency, as you can’t manage what you can’t measure and if you can’t measure your government you’re pretty much hosed — but from a purely technical standpoint, McCain didn’t invent the Blackberry, but Palin sure didn’t invent using Yahoo at work.

In fact, it’s part of a larger trend, one worthy of analysis.

IT departments are always in a bind. They’re responsible for anything that goes wrong on the network, but every restriction, every alteration they make in people’s day to day business, carries with it a risk that users will abandon the corporate network entirely, going “off-grid” in search of a more open and more useful operating environment. You might scoff, and think people would get fired for this stuff, but you know what people really get fired for? Missing their numbers. In the age of Slammer, I remember an IT department that found out about an entire division that had gone near-off-grid, with their own PC’s and own Internet connectivity. (The division didn’t patch, and flooded the rest of Corpnet with the one remaining internal link.)

But it’s not the age of Slammer, anymore. Its never been easier to get away with going off-grid. Widespread availability of WiMax and 3G networks mean there’s an alternate, unmonitored high speed network available at every desk. And what’s available out there? The Cloud.

The Cloud is fascinating. Based on the very real perception that it’s easier to write and maintain software for one tightly controlled server farm rather than millions of servers or even thousands of appliances, The Cloud offers some of the best new functionality we’ve seen in years, at the cost of the wholesale export of internal company data to the Internet.

Some companies embrace this. Others don’t, but like all productive technologies (anyone remember the early days of Linux), the tech comes in quietly, and holds up well after being discovered simply by showing profitability.

Now, is it safe? On the one hand, you’re exporting data outside the perimeter. The whole point was to avoid doing that. On the other, take a look at what’s out there. 37Signals’ BaseCamp is becoming the way to manage clients and projects with a shared environment that tracks conversations, revisions, and schedules. All of these are elements that, by their very design, cross the perimeter. Salesforce.Com practically is the way entire sales fleets manage their customer base. And then there’s what Crystal showed me people do with Google docs:

  1. Put a spreadsheet on Google docs.
  2. Tell everyone who’s supposed to contribute to the spread sheet, to contribute to the version on Google Docs themselves.
  3. Profit.

There are certainly ways to play this game in the traditional IT way. But, you know, distributed locking is one of the grand problems of computer science, even without introducing federation of trust across company lines. Centralized locking? Why, just head over to Google Docs…

And don’t think it’ll stop at the “few” instances where somebody outside the company needs to participate in a shared document. One must recognize that any large corporation is a collection of perimeters: Team, Department, Building, Division, and sometimes, Shared Nameholder (Verizon and Verizon Business are not the same company). Borders are fuzzy, and it’s the every day worker’s responsibility to navigate these borders as quickly and efficiently as possible.

Is the Cloud more efficient? It’s where the most intensive software development efforts are going right now. It may very well be. But is it secure? Is it safe? Are the (not insignicant!) efforts of Google, and Yahoo, and 37Signals, and Salesforce.Com enough? That’s the sixty four thousand dollar question…and right there, in the middle of us asking…

…in walks Sarah Palin, exposing gov.sarah@yahoo.com to the world.

Just like everyone else would.

Happy Birthday to the Renderman Paradox

In security, we have something we call the Birthday Paradox, which is well illustrated by the following question: How many people do you need in a room, before it’s more likely than not that there are two people with the same birthday? Most people think the number’s pretty big — 50, 60, 100…nope. It’s around 27. Most classrooms have two kids with the same birthday. This happens because the more kids you add, the more birthday “slots” are taken that each new kid might potentially now share. The 2nd kid had only 1 birthday to match, while the 25th kid had 24 birthdays he could match.

Well, I have my own paradox. I call it the Renderman Paradox. This holds that, for any conference, there’s someone you’ll never get to see, because somehow you’re always speaking at the same time they are. I call it that, because, well, Renderman (of the Church of WiFi) and I are always up against one another. Finally, this year at Defcon, we weren’t. Renderman was in my room, yes, but after me. And then…

Well, there were a couple thousand people in line. It took a while to get ‘em all in. Actually, it took about a half hour to get them all in. So, the Defcon people bumped Renderman, and Sze Siong, to an overflow room.

I gave my talk to the hordes…but these guys had space for about 50. Ouch. Bonus: Renderman’s talk was “10 things that are pissing me off”. Oops :)

And so, I thought it might be nice to at least publish Renderman and Sze’s talks here. First up, Renderman (here’s his site, with video!):

…and secondly, Sze Siong:

In other news, this Scribd site is pretty cool :) Now, all I need to do is buy Render that beer…

Towards The Next DNS Fix

Ultimately, I can’t at all complain about armchair engineering.  The whole point of Source Port Randomization as an interim fix was to get things to the level that we could all have the big messy discussion about what to do now, without being illuminated by the actively burning state of the DNS infrastructure.

Now.  When it comes to fixing DNS, we have to operate under the same constraint as when we suggest fixes to web browsers.  Just as you’re not allowed to break the web, you’re not allowed to break DNS.  There are indeed many things we could do to make the web a safer place, “if only a bunch of people would re-code their web sites”.  That is, unfortunately, a naive approach that doesn’t actually lead to things getting any safer.  If nobody will deploy the fix, it’s just as if the fix didn’t happen.

We needed this DNS fix to happen.

As I’ve said a couple of times, Dan Bernstein was right.  Source Port Randomization (SPR) is not perfect — I’m pretty embarrassed that we didn’t recognize how common interactions would be with firewalls — but it’s a remarkably flexible and thorough improvement to the status quo.  When I said in my talk that there’s fifteen ways around the TTL, I wasn’t kidding.  From magic query types that are uncached by a recursive server, to nonexistent query types that are ignored by an authoritative server, there may not be a TTL to override.  Or perhaps the attacker actually provides records for 1.google.com, 2.google.com, 3.google.com, and so on.  In other words, the attacker might not even try to overwrite the NS for a domain — he may just want to get a domain in.  How would this be useful?  Consider the web security model, and Mike Perry’s research on cookies.  1.google.com will collect the cookie for Google just fine.

Or perhaps, as in the case of Google Analytics and Facebook and most large, CDN hosted sites, the actual TTL to override needs to be small, for reliability and scaling purposes.

In all of these situations, Source Port Randomization — a solution forged in 1999, long before we recognized all these problematic variant attacks — poses a significant barrier to attack.  It’s not a panacea, but it was never said to be one.  The hope, and it’s not unreasonable, is that it’s a lot easier for secondary defenses to detect and correct for a flood of billions of packets, than a couple of thousand.  SPR’s purpose was to provide a safer environment for an active discussion that would hopefully yield better fixes.  And that’s what it’s doing!

So, lets finally start talking about the better fixes that are emerging.  Specifically, the problem is — how do we stop the blind attacker who’s willing to send us four billion packets in order to pollute a name?  Four major strategies are, at least from what I’ve seen, making real strides towards a better fix.

1) DNSSEC. Say what you will about the perceived technical and political impossibility of this actually happening, but wow there’s been progress these last few weeks:  Besides lots of excited chatter that the roots are finally going to get signed, .GOV seems to be throwing some pretty serious resources at making DNSSEC happen. I’m neutral thus far on all the post-SPR solutions, and I’m really, aggressively neutral on DNSSEC.  The reality is there’s no harder task in all of IT than building a PKI, and the inescapable reality is that DNSSEC is a new identity infrastructure on the order of X.509.  It does solve the problems though, at least for the authoritative servers that opt into it, and the side benefits of having the system fixed in this particular way are rather compelling.

2) Layered Point Fixes. This is the approach Nominum is taking:  Basically, they’re bundling every point fix they can, and actively getting themselves into the position with their customers that as new bypasses are discovered, they can react quickly.  For example, when Nominum receives a packet with an incorrect TXID, they switch to TCP for that particular query.  This constrains an attacker in two ways:  First, they must force as many lookups as there are fake responses.  In other words, instead of being able to send 99.8% fake responses for each forced request, the attacker must send 50% requests, 50% responses.  Second, the attacker is constrained to the query rate that Nominum will actually send queries to a particular domain.

That alone, is not enough.  A slightly less efficient attack does not a fix make.  And so they port randomize.  But that too, is not enough — at least not for the long term.  And so they’re systematically building filters that attempt to detect as many weird variants as possible and attempt to address them on an attack-by-attack basis.

It’s certainly my preference to have a comprehensive fix.  But, pragmatically, I can’t deny that Nominum’s approach is yielding an increasingly harder target.

3) Attack Mode.  I’ll admit, this one appeals to me — that’s a change, I used to be a pretty staunch opponent, as I expect many people to still be.  But bear with me for a second.  Probably the most consistent signal of a blind cache poisoning attack is a spike in the number of responses received per second with an incorrect TXID (and, if you’re monitoring the network, incorrect destination port).  Even with a fully non-response upstream name server, this signal still survives, as the attacker needs to guess transaction IDs and ports and is going to for a very long time guess wrong.  This appears to hold true for all variants, known and even suspected.  Now, the concept of the SPR interim defense is that the brute force will either go too slow to be relevant for an attacker, or fast enough that the raw traffic levels will be noticed by even trivial network monitoring.

We can do better monitoring of DNS traffic with an IDS rather than just a traffic monitor, but you know who’s in a really good position to notice this attack?  The name server itself.  There’s no reason, inside the name server, that we can’t adapt to the attack — and change our posture to compensate.

Imagine for a moment that we monitored the absolute number of packets received with at least the wrong TXID.  (Depending on how we manage sockets, we might not see all the packets with the wrong source port.  We may not need to, or if we do, we can do so fairly trivially with libpcap filtering for source port 53.)  Assuming we were indeed receiving too many packets with the wrong transaction ID, we could deem ourselves…under attack.  What now?

I’ll tell you what we probably shouldn’t do:  Rate limit, either for all IP addresses, or for those that are specifically being spoofed.  (Remember, DNS servers enforce source address on incoming packets so they can correctly calculate bailiwicks — whether a particular server is allowed to speak for a given name in the first place.)  The problem with rate limiting is that, while it works very well to slow an attacker down, it also provides an attacker with a very consistent way to implement targeted denial of service attacks against DNS infrastructure.  Just flood bad replies, and the real reply will consistently get dropped.

A lot of security people are willing to tolerate DoS, in lieu of data corruption.  On one level, yes, it’s true, I’d rather have no service than corrupted service.  On the other, no service is in and of itself bad for business.  A trivial DoS that takes out Google for an ISP is more than just a problem — it’s a deployment blocker.

Again.  If nobody deploys your fix, it’s like you didn’t even write it.

That being said, DNS is a cruel mistress.  Due to the chained nature of DNS, reliable DoS attacks actually enable data corruption, by allowing an attacker to break the chain.  This has already been shown to cause headaches when an IPS blocks traffic to an authoritative server (mentioned earlier, and described in depth in my 2005 Black Ops talk).  But there are also implications to DNS clients, who will themselves now end up with nothing in their cache because a rate limited server couldn’t collect the data in the first place.

So, we shouldn’t drop traffic.  What can we do?  Perhaps, switch to TCP during the attack?  We know Nominum does this, at least on a per-query basis, when it detects an attack for that particular query.  So there’s some precedent.  But the resistance and nervousness around anything that allows you to force large numbers of servers to switch to TCP, for any reason, is significant.  It’s also impossible to ignore that a decent portion of recursive name servers cannot get 53/tcp out of their network, and that there are even  a good number of authoritative name servers that refuse to host their DNS records over TCP.

There’s much less fear around debouncing — at least, well scoped debouncing.  This is just the technical way of saying, if you’re not sure about something, look it up twice.  You do need to make sure you get the same answer back both times — or else an attacker just forces you to debounce, and hopes he gets his contrary answer in both times.  And there remains interesting questions about what to do when the answers legitimately differ, because they come from a CDN that shuffles responses on a per-response basis for load balancing.  What now?  I’d like to avoid TCP, and triple and quadruple querying is only a little more likely to generate multiple queries with the same reply.  One option is to make use of this trick thought up by this neat new nameserver Paul Vixie showed me — I can’t find it right now, but I’ll put a link up once I do.  The idea he had was to wait around a few hundred milliseconds, seeing if a real server would show up with another reply.  If so, there’s an attack.  Now, when he did this, he was doing it all the time, so it was killing performance on DNS for all users of the protocol (again, deployment blocker).  But we’d only be doing this in attack mode.

Yes, I think Akamai would accept slightly slower DNS resolution during an active attack against their particular names, on the particular name server that’s being attacked.

There is one funny variant we’d need to handle, if we were to depend on the real name server exposing the fake reply.  What if the real name server is non-responsive, for whatever reason?  I think the answer here is to handle situations where no answer comes back, by then and only then refusing to accept any packets from that IP address for ten seconds.  In other words, if a query fails, and nobody replies successfully, blackhole that server for ten seconds.  Legitimate servers have an easy way around this DoS — actually respond to that first query — so I think it’s the one DoS I can accept.

One matter that hadn’t really come up was scope.  There are three scopes we can defend against:  Per-query, per-NS, and global.  In other words, we can apply attack mode logic, whatever it may be, to one specific query, all queries to a name server that we see under attack, or all queries in the world.  My suspicion is that unless we actively detect attacks against just an absurd number of name servers (in other words, if the absolute number of incorrect TXIDs is not accounted for by any particular NS, thus meaning an attacker who doesn’t care which names he poisons as long as he gets someone), then per-NS scope is good.

I don’t like per-query, due to variants that it’s just not going to cover.  There’s some controversy here too, though, “query-fate-sharing” scares people a little.

So, in summary, all this ends up collapsing to some variant of:

Monitor the absolute rate of packets received with the wrong TXID, and possibly Port.  (BIND already does this — check the stats code.)
If the rate of packets exceeds some threshold — possibly dynamically set by the number of outstanding queries per second — start tracking which IP’s are “sending” packets with the wrong TXID/Port.
If there are too many NS’s to track, go into global attack mode.  Otherwise, go into per-NS attack mode for those NS’s, for ten seconds.  Hold this attack mode open as long as the spawning incorrect TXID/Port behavior continues, plus twenty seconds.  (This prevents twiddling attack mode on and off really fast, which defeats the purpose.)
During attack mode, debounce within the scope of that attack mode.  If two answers are received that disagree, issue a single query, and make sure one and only one reply comes back.  If no replies come back, suppress queries to that address for some small number of seconds.

The actual thresholds and constants would need to be figured out, but that’s roughly something I’m liking right now.  Sure, it looks complicated, but amusingly it’s still the simplest of the solutions listed thus far!

4) Case Sensitive DNS Responses (or ‘0×20′). This is David Dagon’s concept, and it’s interesting.  The concept is that DNS ignores case (www.foo.com is wWw.FOO.coM) but preserves case (if you ask for wWw.fOO.coM, you’ll get back wWw.fOO.coM).  So if we want more bits of randomness — if we want to get past 4 billion packets into more-packets-than-have-ever-been-sent-in-history — maybe we can use this trait.  As mentioned earlier, the problem with 0×20 is that an attacker can select names that don’t have enough case sensitive characters to add entropy.  Specifically, you can have numbers in a DNS name!  And so, when an attacker forces lookups for:

1.a11111111
1.a11111112
1.a11111113
1.a11111114
1.a11111115

0×20 can only provide one additional bit of entropy — and it’s not clear that one a is even required (it’s there to deal with the complaint ‘well, we’ll just detect completely numeric domains’).  And since all the above names have to be queried against the root servers, whoever corrupts those names gets to include whatever extra records he wants, because they’re all in bailiwick.  This is the exact problem that DNSSEC has — securing www.foo.com doesn’t just require securing foo.com, you also have to secure com and the roots themselves.  (XQID thought they got around this.  So close, but no.  I’ll post why later — this post is about fixes.)

Bottom line, 0×20 can’t secure the roots when there’s not enough characters to add sufficient entropy.

That being said, almost all real world names do have enough characters in them to add lots of entropy.  In fact, of all the non-DNSSEC solutions, 0×20 is the one that can not only work for the common case, but survive without source port randomization.  (The attack mode above just doesn’t work well enough when the attacker has a 1/65K chance of winning.)  It does need some coverage in those synthetic cases where there’s not enough entropy, or even in the real world cases of very short domains (ibm.com, for example).

Well, we have an entire debouncing framework described for Attack Mode.  Could we debounce when we don’t get enough entropy from the name?  Or perhaps we do so only when we detect 0×20 under attack, or is deployed on a network that from either the authoritative or recursive side canonicalizes away the case variation?

I’m not sure what the exact fix looks like.  But what’s clear to me is now is what I was pretty sure of back in March:  The real fix, the comprehensive fix, is not going to be trivial.  It may be DNSSEC, it may not be, but it’s not going to be a one-character call-it-a-day point fix.  Say what you will about Source Port Randomization — conceptually, it’s several orders of magnitude cleaner than everything that’s yielding fruit now.  Dan Bernstein’s solution is good.  Doing better — by crypto, by filtering, by defending ourselves, or by another entropy source — will be hard.

Not impossible, but not the sort of thing 16 engineers in a room could pragmatically hope to accomplish.

Please Do Not Destroy The DNS In Order To Save It

So someone put together a “one character” patch to fix the “dns flaw”, and it hit Slashdot.

Would that one character could really save the day here.

There’s a lot wrong here, the key fact being there are just so many ways around TTL, which itself was never designed to be a security technology in the first place. Gabriel’s trick addresses one particular scenario. It’s not at all enough. Consider:

First of all, you don’t actually know that a nameserver is ever going to provide you a record, or that that record is going to be cached. We’re seeing bugs in both conditions. For example, PowerDNS wasn’t providing responses on strange query types. CNN doesn’t reply at all to nonexistent names. So there may not be a TTL to bypass.

Secondly, the more major the site, the smaller the TTL. One of the issues described in my slides was the fact that nothing prevents an attacker from replying multiple times to a single outbound query. Presume you can get 500 replies in before the real server does. Given that, you have about a 1 in 131 chance of hijacking the record. With Google Analytics’ TTL at 300, that’s about 5 hours on average — and you don’t have to send 4 billion packets, you’re still sending just a couple tens of thousands.

If Google Analytics gets taken, the web pretty much gets taken — welcome to the power of <script src=”http://www.google-analytics.com”> putting foreign code into DOM’s around the world.

And it’s not like 300 is unusually low. Facebook’s at 30 seconds. That translates to about 30 minutes of security for Facebook — or their pizza’s free :)

But there are records that do have long TTL’s, and that’s where things get really dicey. The records with the longest TTL’s in the world are all name server records. Google’s NS records have TTL’s at 345K seconds. Microsoft’s NS records have TTL’s at 143K seconds. Whether that’s a good idea or a bad idea, it’s reality. We allow in-bailiwick overwrite of cached NS records precisely because these very long TTL’d records sometimes need to be overwritten anyway. When Gabriel writes:

What’s the downside to my patch ? I guess we are now holding an
authoritative server to the promise not to change the NS record for
the duration of the TTL, which is kinda what the TTL is for in the
first place :)

What he’s saying is that Google and Microsoft should accept situations where their website is down for up to 95 days hours (still too long). Now, granted, almost nobody’s going to actually hold onto a cached record for that long. But a single point of failure causing up to a week of residual outage out in the field is a very bad thing. A one character patch that caused such failures would be a serious problem indeed.

Now, all this being said, there’s lots of interesting thinking going on out there, and one of the things we all fully expected was a healthy discussion of all the possible options on the table. Maybe there’s a little more press than expected on one of those options, but I do think it’s good that we can now all see just how careful we need to be fixing this bug. There are a couple of approaches that are in fact converging on a safe and effective fix to the DNS, and I’ll be writing about them soon. In the meantime…nobody should presume any easy fix will actually solve the problem.

The Emergence Of A Theme

I’m not sure what it is, but there continues to be some sort of “competition” for “who can find the biggest bug” — as if attackers had to choose, and more importantly, as if any bug was so big that it could not be made even better by combined use with its “competition”.  Before my DNS talk, my old friend FX from Recurity Labs was comparing DNS issues to the Debian Non-Random Number Generator issue that caused all sorts of SSL certificates to offer no security value, and the SNMPv3 flaws that allowed infrastructure devices to be remotely administered by people who happened not to know the password.

Of course, after the talk, it became clear that the DNS hack and the Debian NRNG combined rather destructively — DNS allowed you to finally play MITM with all the SSL private keys you could trivially compute, and as Ben Laurie found, this included the keys for Sun’s OpenID authentication provider.  And, since the DNS hack turns Java back into a universal UDP and TCP gateway, we end up being able to log into SNMPv3 devices that would otherwise be protected behind firewalls.

So there’s no sense making a competition out of it.  There’s just an ever growing toolchest, growing from a single emerging theme:

Weaknesses in authentication and encryption, some which have been known to at least some degree for quite some time and many of which are sourced in the core design of the system, continue to pose a threat to the Internet infrastructure at large, both by corrupting routing, and making those corrupted routes problematic.

Back in July, the genuinely brilliant Halvar Flake posted the following regarding the entire DNS issue:

“I fail to understand the seriousness with which this bug is handled though. Anybody who uses the Internet has to assume that his gateway is owned.”

And thus, why 75% of my Black Hat talk was on the real-world effectiveness of Man-In-The-Middle attacks: Most people aren’t as smart as Halvar.  I’m certainly not :)  Almost nobody assumes that their gateway is owned — and even those that do, and try to engineer around it, deploy ineffective protections that are only “secure unless there’s an attacker”.

I say this is a theme, because it is the unifying element between some of the year’s most high profile flaws.  There are two subclasses — some involve weak authentication migrating traffic from one location to another, while others involve weak authentication allowing an attacker to read or modify traffic migrated to him — but you’d have to have some pretty serious blinders to not see the unifying theme of weak authentication leads to pwnage.

Consider:

Luciano Bello’s Debian NRNG: This involves a core design requiring the generation of random numbers, but the random number generator required a random seed, but alas, the seed was made insufficiently random.  It’s an implementation flaw, but barely — and the effect was catastrophic failure against members of the X.509 PKI authentication system that had used the Debian NRNG, and thus by extension SSL’s encryption logic and OpenID (for Sun’s) authentication gateway.

Wes Hardakar’s SNMPv3 Bug: Here, we have an authentication protocol that allows an attacker to declare how many bytes he wants to have to correctly provide.  Now, the attacker can claim “just 1 please” — and he gets into any router suffering this bug within seconds.  That, by extension, allows control over all traffic traversing that router.

Mike Zusman’s Insecure SSL-VPN’s: SSL is supposed to protect us, but there’s no sense creating a secure session to someone if you don’t actually know who they are.  Don’t worry though, by design anything that isn’t a web browser is terrifyingly likely to only to skip authentication entirely and just create an encrypted link to whoever’s responding.  One would think that SSL-VPN’s, whose sole purpose is to prevent attackers from accessing network traffic, would be immune.  But with 42% of certificates on the Internet being self-signed, and a lot of them being for SSL-VPN’s, one would be wrong.  By extension this auth failure exposes all traffic routed over these SSL-VPN’s.

Mike Perry’s Insecure Cookies: This gets interesting.  Here we have two different authentication protocols in place — one, from server to client, based on X.509.  The other, from client to server, based on a plaintext password (delivered, at least, over an encrypted session authenticated by the server-to-client cert).  But to prevent the user from needing to repeatedly type in their plaintext password, a password-equivalent token (or cookie) is handed to the user’s browser, which will be attached to every request within the securely encrypted channel.  Unfortunately, it’ll also be attached to every request which does not traverse the securely encrypted channel, because the cookies aren’t marked for secure-only.  Once the cookie leaks, of course, it’ll authenticate a bad guy who creates an encrypted session to that server.  So by extension bad guys get to play in any number of interesting sites.

My DNS flaw: Here we have a protocol that directly controls routing decisions, ultimately designed to authenticate its messages via a random number between 0 and 65535.  Guess the number, and change routing.  This was supposed to be OK, because you could only guess a certain number of times per day.  There was even an RFC entirely based around this time limit.  It turns out there’s a good dozen ways around that limit, allowing anonymous and even almost 100% packet spoofed compromise of routing decisions.  This, by extension, allowed exploitation of all traffic that was weakly authenticating.

It’s the same story, again and again.  And now, everyone talking about BGP.  So lets do the same sort of analysis on BGP:

Kapela and Pilosov’s BGP flaw: In BGP, only the nearest neighbor is authenticated.  The concept is that all “members of the club” authenticate all other members, while the actual data they provide and distribute is trusted.  If it’s not actually trusted, anyone can hijack traffic from anyone else’s routes.

Pilosov’s done some cool work here.  It’s not the sort of devastating surprise some people seem to want it to be.  Indeed, that’s what makes it so interesting.  BGP was actually supposed to be broken, in this precise manner. Literally, in every day use, any BGP administrator has always had the ability to hijack anyone else’s traffic.  Pilosov has a new, even beautiful MITM attack, but as mine was not the first DNS attack, his is not the first BGP MITM.  Tales of using BGP to force traffic through a compromised router (possibly compromised through SNMPv3) are legion, and Javascript and the browser DOM blur things pretty fiercely in terms of the relevance of being able to pass through to the legitimate endpoint anyway.

That’s not to take away from the work.  It’s an interesting trick.  But we need to level set here:

First, if you’re not part of the BGP club, you’re just not running this attack.  Pakistan took out YouTube with BGP — but some random kid with the ability to spoof IP packets couldn’t.  In other words, we’re just not going to see a Metasploit module anyone can run to complete these sorts of attacks.  Now, there are some entertaining combinatorics that could be played — DNS to enable Java’s SNMPv3 access to internal routers at an ISP, and then from that internal router running the sort of BGP tricks Pilosov’s talking about.  This goes back to the utter folly of trying to rank these bugs independently from one another.  But these sort of combinatorics are at a fundamentally different level than the fire-and-forget antics that DNS allowed, and on a fundamental level, the number of potential attackers (and the number of involved defenders) on BGP is a lot lower.

Second, we have far better logging — and thus accountability — in the BGP realm than we do perhaps for any other protocol on the Internet.  Consider the archives at APNIC — yes, that’s route history going back to 1999 — and Renesys has even more.  That sort of forensic data is unimaginable for anything else, least of all DNS.  BGP may have its fair share of bad actors — consider spammers who advertise temporary ranges in unused space for mail delivery purposes, thus getting around blackholes — but any of the really nasty stuff leaves a paper trail unmatched by any other attack.

Third, BGP is something of a sledgehammer.  Yes, you’re grabbing traffic — but your control over exactly what traffic you grab is fairly limited.  Contrast that with DNS, which allows astonishingly fine grained targeting over exactly what you grab — indeed, you don’t even need to know in advance what traffic you want.  The victim network will simply offer you interesting names, and you get to choose on the fly which ones you’ll take.  These names may even be internal names, offering the impossible-with-BGP attack of hijacking traffic between two hosts on the exact same network segment.

Finally, BGP suffers some limitations in visibility.  Simply grabbing traffic is nice, but bidirectional flows are better than unidirectional flows, and when you pull something off via DNS, you’re pretty much guaranteed to grab all the traffic from that TCP session even if you stop any further poisoning attempts.  Contrast that with BGP, which operates at Layer 3 and thus may cause the IP packets to reroute at any point when the TCP socket is still active.

So, does that mean its always better to attack DNS than BGP?  Oh, you competitive people would like things to be so simple, wouldn’t you :) Pilosov and I talked for about a half hour at Defcon, and I’ve got nothing but respect for his work.  Lets look at the other side of things for a moment.   First, BGP controls how you route to your name server — if not your recursive server, which may be inside your organization and thus immune to exterior routing protocol attack, then the authoritative servers your recursive servers depend on.  Something like this actually happened recently — witness the curious case of the Unauthorized L Roots, and note the astonishingly familiar potential attacks being described.  Yes, that’s precisely the scenario of BGP used to hijack root DNS servers — with such hijacking actually being noticed.

More importantly, much of my talk, in which I discuss the impacts of MITM attacks, applies to Kapela and Pilosov’s work as well.  It’s 2008, we still don’t have secure email, and that’s just as much of a problem in the face of BGP attacks as it is in the face of DNS attacks.

So, in summary, it’s an interesting side discussion regarding the similarities, differences, and overlaps between DNS and BGP attacks.   BGP has far fewer potential attackers, fewer necessary defenders, is a much less agile attack, and is way easier to monitor forensically (and indeed, with companies like Renesys, is being monitored forensically).  But so what?  It can work, and when it does, it can do much of the same damage we were afraid of via DNS.

We have now had three attacks, in one year, that underscore the fundamentally untrustworthy nature of routing.  DNS, BGP, and SNMPv3 all underscore the fact that the network should only be trusted as a best-effort data transmission system — that if you want to make sure everything’s OK, you can’t just assume — you need to cryptographically authenticate, you need to cryptographically encrypt, and you need to do these things to a level of security beyond “secure unless there’s an attacker.”

A lot of us — myself included, when I first started really looking at SSL — thought we were already distrusting the network.  We weren’t.  That’s what Mike Perry’s telling us, that’s what Mike Zusman’s telling us, and that’s what I’m telling you.

There are some real discussions to be had.  It’s 2008.  Where’s secure email?  Why is almost every autoupdater not from Microsoft thoroughly broken?  What is going on with non-browser network clients that can’t handle traffic from an untrusted server?  How are we going to migrate the web, and indeed all commercial network activity, to authenticated and encrypted protocols that respect the fundamentally untrustworthy nature of the network?

DNS vs. BGP vs. SNMPv3 is inside baseball.  The reality is as follows:

Weaknesses in authentication and encryption, some which have been known to at least some degree for quite some time and many of which are sourced in the core design of the system, continue to pose a threat to the Internet infrastructure at large, both by corrupting routing, and making those corrupted routes problematic.

The question is what to do about it.

(That all being said, I’ll be writing shortly with an update on defenses against DNS.  There be news.)

My (Not So) Little Pwnie

:)

Experimental Mail Server Analyzer Online

I’ve modified the test scripts slightly, to allow arbitrary triggering agents (such as a mail server) to report back the quality of their DNS queries.  You may very well be surprised what NS’s your mail servers are configured to use.  More often than you’d think, people just don’t know.

Next Page →