Limitations of Many Eyes -- in the context of voting systems (fwd)

From: Brian Behlendorf <brian_at_behlendorf_dot_com>
Date: Mon May 19 2008 - 16:55:28 CDT

I'm forwarding the below email with David's permission. The email was
sent to a list of folks interested in the intersection of security and
open source. It's self explanatory, and many of you have probably already
heard about the study. I think it's important, as advocates of open
source voting systems, that we understand this study carefully; David's
message sets it in an appropriate context.

In my own opinion, instead of spending a lot of time and energy refuting
the study (which could still have its flaws, and will be overgeneralized
and spun by foes of open source voting), advocates of the security
advantages of open source voting should instead acknowlege it as a reality
of all software, and argue that open source still gives us the best shot
of detecting exploits, rather than providing any sort of assurance that
exploits could not exist because they would be quickly found. It should
also reinforce the importance of differentiating between "published
source" and true open source, as "published source" is still unlikely to
get anywhere near the amount of scrutiny per line of code. It could also
point out the importance of funding review efforts; open sourcing the tech
doesn't cause code reviewers to appear, it only makes them possible.

Most importantly, it should point out why publishing source code in any
manner does not obviate the need for an auditable (and actually audited)
voting process.


---------- Forwarded message ----------
Date: Sun, 18 May 2008 12:47:57 -0700 (PDT)
From: David Wagner <>
Subject: Re: [ROS] Limitations of Many Eyes -- in the context of voting systems

> The original reason behind the research that led to the static analysis
> tools MALPAS, SPADE and SPARK was to detect trojan code in UK Government
> security software.
> If we could implement open source software using programming languages
> that support sound static analysis of data flow and path expressions, it
> would be far easier to check for trojans.

I used to think so, too. But Ping Yee and I recently did an
experiment whose results have raised serious doubt about this in
my mind.

In Ping's PhD dissertation, Ping came up with an alternative architecture
for a voting machine that vastly reduces its trusted computing base (down
to < 500 lines of simple Python code). We hypothesized that this would
make it much easier for reviewers to find unintentional and intentional
flaws in the software. We decided to test this hypothesis by inviting
several security experts to review our code.

As part of the experiment, we injected three artificial bugs into our
code, to see if the reviewers would find them. These bugs were an
attempt to simulate a malicious backdoor that could influence election
results. We chose these bugs so they would be unlikely to be detected
by ordinary pre-election testing. We crafted the bugs to offer a range
of difficulty in finding them: one that we expected to be easy to find,
one that we expected to be of medium difficulty to spot, and one that
we expected to be difficult for a code reviewer to notice.

We then ran a code review, with a total of 4 expert reviewers. They were
broken down into 3 teams of 1-2 reviewers, each team working separately
for one day to review the buggy code. At the start of the day we
told them "there exists at least one bug in this code!" and told them
that we had inserted an artificial bug (we didn't tell them that we
had inserted more than one until they found their first bug). We even
revealed that the bug was in a narrow subrange of 100 lines of code, to
give them an advantage and narrow the focus of their review. In fact,
all 3 articially inserted bugs were in that range. We then sat back to
see whether they found the bugs.

One team found both the "easy" bug and the "medium" bug, without much
difficulty, but never found the "difficult" bug. Another team found
the "easy" bug after about four hours, but never found the "medium" or
"difficult" bugs. A third team found the "easy" bug after about two
hours but never found the "medium" or "difficult" bugs. After a total
of about 33 reviewer-hours, no one ever found the "difficult" bug.

Of course, had we actually been malicious, we would have inserted just
the "difficult-to-find" bug, and we would have gotten away with it.
(The only reason we inserted that we also inserted the "easy" and
"medium" bugs was because we were worried that they wouldn't find the
"difficult" bug!) I should also point out that we came up with these
malicious backdoors after only 2-3 hours of brainstorming -- we only
came up with the idea of inserting artificial bugs the night before,
so we were in a hurry.

In short, in 2-3 hours we were able to come up with a malicious backdoor
that 4 reviewers could spending a total of 33 reviewer-hours could not
find, even with the information that there was a bug and with the side
information that it could be found within a particular subrange of 100
lines of code. This suggests that malicious backdoors can be very, very
difficult to detect, even in situations carefully crafted to be the best
possible for the defenders.

Let's compare to current commercial practice. Typical code reviewing
practice today for commercial code involves 1-2 reviewers reviewing code
at a rate of 100-200 lines/hour. In comparison, in our experiment the
reviewers reviewed code at a rate of about 14 lines/hour -- they had a
lot more time, and we had more eyeballs, and the malicious backdoor was
still not detected.

Or, we can put this in context by looking at the cost. I estimate that
it cost something like $10,000 to hire 4 reviewers with these qualifications
to perform this kind of code review. But in this experiment we reviewed
only 100 lines of code; real-world systems are much larger than that.
Imagine, for example, a industrial system with 10,000 lines of code;
that's pretty tiny by commercial standards, yet the experiment we ran
suggests that one could spend $1 million and still not detect a
malicious backdoor. (And this isn't even taking into account the fact
that in larger systems there are probably more opportunities for even
more subtle kinds of backdoors.)

For me, the results of this experiment were eye-opening, because
they violated my intuition and my expectations so dramatically.
The conventional wisdom says that there is no reliable way to check
for malicious backdoors in a large, complex software system. I had
been hoping that the situation might be different if one followed sound
engineering principles, kept the system simple, and minimized the size
of the codebase. Alas, it didn't work out that way. This experiment
suggests to me that there is no feasible, reliable way to check for
malicious backdoors in software of any size -- not even for small,
simple systems that have been thoughtfully designed, architected, and
implemented with the malicious code problem in mind.

More details about this experiment can be found in Chapter 8 of
Ping's thesis ( Even more details
can be found in his techreport:

What about programming languages designed to support sound static
analysis and formal program verification? Would they provide a reliable
way to detect deliberately inserted backdoors and malicious logic? I
don't know, but I'm not convinced yet.

Our experiment wasn't designed to test that question. But, Ping
put a lot of effort into manual verification of the code. He put
together a 70-page assurance document, which was made available to the
reviewers. This document contained his best attempt at a requirements
analysis, a careful explanation of the threat model and the correctness
claims he was making about his software, as well as rigorous analysis of
each of those claims. He also translated those claims into invariants,
preconditions, and postconditions on the code and performed a line-by-line
analysis of the code, showing the pre- and post-conditions needed at
each line to prove these claims. In other words, this analysis was
along the lines of what an automated program verification tool might
perform, except that Ping did it by hand. The assurance document is
available here:
Ping did use a programming language subset that was explicitly designed
to support human code review and chosen with simplicity in mind (details
in Chapter 3 of the assurance document), though sadly we didn't have
automated static analysis tools.

I don't believe that "sound static analysis of data flow and path
expressions" would have made it any easier to find any of the three
malicious backdoors we artificially inserted.

But let me be more charitable, and imagine the best possible program
verification and analysis tool I can plausibly envision (even though
I think it's beyond the state of the art to build such a thing today,
this is the most I can think to ask for). I think such a hypothetical
tool would have fared as follows:
   * A good program verification tool could have detected the "easy"
     bug. It falls squarely into the category of the kind of bugs that
     formal verification is intended to find.
   * A strict static analysis tool might have helped to detect the
     "medium" bug, though this is debateable -- I could argue it
     either way.
   * Hard to say, but I think formal verification would not have
     made a huge difference in the difficulty of detecting the "difficult"
     bug (I suspect developing a spec that would be sufficient to detect
     the bug is as hard as finding the bug -- the hard part is arguably
     recognizing the need for this particular special-case requirement,
     and once you realize that, I suspect you'll have a decent chance of
     finding the bug either way).

One challenge here is that such tools require specifications. If
we are assuming the possibility of malicious logic in the program, we
also have to assume the possibility of malicious logic in the
specification. Since formal verification tools can, at best, only
verify the consistency of the program and the specification, a matching
bug in both the program and the spec can evade detection. And checking
detailed, program-level specs for malicious logic does not seem likely
to be appreciably easier than checking programs for malicious logic.

But that's not even the worst of the it. The worst of it is that
we have to assume that a malicious insider will know that formal
verification is being used and will craft the bug to minimize the
chances that the formal verification tool will detect the bug. (After
all, the malicious insider can run the formal verification tools himself
and check whether his bug will be detected, and he can keep trying until
he finds a bug that is unlikely to be detected by that tool.) In
other words, we have to assume that the attacker will adapt to the
defenses you put in place. If there is any chink in the armor, we
have to assume the attacker will find it. That makes this problem
challenging, even if we assume major advances in program verification.

So for me, I've gradually come to the hypothesis that detecting
malicious backdoors is really hard, even in the best possible case.
I don't know how to do it with any confidence or assurance, for
any non-trivial system, no matter how well-designed or implemented.
Sounds like a difficult open problem to me.

-- David
OVC-discuss mailing list
By sending email to the OVC-discuss list, you thereby agree to release the content of your posts to the Public Domain--with the exception of copyrighted material quoted according to fair use, including publicly archiving at
= The content of this message, with the exception of any external
= quotations under fair use, are released to the Public Domain
Received on Sat May 31 16:17:02 2008

This archive was generated by hypermail 2.1.8 : Sat May 31 2008 - 16:17:04 CDT