Why Would Researchers Oppose Open Access?

Last week I started sort of a relaxed flame war with other members of the steering committee for an ACM conference on the subject of open access to the proceedings. “Open access” would mean that anyone could download the proceedings. The current situation is slightly different:

  • Often, individual papers are available on authors’ home pages. ACM permits authors to do this, but not to upload papers to sites like arXiv.
  • To get the actual proceedings for the conference you have to subscribe to the ACM Digital Library which costs $99/year, on top of the ACM membership dues.

Going into this argument I assumed that no researcher would oppose open access, because the benefits are clear and there are no downsides (I’m often naive like this). In contrast, I ran into what felt like hardened opposition.

The purpose of this post isn’t to argue for open access — others have done this perfectly well, and anyway as I said it’s a no-brainer from the researchers’ point of view — but rather to try to understand why researchers might be against open access. Here’s my analysis of some of the arguments I ran into. I may be unfairly caricaturing them but hey, it’s my blog.

Argument 1: “Open access is against ACM policy.”

This forms an argument against the ACM (or at least against its publication policy), not against open access.

Argument 2: “Open access costs money (for network links, server machines, etc.) and we don’t know where it will come from.”

This would be a good argument if the arXiv didn’t exist. But it does. In fact it’s been here for nearly 20 years and isn’t likely to go away soon. Furthermore, since 1998 the ACM has collaborated with arXiv to form the Computing Research Repository. I don’t fully understand the history between the CoRR and the ACM Digital Library but I think it’s safe to say that the ACM’s support of the CoRR is half-hearted at best.

Of course I’m not saying that arXiv is free (its annual budget is around $400K) but rather that, since it already exists, the incremental cost of a new arXiv’ed paper is low. Also, the more people who use it, the more likely it is that creative ways will be found to keep it in existence. Storage and network costs are always dropping; perhaps in the future the arXiv can live “in the cloud” at considerably lower cost. Say, this sounds suspiciously like a computer science research problem (yes I know about LOCKSS).

Argument 3: “Revenues from the Digital Library fund other things, including keeping conference registration fees low.”

Aha! Finally we almost have a good argument. But not really. For one thing, I had always heard that conferences usually make money, not lose it. But more importantly, it’s not right for the ACM to decide once and for all that all of its conferences are content-generators for the monetized DL. Rather, research communities should individually decide whether their portion of the revenues are enough to offset the disadvantages of closed-access. In summary: if the argument is about money (isn’t it always?) we need to see the figures and make informed decisions. I looked for this information online and didn’t find it (if it is available, please send me a link).

Argument 4: “Authors can put papers on their web pages, this is effectively open access.”

This is weak. First, not all authors do this, so the record is incomplete. For example, my sense is that many European academics are less well-versed in the fine art of excessive self-promotion than are their American counterparts. Second, proceedings are fragmented all over the web; not really ideal.  Third, as people retire, move to industry, etc., their home pages go away and the papers stop being accessible.

Summary

So far, my impression is that when researchers are opposed to open access, it isn’t for good reasons.  Another impression I have is that these attitudes are generational: senior researchers are more likely to oppose open access than are junior ones who “grew up” with the web. It seems very likely that retirements over the next 10 years will cause a phase change.  (Of course, it is also possible that as people become more senior, their attitudes change.)

Associative and Communicative

I found this charming error while reading over a draft of a paper that my group plans to submit soon:

It is easy to prove from its definition that $*$ is communicative and associative.

I’m not exactly sure why I find this hilarious, perhaps it’s because one might guess that due to its role, the separating conjunction is a very cold and distant little operator, sort of a gruff traffic cop of the logic world. On the other hand, in this sentence here we’re claiming that it not only associates freely, but also communicates well.

Grandview Peak

[nggallery id=23]

Grandview Peak, at 9410′, is the highest point in Salt Lake City. Even so, it’s a long way from anywhere and no trail goes to its summit. Over the course of four trips to Grandview I’ve yet to see another person within two miles of the top (not counting whoever I’m hiking with, of course).

One of the reasons I enjoy Grandview is that the route has great variety. You get peaceful hiking near an alpine stream, typical low-Wasatch walking through scrub oak, a nice climb in open pine forest, a long ridge-run with plenty of minor obstacles, and finally a serious two-mile brush thrash on exit.

According to Google Earth, my route was right at 10 miles and involved 4400′ of gain/loss. It took about 6.5 hours and 1.5 MPH felt plenty fast given the difficult terrain.  I’d been hoping for pleasant temperatures; valley highs were around 90 and the average adiabatic lapse rate predicts that 5000 feet higher it should be 17 degrees cooler.  Somehow this prediction was total crap and it was both hot and humid; I guess surface heating probably dwarfs adiabatic effects unless the air is moving around a lot, and transpiration defeats Utah’s natural low humidity. Anyway, three liters of water was not enough. My previous times on Grandview were a lot more pleasant, and had been in spring or fall.  Here’s a description of a similar route I took a few years ago.

Epic Win

Today I visited my favorite taco stand in SLC, the one facing State Street in the Sears parking lot close to 800 South. Four excellent carne asada tacos for $3 is hard to beat. After lunch I went to the new Epic Brewery just a few hundred feet away. I didn’t know much about them, but had heard they’re making good beer. Epic’s shtick turns out to be interesting: they brew strong beer (in contrast, most Utah microbrew is 4.0% ABV) and sell it only in 22 oz (~650 ml) bottles. The retail store is minimalist: a fridge full of bottles, a rack of t-shirts,  and a cash register. While reviewing papers tonight I opened an “825 State Stout.” It is good: a little sweet, not overly alcoholic or hoppy, with plenty of toasted malt flavor. Overall above average among stouts I’ve tasted — nice, since Utah stouts tend to be underwhelming.

Ten Stupid Questions

Teachers sometimes tell students that there are no stupid questions. This is a huge lie; many questions are so stupid they make my teeth ache. I don’t have a great definition of “stupid question” but it’s something like:

A question that only wastes time. Neither the asking nor the answering benefits anyone.

Here are a few of my own stupid questions:

  1. What’s the right way for a male professor to explain male/female connector terminology to a female student? (No, I haven’t been asked this, but one of my colleagues has.) The hazards to avoid are mortal embarrassment or a sexual harassment case.
  2. Why do so many blog authors populate their entries with images not at all related to the topic of the blog post?
  3. What’s the beverage with the highest alcohol content that a person could survive on indefinitely? I mean, if food was plentiful, but no other liquid was available? Clearly some sort of 0.5% near-beer could sustain life indefinitely, and clearly bourbon would kill you even faster than unassisted dehydration. But what about 3.2% beer or low-percentage wine?
  4. Why do alcohol-related web sites in the US ask for age verification before permitting entry? Also, why is this age verification so stupid that a competent six year-old could get past it?
  5. Why do computers suck so much? I’m constantly on the verge of apologizing for being a computer science professor. (Not to students — they asked for it — but to regular people who are forced to interact with crappy software systems.)
  6. Why am I not hungry when I wake up, even if I was ravenous before going to bed?
  7. How can Amazon.com’s search engine suck so badly after 15 years in the business? I just used their book search form to look for author “King, Stephen” and title “It” and the obvious match was 10th in the list (a few years ago I did this search and it was 20th). I routinely use Google, instead of Amazon, to search for books in Amazon.
  8. Why do drivers making left turns often get only partially into the left turn lane? Why do they also routinely swing right before turning left, even though they’re driving slowly enough that it’s obviously unnecessary to increase the radius of the turn?
  9. Why is it a tiny bit embarrassing to run into a coffee shop employee who I know from a previous coffee shop they worked at?
  10. Why are there students who make the effort to come to class and then read the newspaper the whole time? (They even do this when it’s 100% clear there will be no pop quiz or similar.)

I shouldn’t wonder about these things but I do.

A Guide to Undefined Behavior in C and C++, Part 1

Also see Part 2 and Part 3.

Programming languages typically make a distinction between normal program actions and erroneous actions. For Turing-complete languages we cannot reliably decide offline whether a program has the potential to execute an error; we have to just run it and see.

In a safe programming language, errors are trapped as they happen. Java, for example, is largely safe via its exception system. In an unsafe programming language, errors are not trapped. Rather, after executing an erroneous operation the program keeps going, but in a silently faulty way that may have observable consequences later on. Luca Cardelli’s article on type systems has a nice clear introduction to these issues. C and C++ are unsafe in a strong sense: executing an erroneous operation causes the entire program to be meaningless, as opposed to just the erroneous operation having an unpredictable result. In these languages erroneous operations are said to have undefined behavior.

The C FAQ defines “undefined behavior” like this:

Anything at all can happen; the Standard imposes no requirements. The program may fail to compile, or it may execute incorrectly (either crashing or silently generating incorrect results), or it may fortuitously do exactly what the programmer intended.

This is a good summary. Pretty much every C and C++ programmer understands that accessing a null pointer and dividing by zero are erroneous actions. On the other hand, the full implications of undefined behavior and its interactions with aggressive compilers are not well-appreciated. This post explores these topics.

A Model for Undefined Behavior

For now, we can ignore the existence of compilers. There is only the “C implementation” which — if the implementation conforms to the C standard — acts the same as the “C abstract machine” when executing a conforming program. The C abstract machine is a simple interpreter for C that is described in the C standard. We can use it to determine the meaning of any C program.

The execution of a program consists of simple steps such as adding two numbers or jumping to a label. If every step in the execution of a program has defined behavior, then the entire execution is well-defined. Note that even well-defined executions may not have a unique result due to unspecified and implementation-defined behavior; we’ll ignore both of these here.

If any step in a program’s execution has undefined behavior, then the entire execution is without meaning. This is important: it’s not that evaluating (1<<32) has an unpredictable result, but rather that the entire execution of a program that evaluates this expression is meaningless. Also, it’s not that the execution is meaningful up to the point where undefined behavior happens: the bad effects can actually precede the undefined operation.

As a quick example let’s take this program:

#include <limits.h>
#include <stdio.h>

int main (void)
{
  printf ("%d\n", (INT_MAX+1) < 0);
  return 0;
}

The program is asking the C implementation to answer a simple question: if we add one to the largest representable integer, is the result negative? This is perfectly legal behavior for a C implementation:

$ cc test.c -o test
$ ./test
1

So is this:

$ cc test.c -o test
$ ./test
0

And this:

$ cc test.c -o test
$ ./test
42

And this:

$ cc test.c -o test
$ ./test
Formatting root partition, chomp chomp

One might say: Some of these compilers are behaving improperly because the C standard says a relational operator must return 0 or 1. But since the program has no meaning at all, the implementation can do whatever it likes. Undefined behavior trumps all other behaviors of the C abstract machine.

Will a real compiler emit code to chomp your disk? Of course not, but keep in mind that practically speaking, undefined behavior often does lead to Bad Things because many security vulnerabilities start out as memory or integer operations that have undefined behavior. For example, accessing an out of bounds array element is a key part of the canonical stack smashing attack. In summary: the compiler does not need to emit code to format your disk. Rather, following the OOB array access your computer will begin executing exploit code, and that code is what will format your disk.

No Traveling

It is very common for people to say — or at least think — something like this:

The x86 ADD instruction is used to implement C’s signed add operation, and it has two’s complement behavior when the result overflows. I’m developing for an x86 platform, so I should be able to expect two’s complement semantics when 32-bit signed integers overflow.

THIS IS WRONG. You are saying something like this:

Somebody once told me that in basketball you can’t hold the ball and run. I got a basketball and tried it and it worked just fine. He obviously didn’t understand basketball.

(This explanation is due to Roger Miller via Steve Summit.)

Of course it is physically possible to pick up a basketball and run with it. It is also possible you will get away with it during a game.  However, it is against the rules; good players won’t do it and bad players won’t get away with it for long. Evaluating (INT_MAX+1) in C or C++ is exactly the same: it may work sometimes, but don’t expect to keep getting away with it. The situation is actually a bit subtle so let’s look in more detail.

First, are there C implementations that guarantee two’s complement behavior when a signed integer overflows? Of course there are. Many compilers will have this behavior when optimizations are turned off, for example, and GCC has an option (-fwrapv) for enforcing this behavior at all optimization levels. Other compilers will have this behavior at all optimization levels by default.

There are also, it should go without saying, compilers that do not have two’s complement behavior for signed overflows. Moreover, there are compilers (like GCC) where integer overflow behaved a certain way for many years and then at some point the optimizer got just a little bit smarter and integer overflows suddenly and silently stopped working as expected. This is perfectly OK as far as the standard goes. While it may be unfriendly to developers, it would be considered a win by the compiler team because it will increase benchmark scores.

In summary: There’s nothing inherently bad about running with a ball in your hands and also there’s nothing inherently bad about shifting a 32-bit number by 33 bit positions. But one is against the rules of basketball and the other is against the rules of C and C++. In both cases, the people designing the game have created arbitrary rules and we either have to play by them or else find a game we like better.

Why Is Undefined Behavior Good?

The good thing — the only good thing! — about undefined behavior in C/C++ is that it simplifies the compiler’s job, making it possible to generate very efficient code in certain situations. Usually these situations involve tight loops. For example, high-performance array code doesn’t need to perform bounds checks, avoiding the need for tricky optimization passes to hoist these checks outside of loops. Similarly, when compiling a loop that increments a signed integer, the C compiler does not need to worry about the case where the variable overflows and becomes negative: this facilitates several loop optimizations. I’ve heard that certain tight loops speed up by 30%-50% when the compiler is permitted to take advantage of the undefined nature of signed overflow. Similarly, there have been C compilers that optionally give undefined semantics to unsigned overflow to speed up other loops.

Why Is Undefined Behavior Bad?

When programmers cannot be trusted to reliably avoid undefined behavior, we end up with programs that silently misbehave. This has turned out to be a really bad problem for codes like web servers and web browsers that deal with hostile data because these programs end up being compromised and running code that arrived over the wire. In many cases, we don’t actually need the performance gained by exploitation of undefined behavior, but due to legacy code and legacy toolchains, we’re stuck with the nasty consequences.

A less serious problem, more of an annoyance, is where behavior is undefined in cases where all it does is make the compiler writer’s job a bit easier, and no performance is gained. For example a C implementation has undefined behavior when:

An unmatched ‘ or ” character is encountered on a logical source line during tokenization.

With all due respect to the C standard committee, this is just lazy. Would it really impose an undue burden on C implementors to require that they emit a compile-time error message when quote marks are unmatched? Even a 30 year-old (at the time C99 was standardized) systems programming language can do better than this. One suspects that the C standard body simply got used to throwing behaviors into the “undefined” bucket and got a little carried away. Actually, since the C99 standard lists 191 different kinds of undefined behavior, it’s fair to say they got a lot carried away.

Understanding the Compiler’s View of Undefined Behavior

The key insight behind designing a programming language with undefined behavior is that the compiler is only obligated to consider cases where the behavior is defined. We’ll now explore the implications of this.

If we imagine a C program being executed by the C abstract machine, undefined behavior is very easy to understand: each operation performed by the program is either defined or undefined, and usually it’s pretty clear which is which. Undefined behavior becomes difficult to deal with when we start being concerned with all possible executions of a program. Application developers, who need code to be correct in every situation, care about this, and so do compiler developers, who need to emit machine code that is correct over all possible executions.

Talking about all possible executions of a program is a little tricky, so let’s make a few simplifying assumptions. First, we’ll discuss a single C/C++ function instead of an entire program. Second, we’ll assume that the function terminates for every input. Third, we’ll assume the function’s execution is deterministic; for example, it’s not cooperating with other threads via shared memory. Finally, we’ll pretend that we have infinite computing resources, making it possible to exhaustively test the function. Exhaustive testing means that all possible inputs are tried, whether they come from arguments, global variables, file I/O, or whatever.

The exhaustive testing algorithm is simple:

  1. Compute next input, terminating if we’ve tried them all
  2. Using this input, run the function in the C abstract machine, keeping track of whether any operation with undefined behavior was executed
  3. Go to step 1

Enumerating all inputs is not too difficult. Starting with the smallest input (measured in bits) that the function accepts, try all possible bit patterns of that size. Then move to the next size. This process may or may not terminate but it doesn’t matter since we have infinite computing resources.

For programs that contain unspecified and implementation-defined behaviors, each input may result in several or many possible executions. This doesn’t fundamentally complicate the situation.

OK, what has our thought experiment accomplished? We now know, for our function, which of these categories it falls into:

  • Type 1: Behavior is defined for all inputs
  • Type 2: Behavior is defined for some inputs and undefined for others
  • Type 3: Behavior is undefined for all inputs

Type 1 Functions

These have no restrictions on their inputs: they behave well for all possible inputs (of course, “behaving well” may include returning an error code). Generally, API-level functions and functions that deal with unsanitized data should be Type 1. For example, here’s a utility function for performing integer division without executing undefined behaviors:

int32_t safe_div_int32_t (int32_t a, int32_t b) {
  if ((b == 0) || ((a == INT32_MIN) && (b == -1))) {
    report_integer_math_error();
    return 0;
  } else {
    return a / b;
  }
}

Since Type 1 functions never execute operations with undefined behavior, the compiler is obligated to generate code that does something sensible regardless of the function’s inputs. We don’t need to consider these functions any further.

Type 3 Functions

These functions admit no well-defined executions. They are, strictly speaking, completely meaningless: the compiler is not even obligated to generate even a return instruction. Do Type 3 functions really exist? Yes, and in fact they are common. For example, a function that — regardless of input — uses a variable without initializing it is easy to unintentionally write. Compilers are getting smarter and smarter about recognizing and exploiting this kind of code. Here’s a great example from the Google Native Client project:

When returning from trusted to untrusted code, we must sanitize the return address before taking it. This ensures that untrusted code cannot use the syscall interface to vector execution to an arbitrary address. This role is entrusted to the function NaClSandboxAddr, in sel_ldr.h. Unfortunately, since r572, this function has been a no-op on x86.

-- What happened?

During a routine refactoring, code that once read

aligned_tramp_ret = tramp_ret & ~(nap->align_boundary - 1);

was changed to read

return addr & ~(uintptr_t)((1 << nap->align_boundary) - 1);

Besides the variable renames (which were intentional and correct), a shift was introduced, treating nap->align_boundary as the log2 of bundle size.

We didn't notice this because NaCl on x86 uses a 32-byte bundle size.  On x86 with gcc, (1 << 32) == 1. (I believe the standard leaves this behavior undefined, but I'm rusty.) Thus, the entire sandboxing sequence became a no-op.

This change had four listed reviewers and was explicitly LGTM'd by two. Nobody appears to have noticed the change.

-- Impact

There is a potential for untrusted code on 32-bit x86 to unalign its instruction stream by constructing a return address and making a syscall. This could subvert the validator. A similar vulnerability may affect x86- 64.

ARM is not affected for historical reasons: the ARM implementation masks the untrusted return address using a different method.

What happened? A simple refactoring put the function containing this code into Type 3. The person who sent this message believes that x86-gcc evaluates (1<<32) to 1, but there’s no reason to expect this behavior to be reliable (in fact it is not on a few versions of x86-gcc that I tried). This construct is definitely undefined and of course the compiler can do done anything it wants. As is typical for a C compiler, it chose to simply not emit the instructions corresponding to the undefined operation. (A C compiler’s #1 goal is to emit efficient code.) Once the Google programmers gave the compiler the license to kill, it went ahead and killed. One might ask: Wouldn’t it be great if the compiler provided a warning or something when it detected a Type 3 function? Sure! But that is not the compiler’s priority.

The Native Client example is a good one because it illustrates how competent programmers can be suckered in by an optimizing compiler’s underhanded way of exploiting undefined behavior. A compiler that is very smart at recognizing and silently destroying Type 3 functions becomes effectively evil, from the developer’s point of view.

Type 2 Functions

These have behavior that is defined for some inputs and undefined for others. This is the most interesting case for our purposes. Signed integer divide makes a good example:

int32_t unsafe_div_int32_t (int32_t a, int32_t b) {
  return a / b;
}

This function has a precondition; it should only be called with arguments that satisfy this predicate:

(b != 0) && (!((a == INT32_MIN) && (b == -1)))

Of course it’s no coincidence that this predicate looks a lot like the test in the Type 1 version of this function. If you, the caller, violate this precondition, your program’s meaning will be destroyed. Is it OK to write functions like this, that have non-trivial preconditions? In general, for internal utility functions this is perfectly OK as long as the precondition is clearly documented.

Now let’s look at the compiler’s job when translating this function into object code. The compiler performs a case analysis:

  • Case 1: (b != 0) && (!((a == INT32_MIN) && (b == -1)))
    Behavior of / operator is defined → Compiler is obligated to emit code computing a / b
  • Case 2: (b == 0) || ((a == INT32_MIN) && (b == -1))
    Behavior of / operator is undefined → Compiler has no particular obligations

Now the compiler writers ask themselves the question: What is the most efficient implementation of these two cases? Since Case 2 incurs no obligations, the simplest thing is to simply not consider it. The compiler can emit code only for Case 1.

A Java compiler, in contrast, has obligations in Case 2 and must deal with it (though in this particular case, it is likely that there won’t be runtime overhead since processors can usually provide trapping behavior for integer divide by zero).

Let’s look at another Type 2 function:

int stupid (int a) {
  return (a+1) > a;
}

The precondition for avoiding undefined behavior is:

(a != INT_MAX)

Here the case analysis done by an optimizing C or C++ compiler is:

  • Case 1: a != INT_MAX
    Behavior of + is defined → Computer is obligated to return 1
  • Case 2: a == INT_MAX
    Behavior of + is undefined → Compiler has no particular obligations

Again, Case 2 is degenerate and disappears from the compiler’s reasoning. Case 1 is all that matters. Thus, a good x86-64 compiler will emit:

stupid:
  movl $1, %eax
  ret

If we use the -fwrapv flag to tell GCC that integer overflow has two’s complement behavior, we get a different case analysis:

  • Case 1: a != INT_MAX
    Behavior is defined → Computer is obligated to return 1
  • Case 2: a == INT_MAX
    Behavior is defined → Compiler is obligated to return 0

Here the cases cannot be collapsed and the compiler is obligated to actually perform the addition and check its result:

stupid:
  leal 1(%rdi), %eax
  cmpl %edi, %eax
  setg %al
  movzbl %al, %eax
  ret

Similarly, an ahead-of-time Java compiler also has to perform the addition because Java mandates two’s complement behavior when a signed integer overflows (I’m using GCJ for x86-64):

_ZN13HelloWorldApp6stupidEJbii:
  leal 1(%rsi), %eax
  cmpl %eax, %esi
  setl %al
  ret

This case-collapsing view of undefined behavior provides a powerful way to explain how compilers really work. Remember, their main goal is to give you fast code that obeys the letter of the law, so they will attempt to forget about undefined behavior as fast as possible, without telling you that this happened.

A Fun Case Analysis

About a year ago, the Linux kernel started using a special GCC flag to tell the compiler to avoid optimizing away useless null-pointer checks. The code that caused developers to add this flag looks like this (I’ve cleaned up the example just a bit):

static void __devexit agnx_pci_remove (struct pci_dev *pdev)
{
  struct ieee80211_hw *dev = pci_get_drvdata(pdev);
  struct agnx_priv *priv = dev->priv; 

  if (!dev) return;
  ... do stuff using dev ...
}

The idiom here is to get a pointer to a device struct, test it for null, and then use it. But there’s a problem! In this function, the pointer is dereferenced before the null check. This leads an optimizing compiler (for example, gcc at -O2 or higher) to perform the following case analysis:

  • Case 1: dev == NULL
    “dev->priv” has undefined behavior → Compiler has no particular obligations
  • Case 2: dev != NULL
    Null pointer check won’t fail → Null pointer check is dead code and may be deleted

As we can now easily see, neither case necessitates a null pointer check. The check is removed, potentially creating an exploitable security vulnerability.

Of course the problem is the use-before-check of pci_get_drvdata()’s return value, and this has to be fixed by moving the use after the check. But until all such code can be inspected (manually or by a tool), it was deemed safer to just tell the compiler to be a bit conservative. The loss of efficiency due to a predictable branch like this is totally negligible. Similar code has been found in other parts of the kernel.

Living with Undefined Behavior

In the long run, unsafe programming languages will not be used by mainstream developers, but rather reserved for situations where high performance and a low resource footprint are critical. In the meantime, dealing with undefined behavior is not totally straightforward and a patchwork approach seems to be best:

  • Enable and heed compiler warnings, preferably using multiple compilers
  • Use static analyzers (like Clang’s, Coverity, etc.) to get even more warnings
  • Use compiler-supported dynamic checks; for example, gcc’s -ftrapv flag generates code to trap signed integer overflows
  • Use tools like Valgrind to get additional dynamic checks
  • When functions are “type 2” as categorized above, document their preconditions and postconditions
  • Use assertions to verify that functions’ preconditions are postconditions actually hold
  • Particularly in C++, use high-quality data structure libraries

Basically: be very careful, use good tools, and hope for the best.

Is There Anything Knol Could Have Done to Attract Plagiarists More Effectively?

It is known that Google Knol has some plagiarism problems, but I wanted to share a quick anecdote. In early 2010 I noticed this Knol, which plagiarizes an article originally written by Nigel Jones. I’m sure that Nigel’s article is the original because it appeared in print nine years ago. I was annoyed to see that the knockoff has a “Top Viewed Knol Award” and also its author, Vivek Bhadra, has a “Top Viewed Author Award,” so I left a comment on the article suggesting that it may have borrowed content without attribution, and then forgot about it.

A few months later, in mid-April 2010, I happened to revisit the plagiarized Knol and saw that Bhadra had deleted my comment and also banned me from commenting on all of his articles. Smelling a rat, I ran web searches on phrases from more of his articles and found that most of them are plagiarized. But what could I do if not comment on the articles? Aha — there’s a “report abusive content” button, but it turns out none of the categories of abuse includes plagiarized content. There’s a separate link for reporting copyright infringement and it contains these instructions:

To file a notice of infringement with us, you must provide a written communication (by fax or regular mail — not by email, except by prior agreement) that sets forth the items specified below.

No email, nice! Also, only the content’s owner is permitted to send this letter — third parties who notice plagiarism are not welcome.

So anyway, I used the “other” checkbox on the “abusive content” menu to report Bhadra’s 10 highest-ranked plagiarized pages, including links to the original content in the comment field. This was about three months ago and nothing’s happened — the pages are still up and he’s still listed as a top-viewed author.

Internet plagiarism is hardly novel or shocking. The surprising thing is the picture that emerges when we summarize Knol’s design point:

  • Knol makes it trivial to monetize Wikipedia-style content by providing good interoperation with Google’s advertising
  • Knol lets content providers ban commenters they don’t like
  • Knol offers no good way for third parties to report plagiarism, and fails to act (within three months, at least) on reports made through their abusive content system
  • Knol sets a needlessly high bar for content owners to report a DMCA violation by asking them to use physical mail

One might ask: Is there anything Knol could have done differently to attract plagiarists more effectively?