What I Want From a Bibliography System

As a professor I spend a fair amount of time wrangling with references. Because it’s free and reasonably simple, I use BibTeX: an add-on tool for LaTeX that automates the construction of a bibliography by pulling references out of a separate text file, assigning them numbers (or other identifiers), and formatting the entries appropriately.

BibTeX entries are widely available on the web. In principle this is great, but in practice the quality of downloadable BibTeX entries is poor: they contain so many typos, omissions, misclassifications, and other problems that I no longer download them, but rather create a new entry from scratch, perhaps using the existing entry as a rough specification. Even BibTeX entries from the ACM’s Digital Library — a for-pay service provided by the premier professional society for computer science — are not directly usable, basically every one of them requires editing. Over time I’ve come to more and more blame BibTeX for these problems, rather than blaming the people creating the entries. Here are the design points I’d like to see in a bibliography system.

Style neutrality: A bibliography entry should be a collection of factual information about a document. The “bibliography style” — a collection of formatting guidelines specific to a particular journal or similar — should be a separate concern. BibTeX does support bibliography styles but the implementation is incomplete, forcing entries to be tweaked when the style is changed.

Minimal redundancy: Any large BibTeX file contains a substantial amount of redundancy, because BibTeX poorly supports the kind of abstraction that would permit, for example, the common parts of two papers appearing at the same conference to be factored out. Duplication is bad because it consumes time and also makes it basically impossible to avoid inconsistencies.

Standalone entries: I should be able to download a bibliography entry from the net and use it right away. This goal is met by BibTeX but only at the expense of massive redundancy. Meeting both goals at the same time may be tricky. One solution would be to follow an entry’s dependency chain when exporting it, in order to rip a minimal, self-sufficient collection of information out of the database. This seems awkward. A better answer is probably to rely on conventions: if there is a globally recognized name for each conference and journal, a self-sufficient entry can simply refer to it.

Plain-text entries: Putting the bibliography in XML or some random binary format isn’t acceptable; the text-only database format is one of the good things about BibTeX.

Network friendliness: It should be easy to grab bibliography entries from well-known sources on the network, such as arXiv. Assuming that the problem of these entries sucking can be solved, I should not even need a local bibliography file at all.

User friendliness: BibTeX suffers from weak error checking (solved somewhat by add-on tools like bibclean) and formatting difficulties. For example, BibTeX’s propensity for taking capitalized letters from the bib entry and making them lowercase in the output causes numerous bibliographies I read to contain proper nouns that are not capitalized. Messing with BibTeX styles is quite painful.

BibTeX is a decent tool, it just hasn’t aged or scaled well. CrossTeX and BibLaTeX sound nice and solve some of the problems I’ve mentioned here. However, since I have about 37,000 lines of BibTeX sitting around, I’m not interested in migrating to a new system until a clear winner emerges.

May 9, 2011

regehr

Academia, Computer Science

9 responses to “What I Want From a Bibliography System”

Robby says:

May 9, 2011 at 9:53 pm

Scribble’s bibliography support helps with some of that.
Suresh says:

May 9, 2011 at 9:59 pm

Have you used Mendeley: it automates the process of getting references from just a PDF file fairly effectively. It’s admittedly only as good as its web-sources, but it does deal with ‘network-friendliness’. The entries are stored as bibtex so they’re backwards compatible.

Also, the crossref functionality used in DBLP bibtex entries addresses the problem of multiple papers from the same conference, and there are various post-processing tools to clean the bib file. Note though that the bibtex file is intended not as a per-document file (as it’s used) but as a database of references: in fact there are tools to extract only the references pertaining to a specific .aux file from a given bibtex file.
regehr says:

May 9, 2011 at 10:15 pm

Suresh I hate the DBLP entries, they’re crammed with crap I don’t want like publisher and editors, and leave out things I’d prefer to have.

The crossref hack struck me as too much of a bandaid but I didn’t look at it that closely.

Mendeley looks interesting, I hadn’t seen it…
regehr says:

May 9, 2011 at 10:17 pm

Robby, do you guys write your papers in scribble?
Suresh says:

May 10, 2011 at 12:05 am

It seems to me that you want a bibliography database to be complete (ie like DBLP) and use styles to eliminate items you don’t want to display (most bibtex styles are competent at add-deletion of fields). In a different context, someone might want the fields that you don’t want.

Mendeley is nice – worth checking out. I haven’t fully integrated it into my workflow yet, but the tagging system and the auto import of PDFs/bibrefs is greatly improved.
Ragib Hasan says:

May 10, 2011 at 12:33 am

I think BibTeX with CrossRef works quite well in rooting out redundancies. For my BibTeX library, I created a common.bib file containing info about most conferences. For each paper, a crossref links to an entry in the common.bib file, so there is no redundancy for the conference information common to two papers.

I agree with your point about non-usability of bibtex files from ACM, IEEE, or Google Scholar. In case of GScholar, it manages to mess up virtually all the entries, either through misclassification, or malformed attributes.
regehr says:

May 10, 2011 at 12:29 pm

Suresh, the bibliography styles I use are not particularly good at eliminating extraneous fields.

I agree that it’s fine for the original bib entry to contain anything that is plausibly useful to anyone.
Lex Spoon says:

May 10, 2011 at 4:07 pm

I’m enjoying reading the different systems people are mentioning.

For your desires list, I would actually remove redundancy. I know this is programmer heresy, but I have found in practice that it works better to make a local .bib file for each paper rather than to pull entries from master .bib files. Then it’s possible to tweak the local .bib files to be exactly what is appropriate for the paper in question. It’s easy, simple, and sufficient.
Paul Agapow says:

May 12, 2011 at 7:13 am

Mendeley is … fine. It often extracts the metadata correctly, it’s reasonably frictionless to use, can import bibliographies in a variety of styles. Conversely, its output styles are sometimes patchy or malformed, some of the assumptions it makes are troublesome (e.g. authors are people and not organisations) and the amount of online quota they give you rapidly runs out. But it works well enough that it’s what I’m using at the moment. (Which is to say that it’s adequate and better than the eye-gouging, head-bangingly fiddly EndNote. BibTeX is out of the question due to the need to work with MSWord. Zotero is reasonably feature equivalent with Mendeley but implies you to use Firefox and thus loses out.)

A utility that has provided invaluable over the years is c2bib, which allows you to copy some text giving a bibitem, makes a few guesses about the format followed by some manual intervention and then saves the result in a BibTeX file. Recommended for when you’ve left manually entering a mass of bibitems.