Background And Method

Inspired by Michael Mitzenmacher's flamebait post on SODA/STOC/FOCS, we decided to roll up our sleeves, and resolve the biggest outstanding issue in Theoretical Computer Science, namely the great "STOC/FOCS vs. SODA" debate ("P vs. NP" is a tainted second, sullied by all that money being offered to solve it). We have some interesting preliminary observations, and there are many interesting problems left open by our work ;)

The hypothesis:

There is a significant difference in citation patterns between STOC/FOCS and SODA

The plan:

First, we obtained the list of titles of conference papers appearing in STOC, FOCS and SODA in the last 10 years (1997-2006). We deliberately excluded 2007 because FOCS hasn't happened yet. We got this list from DBLP (Note: DBLP does not make any distinction between actual papers and tutorials/invited articles; we decided to keep all titles because there weren't that many tutorials/invited papers in any case).

For each title, we extracted the citation count from Google Scholar, using a process that we henceforth refer to as "The Extractor". Life is too short to describe what "The Extractor" is. Suffices to say that its output, although not perfect, has been verified to be somewhat close to the true distribution (see below).

The results can be found at this link. The tables and graphs are quite self-explanatory.

OBSERVATIONS: The main conclusion is that the hypothesis is valid: a systematic discrepancy between citation counts of SODA vs. STOC/FOCS does appear to exist. However, the discrepancy varies significantly over time, with years 1999-2001 experiencing the highest variations. It is interesting to note that 1999 was the the year when SODA introduced four parallel sessions as well as the short paper option. Although most of the stats for STOC and FOCS are quite similar, there appears to be a discrepancy at the end of the tail. Specifically, the 5 highest citation counts per year for STOC (years 1997-2001) are all higher than the highest citation count for FOCS (year 2001). (Note: the highest cited STOC article in 2001 was Christos Papadimitriou's tutorial paper on algorithms and game theory)

Update (7/28): As a result of a Google Scholar rebuild, the top 5 cited papers include one paper from FOCS (FOCS 2000)

Another interesting observation comes from separating the SODA cites into long and short paper groups (for the period 1999-2005). Plotting citations for short vs long papers separately indicates that the presence of short papers caused a net downward influence on SODA citation counts, but as fewer and fewer shorts were accepted, this influence decreased.

APPENDIX: methods and techniques.

The data gathering process had the following systematic sources of error:
  1. Erasures: no citation count. This could mean several things: the number was 0; the title was mispelled; the DBLP title was different from the Google Scholar version, etc. We noticed that the inclusion of "Extended Abstract" can throw the search off, and it often needs to be removed from the title for an accurate search.
  2. Wrong citation count. Typically, in this case, the number given was an overestimate of the actual count. Reasons as above. Also note that self-citations can inflate the apparent citation count of a paper. We made no attempt to remove self-citations.
  3. Deletions: the paper titles were trolled from DBLP using automatic scripts, and occasionally parse errors in HTML caused titles to drop. There are several inconsistencies in how DBLP renders the list of papers for a conference in different years, and our scripts tried to take this into account, but there are some known issues that need to be fixed.
To make sure that the output makes sense, we performed a few "checks and balances". In particular:
Median (Mike's/Ours)Total (Mike's/Ours)
FOCS 38/38 3551/3315
STOC 21/21 3393/2975
SODA 14/13 2578/2520

Acknowledgements

Adam Buchsbaum for suggesting the idea of, and generating data for the short-long variants of SODA. David Johnson, Graham Cormode, and Sudipto Guha for bug fixes, useful suggestions and ideas for further plots (which we can look into after the data is cleaned up)
Created by Piotr Indyk and Suresh Venkatasubramanian. Original idea by Michael Mitzenmacher. Last modified: Mon Jul 30 21:15:44 MDT 2007