Background And Method
Inspired by Michael Mitzenmacher's
flamebait post
on SODA/STOC/FOCS, we decided to roll up our sleeves, and resolve
the biggest outstanding issue in Theoretical Computer Science, namely
the great "STOC/FOCS vs. SODA" debate ("P vs. NP" is a tainted second,
sullied by all that money being offered to solve it). We have some
interesting preliminary observations, and there are many interesting
problems left open by our work ;)
The hypothesis:
There is a significant difference
in citation patterns between STOC/FOCS and SODA
The plan:
First, we obtained the list of titles of conference papers appearing in
STOC, FOCS and SODA in the last 10 years (1997-2006). We deliberately
excluded 2007 because FOCS hasn't happened yet. We got this list from DBLP
(Note: DBLP does not make any distinction between actual papers and
tutorials/invited articles; we decided to keep all titles because there
weren't that many tutorials/invited papers in any case).
For each title, we extracted the citation count from Google Scholar, using
a process that we henceforth refer to as "The Extractor". Life is too
short to describe what "The Extractor" is. Suffices to say that its
output, although not perfect, has been verified to be somewhat close to
the true distribution (see below).
The results can be found at this link. The tables and graphs are quite self-explanatory.
OBSERVATIONS:
The main conclusion is that the hypothesis is valid: a systematic discrepancy between citation counts of SODA vs. STOC/FOCS does appear to exist. However, the discrepancy varies significantly over time, with years 1999-2001 experiencing the highest variations. It is interesting to note that 1999 was the the year when SODA introduced four parallel sessions as
well as the short paper option.
Although most of the stats for STOC and FOCS are quite similar, there appears to be a discrepancy at the end of the tail. Specifically, the 5 highest citation counts per year for STOC (years 1997-2001) are all higher than the highest citation count for FOCS (year 2001). (Note: the highest cited STOC article in 2001 was Christos Papadimitriou's tutorial paper on algorithms and game theory)
Update (7/28): As a result of a Google Scholar rebuild,
the top 5 cited papers include one paper from FOCS (FOCS 2000)
Another interesting observation comes from separating the SODA cites into long and short paper groups (for the period 1999-2005). Plotting citations for short vs long papers separately indicates that the presence of short papers caused a net downward influence on SODA citation counts, but as fewer and fewer shorts were accepted, this influence decreased.
APPENDIX: methods and techniques.
The data gathering process had the following systematic sources of error:
- Erasures: no citation count. This could mean several things: the number was 0; the title was mispelled; the DBLP title was different from the Google Scholar version, etc. We noticed that the inclusion of "Extended Abstract" can throw the search off, and it often needs to be removed from the title for an accurate search.
- Wrong citation count. Typically, in this case, the number given was an overestimate of the actual count. Reasons as above. Also note that self-citations can inflate the apparent citation count of a paper. We made no attempt to remove self-citations.
- Deletions: the paper titles were trolled from DBLP using automatic scripts,
and occasionally parse errors in HTML caused titles to drop. There are several inconsistencies in how DBLP renders the list of papers for a conference in different years, and our scripts tried to take this into account, but there are some known issues that need to be fixed.
To make sure that the output makes sense, we performed a few "checks and
balances". In particular:
- we sampled 10 random titles from each of FOCS, STOC and SODA for each of the 10 years, and for each title we checked the citation count by hand. Results: there were 7 mistakes in FOCS, 9 in STOC, and 11 in SODA, indicating a current error rate in the 9-10% range.
- for each of FOCS, STOC, SODA, we verified (by hand) the values of the top 10 citation numbers, as reported by The Extractor
- we compared our stats for the year 2000 with the stats obtained by Michael. The results are pretty close:
|
| Median (Mike's/Ours) | Total (Mike's/Ours) |
| FOCS | 38/38 | 3551/3315 |
| STOC | 21/21 | 3393/2975 |
| SODA | 14/13 | 2578/2520 |
Acknowledgements
Adam Buchsbaum for suggesting the idea of, and generating data for the short-long variants of SODA. David Johnson, Graham Cormode, and Sudipto Guha for bug fixes, useful suggestions and ideas for further plots (which we can look into after the data is cleaned up)
Created by Piotr Indyk and Suresh Venkatasubramanian. Original idea by
Michael Mitzenmacher.
Last modified: Mon Jul 30 21:15:44 MDT 2007