Today I spent all day with Sandra Anderson’s citation graph lineage queries. Though I can compute “all-pairs reachability” for the first 10000 papers in the dataset… I can only currently compute “least-common ancestor” for the first 500 papers. There are some severe algorithmic scalability challenges here that we are excited to tackle.
Part of the problem is that there are 2 million papers and it seems that about 2%—5% of all paper pairs have a common ancestor. This implies that the result set is pretty big — 200 billion papers. This does not scare us, but it is a good real use case!
Due to data restrictions, we are working with anonymized paper IDs for this project and only Jevin has the secret mapping. To ensure that our computation is sensible, we sent him about 2300 pairs of papers and their least common ancestors, and he de-blinded some of the titles. The results are pretty fascinating:
- some of the least common ancestors are back in the 1700’s
- some of the least common ancestors are 54 citations deep from one of their papers (update: it appears this is probably bad data, because one such chain is actually anachronistic)
- The common ancestors for papers in different fields are often old and seem really fundamental, e.g. (What is Capital?, I. Fisher, 1896) and (On the Mathematical Foundations of Theoretical Statistics, R. A. Fisher, 1922).
Tomorrow I will work more on the scaling issues!
Comments !