2012-12-13

arXiv modeling

Camp Hogg (which includes Muandet these days) had lunch with David Blei (Princeton), who is a computer scientist and machine-learning expert. He told us about projects he is doing to index and provide recommendations for arXiv papers, based (presumably) on his experience with author–topic modeling. Blei is a kindred spirit, because he favors methods that have a graphical model or probabilistic generative model underlying. We agreed that this is beneficial, because it moves the decision making from what algorithm should we use? to more scientific questions like what is causing our noise? and what aspects of the problem depend on what other aspects?. These scientific questions lay the assumptions and domain-knowledge input bare.

We talked about the value of having arXiv indexing, how automated paper recommendations might be used, what things could cause users to love or hate it, and what kinds of external information might be useful. We mentioned Twitter. Blei noted that any time that you have a set of user bibliographies—that is, the list of papers they care about or use—those bibliographies can help inform a model of what the papers are about. For example, a paper might be in the statistics literature, and have only statistics words in it, but in fact be highly read by physicists. That is an indicator that the paper's subject matter spills into physics, in some very real sense. One of Blei's interests is finding influential interdisciplinary papers by methods like these. And the nice thing is that external forums like Twitter, Facebook (gasp), and user histories at the arXiv effectively provide such bibliographies.

Late in the day we met up with Micha Gorelick (bitly) to discuss our plans for the dotastronomy hack day in New York City this weekend (organized by Gus Muench, Harvard). We are wondering if we could hack from idea to submittable paper in one day.

1 comment:

  1. Dutch Railroader18 December, 2012 13:33

    There are a number of good/bad ways this can be done. Amazon, and other sites, for example, record what you look at, and find matches based on that set. There is "also viewed" as part of this.

    I have experimented with great success with a trivial, but effective friends of friends to find papers very close to given paper that may have been overlooked. One takes a paper's reference list, and finds all the references in the references, and citations to the references, crosses out items on the original reference list on this list, and sorts by hits. This identifies papers not cited by your work, but that have close kinship with it as based on common citations and references.

    One might extend this in various ways to identify ArXiv works. A trivial procedure would simply be hits on a target reference list from that paper's references. This would not bring up neat "far away" papers that you might enjoy, but it's a very direct way of targeting new papers that touch what you're working on.

    ReplyDelete