Dr Kilian Walsh

One of the great pleasures of my job is being involved in the bestowal of PhDs! Today, Kilian Walsh (NYU) defended his PhD, which he did with Jeremy Tinker (NYU). The thesis was about the connections between galaxies and their halos. As my loyal reader knows, I find it amazing that this description of the world works at all, but it works incredibly well.

One of the puzzling results from Walsh's work is that although halos themselves have detailed properties that depend on how, where, and when they assembled their mass, the properties of the galaxies that they contain don't seem to depend on any halo property except the mass itself! So the halos have (say) spin parameters that depend on assembly time, the galaxies don't seem to have properties that depend on halo spin parameter! Or if they do, it's pretty subtle. This subject is called halo assembly bias and galaxy assembly bias; there is plenty of the former and none of the latter. Odd.

Of course the tools used for this are blunt tools, because we don't get to see the halos! But Walsh's work has been about sharpening those tools. (I could joke that he sharpens them from extremely blunt to very blunt!) For example, he figured out how to use the void probability function in combination with clustering to put stronger constraints on halo occupation models.

Congratulations Dr. Walsh!


out sick

I was out sick today. It was bad, because I was supposed to give the kick-off talk at Novel Ideas for Dark Matter in Princeton.


what's permitted for target selection?

Because I have been working with Rix (MPIA) to help the new project SDSS-V make plans to choose spectroscopic targets, and also because of work I have been doing with Bedell (Flatiron) on thinking about planning radial-velocity follow-up observations, I find myself saying certain things over and over again about how we are permitted to choose targets if we want it to be easy (and even more importantly, possible) to use the data in a statistical project that, say, determines the population of stars or planets, or, say, measures the structural properties of the Milky Way disk. Whenever I am saying things over and over again, and I don't have a paper to point to, that suggests we write one. So I started conceiving today a paper about selection functions in general, and what you gain and lose by making them more complicated in various ways. And what is not allowed, ever!


#hackAAS at #aas233

Today was the AAS Hack Together Day, sponsored by the National Science Foundation and by the Moore Foundation, both of which have been very supportive of the research I have done, and both of which are thinking outside the box about how we raise the next generation of scientists! We had a huge group and lots happened. If you want to get a sense of the range and scope of the projects, look at these telegraphic wrap-up slides, which (as always) only paint a partial picture!

We were very fortunate to have Huppenkothen (UW) in the room, and in (literally) five minutes before we started, she put together these slides about hack days. I love that! I think Huppenkothen is the world ambassador and chief philosopher of hacking.

I worked on two hacks. Well really one. The one I didn't really work on was to launch a Mastodon instance. Mastodon is the open-source alternative to Twitter(tm) and has nice features like content warnings (on which you can filter) and community-governable rules and restrictions. I thought it might be fun to try to compete with the big players in social! Although I didn't work on it at all, Dino Bektešević (UW) took over the project and (with a lot of hacking) got it up and running on an AWS instance. It took some hacking because (like many open-source projects) the documentation and tutorials were out of date and filled with version (and other) inconsistencies. But Bektešević (and I by extension) learned a lot!

The hack I actually did (a very tiny, tiny bit of) work on was to write a stellar-binaries-themed science white paper for the Decadal Survey. Katie Breivik (CITA) and Adrian Price-Whelan (Princeton) are leading it. Get in touch with us if you want to help! The point is: Binary stars are a critical part of every science theme for the next decade.


#AAS233, day 3

I arrived today at #AAS233. I'm here mainly for the Hack Together Day (which is tomorrow), but I did go to some exoplanet talks. One nice example was Molly Kosiarek (UCSC) who talked about a small planet in some K2 data. She fit Gaussian Processes to the K2 light curve and used that to determine kernel parameters for a quasi-periodic stochastic process. She then used those kernel parameters to fit the radial-velocity data to improve her constraints on the planet mass. She writes more in this paper. Her procedure involves quite a few assumptions, but it is cool because it is a kernel-learning problem, and she was explicitly invoking an interesting kind of generalizability (learning on light curve, applying to spectroscopy).

Late in the day I had a conversation with Jonathan Bird (Nashville) about the challenges of getting projects done. And another with Chris Lintott (Oxford) about scientific communication on the web and in the journals.


reproducing old results

I spent a bit of research time making near-term plans with Storey-Fisher (NYU), who is developing new estimators of clustering statistics. Because clustering is two-point (at least), computational complexity is an issue; she is working on getting things fast. She has had some success; it looks like we are fast enough now. The near-term goals are to reproduce some high-impact results from some Zehavi papers on SDSS data. Then we will have a baseline to beat with our new estimators.


expected future-discounted discovery rate

My tiny bit of research today was on observation scheduling: I read a new paper by Bellm et al about scheduling wide-field imaging observations for ZTF and LSST. It does a good job of talking about the issues but it doesn't meet my (particular, constrained) needs, in part because Bellm et al are (sensibly) scheduling full nights of observations (that is, not going just-in-time with the scheduling), and they have separate optimizations for volume searched and slew overheads. However, it is highly relevant to what I have been doing. It also had lots of great references that I didn't know about! They also make a strong case for optimizing full nights rather than going just-in-time. I agree that this is better, provided that your conditions aren't changing under you. If they are changing under you, you can't really plan ahead. Interesting set of issues, and something that differentiates imaging-survey scheduling from spectroscopic follow-up scheduling.

I also did some work comparing expected information gain to expected discovery rate. One issue with information gain is that if it isn't information gain in this exposure (and it isn't, because we have to look ahead), then it is hard to write down the information gain, because it depends strongly on future decisions (for example, if we decide to stop observing the source entirely!). So I am leaning towards making my first contribution on this subject be about discovery rate.

Expected future-discounted discovery rate, that is.


target selection

On the weekend, Rix (MPIA) and I got in a call to discuss the target selection for SDSS-V, which is a future survey to measure multi-epoch spectroscopy for (potentially) millions of stars. The issue is that we have many stellar targeting categories, and Rix and my view is that targeting should be based only on the measured properties of stars in a small set of public, versioned photometric and astrometric catalogs.

This might not sound like a hard constraint, but it is: It means you can't use all the things we know about the stars to select them. That seems crazy to many of our colleagues: Aren't you wasting telescope time if you observe things that you could have known, from existing observations, was not in the desired category? That is, if you require that selection be done from a certain set of public information sources, you are ensuring an efficiency hit.

But that is compensated—way more than compensated—by the point that the target selection will be understandable, repeatable, and simulate-able. That is, the more automatic the target selection it is, from simple inputs, the easier it is to do populations analyses, statistical analyses, and simulate the survey (or what the survey would have done in a different galaxy). See, for example, cosmology: The incredibly precise measurements in cosmology have been made possible by performing simple, inefficient, but easy-to-understand-and-model selection functions. And, indeed: When the selection functions get crazy (as they do in SDSS-III quasar target selection, with which I was involved), the data become very hard to use (the clustering of those quasars on large scales can never be known extremely precisely).

Side note: This problem has been disastrous for radial-velocity surveys for planets, because in most cases, the observation planning has been done by people in a room, talking. That's extremely hard to model in a data analysis.

Rix and I also discussed a couple of subtleties. One is that not only should the selection be based on public surveys, it really should be based only on the measurements from those surveys, and not the uncertainties or error estimates. This is in part because the uncertainties are rarely known correctly, and in part because the uncertainties are a property of the survey, not the Universe! But this is a subtlety. Another subtlety is that we might not just want target lists, we might want priorities. Can we easily model a survey built on target priorities rather than target selection? I think so, but I haven't faced that yet in my statistical work.


refereeing, selecting, and planning

I don't think I have done good job of writing the rules for this blog, because I don't get to count refereeing. Really, refereeing papers is a big job and it really is research, since it sometimes involves a lot of literature work or calculation. I worked on some refereeing projects today for a large part of the day. Not research? Hmmm.

Also not counting as research: I worked on the Gaia Sprint participant selection. This is a hard problem because everyone who applied would be a good participant! As part of this, I worked on demographic statistics of the applicant pool and the possibly selected participants. I hope to be sending out emails next week (apologies to those who are waiting for us to respond!).

Late in the day I had a nice conversation with Stephen Feeney (Flatiron) about his upcoming seminar at Toronto. How do different aspects of data analysis relate? And how do the different scientific targets of that data analysis relate? And how to tell the audience what they want to know about the science, the methods, and the speaker. I am a big believer that a talk you give should communicate things about yourself and not just the Universe. Papers are about the Universe, talks are about you. That's why we invited you!


the limits of wobble

The day was pretty-much lost to non-research in the form of project management tasks and refereeing and hiring and related. But I did get in a good conversation with Bedell (Flatiron) with Luger (Flatiron) and Foreman-Mackey (Flatiron) about the hyper-parameter optimization in our new wobble code. It requires some hand-holding, and if Bedell is going to “run on everything” as she intends to this month, it needs to be very robust and hands-free. We discussed for a bit and decided that she should just set the hyper-parameters to values we know are pretty reasonable right now and just run on everything, and we should only reconsider this question after we have a bunch of cases in hand to look at and understand. All this relates to the point that although we know that wobble works incredibly well on the data we have run it on, we don't currently know its limits in terms of signal-to-noise, number of epochs, phase coverage in the barycentric year, and stellar temperature.


finished a paper!

It was a great day at Flatiron today! Megan Bedell (Flatiron) finished her paper on wobble. This paper is both about a method for building a data-driven model for high-resolution spectral observations of stars (for the purposes of making extremely precise radial-velocity measurements), and about an open-source code that implements the model. One of the things we did today before submission is discuss the distinction between a software paper and a methods paper, and then we audited the text to make sure that we are making good software/method distinctions.

Another thing that came up in our finishing-up work was the idea of an approximation: As I like to say, once you have specified your assumptions or approximations with sufficient precision, there is only one method to implement. That is, there isn't an optimal method! There is only the method, conditioned on assumptions. But now the question is: What is the epistemological status of the assumptions? I think the assumptions are just choices we make in order to specify the method! That is, when we treat the noise as Gaussian, it is not a claim that the noise is truly Gaussian! It is a claim that we can treat it as Gaussian and still get good and useful results. Once again, my pragmatism. We audited a bit for this kind of language too.

We submitted the paper to the AAS Journals and to arXiv. Look for it on Thursday night (US time) or Friday morning!


long-form writing

I'm spending some time over the break thinking about possible long-form writing projects. I have an Atlas of Galaxies to finish, and I have ideas about possible books on introductory mechanics, and ideas about something on the practice and deep beliefs of scientists. And statistics and data analysis, of course! I kicked those around and wrote a little in a possible mechanics preface.


scientific priorities

I spent a piece of the morning exhaustively going through short-term priorities with Bedell (Flatiron). We discussed strategy given her stage. She has enough projects to last a decade! I guess we all do, but it is still amazing when we list them. We decided to focus on things that make direct use of the technologies we have built and not particularly build new technology for a bit. We also decided to submit the wobble paper right after the break.

After this, we segued into a conversation about the (badly named) Rossiter-McLaughlin effect with Luger (Flatiron) and Beale (Flatiron). The effect is the effective change in a star's radial velocity as a planet transits its surface, since it is rotating and has a spatial gradient in surface RV. We discussed what is involved in modeling this more accurately than is currently done. There were some philosophical issues coming up around flux conservation, limb darkening, and continuum normalization. All hard issues!

At the end of the day I got in a short quality conversation (over wine) with Alex Barnett (Flatiron) so I could pre-flash him the correlation-function and power-spectrum problems that Storey-Fisher (NYU) and I will bring him in January. He agreed that we are going to effectively unify fourier-space and real-space approaches when we make them all more efficient and more accurate. So excited about a winter of clustering!


almost nothing

My only research today was a short conversation with Bedell (Flatiron) about finishing up our paper on wobble.


office hours

It is exam week here, so I spent my whole day holding marathon office hours. That was fun! But not research.