Hogg's Research: reproducibility

Showing posts with label reproducibility. Show all posts

2023-05-30

Dr Irina Espejo

Today it was my honor to serve on the PhD defense committee of Irina Espejo (NYU), who is one of the first (ever in the world, actually!) PhDs in Data Science. Her PhD research involved making real, practical, scalable, reproducible tools for the (late-in-pipeline) analysis of high-energy physics data from the Large Hadron Collider. She built tools to speed up likelihood-free inferences, and she built a tool to find exclusion regions (upper limits) in complex parameter spaces. She used the latter to put constraints on a (real, not toy) proposed modification to the standard model.

On the first project, the tools that she built (and built on) make the LHC more sensitive to new physics, because they find better test statistics for distinguishing models. They make some searches far better, which makes me wonder whether particle physics is using our money efficiently??

2023-02-13

reproducibility; reionization

Today featured a blackboard talk by Soltan Hassan (NYU), about a semi-analytic model to explain the various bits of data we have about the reionization of the universe at redshifts around 7. The model is baroque, but there are no options when it comes to problems that are deep in gastrophysics.

After lunch I spent an hour on a panel organized by NYU Libraries about reproducibility in the natural sciences. That was fun; so many ideas! One interesting idea is that it is transparency, more than reproducibility, that is important. Another was a technical suggestion: If you want your students to be good at making reproducible code, they shouldn't bring you plots of their results, they should bring you code that you can run to make those plots! Haha, genius.

2019-09-09

enumerating all possible statistical tests

Today I got in my first weekly meeting (of the new academic year) with Kate Storey-Fisher (NYU). We went through priorities and then spoke about the problem of performing some kind of comprehensive or complete search of the large-scale structure data for anomalies. One option (popular these days) is to train a machine-learning method to recognize what's ordinary and then ask it to classify non-ordinary structures as anomalies. This is a great idea! But it has the problem that, at the end of the day, you don't know how many hypotheses you have tested. If you find a few-sigma anomaly, that isn't surprising if you have looked in many thousands of possible “places”. It is surprising if you have only looked in a few. So I am looking for comprehensive approaches where we can pre-register an enumerated list of tests we are going to do, but to have that list of tests be exceedingly long (like machine-generated). This is turning out to be a hard problem.

2019-01-18

Dr Lukas Henrich

It was an honor and a privilege to serve on the PhD defense committee of Lukas Heinrich (NYU), who has had a huge impact on how particle physicists do data analysis. For one, he has designed and built a system that permits re-use of intermediate data results from the ATLAS experiment in new data analyses, measurements, and searches for new physics. For another, he has figured out how to preserve data analyses and workflows in a reproducible framework using containers. For yet another, he has been central in convincing the ATLAS experiment and CERN more generally to adopt standards for the registration and preservation of data analysis components. And if that's not all, he has structured this so that data analyses can be expressed as modular graphs and modified and re-executed.

I'm not worthy! But in addition to all this, Heinrich is a great example of the idea (that I like to say) that principled data analysis lies at the intersection of theory and hardware: His work on ruling out supersymmetric models using ATLAS data requires a mixture of theoretical and engineering skills and knowledge that he has nailed.

The day was a pleasure, and that isn't just the champagne talking. Congratulations Dr. Heinrich!

2019-01-11

what's permitted for target selection?

Because I have been working with Rix (MPIA) to help the new project SDSS-V make plans to choose spectroscopic targets, and also because of work I have been doing with Bedell (Flatiron) on thinking about planning radial-velocity follow-up observations, I find myself saying certain things over and over again about how we are permitted to choose targets if we want it to be easy (and even more importantly, possible) to use the data in a statistical project that, say, determines the population of stars or planets, or, say, measures the structural properties of the Milky Way disk. Whenever I am saying things over and over again, and I don't have a paper to point to, that suggests we write one. So I started conceiving today a paper about selection functions in general, and what you gain and lose by making them more complicated in various ways. And what is not allowed, ever!

2019-01-08

reproducing old results

I spent a bit of research time making near-term plans with Storey-Fisher (NYU), who is developing new estimators of clustering statistics. Because clustering is two-point (at least), computational complexity is an issue; she is working on getting things fast. She has had some success; it looks like we are fast enough now. The near-term goals are to reproduce some high-impact results from some Zehavi papers on SDSS data. Then we will have a baseline to beat with our new estimators.