Hogg's Research: machine learning

Showing posts with label machine learning. Show all posts

2025-07-08

robust dimensionality reductions

Dimensionality reduction (the basic being PCA) is very sensitive to outliers: A single bad pixel can dominate most objectives and thus create a spurious dimension. One of the best and most classic solutions to this is the robust PCA method, which is presented in a (very long) paper with impressive math and beautiful results. Yesterday Hans-Walter Rix (MPIA) and I coded it up and applied it to ESA Gaia RVS spectra, with extensive (and impressive) help from Claude. It looks very promising, especially in capturing oddities in hot stars. Today I worked out that there should be something similar that takes into account data weights (inverses of squared uncertainties), and I wrote down the algorithm (on paper). We'll see.

2025-07-07

stellar twins vs synthetic stellar twins

In the Milky Way meeting at MPIA today, a bit of a discussion broke out about using stellar twins, inspired by work by Yuan-Sen Ting (OSU). The idea is: If you have two stars with very similar overall metallicity, and very similar temperature and surface gravity, then it should be possible to measure accurate element abundnace anomalies between the two stars, even in the absence of an extremely accurate spectral synthesis code.

My view, which does not contradict this point, is that an even better way to use this stellar-twin idea is to synthesize a twin for every star, using stars that are similar in (either) parameters or else spectral space. After all, an interpolation to your target star should more accurately represent it than even the most similar individual comparison star. That idea, fundamentally, is the main idea behind The Cannon.

2024-03-15

IAIFI Symposium, day two

Today was day two of a meeting on generative AI in physics, hosted by MIT. My favorite talks today were by Song Han (MIT) and Thea Aarestad (ETH), both of whom are working on making ML systems run ultra-fast on extremely limited hardware. Themes were: Work at low precision. Even 4-bit number representations! Radical. And bandwidth is way more expensive than compute: Never move data, latents, or weights to new hardware; work as locally as you can. They both showed amazing performance on terrible, tiny hardware. In addition, Han makes really cute 3d-printed devices! A conversation at the end that didn't quite happen is about how Aarestad's work might benefit from equivariant methods: Her application area is triggers in the CMS device at the LHC; her symmetry group is the Lorentz group (and permutations and etc). The day started with me on a panel in which my co-panelists said absolutely unhhinged things about the future of physics and artificial intelligence. I learned that many people think we are only years away from having independently operating, fully functional aritificial physicists that are more capable than we are.

2024-03-14

IAIFI Symposium, day one

Today was the first day of a two-day symposium on the impact of Generative AI in physics. It is hosted by IAIFI and A3D3, two interdisciplinary and inter-institutional entities working on things related to machine learning. I really enjoyed the content today. One example was Anna Scaife (Manchester) telling us that all the different methods they have used for uncertainty quantification in astronomy-meets-ML contexts give different and inconsistent answers. It is very hard to know your uncertainty when you are doing ML. Another example was Simon Batzner (DeepMind) explaining that equivariant methods were absolutely required for the materials-design projects at DeepMind, and that introducing the equivariance absolutely did not bork optimization (as many believe it will). Those materials-design projects have been ridiculously successful. He said the amusing thing “Machine learning is IID, science is OOD”. I couldn't agree more. In a panel at the end of the day I learned that learned ML controllers now beat hand-built controllers in some robotics applications. That's interesting and surprising.

2024-03-10

APOGEE spectra as a training set

I spent a lot of the day building a training set for a machine-learning problem set. I am building the training set out of the SDSS-V APOGEE spectra, which are like one-dimensional images for training CNNs and other kinds of deep learning tasks. I wanted relatively raw data, so I spent a lot of time going deep in the SDSS-V data model and data directories, which are beautiful. I learned a lot, and I created a public data set. I chose stars in a temperature and log-gravity range in which I think the APOGEE pipelines work well and the learning problem should work. I didn't clean the data, because I am hoping that contemporary deep learning methods should be able to find and deal with outliers and data issues. If you want to look at my training set (or do my problem set), start here.

2024-01-09

Galactic cartography

Neige Frankel (CITA) and I discussed measurements of the age and metallicity gradients in the Milky Way today. In my machine-learning world, I am working on biases that come in when you use the outputs of regressions (label transfer) to perform population inferences (like mean age as a function of actions or radius). We are gearing up to do a fake but end-to-end simulation of how the Milky Way gets observed, to see if the observed Galaxy looks anything like (what we know in this fake world to be) the truth.

2024-01-08

auto-encoder for calibration data

Connor Hainje (NYU) is looking at whether we could build a hierarchical or generative model of SDSS-V BOSS spectrograph calibration data, such that we could reduce the survey's per-visit calibration overheads. He started by building an auto-encoder, which is a simple, self-supervised generative model. It works really well! We discussed how to judge performance (held-out data) and how performance should depend on the size of the latent space (I predict that it won't want a large latent space). We also decided that we should announce an SDSS-V project and send out a call for collaboration.

[Note added later: Contardo (SISSA) points out that an autoencoder is not a generative model. That's right, but there are multiple definitions of generative model; only one of which is that you can sample from it. Another is that it is a parameterized model that can predict the data. Another is that it is a likelihood function for the parameters. But she's right: We are going to punk parts of the auto-encoder into a generative model in the sense of a likelihood function.]

2024-01-02

informal scientific communication

I have been sending out my draft manuscript on machine learning in the natural sciences to various people I know who have opinions on this. I've been getting great feedback, and it reminds me that there is a lot of important scientific communication that is on informal channels. One thing that interests me: Is there a way to make such conversation more public and viewable and research-able?

2023-12-29

partial differential equations

I am trying to write a proposal to fund the research I do on machine-learning theory. The proposal is to work on ocean dynamics. It's a great application for the things we have done! But it's hard to write a credible proposal in an area that's new to you. Interdisciplinarity and agility is not rewarded in the funding system at present! At least I am learning a ton as I write this.

2023-12-28

philosophy

I've been working on two philosophical projects this month. The first has been an interaction with Jim Peebles (Princeton) around a paper he has been writing, setting down his philosophy of physics. I am pretty aligned with his position, which I expect to hit the arXiv soon. I'm not a co-author of that. But one of the interesting things about science is how much of our work in in anonymous (or quasi-anonymous) support of others.

The second philosophical project is a paper about machine learning and science: I am trying to set down my thoughts about how ML can and can't help the sciences. This is fundamentally a philosophy-of-science question, not a science question.

2023-12-02

try bigger writing

I have been buried in job season and other people's projects. That's good! Hiring and advising are the main things we do in this job. But I decided today that I need to actually start a longer writing project that is my own baby. So I started to turn the set of talks I have been giving about machine learning and astrophysics into a paper. Maybe for the new ICML Position Paper call?

2023-11-27

Terra Hunting Fall Science Meeting, day 1

Today was the first day of the Terra Hunting annual science meeting. One highlight of the day was a presentation by Yan Liang (Princeton), who is modeling stellar spectral variability (the tiny variability) that affects extremely precise radial-velocity measurements. Her method involves a neural network, which is trained to distinguish RV variations and spectral shape variations through a self-supervised approach (with a data augmentation). Then it separates true stellar RV variations from spectral-variability-induced wrong RV variations by requiring (essentially) that the RV variations be uncorrelated with the (latent) description of the stellar spectral shape. This connects to various themes I am interested in, including wobble by Bedell, a spectral variability project by Zhao, and causal structure in machine learning.

2023-11-14

conjectures about pre-training

On Monday of this week, Shirley Ho (Flatiron) gave a talk at NYU in which she mentioned the unreasonable effectiveness of pre-training a neural network: If, before you train your network on your real (expensive, small) training data, you train it on a lot of (cheap, approximate) pre-training data, you get better overall performance. Why? Ho discussed this in the context of PDE emulation: She pre-trains with cheap PDEs and then trains on expensive PDEs and she gets way better performance than she does if she just trains on the expsensive stuff.

Why does this work? One interesting observation is that even pre-training on cat videos helps with the final training! Ho's belief is that the pre-training gets the network understanding time continuity and other smoothness kinds of things. My conjecture is that the pre-training teaches the network about (approximate) diffeomorphism invariance (coordinate freedom). The cool thing is that these conjectures could be tested with interventions!

2023-11-10

data augmentation

A highlight of my day was a colloquium by Renée Hložek (Toronto) about cosmology and event detection with the LSST/Rubin. Importantly (from my perspective), she has run a set of challenges for classifying transients, based on simulations of the output of the very very loud LSST event-detection systems. The results are a bit depressing, I think (sorry Renée!), because (as she emphasized), all the successful methods (and none were exceedingly successful) made heavy use of data augmentation: They noisified things, artificially redshifted things, dropped data points from things, and so on. That's a good idea, but it shows that machine-learning methods at the present day can't easily (or ever?) be told what to expect as an event redshifts or gets fainter or happens on a different night. I'd love to fix those problems. You can almost think of all of these things as group operations. They are groups acting in a latent space though, not in the data space. Hard problems! But worthwhile.

2023-11-08

linear regression

Valentina Tardugno (NYU) and I are looking at the NASA TESS housekeeping data: What parts of it are relevant to understanding the light curves? The weird thing is: We are asking this by asking: What housekeeping data can be reliably predicted using the light curves? Why this way? Because the light curves are higher in signal-to-noise (in general) than most channels of the housekeeping data. Today we went through all the relevant linear algebra for big linear models (which is where we are starting, of course!).

2023-11-06

predicting spectra from spectra

Saakshi More (NYUAD) came into my office during office hours today to ask about possible data science projects in physics. I pitched to her predicting ESA Gaia RVS spectra from Gaia XP spectra, and vice versa. Has anyone done that? In one direction, you have to predict high resolution detail from low-resolution input; in the other direction, you have to predict a wide wavelength range from narrow input. It seems like perfect for something like a linear auto-encoder (at least for a small patch of the color–magnitude diagram; non-linear for a large patch). Later in the day I talked to Gaby Contardo and she said: If you want to go simple, how about nearest neighbor? Good idea!

2023-11-01

M dwarfs

I had a great phone call with Madyson Barber (UNC) and Andrew Mann (UNC) today about M dwarf stellar spectroscopy. I love the problem of understanding the spectra of M dwarfs because this is a subject where there is no ground truth: No physical models of M dwarf photospheres work very well! Why not? Probably because they depend on lots of molecular transitions and band heads, the properties of which are not known (and very sensitive to conditions).

I love problems where there is no ground truth! After all, science as a whole has no ground truth! So the M-dwarf spectroscopy problem is a microcosm of all of science. I went off the deep end on this call, and we were all left knowing less than we knew when we started the call. By this post, I apologize to Barber and Mann.

2023-10-29

area of a triangle?

On Friday and the weekend, I came up with (what I think is) a novel formula for the area A of a triangle! That's weird. I was looking for a formula in the Deep Sets (or map-reduce) format. Here it is. It's ridiculous and useless, but it involves only sums over functions of the individual corners of the triangle. It was hard to find! But it's exact (I believe).

2023-10-20

Florida, day two

Today was day two of my visit to University of Florida. I had many interesting discussions. One highlight was with Dhruv Zimmerman, who wants to infer big labels (non-parametric functions of time) from small features (a few bands of photometry). That's my kind of problem! We discussed different approaches, and we discussed possible featurizations (or dimensionality reductions) of the labels. I also pitched an information-theoretic analysis. If there's one thing I've learned in the last few years, it is that you shouldn't be afraid to solve problems where there are fewer data than parameters! You just have to structure the problem with eyes wide open.

After many more (equally interesting) discussions, the day ended with Sarah Ballard's group out at a lovely beer garden. We discussed the question: Should students be involved in, and privy to, all the bad things with which we faculty interact as academics, or should we protect students from the bad things? You can imagine my position, since I am all about transparency. But the positions were interesting. Ballard pointed out that in an advisor–student relationship, the student might not feel that they can refuse when the advisor wants to unload their feelings! That power asymmetry is very real. But Ballard's students (Chance, Guerrero, Lam, Seagear) said that they want to understand the bad things too; they aren't in graduate school just to write papers (that comment is for you, Quadry!).

2023-10-19

Florida, day one

I spent today with Sarah Ballard's group, plus others, at the University of Florida. I gave a talk, to a large, lively, and delightful audience. At the end of this talk I was very impressed by the following thing: Ballard had everyone in the room discuss with their neighbors (turn and talk) for about 3 minutes, after the seminar but before the question period began! This is a technique I use in class sometimes; it increases participation. After those 3 minutes, audience members had myriad questions, as one might imagine.

I spoke with many people in the Department about their projects. One highlight was Jason Dittman, who showed me gorgeous evidence that a particular warm exoplanet on an eccentric orbit has an atmosphere that undergoes some kind of phase change at some critical insolation, as it moves away from its host star on its orbit. Crazy!

Late in the day I discussed n-point functions and other cosmological statistics with Zach Slepian and Jiamin Hou. We discussed the plausibility of getting tractable likelihoods for any n-point functions. We also discussed the oddity that n-point functions involve sums over n-star configurations among N stars (N choose n), but there are mathematical results that show that any permutation-invariant function of any point cloud can be expressed with only a sum over stars (N). That sounds like a research problem!