Hogg's Research: anomaly detection

Showing posts with label anomaly detection. Show all posts

2025-07-24

how significant is your anomaly?

So imagine that you have a unique data set Y, and in that data set Y you measure a bunch of parameters θ by a bunch of different methods. Then you find, in your favorite analysis, your estimate of one particular parameter is way out of line: All of physics must be wrong! How do you figure out the significance of your result?

If you only ever have data Y, you can't answer this question very satisfactorily: You searched Y for an anomaly, and now you want to test the significance. That's why so many a posteriori anomaly results end up going away: That search probably tested way more hypotheses than you think it did, so any significances should be reduced accordingly.

The best approach is to use only part of your data (somehow) to search, and then use a found anomaly to propose a hypothesis test, and then test that test in the held-out or new data. But that often isn't possible, or it is already too late. But if you can do this, then there is usually a likelihood ratio that is decisive about the significance of the anomaly!

I discussed all these issues today with Kate Storey-Fisher (Stanford) and Abby Williams (Chicago) today, as we are trying to finish a paper on the anomalous amplitude of the kinematic dipole in quasar samples.

2025-07-23

finding emission lines (and other oddities) in hot stars

I showed my robust spectral decomposition (dimensionality reduction) and residuals to the MPIA Binaries group today. There was much useful feedback (including that my H-gamma was actually H-delta; embarassing!). One comment was that the model isn't truly a causal separation between star and lines, so there will be some mean lines in the star model; lines aren't entirely outliers. That's true! The group suggested that I iterate to remove stars with lines from the training set.

After the meeting, I implemented some of that, but problems like this have a pathology: If you carefully remove stars with high residuals at some wavelength, then the training data will be deficient, or low, at that wavelength. And then the model will go lower, and then more stars will have excess at that wavelength and: Disaster. So when I implemented, I required a 2-sigma deviation, and I removed both high and low outliers. I don't know if this will work, but I am testing now.