Finding Cosmic Needles in a Galactic Haystack
Color-color diagrams for all sources with multi-wavelength photometry in the Fornax core region, shown as gray dots. Spectroscopically confirmed samples are shown for GCs (blue), stars (golden) and galaxies (purple). These are used as labeled samples for the SVC model.
Globular clusters (GCs), ancient and dense balls of stars orbiting galaxies, are some of the universe's most abundant fossil records. They encode when and how galaxies assembled themselves billions of years ago. While GCs in the Local Group can be resolved into individual stars, 19 Mpc away they look almost identical to foreground stars in our own Milky Way and distant background galaxies. Traditional approaches using simple color cuts produce contamination rates of 30–70% and make stellar population properties inference ambiguous. That's not science; that's a coin flip with extra steps.
In our recent paper, Ordenes-Briceño et al. (2025), we decided that the humble Support Vector Machine (SVM), an algorithm old enough to have a driver's license, could solve this better than brute-force color selections ever did. Using data from the Next Generation Fornax Survey (NGFS) spanning near-UV to near-infrared (u'g'i'JKs), our team built an SVM classifier trained on ~5,000 spectroscopically confirmed sources: ~1,200 globular clusters, ~2,100 foreground stars, and ~1,600 background galaxies. We fed it 15 features: 10 color combinations plus morphological parameters like FWHM and the spread model. The clever insight was: not all features are created equal. Through permutation importance analysis and correlation clustering, we could prune down to just 7 features, i.e. five colors spanning the full UV-to-NIR baseline plus two structural parameters. This leaner model hits 96.6% accuracy with ~10% misclassification, while avoiding the overfitting trap that inflated the 15-feature model's scores.
The color pair (u'−g') vs. (g'−Ks) emerged as the MVP: connecting near-UV to near-IR gives maximum leverage for separating the three populations. Drop the u'-band? Performance degrades. Drop the NIR? You're back to selecting nearly 12,000 "globular clusters”, i.e. three times too many.
This has direct implications for the coming data tsunami. The Vera Rubin Observatory's LSST will detect millions of unresolved sources, and Euclid is already flying and delivered the first batch of early release data. We, therefore, tested LSST-like filter combinations and found that without NIR support from Euclid or Roman, optical-only classification remains suboptimal. Thus, the message to survey designers is clear: your UV and NIR bands aren't luxury add-ons, they're load-bearing walls for astrophysics research.
The final catalog delivers 2,926 globular cluster candidates in the Fornax cluster core, a clean, probability-vetted sample ready for serious science on dark matter halos, intracluster light, and galaxy assembly history. We are actively exploiting these data for some stunning new insights, so stay tuned.
Sometimes the best tool for the job isn't the newest deep learning architecture, it's a well-tuned classic with the right data behind it.
Related Papers