Researchers have introduced a new approach to object counting that is designed to work across a wide range of visual settings, from crowded microscope slides to remote-sensing imagery. The paper, titled Count Anything, proposes both a new benchmark and a model aimed at solving one of computer vision's persistent problems: counting objects reliably outside narrowly defined datasets.
The work argues that existing counting systems are too specialized. Many models are built for a single type of scene, such as people in crowds, vehicles on roads, cells under a microscope or crops in farmland. According to the authors, that specialization makes it difficult for systems to generalize when object types, sizes, densities and visual styles change. Their goal is to create a more flexible counting framework that can handle different domains with a single text-guided interface.
In the formulation described by the authors, a model receives an image and a natural-language prompt describing what should be counted. Instead of producing only a number, it returns a set of instance-level points that correspond to the target objects. The count is derived from those points, which gives the system a more interpretable output than traditional counting methods that rely on density maps.
The paper says this setup combines category-specific counting with spatial localization. That means users can ask for a particular object class while also receiving point-based evidence showing where the model believes those objects are located.
To support this approach, the researchers built CLOC, short for Cross-domain Large-scale Object Counting. The benchmark brings together public data sources into one dataset spanning six visual domains: general scenes, remote sensing, histopathology, cellular microscopy, agriculture and microbiology. In total, CLOC includes about 220,000 images, 619 categories and 15 million object instances.
On top of the dataset, the paper introduces Count Anything, a model built around discrete object points rather than density maps. The system uses two complementary components. One is a region-level sparse counter, aimed at large and widely spaced targets. The other is a pixel-level dense counter, which is meant for small, crowded or poorly defined objects.
The authors say the sparse counter can provide anchors at the object level, while the dense counter predicts points in areas where objects are packed tightly together or have weak boundaries. To train the model across heterogeneous annotations, the paper uses point-centric supervision. A parameter-free fusion method then combines the outputs of both counters.
The researchers describe the design as a way to make one model more adaptable across many counting scenarios, rather than forcing separate systems for different domains. They also say the approach is intended to work with open-world counting settings, where the model must deal with unfamiliar categories or distributions.
According to the paper, experiments show that Count Anything achieves strong accuracy and generalization across multiple domains, and that it outperforms existing open-world counting methods. The authors also said code for the project has been made available online.
The paper arrives as part of a broader push toward generalist vision models that can handle more than one narrow task. In object counting, that shift could matter for applications ranging from environmental monitoring to biomedical imaging, where a system may need to adapt to very different kinds of scenes without being retrained for each one.
As presented in the paper, Count Anything is an effort to unify those settings under one text-guided framework, while preserving the spatial interpretability that many counting applications require.