Uncovering Geometric Primitives in Object Representations
Vision Models achieve remarkable accuracy in categorizing objects, yet it remains unclear if these successes are driven by superficial texture matching or a deeper structural understanding of the world. Determining whether a model’s internal representation of a „table“ is grounded in the abstract concept of a „rectangle“ is essential for developing AI that perceives the world through human-like geometric reasoning. This research introduces the concept of approximate probes to quantify the geometric transfer between pure mathematical ideals and complex real-world entities. These probes are then deployed across the layers of a pre-trained vision transformer to evaluate if the model successfully recognizes the underlying geometry of real-world photographs without ever being trained on them.
References
Tenney, I., Das, D., & Pavlick, E. (2019). BERT rediscovers the classical NLP pipeline. In Proceedings of the 57th annual meeting of the association for computational linguistics (pp. 4593-4601).
Arnold, S., & Gröbner, R. (2026). Locating and Editing Figure-Ground Organization in Vision Transformers. arXiv preprint arXiv:2603.06407.
Mechanistic Interpretability of Syntactic Reduction
Natural language frequently employs abbreviated structures, such as verb contractions (e.g., isn’t) and the omission of subordinating conjunctions. This thesis aims to dissect the shared functional mechanisms behind morphological contractions and that-deletion within language models using logit lens and activation patching.
References
https://www.lesswrong.com/posts/AcKRB8wDpdaN6v6ru/interpreting-gpt-the-logit-lens
Meng, K., Bau, D., Andonian, A., & Belinkov, Y. (2022). Locating and editing factual associations in gpt. Advances in neural information processing systems, 35, 17359-17372.
Arnold, S., & Gröbner, R. (2025,). Steering Prepositional Phrases in Language Models: A Case of with-headed Adjectival and Adverbial Complements in Gemma-2. In Proceedings of the 8th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP (pp. 69-78).
Conceptualising a Reusable and „Sense“-inspired Concept Bank for Vision Models
While Vision Models achieve state-of-the-art predictive performance in object categorization, it remains an open question whether this efficacy stems from superficial statistical correlations or a robust comprehension of the physical world. To systematically evaluate the intermediate representations within a model’s latent represenation space, we require a rigorously formalized and reusable conceptual vocabulary. Current interpretability methodologies predominantly rely on ad-hoc concepts In contrast, this thesis seeks to derive a generalizable and resuable concept bank grounded on semantic hierarchies (e.g., WordNet, SenseTags) and evaluate them using linear probes.
Concept Bank: Extract broad, recurring concepts and map them to basic visual elements, e.g. PERSON, ANIMAL, ARTIFACT, PLANT, TEXTURE, MATERIAL etc. Link these abstract concepts to concrete image categories from datasets like ImageNet and MSCOCO.
Concept Probes: Train Support Vector Machines (SVMs) on the vision model’s internal layers to test if its latent space naturally separates these concepts with a linear boundary.
Literature
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., … & Houlsby, N. (2020). An image is worth 16×16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929.
Bau, D., Zhou, B., Khosla, A., Oliva, A., & Torralba, A. (2017). Network dissection: Quantifying interpretability of deep visual representations. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 6541-6549).
Kim, B., Wattenberg, M., Gilmer, J., Cai, C., Wexler, J., & Viegas, F. (2018). Interpretability beyond feature attribution: Quantitative testing with concept activation vectors (tcav). In International conference on machine learning (pp. 2668-2677). PMLR.
Schneider, N., & Smith, N. A. (2015). A corpus and model integrating multiword expressions and supersenses. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 1537-1547).
Baryshnikov, A., & Ryabinin, M. (2023). Hypernymy understanding evaluation of text-to-image models via wordnet hierarchy. arXiv preprint arXiv:2310.09247.
Stöckl, A. (2023). Evaluating a synthetic image dataset generated with stable diffusion. In International Congress on Information and Communication Technology (pp. 805-818). Singapore: Springer Nature Singapore.