
This study explores the capabilities of multimodal foundation models (MFMs), specifically CLIP and GPT-4V, in predicting human perceptions of urban environments using geospatial data. By integrating multimodal inputs such as street view imagery and textual descriptions, the research aims to address the limitations of unimodal approaches and dataset specificity. The study utilizes a dataset from Songdo, South Korea, to assess six key perception variables: safety, liveliness, boredom, wealth, depression, and beauty. Results reveal that while MFMs demonstrate significant potential in zero-shot learning for urban perception prediction, they also highlight the need for improved multimodal integration and methodological advancements. This work paves the way for more robust GeoAI applications in urban planning and design.
Autor / Author: | Han, Soyoung; Kim, Soyoung |
Institution / Institution: | Incheon National University, Incheon/South Korea; Incheon National University, Incheon/South Korea |
Seitenzahl / Pages: | 8 |
Sprache / Language: | Englisch |
Veröffentlichung / Publication: | JoDLA – Journal of Digital Landscape Architecture, 10-2025 |
Tagung / Conference: | Digital Landscape Architecture 2025 – Collaboration |
Veranstaltungsort, -datum / Venue, Date: | Dessau Campus of Anhalt University, Germany 04-06-25 - 07-06-25 |
Schlüsselwörter (de): | |
Keywords (en): | GeoAI, multimodal foundation models (MFMs), urban perception prediction |
Paper review type: | Full Paper Review |
DOI: | doi:10.14627/537754061 |
Diese Website nutzt Cookies, um ihre Dienste anbieten zu können und Zugriffe zu analysieren. Dabei ist uns der Datenschutz sehr wichtig.
Legen Sie hier Ihre Cookie-Einstellungen fest. Sie können Sie jederzeit auf der Seite Cookie-Informationen ändern.