Recent audio-to-image models have shown impressive performance in generating images of specific objects conditioned on their corresponding sounds. However, these models fail to reconstruct real-world landscapes conditioned on environmental soundscapes. To address this gap, we present Geo-contextual Soundscape-to-Landscape (GeoS2L) generation, a novel and practically significant task that aims to synthesize geographically realistic landscape images from environmental soundscapes.
To support this task, we construct two large-scale geo-contextual multi-modal datasets, SoundingSVI and SonicUrban, which pair diverse environmental soundscapes with real-world landscape images. We propose SounDiT, a diffusion transformer (DiT)-based model that incorporates environmental soundscapes and geo-contextual scene conditioning to synthesize geographically coherent landscape images. Furthermore, we propose the Place Similarity Score (PSS), a practically-informed geo-contextual evaluation framework to measure consistency between input soundscapes and generated landscape images. Extensive experiments demonstrate that SounDiT outperforms existing baselines in the GeoS2L, while the PSS effectively captures multi-level generation consistency across element, scene,and human perception.
To effectively and efficiently inject soundscape and scene context, each SounDiT block introduces (1) a Mixture-of-Experts (MoE) Soundscape Conditioning module, (2) a Scene Low-Rank Mixer (SLRCM) module, and a Scene AdaLN (S-AdaLN) module to improve scene consistency. By hierarchically fusing visual, scene, and soundscape cues, SounDiT generates geo-contextually coherent landscapes.
A car horn might suggest a congested urban street, rather than a static vehicle. Prior A2I studies have primarily relied on large-scale, general-domain audio-visual datasets. These datasets primarily focus on sound-source localization and general audio-visual correspondence learning, lacking the specific geographical context required for soundscape-to-landscape generation. For instance, geographers may be more interested in generating a wetland implied by a frog call, rather than an image of the frog itself. To address this gap, we introduce two new multimodal datasets: SoundingSVI and SonicUrban, designed to support GeoS2L generation with large-scale, geographically diverse, and context-rich data.
Since soundscapes and landscapes co-occur in the same space, they are expected to share similar environmental characteristics and place settings. Following this hypothesis, we introduce Place Similarity Score (PSS), a practically informed geo-contextual evaluation that quantifies the underlying place setting reflected in images across the element-, scene-, and human-perception levels. These three levels of measurements of environmental characteristics have been widely employed in geographic and urban planning practices.