SounDiT: Ge🌍-Contextual
Soundscape-to-Landscape Generation

1The University of Texas at Austin, 2University of Tennessee, Knoxville, 3University of South Carolina, 4Arizona State University, 5University of Canterbury, 6Texas A&M University, 7University of Wisconsin-Madison

*Equal contribution

Corresponding Author

Abstract

Recent audio-to-image models have shown impressive performance in generating images of specific objects conditioned on their corresponding sounds. However, these models fail to reconstruct real-world landscapes conditioned on environmental soundscapes. To address this gap, we present Geo-contextual Soundscape-to-Landscape (GeoS2L) generation, a novel and practically significant task that aims to synthesize geographically realistic landscape images from environmental soundscapes.

Pipeline

To support this task, we construct two large-scale geo-contextual multi-modal datasets, SoundingSVI and SonicUrban, which pair diverse environmental soundscapes with real-world landscape images. We propose SounDiT, a diffusion transformer (DiT)-based model that incorporates environmental soundscapes and geo-contextual scene conditioning to synthesize geographically coherent landscape images. Furthermore, we propose the Place Similarity Score (PSS), a practically-informed geo-contextual evaluation framework to measure consistency between input soundscapes and generated landscape images. Extensive experiments demonstrate that SounDiT outperforms existing baselines in the GeoS2L, while the PSS effectively captures multi-level generation consistency across element, scene,and human perception.


Geo-contextual Soundscape-to-Landscape

Ground Truth
Ground Truth image
Generated
Generated image
Ground Truth
Ground Truth image
Generated
Generated image
Ground Truth
Ground Truth image
Generated
Generated image
Ground Truth
Ground Truth image
Generated
Generated image
Ground Truth
Ground Truth image
Generated
Generated image
Ground Truth
Ground Truth image
Generated
Generated image

Method

To effectively and efficiently inject soundscape and scene context, each SounDiT block introduces (1) a Mixture-of-Experts (MoE) Soundscape Conditioning module, (2) a Scene Low-Rank Mixer (SLRCM) module, and a Scene AdaLN (S-AdaLN) module to improve scene consistency. By hierarchically fusing visual, scene, and soundscape cues, SounDiT generates geo-contextually coherent landscapes.

Pipeline

Dataset

A car horn might suggest a congested urban street, rather than a static vehicle. Prior A2I studies have primarily relied on large-scale, general-domain audio-visual datasets. These datasets primarily focus on sound-source localization and general audio-visual correspondence learning, lacking the specific geographical context required for soundscape-to-landscape generation. For instance, geographers may be more interested in generating a wetland implied by a frog call, rather than an image of the frog itself. To address this gap, we introduce two new multimodal datasets: SoundingSVI and SonicUrban, designed to support GeoS2L generation with large-scale, geographically diverse, and context-rich data.

Pipeline

Geo-Contextual Evaluation Framework: Place Similarity Score (PSS)

Since soundscapes and landscapes co-occur in the same space, they are expected to share similar environmental characteristics and place settings. Following this hypothesis, we introduce Place Similarity Score (PSS), a practically informed geo-contextual evaluation that quantifies the underlying place setting reflected in images across the element-, scene-, and human-perception levels. These three levels of measurements of environmental characteristics have been widely employed in geographic and urban planning practices.

Pipeline

Acknowledgements