AWARE — Ablation Studies

📊 Supplementary Figures ▼

Fig. S0. Weather condition distribution across all evaluation datasets, showing the severe class imbalance that AWACS addresses.

Fig. S1. Domain gap heatmap across model–dataset combinations.

Fig. S2. Three-stage comparison: Baseline vs. S1 vs. S2 performance.

Fig. S3. Stage 1 mIoU distributions per evaluation dataset.

Fig. S4. Stage 1 mIoU distributions per model architecture.

Fig. S5. Stage 2 mIoU distributions per evaluation dataset.

Fig. S6. Stage 2 mIoU distributions per model architecture.

Fig. S7. SWIFT metric correlation heatmap.

Fig. S8. SWIFT balance and diversity profiles per dataset.

Fig. S9. Dataset size vs. weather balance scatter.

Fig. S10. Quality–performance relationship in Stage 2.

Fig. S11. Model capacity vs. augmentation gain.

Fig. S12. Model capacity threshold analysis.

Fig. S13. Radar chart: per-domain performance (S1).

Fig. S14. Radar chart: per-domain performance (S2).

Fig. S15. S1 vs S2 bar comparison across strategies.

Fig. S16. S1 vs. S2 gain scatter plot.

Fig. S17. Extended training convergence curves.

Fig. S18. Diminishing returns analysis.

Fig. S19. Generative quality metric distributions (violin).

Fig. S20. Visual comparison of all generative augmentation methods.

Fig. S21. Segmentation prediction examples across conditions.

📁 Downloadable Data (CSV) ▼

📄 Strategy Leaderboard (S1) — 26 strategies, overall mIoU + gain
📄 Strategy Leaderboard (S2) — 26 strategies, with real adverse data
📄 Per-Domain Breakdown (S1) — 7 weather domains per strategy
📄 Per-Domain Breakdown (S2) — 7 weather domains per strategy
📄 Per-Model Breakdown (S1) — 5 architectures per strategy
📄 Per-Model Breakdown (S2) — 5 architectures per strategy
📄 Generative Quality Metrics (PRISM) — CQS, FID, LPIPS, SSIM, mIoU per method
📄 Per-Dataset Breakdown (S1) — All evaluation datasets per strategy

🔧 Method Descriptions (6 Families, 21 Strategies) ▼

Family 1: 2D Rendering

Parametric weather synthesis through classic image processing — no neural networks.

Automold — Road-specific augmentation (rain, fog, sun flare, shadow)
Albumentations — Efficient weather-specific augmentation library
Augmenters — Comprehensive augmentation pipeline framework (imgaug)
Weather Effect Generator — Physics-inspired fog, rain, snow particle effects

Family 2: CNN/GAN

CNN and GAN architectures for unpaired image-to-image translation.

CycleGAN — Unpaired image-to-image translation via adversarial learning with cycle consistency loss
StarGAN v2 — Multi-domain image translation with diverse style synthesis
CUT — Contrastive Unpaired Translation using patchwise contrastive learning
SUSTechGAN — Foggy scene synthesis specialized for driving scenarios

Family 3: Style Transfer

Neural style transfer models for domain-specific appearance manipulation.

LANIT — Language-guided multi-domain translation
TSIT — Texture and Structure Improved Transfer for style translation
Attribute Hallucination — Attribute-based hallucination for weather effects

Family 4: Diffusion

Diffusion‐based image-to-image models including ControlNet-conditioned approaches.

CycleDiffusion — Extends cycle consistency to diffusion models for flexible content preservation
Img2Img — Stable Diffusion image-to-image pipeline with weather prompts
InstructPix2Pix — Instruction-following image editing model
ControlNet-Seg — Segmentation-conditioned generation via ControlNet
UniControl — Unified multi-condition controllable generation

Family 5: Multimodal Diffusion

VLM and multimodal diffusion models for text-guided weather synthesis.

Step1X / Step1X v1.2 — Progressive diffusion for weather editing
Flux Kontext — Next-generation flow-matching with in-context transfer
VisualCloze — Visual in-context learning for image transformation
Qwen Image Edit — Large-scale multimodal editing model

Family 6: Standard Augmentation

Standard augmentation pipelines not specific to weather.

RandAugment — Random augmentation policy sampled from a set of transforms
AutoAugment — Learned augmentation policy via reinforcement learning
CutMix — Spatial cutout with mix of training samples
MixUp — Convex combination of training examples

📐 Technical Appendix ▼

Evaluation Metrics

mIoU — Mean Intersection over Union across all semantic classes
FID — Fréchet Inception Distance (lower = more realistic)
LPIPS — Learned Perceptual Image Patch Similarity (lower = more similar)
SSIM — Structural Similarity Index (higher = more similar)
CQS — Composite Quality Score combining FID, LPIPS, SSIM, mIoU, Pixel Accuracy

Training Pipeline

Stage 1 (S1): Cityscapes (fine) training only → evaluated on 4 diverse test sets
Stage 2 (S2): Multi-dataset training (Cityscapes + ACDC + BDD10k + IDD-AW + MapillaryVistas + OUTSIDE15k) → same test sets
Architectures: PSPNet (R50), SegFormer (MiT-B3), SegNeXt (MSCAN-B), Mask2Former (Swin-B), HRNet (HR48)
Training: 40k iterations, AdamW optimizer, poly learning rate schedule

PRISM — Pipeline for Robust Image Similarity Metrics

Standardized quality assessment framework computing FID, LPIPS, SSIM, PSNR, pixel accuracy, mIoU, and frequency-weighted IoU for each generative method against original images.

SWIFT — Structured Weather Identification and Feature Taxonomy

Condition-aware dataset splitting strategy using CLIP-based weather classification. Two-stage process: (1) indoor/outdoor filtering, (2) 7-class weather classification with fog counter-prompts.

Shannon Entropy — Dataset Balance Metric

Measures how close a weather domain distribution is to uniform on a 0–1 scale. Used to quantify class imbalance across evaluation datasets.

Normalized Shannon Entropy:

H_norm = H / H_max = −Σ_i=1…K p_i ln(p_i) / ln(K)

K = 7 — Number of weather categories (clear_day, foggy, snowy, night, rainy, dawn_dusk, cloudy)
p_i — Proportion of images in category i
H_norm = 1 → Perfectly uniform distribution
H_norm → 0 → Highly skewed/imbalanced distribution

Companion metric — Imbalance Ratio: IR = N_max / N_min (ratio of largest to smallest category count).

Pixel-level variant: The same formula applied to segmentation class distributions across pixels, where max entropy uses the count of non-zero classes rather than all possible classes.

Quality thresholds (pixel-level H_norm):

> 0.8 — Very balanced class distribution (high diversity)
> 0.6 — Reasonably balanced
> 0.3 — Imbalanced
≤ 0.3 — Highly imbalanced

Layout Diversity — Spatial Pyramid Matching

Measures structural diversity of segmentation layouts across a dataset using Spatial Pyramid Matching (SPM) with Histogram Intersection similarity.

Step 1 — Spatial Pyramid Histograms:

Each segmentation mask is divided into a grid of 2^l × 2^l cells at level l. Per-cell class histograms are L1-normalized and weighted by level.

Level 0 (1×1 grid) — weight = 0.0625
Level 1 (2×2 grid) — weight = 0.125
Level 2 (4×4 grid) — weight = 0.25
Level 3 (8×8 grid) — weight = 0.5

Weight_l = 0.5^{(L_max − l + 1)} where L_max = 3

Step 2 — Descriptor & Similarity:

Descriptor_k = ⊕_l∈levels w_l · SpatialHistogram(M_k, l)

Similarity(i, j) = Σ_d min(Descriptor_i[d], Descriptor_j[d])

Step 3 — Diversity Score:

Diversity = 1 − mean(Similarity_off-diagonal)

Similarity matrix is normalized by mean self-similarity to [0, 1] range
Diversity → 1 means highly diverse layouts
Diversity → 0 means very similar/repetitive layouts

Benchmark parameters:

num_samples = 100 — Images sampled per dataset
min_domain_samples = 10 — Minimum samples for per-domain analysis
Datasets: ACDC, BDD10k, Cityscapes, Mapillary, OUTSIDE15k, IDD

PROVE — Progressive Real-data Organization for Validation of Effects

Systematic downstream evaluation framework. Tests each augmentation strategy across all model × dataset combinations, computing per-domain and aggregate performance metrics.

Ablation Studies & Supplementary Material

Interactive Analysis

Supplementary Material