🌟A collection of papers, datasets, benchmarks, code, and pre-trained weights for Remote Sensing Foundation Models (RSFMs).
🔥🔥🔥 Last Updated on 2025.08.07 🔥🔥🔥
- 2025.08.04: Our recent work, SkySense++, a follow-up to our SkySense model, is accepted by Nature Machine Intelligence. We have released the code and pretrained weights at this repository.
- Models
- Datasets & Benchmarks
- Others
Abbreviation | Title | Publication | Paper | Code & Weights |
---|---|---|---|---|
GeoKR | Geographical Knowledge-Driven Representation Learning for Remote Sensing Images | TGRS2021 | GeoKR | link |
- | Self-Supervised Learning of Remote Sensing Scene Representations Using Contrastive Multiview Coding | CVPRW2021 | Paper | link |
GASSL | Geography-Aware Self-Supervised Learning | ICCV2021 | GASSL | link |
SeCo | Seasonal Contrast: Unsupervised Pre-Training From Uncurated Remote Sensing Data | ICCV2021 | SeCo | link |
DINO-MM | Self-supervised Vision Transformers for Joint SAR-optical Representation Learning | IGARSS2022 | DINO-MM | link |
SatMAE | SatMAE: Pre-training Transformers for Temporal and Multi-Spectral Satellite Imagery | NeurIPS2022 | SatMAE | link |
RS-BYOL | Self-Supervised Learning for Invariant Representations From Multi-Spectral and SAR Images | JSTARS2022 | RS-BYOL | null |
GeCo | Geographical Supervision Correction for Remote Sensing Representation Learning | TGRS2022 | GeCo | null |
RingMo | RingMo: A remote sensing foundation model with masked image modeling | TGRS2022 | RingMo | Code |
RVSA | Advancing plain vision transformer toward remote sensing foundation model | TGRS2022 | RVSA | link |
RSP | An Empirical Study of Remote Sensing Pretraining | TGRS2022 | RSP | link |
MATTER | Self-Supervised Material and Texture Representation Learning for Remote Sensing Tasks | CVPR2022 | MATTER | null |
CSPT | Consecutive Pre-Training: A Knowledge Transfer Learning Strategy with Relevant Unlabeled Data for Remote Sensing Domain | RS2022 | CSPT | link |
- | Self-supervised Vision Transformers for Land-cover Segmentation and Classification | CVPRW2022 | Paper | link |
BFM | A billion-scale foundation model for remote sensing images | Arxiv2023 | BFM | null |
TOV | TOV: The original vision model for optical remote sensing image understanding via self-supervised learning | JSTARS2023 | TOV | link |
CMID | CMID: A Unified Self-Supervised Learning Framework for Remote Sensing Image Understanding | TGRS2023 | CMID | link |
RingMo-Sense | RingMo-Sense: Remote Sensing Foundation Model for Spatiotemporal Prediction via Spatiotemporal Evolution Disentangling | TGRS2023 | RingMo-Sense | null |
IaI-SimCLR | Multi-Modal Multi-Objective Contrastive Learning for Sentinel-1/2 Imagery | CVPRW2023 | IaI-SimCLR | null |
CACo | Change-Aware Sampling and Contrastive Learning for Satellite Images | CVPR2023 | CACo | link |
SatLas | SatlasPretrain: A Large-Scale Dataset for Remote Sensing Image Understanding | ICCV2023 | SatLas | link |
GFM | Towards Geospatial Foundation Models via Continual Pretraining | ICCV2023 | GFM | link |
Scale-MAE | Scale-MAE: A Scale-Aware Masked Autoencoder for Multiscale Geospatial Representation Learning | ICCV2023 | Scale-MAE | link |
DINO-MC | DINO-MC: Self-supervised Contrastive Learning for Remote Sensing Imagery with Multi-sized Local Crops | Arxiv2023 | DINO-MC | link |
CROMA | CROMA: Remote Sensing Representations with Contrastive Radar-Optical Masked Autoencoders | NeurIPS2023 | CROMA | link |
Cross-Scale MAE | Cross-Scale MAE: A Tale of Multiscale Exploitation in Remote Sensing | NeurIPS2023 | Cross-Scale MAE | link |
DeCUR | DeCUR: decoupling common & unique representations for multimodal self-supervision | ECCV2024 | DeCUR | link |
Presto | Lightweight, Pre-trained Transformers for Remote Sensing Timeseries | Arxiv2023 | Presto | link |
CtxMIM | CtxMIM: Context-Enhanced Masked Image Modeling for Remote Sensing Image Understanding | Arxiv2023 | CtxMIM | null |
FG-MAE | Feature Guided Masked Autoencoder for Self-supervised Learning in Remote Sensing | Arxiv2023 | FG-MAE | link |
Prithvi | Foundation Models for Generalist Geospatial Artificial Intelligence | Arxiv2023 | Prithvi | link |
RingMo-lite | RingMo-lite: A Remote Sensing Multi-task Lightweight Network with CNN-Transformer Hybrid Framework | Arxiv2023 | RingMo-lite | null |
- | A Self-Supervised Cross-Modal Remote Sensing Foundation Model with Multi-Domain Representation and Cross-Domain Fusion | IGARSS2023 | Paper | null |
EarthPT | EarthPT: a foundation model for Earth Observation | NeurIPS2023 CCAI workshop | EarthPT | link |
USat | USat: A Unified Self-Supervised Encoder for Multi-Sensor Satellite Imagery | Arxiv2023 | USat | link |
AIEarth | Analytical Insight of Earth: A Cloud-Platform of Intelligent Computing for Geospatial Big Data | Arxiv2023 | AIEarth | link |
- | Self-Supervised Learning for SAR ATR with a Knowledge-Guided Predictive Architecture | Arxiv2023 | Paper | link |
Clay | Clay Foundation Model | - | null | link |
Hydro | Hydro--A Foundation Model for Water in Satellite Imagery | - | null | link |
U-BARN | Self-Supervised Spatio-Temporal Representation Learning of Satellite Image Time Series | JSTARS2024 | Paper | link |
GeRSP | Generic Knowledge Boosted Pre-training For Remote Sensing Images | Arxiv2024 | GeRSP | GeRSP |
SwiMDiff | SwiMDiff: Scene-wide Matching Contrastive Learning with Diffusion Constraint for Remote Sensing Image | Arxiv2024 | SwiMDiff | null |
OFA-Net | One for All: Toward Unified Foundation Models for Earth Vision | IGARSS2024 | OFA-Net | null |
SMLFR | Generative ConvNet Foundation Model With Sparse Modeling and Low-Frequency Reconstruction for Remote Sensing Image Interpretation | TGRS2024 | SMLFR | link |
SpectralGPT | SpectralGPT: Spectral Foundation Model | TPAMI2024 | SpectralGPT | link |
S2MAE | S2MAE: A Spatial-Spectral Pretraining Foundation Model for Spectral Remote Sensing Data | CVPR2024 | S2MAE | null |
SatMAE++ | Rethinking Transformers Pre-training for Multi-Spectral Satellite Imagery | CVPR2024 | SatMAE++ | link |
msGFM | Bridging Remote Sensors with Multisensor Geospatial Foundation Models | CVPR2024 | msGFM | link |
SkySense | SkySense: A Multi-Modal Remote Sensing Foundation Model Towards Universal Interpretation for Earth Observation Imagery | CVPR2024 | SkySense | link |
MTP | MTP: Advancing Remote Sensing Foundation Model via Multi-Task Pretraining | IEEE JSTARS2024 | MTP | link |
DOFA | Neural Plasticity-Inspired Foundation Model for Observing the Earth Crossing Modalities | Arxiv2024 | DOFA | link |
MMEarth | MMEarth: Exploring Multi-Modal Pretext Tasks For Geospatial Representation Learning | ECCV2024 | MMEarth | link |
LeMeViT | LeMeViT: Efficient Vision Transformer with Learnable Meta Tokens for Remote Sensing Image Interpretation | IJCAI2024 | LeMeViT | link |
SoftCon | Multi-Label Guided Soft Contrastive Learning for Efficient Earth Observation Pretraining | TGRS2024 | SoftCon | link |
RS-DFM | RS-DFM: A Remote Sensing Distributed Foundation Model for Diverse Downstream Tasks | Arxiv2024 | RS-DFM | null |
A2-MAE | A2-MAE: A spatial-temporal-spectral unified remote sensing pre-training method based on anchor-aware masked autoencoder | Arxiv2024 | A2-MAE | null |
OmniSat | OmniSat: Self-Supervised Modality Fusion for Earth Observation | ECCV2024 | OmniSat | link |
MM-VSF | Towards a Knowledge guided Multimodal Foundation Model for Spatio-Temporal Remote Sensing Applications | Arxiv2024 | MM-VSF | null |
MA3E | Masked Angle-Aware Autoencoder for Remote Sensing Images | ECCV2024 | MA3E | link |
SpectralEarth | SpectralEarth: Training Hyperspectral Foundation Models at Scale | Arxiv2024 | SpectralEarth | null |
SenPa-MAE | SenPa-MAE: Sensor Parameter Aware Masked Autoencoder for Multi-Satellite Self-Supervised Pretraining | Arxiv2024 | SenPa-MAE | link |
RingMo-Aerial | RingMo-Aerial: An Aerial Remote Sensing Foundation Model With A Affine Transformation Contrastive Learning | Arxiv2024 | RingMo-Aerial | null |
SAR-JEPA | Predicting Gradient is Better: Exploring Self-Supervised Learning for SAR ATR with a Joint-Embedding Predictive Architecture | ISPRS JPRS2024 | SAR-JEPA | link |
PIS | Pretrain a Remote Sensing Foundation Model by Promoting Intra-instance Similarity | TGRS2024 | PIS | link |
OReole-FM | OReole-FM: successes and challenges toward billion-parameter foundation models for high-resolution satellite imagery | SIGSPATIAL2024 | OReole-FM | null |
PIEViT | Pattern Integration and Enhancement Vision Transformer for Self-supervised Learning in Remote Sensing | Arxiv2024 | PIEViT | null |
SatVision-TOA | SatVision-TOA: A Geospatial Foundation Model for Coarse-Resolution All-Sky Remote Sensing Imagery | Arxiv2024 | SatVision-TOA | link |
Prithvi-EO-2.0 | Prithvi-EO-2.0: A Versatile Multi-Temporal Foundation Model for Earth Observation Applications | Arxiv2024 | Prithvi-EO-2.0 | link |
WildSAT | WildSAT: Learning Satellite Image Representations from Wildlife Observations | Arxiv2024 | WildSAT | link |
SeaMo | SeaMo: A Multi-Seasonal and Multimodal Remote Sensing Foundation Model | Information Fusion2025 | SeaMo | null |
HyperSIGMA | HyperSIGMA: Hyperspectral Intelligence Comprehension Foundation Model | IEEE TPAMI2025 | HyperSIGMA | link |
FoMo | FoMo: Multi-Modal, Multi-Scale and Multi-Task Remote Sensing Foundation Models for Forest Monitoring | AAAI2025 | FoMo | link |
SatMamba | SatMamba: Development of Foundation Models for Remote Sensing Imagery Using State Space Models | Arxiv2025 | SatMamba | link |
Galileo | Galileo: Learning Global and Local Features in Pretrained Remote Sensing Models | ICML2025 | Galileo | link |
SatDiFuser | Can Generative Geospatial Diffusion Models Excel as Discriminative Geospatial Foundation Models? | Arxiv2025 | SatDiFuser | null |
RoMA | RoMA: Scaling up Mamba-based Foundation Models for Remote Sensing | Arxiv2025 | RoMA | link |
Panopticon | Panopticon: Advancing Any-Sensor Foundation Models for Earth Observation | CVPR2025 | Panopticon | link |
HyperFree | HyperFree: A Channel-adaptive and Tuning-free Foundation Model for Hyperspectral Remote Sensing Imagery | CVPR2025 | HyperFree | link |
AnySat | AnySat: An Earth Observation Model for Any Resolutions, Scales, and Modalities | CVPR2025 | AnySat | link |
HyperSL | HyperSL: A Spectral Foundation Model for Hyperspectral Image Interpretation | IEEE TGRS2025 | HyperSL | link |
DynamicVis | DynamicVis: An Efficient and General Visual Foundation Model for Remote Sensing Image Understanding | Arxiv2025 | DynamicVis | link |
FlexiMo | FlexiMo: A Flexible Remote Sensing Foundation Model | Arxiv2025 | FlexiMo | null |
TiMo | TiMo: Spatiotemporal Foundation Model for Satellite Image Time Series | Arxiv2025 | TiMo | link |
RingMoE | RingMoE: Mixture-of-Modality-Experts Multi-Modal Foundation Models for Universal Remote Sensing Image Interpretation | Arxiv2025 | RingMoE | null |
- | A Complex-valued SAR Foundation Model Based on Physically Inspired Representation Learning | Arxiv2025 | Paper | null |
TerraFM | TerraFM: A Scalable Foundation Model for Unified Multisensor Earth Observation | Arxiv2025 | TerraFM | link |
TESSERA | TESSERA: Temporal Embeddings of Surface Spectra for Earth Representation and Analysis | Arxiv2025 | TESSERA | null |
MoSAiC | MoSAiC: Multi-Modal Multi-Label Supervision-Aware Contrastive Learning for Remote Sensing | Arxiv2025 | MoSAiC | null |
CGEarthEye | CGEarthEye:A High-Resolution Remote Sensing Vision Foundation Model Based on the Jilin-1 Satellite Constellation | Arxiv2025 | CGEarthEye | null |
MAPEX | MAPEX: Modality-Aware Pruning of Experts for Remote Sensing Foundation Models | Arxiv2025 | MAPEX | link |
FedSense | Towards Privacy-preserved Pre-training of Remote Sensing Foundation Models with Federated Mutual-guidance Learning | ICCV2025 | FedSense | null |
RS-vHeat | RS-vHeat: Heat Conduction Guided Efficient Remote Sensing Foundation Model | ICCV2025 | RS-vHeat | null |
Copernicus-FM | Towards a Unified Copernicus Foundation Model for Earth Vision | ICCV2025 | Copernicus-FM | link |
SelectiveMAE | Scaling Efficient Masked Autoencoder Learning on Large Remote Sensing Dataset | ICCV2025 | SelectiveMAE | link |
SMARTIES | SMARTIES: Spectrum-Aware Multi-Sensor Auto-Encoder for Remote Sensing Images | ICCV2025 | SMARTIES | link |
TerraMind | TerraMind: Large-Scale Generative Multimodality for Earth Observation | ICCV2025 | TerraMind | link |
SkySense V2 | SkySense V2: A Unified Foundation Model for Multi-modal Remote Sensing | ICCV2025 | SkySense V2 | null |
AlphaEarth | AlphaEarth Foundations: An embedding field model for accurate and efficient global mapping from sparse label data | Arxiv2025 | AlphaEarth | null |
SkySense++ | A semantic-enhanced multi-modal remote sensing foundation model for Earth observation | Nature Machine Intelligence 2025 | SkySense++ | link |
Abbreviation | Title | Publication | Paper | Code & Weights |
---|---|---|---|---|
RSGPT | RSGPT: A Remote Sensing Vision Language Model and Benchmark | Arxiv2023 | RSGPT | link |
RemoteCLIP | RemoteCLIP: A Vision Language Foundation Model for Remote Sensing | IEEE TGRS2024 | RemoteCLIP | link |
GeoRSCLIP | RS5M: A Large Scale Vision-Language Dataset for Remote Sensing Vision-Language Foundation Model | IEEE TGRS2024 | GeoRSCLIP | link |
GRAFT | Remote Sensing Vision-Language Foundation Models without Annotations via Ground Remote Alignment | ICLR2024 | GRAFT | null |
- | Charting New Territories: Exploring the Geographic and Geospatial Capabilities of Multimodal LLMs | Arxiv2023 | Paper | link |
- | Remote Sensing ChatGPT: Solving Remote Sensing Tasks with ChatGPT and Visual Models | Arxiv2024 | Paper | link |
EarthGPT | EarthGPT: A Universal Multi-modal Large Language Model for Multi-sensor Image Comprehension in Remote Sensing Domain | Arxiv2024 | EarthGPT | null |
SkyCLIP | SkyScript: A Large and Semantically Diverse Vision-Language Dataset for Remote Sensing | AAAI2024 | SkyCLIP | link |
GeoChat | GeoChat: Grounded Large Vision-Language Model for Remote Sensing | CVPR2024 | GeoChat | link |
LHRS-Bot | LHRS-Bot: Empowering Remote Sensing with VGI-Enhanced Large Multimodal Language Model | ECCV2024 | LHRS-Bot | link |
RS-LLaVA | RS-LLaVA: Large Vision Language Model for Joint Captioning and Question Answering in Remote Sensing Imagery | RS2024 | RS-LLaVA | link |
SkySenseGPT | SkySenseGPT: A Fine-Grained Instruction Tuning Dataset and Model for Remote Sensing Vision-Language Understanding | Arxiv2024 | SkySenseGPT | link |
EarthMarker | EarthMarker: Visual Prompt Learning for Region-level and Point-level Remote Sensing Imagery Comprehension | IEEE TGRS2024 | EarthMarker | link |
GeoText | Towards Natural Language-Guided Drones: GeoText-1652 Benchmark with Spatial Relation Matching | ECCV2024 | Aquila | link |
Aquila | Aquila: A Hierarchically Aligned Visual-Language Model for Enhanced Remote Sensing Image Comprehension | Arxiv2024 | Aquila | null |
LHRS-Bot-Nova | LHRS-Bot-Nova: Improved Multimodal Large Language Model for Remote Sensing Vision-Language Interpretation | Arxiv2024 | LHRS-Bot-Nova | link |
RSCLIP | Pushing the Limits of Vision-Language Models in Remote Sensing without Human Annotations | Arxiv2024 | RSCLIP | null |
GeoGround | GeoGround: A Unified Large Vision-Language Model for Remote Sensing Visual Grounding | Arxiv2024 | GeoGround | link |
RingMoGPT | RingMoGPT: A Unified Remote Sensing Foundation Model for Vision, Language, and grounded tasks | TGRS2024 | RingMoGPT | null |
RSUniVLM | RSUniVLM: A Unified Vision Language Model for Remote Sensing via Granularity-oriented Mixture of Experts | Arxiv2024 | RSUniVLM | link |
UniRS | UniRS: Unifying Multi-temporal Remote Sensing Tasks through Vision Language Models | Arxiv2024 | UniRS | null |
REO-VLM | REO-VLM: Transforming VLM to Meet Regression Challenges in Earth Observation | Arxiv2024 | REO-VLM | null |
SkyEyeGPT | SkyEyeGPT: Unifying Remote Sensing Vision-Language Tasks via Instruction Tuning with Large Language Model | ISPRS JPRS2025 | SkyEyeGPT | link |
VHM | VHM: Versatile and Honest Vision Language Model for Remote Sensing Image Analysis | AAAI2025 | VHM | link |
TEOChat | TEOChat: Large Language and Vision Assistant for Temporal Earth Observation Data | ICLR2025 | TEOChat | link |
EarthDial | EarthDial: Turning Multi-sensory Earth Observations to Interactive Dialogues | CVPR2025 | EarthDial | link |
SkySense-O | SkySense-O: Towards Open-World Remote Sensing Interpretation with Vision-Centric Visual-Language Modeling | CVPR2025 | SkySense-O | link |
XLRS-Bench | XLRS-Bench: Could Your Multimodal LLMs Understand Extremely Large Ultra-High-Resolution Remote Sensing Imagery? | CVPR2025 | XLRS-Bench | link |
GeoPix | GeoPix: Multi-Modal Large Language Model for Pixel-level Image Understanding in Remote Sensing | IEEE GRSM2025 | GeoPix | link |
GeoPixel | GeoPixel: Pixel Grounding Large Multimodal Model in Remote Sensing | ICML2025 | GeoPixel | link |
- | Quality-Driven Curation of Remote Sensing Vision-Language Data via Learned Scoring Models | Arxiv2025 | Paper | null |
DOFA-CLIP | DOFA-CLIP: Multimodal Vision–Language Foundation Models for Earth Observation | Arxiv2025 | DOFA-CLIP | link |
Falcon | Falcon: A Remote Sensing Vision-Language Foundation Model | Arxiv2025 | Falcon | link |
LRS-VQA | When Large Vision-Language Model Meets Large Remote Sensing Imagery: Coarse-to-Fine Text-Guided Token Pruning | ICCV2025 | LRS-VQA | link |
UrbanLLaVA | UrbanLLaVA: A Multi-modal Large Language Model for Urban Intelligence with Spatial Reasoning and Understanding | ICCV2025 | UrbanLLaVA | link |
OmniGeo | OmniGeo: Towards a Multimodal Large Language Models for Geospatial Artificial Intelligence | Arxiv2025 | OmniGeo | null |
EagleVision | EagleVision: Object-level Attribute Multimodal LLM for Remote Sensing | Arxiv2025 | EagleVision | link |
SegEarth-R1 | SegEarth-R1: Geospatial Pixel Reasoning via Large Language Model | Arxiv2025 | LISAt | link |
RemoteSAM | RemoteSAM: Towards Segment Anything for Earth Observation | ACMMM2025 | RemoteSAM | link |
DynamicVL | DynamicVL: Benchmarking Multimodal Large Language Models for Dynamic City Understanding | Arxiv2025 | DynamicVL | null |
LISAt | LISAt: Language- Instructed Segmentation Assistant for Satellite Imagery | Arxiv2025 | LISAt | link |
EarthMind | EarthMind: Towards Multi-Granular and Multi-Sensor Earth Observation with Large Multimodal Models | Arxiv2025 | EarthMind | link |
- | Remote Sensing Large Vision-Language Model: Semantic-augmented Multi-level Alignment and Semantic-aware Expert Modeling | Arxiv2025 | Paper | null |
RingMo-Agent | RingMo-Agent: A Unified Remote Sensing Foundation Model for Multi-Platform and Multi-Modal Reasoning | Arxiv2025 | RingMo-Agent | null |
Abbreviation | Title | Publication | Paper | Code & Weights |
---|---|---|---|---|
Seg2Sat | Seg2Sat - Segmentation to aerial view using pretrained diffuser models | Github | null | link |
- | Generate Your Own Scotland: Satellite Image Generation Conditioned on Maps | NeurIPSW2023 | Paper | link |
GeoRSSD | RS5M: A Large Scale Vision-Language Dataset for Remote Sensing Vision-Language Foundation Model | Arxiv2023 | Paper | link |
DiffusionSat | DiffusionSat: A Generative Foundation Model for Satellite Imagery | ICLR2024 | DiffusionSat | link |
CRS-Diff | CRS-Diff: Controllable Generative Remote Sensing Foundation Model | Arxiv2024 | Paper | null |
MetaEarth | MetaEarth: A Generative Foundation Model for Global-Scale Remote Sensing Image Generation | Arxiv2024 | Paper | link |
CRS-Diff | CRS-Diff: Controllable Generative Remote Sensing Foundation Model | Arxiv2024 | Paper | link |
HSIGene | HSIGene: A Foundation Model For Hyperspectral Image Generation | Arxiv2024 | Paper | link |
Text2Earth | Text2Earth: Unlocking Text-driven Remote Sensing Image Generation with a Global-Scale Dataset and a Foundation Model | Arxiv2025 | Paper | link |
Abbreviation | Title | Publication | Paper | Code & Weights |
---|---|---|---|---|
CSP | CSP: Self-Supervised Contrastive Spatial Pre-Training for Geospatial-Visual Representations | ICML2023 | CSP | link |
GeoCLIP | GeoCLIP: Clip-Inspired Alignment between Locations and Images for Effective Worldwide Geo-localization | NeurIPS2023 | GeoCLIP | link |
SatCLIP | SatCLIP: Global, General-Purpose Location Embeddings with Satellite Imagery | Arxiv2023 | SatCLIP | link |
RANGE | RANGE: Retrieval Augmented Neural Fields for Multi-Resolution Geo-Embeddings | CVPR2025 | RANGE | null |
GAIR | GAIR: Improving Multimodal Geo-Foundation Model with Geo-Aligned Implicit Representations | Arxiv2025 | GAIR | null |
Abbreviation | Title | Publication | Paper | Code & Weights |
---|---|---|---|---|
- | Self-supervised audiovisual representation learning for remote sensing data | JAG2022 | Paper | link |
Abbreviation | Title | Publication | Paper | Code & Weights | Task |
---|---|---|---|---|---|
SS-MAE | SS-MAE: Spatial-Spectral Masked Auto-Encoder for Mulit-Source Remote Sensing Image Classification | TGRS2023 | Paper | link | Image Classification |
- | A Decoupling Paradigm With Prompt Learning for Remote Sensing Image Change Captioning | TGRS2023 | Paper | link | Remote Sensing Image Change Captioning |
TTP | Time Travelling Pixels: Bitemporal Features Integration with Foundation Model for Remote Sensing Image Change Detection | Arxiv2023 | Paper | link | Change Detection |
CSMAE | Exploring Masked Autoencoders for Sensor-Agnostic Image Retrieval in Remote Sensing | Arxiv2024 | Paper | link | Image Retrieval |
RSPrompter | RSPrompter: Learning to Prompt for Remote Sensing Instance Segmentation based on Visual Foundation Model | TGRS2024 | Paper | link | Instance Segmentation |
BAN | A New Learning Paradigm for Foundation Model-based Remote Sensing Change Detection | TGRS2024 | Paper | link | Change Detection |
- | Change Detection Between Optical Remote Sensing Imagery and Map Data via Segment Anything Model (SAM) | Arxiv2024 | Paper | null | Change Detection (Optical & OSM data) |
AnyChange | Segment Any Change | Arxiv2024 | Paper | null | Zero-shot Change Detection |
RS-CapRet | Large Language Models for Captioning and Retrieving Remote Sensing Images | Arxiv2024 | Paper | null | Image Caption & Text-image Retrieval |
- | Task Specific Pretraining with Noisy Labels for Remote sensing Image Segmentation | Arxiv2024 | Paper | null | Image Segmentation (Noisy labels) |
RSBuilding | RSBuilding: Towards General Remote Sensing Image Building Extraction and Change Detection with Foundation Model | Arxiv2024 | Paper | link | Building Extraction and Change Detection |
SAM-Road | Segment Anything Model for Road Network Graph Extraction | Arxiv2024 | Paper | link | Road Extraction |
CrossEarth | CrossEarth: Geospatial Vision Foundation Model for Domain Generalizable Remote Sensing Semantic Segmentation | Arxiv2024 | Paper | link | Domain Generalizable Remote Sensing Semantic Segmentation |
GeoGround | GeoGround: A Unified Large Vision-Language Model for Remote Sensing Visual Grounding | Arxiv2024 | Paper | link | Remote Sensing Visual Grounding |
SARATR-X | SARATR-X: Toward Building a Foundation Model for SAR Target Recognition | IEEE TIP2025 | SARATR-X | link | SAR Target Recognition |
Abbreviation | Title | Publication | Paper | Code & Weights |
---|---|---|---|---|
GeoLLM-QA | Evaluating Tool-Augmented Agents in Remote Sensing Platforms | ICLR 2024 ML4RS Workshop | Paper | null |
RS-Agent | RS-Agent: Automating Remote Sensing Tasks through Intelligent Agents | Arxiv2024 | Paper | null |
Change-Agent | Change-Agent: Toward Interactive Comprehensive Remote Sensing Change Interpretation and Analysis | TGRS2024 | Paper | link |
GeoLLM-Engine | GeoLLM-Engine: A Realistic Environment for Building Geospatial Copilots. | CVPRW2024 | Paper | null |
PEACE | PEACE: Empowering Geologic Map Holistic Understanding with MLLMs | CVPR2025 | Paper | link |
- | Towards LLM Agents for Earth Observation: The UnivEARTH Dataset | Arxiv2025 | Paper | null |
Geo-OLM | Geo-OLM: Enabling Sustainable Earth Observation Studies with Cost-Efficient Open Language Models & State-Driven Workflows | COMPASS'2025 | Paper | link |
ThinkGeo | ThinkGeo: Evaluating Tool-Augmented Agents for Remote Sensing Tasks | Arxiv2025 | Paper | link |
Abbreviation | Title | Publication | Paper | Link | Downstream Tasks |
---|---|---|---|---|---|
- | Revisiting pre-trained remote sensing model benchmarks: resizing and normalization matters | Arxiv2023 | Paper | link | Classification |
GEO-Bench | GEO-Bench: Toward Foundation Models for Earth Monitoring | Arxiv2023 | Paper | link | Classification & Segmentation |
FoMo-Bench | FoMo-Bench: a multi-modal, multi-scale and multi-task Forest Monitoring Benchmark for remote sensing foundation models | Arxiv2023 | FoMo-Bench | Comming soon | Classification & Segmentation & Detection for forest monitoring |
PhilEO | PhilEO Bench: Evaluating Geo-Spatial Foundation Models | Arxiv2024 | Paper | link | Segmentation & Regression estimation |
SkySense | SkySense: A Multi-Modal Remote Sensing Foundation Model Towards Universal Interpretation for Earth Observation Imagery | CVPR2024 | SkySense | Targeted open-source | Classification & Segmentation & Detection & Change detection & Multi-Modal Segmentation: Time-insensitive LandCover Mapping & Multi-Modal Segmentation: Time-sensitive Crop Mapping & Multi-Modal Scene Classification |
VLEO-Bench | Good at captioning, bad at counting: Benchmarking GPT-4V on Earth observation data | Arxiv2024 | VLEO-bench | link | Location Recognition & Captioning & Scene Classification & Counting & Detection & Change detection |
VRSBench | VRSBench: A Versatile Vision-Language Benchmark Dataset for Remote Sensing Image Understanding | NeurIPS2024 | VRSBench | link | Image Captioning & Object Referring & Visual Question Answering |
UrBench | UrBench: A Comprehensive Benchmark for Evaluating Large Multimodal Models in Multi-View Urban Scenarios | Arxiv2024 | UrBench | link | Object Referring & Visual Question Answering & Counting & Scene Classification & Location Recognition & Geolocalization |
PANGAEA | PANGAEA: A Global and Inclusive Benchmark for Geospatial Foundation Models | Arxiv2024 | PANGAEA | link | Segmentation & Change detection & Regression |
COREval | COREval: A Comprehensive and Objective Benchmark for Evaluating the Remote Sensing Capabilities of Large Vision-Language Models | Arxiv2024 | COREval | null | Perception & Reasoning |
GEOBench-VLM | GEOBench-VLM: Benchmarking Vision-Language Models for Geospatial Tasks | Arxiv2024 | GEOBench-VLM | link | Scene Understanding & Counting & Object Classification & Event Detection & Spatial Relations |
Copernicus-Bench | Towards a Unified Copernicus Foundation Model for Earth Vision | Arxiv2025 | Copernicus-Bench | link | Segmentation & Classification & Change detection & Regression |
Abbreviation | Title | Publication | Paper | Attribute | Link |
---|---|---|---|---|---|
fMoW | Functional Map of the World | CVPR2018 | fMoW | Vision | link |
SEN12MS | SEN12MS -- A Curated Dataset of Georeferenced Multi-Spectral Sentinel-1/2 Imagery for Deep Learning and Data Fusion | - | SEN12MS | Vision | link |
BEN-MM | BigEarthNet-MM: A Large Scale Multi-Modal Multi-Label Benchmark Archive for Remote Sensing Image Classification and Retrieval | GRSM2021 | BEN-MM | Vision | link |
MillionAID | On Creating Benchmark Dataset for Aerial Image Interpretation: Reviews, Guidances, and Million-AID | JSTARS2021 | MillionAID | Vision | link |
SeCo | Seasonal Contrast: Unsupervised Pre-Training From Uncurated Remote Sensing Data | ICCV2021 | SeCo | Vision | link |
fMoW-S2 | SatMAE: Pre-training Transformers for Temporal and Multi-Spectral Satellite Imagery | NeurIPS2022 | fMoW-S2 | Vision | link |
TOV-RS-Balanced | TOV: The original vision model for optical remote sensing image understanding via self-supervised learning | JSTARS2023 | TOV | Vision | link |
SSL4EO-S12 | SSL4EO-S12: A Large-Scale Multi-Modal, Multi-Temporal Dataset for Self-Supervised Learning in Earth Observation | GRSM2023 | SSL4EO-S12 | Vision | link |
SSL4EO-L | SSL4EO-L: Datasets and Foundation Models for Landsat Imagery | Arxiv2023 | SSL4EO-L | Vision | link |
SatlasPretrain | SatlasPretrain: A Large-Scale Dataset for Remote Sensing Image Understanding | ICCV2023 | SatlasPretrain | Vision (Supervised) | link |
CACo | Change-Aware Sampling and Contrastive Learning for Satellite Images | CVPR2023 | CACo | Vision | Comming soon |
SAMRS | SAMRS: Scaling-up Remote Sensing Segmentation Dataset with Segment Anything Model | NeurIPS2023 | SAMRS | Vision | link |
RSVG | RSVG: Exploring Data and Models for Visual Grounding on Remote Sensing Data | TGRS2023 | RSVG | Vision-Language | link |
RS5M | RS5M: A Large Scale Vision-Language Dataset for Remote Sensing Vision-Language Foundation Model | Arxiv2023 | RS5M | Vision-Language | link |
GEO-Bench | GEO-Bench: Toward Foundation Models for Earth Monitoring | Arxiv2023 | GEO-Bench | Vision (Evaluation) | link |
RSICap & RSIEval | RSGPT: A Remote Sensing Vision Language Model and Benchmark | Arxiv2023 | RSGPT | Vision-Language | Comming soon |
Clay | Clay Foundation Model | - | null | Vision | link |
SATIN | SATIN: A Multi-Task Metadataset for Classifying Satellite Imagery using Vision-Language Models | ICCVW2023 | SATIN | Vision-Language | link |
SkyScript | SkyScript: A Large and Semantically Diverse Vision-Language Dataset for Remote Sensing | AAAI2024 | SkyScript | Vision-Language | link |
ChatEarthNet | ChatEarthNet: A Global-Scale, High-Quality Image-Text Dataset for Remote Sensing | Arxiv2024 | ChatEarthNet | Vision-Language | link |
LuoJiaHOG | LuoJiaHOG: A Hierarchy Oriented Geo-aware Image Caption Dataset for Remote Sensing Image-Text Retrieval | Arxiv2024 | LuoJiaHOG | Vision-Language | null |
MMEarth | MMEarth: Exploring Multi-Modal Pretext Tasks For Geospatial Representation Learning | Arxiv2024 | MMEarth | Vision | link |
SeeFar | SeeFar: Satellite Agnostic Multi-Resolution Dataset for Geospatial Foundation Models | Arxiv2024 | SeeFar | Vision | link |
FIT-RS | SkySenseGPT: A Fine-Grained Instruction Tuning Dataset and Model for Remote Sensing Vision-Language Understanding | Arxiv2024 | Paper | Vision-Language | link |
RS-GPT4V | RS-GPT4V: A Unified Multimodal Instruction-Following Dataset for Remote Sensing Image Understanding | Arxiv2024 | Paper | Vision-Language | link |
RS-4M | Scaling Efficient Masked Autoencoder Learning on Large Remote Sensing Dataset | Arxiv2024 | RS-4M | Vision | link |
Major TOM | Major TOM: Expandable Datasets for Earth Observation | Arxiv2024 | Major TOM | Vision | link |
VRSBench | VRSBench: A Versatile Vision-Language Benchmark Dataset for Remote Sensing Image Understanding | Arxiv2024 | VRSBench | Vision-Language | link |
MMM-RS | MMM-RS: A Multi-modal, Multi-GSD, Multi-scene Remote Sensing Dataset and Benchmark for Text-to-Image Generation | Arxiv2024 | MMM-RS | Vision-Language | link |
DDFAV | DDFAV: Remote Sensing Large Vision Language Models Dataset and Evaluation Benchmark | Arxiv2024 | DDFAV | Vision-Language | link |
M3LEO | A Multi-Modal, Multi-Label Earth Observation Dataset Integrating Interferometric SAR and Multispectral Data | NeurIPS2024 | M3LEO | Vision | link |
Copernicus-Pretrain | Towards a Unified Copernicus Foundation Model for Earth Vision | Arxiv2025 | Copernicus-Pretrain | Vision | link |
(TODO. This section is dedicated to recommending more relevant and impactful projects, with the hope of promoting the development of the RS community. 😄 🚀)
Title | Link | Brief Introduction |
---|---|---|
RSFMs (Remote Sensing Foundation Models) Playground | link | An open-source playground to streamline the evaluation and fine-tuning of RSFMs on various datasets. |
PANGAEA | link | A Global and Inclusive Benchmark for Geospatial Foundation Models. |
GeoFM | link | Evaluation of Foundation Models for Earth Observation. |
EOUncertaintyGeneralization | link | On the Generalization of Representation Uncertainty in Earth Observation. |
Title | Publication | Paper | Attribute |
---|---|---|---|
Self-Supervised Remote Sensing Feature Learning: Learning Paradigms, Challenges, and Future Works | TGRS2023 | Paper | Vision & Vision-Language |
The Potential of Visual ChatGPT For Remote Sensing | Arxiv2023 | Paper | Vision-Language |
遥感大模型:进展与前瞻 | 武汉大学学报 (信息科学版) 2023 | Paper | Vision & Vision-Language |
地理人工智能样本:模型、质量与服务 | 武汉大学学报 (信息科学版) 2023 | Paper | - |
Brain-Inspired Remote Sensing Foundation Models and Open Problems: A Comprehensive Survey | JSTARS2023 | Paper | Vision & Vision-Language |
Revisiting pre-trained remote sensing model benchmarks: resizing and normalization matters | Arxiv2023 | Paper | Vision |
An Agenda for Multimodal Foundation Models for Earth Observation | IGARSS2023 | Paper | Vision |
Transfer learning in environmental remote sensing | RSE2024 | Paper | Transfer learning |
遥感基础模型发展综述与未来设想 | 遥感学报2023 | Paper | - |
On the Promises and Challenges of Multimodal Foundation Models for Geographical, Environmental, Agricultural, and Urban Planning Applications | Arxiv2023 | Paper | Vision-Language |
Vision-Language Models in Remote Sensing: Current Progress and Future Trends | IEEE GRSM2024 | Paper | Vision-Language |
On the Foundations of Earth and Climate Foundation Models | Arxiv2024 | Paper | Vision & Vision-Language |
Towards Vision-Language Geo-Foundation Model: A Survey | Arxiv2024 | Paper | Vision-Language |
AI Foundation Models in Remote Sensing: A Survey | Arxiv2024 | Paper | Vision |
Foundation model for generalist remote sensing intelligence: Potentials and prospects | Science Bulletin2024 | Paper | - |
Advancements in Visual Language Models for Remote Sensing: Datasets, Capabilities, and Enhancement Techniques | Arxiv2024 | Paper | Vision-Language |
Foundation Models for Remote Sensing and Earth Observation: A Survey | Arxiv2024 | Paper | Vision & Vision-Language |
多模态遥感基础大模型:研究现状与未来展望 | 测绘学报2024 | Paper | Vision & Vision-Language & Generative & Vision-Location |
When Geoscience Meets Foundation Models: Toward a general geoscience artificial intelligence system | IEEE GRSM2024 | Paper | Vision & Vision-Language |
Towards the next generation of Geospatial Artificial Intelligence | JAG2025 | Paper | - |
Vision Foundation Models in Remote Sensing: A survey | IEEE GRSM2025 | Paper | Vision |
Unleashing the potential of remote sensing foundation models via bridging data and computility islands | The Innovation2025 | Paper | - |
A Survey on Remote Sensing Foundation Models: From Vision to Multimodality | Arxiv2025 | Paper | - |
If you find this repository useful, please consider giving a star ⭐ and citation:
@inproceedings{guo2024skysense,
title={Skysense: A multi-modal remote sensing foundation model towards universal interpretation for earth observation imagery},
author={Guo, Xin and Lao, Jiangwei and Dang, Bo and Zhang, Yingying and Yu, Lei and Ru, Lixiang and Zhong, Liheng and Huang, Ziyuan and Wu, Kang and Hu, Dingxiang and others},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
pages={27672--27683},
year={2024}
}
@article{li2025unleashing,
title={Unleashing the potential of remote sensing foundation models via bridging data and computility islands},
author={Li, Yansheng and Tan, Jieyi and Dang, Bo and Ye, Mang and Bartalev, Sergey A and Shinkarenko, Stanislav and Wang, Linlin and Zhang, Yingying and Ru, Lixiang and Guo, Xin and others},
journal={The Innovation},
year={2025},
publisher={Elsevier}
}
@article{wu2025semantic,
author = {Wu, Kang and Zhang, Yingying and Ru, Lixiang and Dang, Bo and Lao, Jiangwei and Yu, Lei and Luo, Junwei and Zhu, Zifan and Sun, Yue and Zhang, Jiahao and Zhu, Qi and Wang, Jian and Yang, Ming and Chen, Jingdong and Zhang, Yongjun and Li, Yansheng},
title= {A semantic‑enhanced multi‑modal remote sensing foundation model for Earth observation},
journal= {Nature Machine Intelligence},
year= {2025},
doi= {10.1038/s42256-025-01078-8},
url= {https://doi.org/10.1038/s42256-025-01078-8}
}
@inproceedings{zhu2025skysense,
title={Skysense-o: Towards open-world remote sensing interpretation with vision-centric visual-language modeling},
author={Zhu, Qi and Lao, Jiangwei and Ji, Deyi and Luo, Junwei and Wu, Kang and Zhang, Yingying and Ru, Lixiang and Wang, Jian and Chen, Jingdong and Yang, Ming and others},
booktitle={Proceedings of the Computer Vision and Pattern Recognition Conference},
pages={14733--14744},
year={2025}
}
@article{luo2024skysensegpt,
title={Skysensegpt: A fine-grained instruction tuning dataset and model for remote sensing vision-language understanding},
author={Luo, Junwei and Pang, Zhen and Zhang, Yongjun and Wang, Tingzhu and Wang, Linlin and Dang, Bo and Lao, Jiangwei and Wang, Jian and Chen, Jingdong and Tan, Yihua and others},
journal={arXiv preprint arXiv:2406.10100},
year={2024}
}