Environmental Metadata
======================
.. currentmodule:: biosample_enricher.environmental_metadata
Get environmental metadata for geographic coordinates.
What It Does
------------
**Get environmental metadata values** for slots like ``annual_precpt``, ``annual_temp``, ``elev``, ``depth``, etc.
Give it GPS coordinates → Get back values with units and provenance. Compatible with NMDC submission-schema and other applications.
Quick Example
-------------
.. code-block:: python
from biosample_enricher.environmental_metadata import get_environmental_metadata
# Get climate data for San Francisco
result = get_environmental_metadata(
lat=37.7749,
lon=-122.4194,
slots=["annual_precpt", "annual_temp"]
)
# Use the values in your NMDC submission
print(result["values"])
# {'annual_precpt': 519.3, 'annual_temp': 14.1}
# Check which data sources were used
print(result["metadata"]["climate_normals"]["providers_used"])
# ['meteostat', 'nasa_power']
What Values Can You Get?
-------------------------
Currently Supported (✅ Ready to Use)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
**Climate Data** (30-year averages, no datetime needed):
.. list-table::
:header-rows: 1
:widths: 20 50 15 15
* - Slot Name
- Description
- Units
- Type
* - ``annual_precpt``
- Annual precipitation (30-year average)
- millimeters/year
- float
* - ``annual_temp``
- Annual temperature (30-year average)
- degrees Celsius
- float
**Providers**: meteostat, nasa_power (automatically queries both and averages results)
**Elevation Data** (no datetime needed):
.. list-table::
:header-rows: 1
:widths: 20 50 15 15
* - Slot Name
- Description
- Units
- Type
* - ``elev``
- Elevation above sea level
- meters
- float
**Providers**: USGS, Google Maps, Open Topo Data, OSM (tries multiple sources)
Partially Implemented (⚠️ Use with Caution)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
**Weather Data** (requires datetime - collection date/time):
.. list-table::
:header-rows: 1
:widths: 20 50 15 15
* - Slot Name
- Description
- Units
- Type
* - ``temp``
- Temperature at collection time
- degrees Celsius
- float
* - ``air_temp``
- Air temperature (alias for temp)
- degrees Celsius
- float
* - ``humidity``
- Relative humidity
- g/m³
- string
* - ``wind_speed``
- Wind speed
- m/s
- string
* - ``wind_direction``
- Wind direction
- degrees
- string
* - ``solar_irradiance``
- Solar radiation
- W/m²
- string
.. warning::
Weather slots require ``datetime_obj`` parameter. Data availability depends on location and date.
Not all slots may return values for all locations/times.
**Marine Data** (no datetime needed):
.. list-table::
:header-rows: 1
:widths: 20 50 15 15
* - Slot Name
- Description
- Units
- Type
* - ``depth``
- Water depth (negative for underwater)
- meters
- string
.. warning::
Marine providers (GEBCO, ESA CCI, NOAA) are marked as unreliable in Issue #181.
Data quality varies significantly by location.
**Soil Data** (no datetime needed):
.. list-table::
:header-rows: 1
:widths: 20 50 15 15
* - Slot Name
- Description
- Units
- Type
* - ``ph``
- Soil pH
- pH units
- float
* - ``soil_type``
- USDA soil texture class
- text
- string
.. warning::
SoilGrids provider has intermittent failures (Issue #184). Success rate varies by location.
Not Yet Implemented (❌ Future Work)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
These submission-schema slots are **not yet supported**:
- ``cur_vegetation`` - Current vegetation type (Issue #194)
- ``flooding`` - Flooding history (Issue #192)
- ``fire`` - Fire history
- ``extreme_event`` - Extreme weather events
- Many others...
See the `submission-schema documentation `_ for the complete list of slots.
.. seealso::
Issue #193: Add submission-schema extraction helpers for more slots
Parameters
----------
.. function:: get_environmental_metadata(lat, lon, slots, datetime_obj=None, providers=None, strategy="mean")
Get NMDC submission-schema values for specified slots.
:param float lat: Latitude in decimal degrees (required)
Valid range: -90 to 90
:param float lon: Longitude in decimal degrees (required)
Valid range: -180 to 180
:param list[str] slots: Slot names to retrieve (required)
Must be from supported slots listed above.
Cannot be empty.
**Examples**::
# Climate only
slots=["annual_precpt", "annual_temp"]
# Mix climate and elevation
slots=["annual_precpt", "elev"]
# Weather (requires datetime_obj)
slots=["temp", "humidity", "wind_speed"]
:param datetime datetime_obj: Collection date/time (optional)
**Required for weather slots** (temp, air_temp, humidity, etc.)
Not used for climate, elevation, marine, or soil slots
**Example**::
from datetime import datetime
datetime_obj=datetime(2023, 7, 15, 14, 30)
:param list[str] providers: Specific providers to use (optional)
If None (default), queries all available providers.
**Valid providers by slot category:**
- Climate slots: ``["meteostat", "nasa_power"]``
- Elevation slots: ``["usgs", "google", "open_topo_data", "osm"]``
**Examples**::
# Use only meteostat for climate
providers=["meteostat"]
# Use only USGS for elevation
providers=["usgs"]
:param str strategy: How to combine values from multiple providers (optional)
Default is ``"mean"``.
**Valid values** (from ``CONSENSUS_STRATEGIES``):
- ``"mean"``: Average across all successful providers (default, most reliable)
- ``"median"``: Middle value when sorted (robust to outliers)
- ``"first"``: Use first successful provider in priority order (fastest)
- ``"best_quality"``: Use provider with best quality metric (closest station, highest resolution)
See :ref:`consensus-strategies` for detailed descriptions and usage guidance.
**Example**::
# Use median to handle outliers
result = get_environmental_metadata(
lat=46.8523, lon=-121.7603,
slots=["elev"],
strategy="median"
)
:returns: Dictionary with two keys:
- ``"values"``: Dict mapping slot names to submission-ready values
- Values are in the correct units for submission-schema
- Slots that failed to retrieve data are **omitted** (not None)
- Types match submission-schema requirements (mostly float, some string)
- ``"metadata"``: Dict with provider information for transparency
- Shows which data sources contributed to each value
- Includes provider-specific results for comparison
- Lists any providers that failed with error messages
:rtype: dict[str, Any]
:raises ValueError: If latitude is outside -90 to 90
:raises ValueError: If longitude is outside -180 to 180
:raises ValueError: If slots list is empty
:raises ValueError: If slots contains unsupported slot names
(Error message will list all supported slots)
:raises ValueError: If providers contains invalid provider names for requested slots
(Error message will list valid providers)
Return Value Structure
----------------------
The function returns a dictionary with this structure:
.. code-block:: python
{
"values": {
"annual_precpt": 519.3, # float: mm/year
"annual_temp": 14.1, # float: °C
"elev": 52.4 # float: meters
# Missing/failed slots are omitted
},
"metadata": {
"climate_normals": { # Only present if climate slots requested
"providers_used": ["meteostat", "nasa_power"],
"consensus_strategy": "consensus", # How values were combined
"provider_results": {
"meteostat": {
"annual_precpt": 453.1,
"annual_temp": 14.2,
"period": "1991-2020",
"station_distance_km": 3.2
},
"nasa_power": {
"annual_precpt": 585.5,
"annual_temp": 14.0,
"period": "2001-2020"
}
},
"failed_providers": {} # Dict of {provider: error_message}
}
# "weather", "elevation", "marine", "soil" metadata added as implemented
}
}
.. note::
**Missing slots are omitted**, not set to None. Always check with ``if "slot_name" in result["values"]``
before accessing values.
Examples
--------
Basic Climate Data
~~~~~~~~~~~~~~~~~~
.. code-block:: python
from biosample_enricher.environmental_metadata import get_environmental_metadata
# Get 30-year climate averages for a location
result = get_environmental_metadata(
lat=42.3601, # Boston
lon=-71.0589,
slots=["annual_precpt", "annual_temp"]
)
# Use the values
precip = result["values"]["annual_precpt"] # 1090.2 mm/year
temp = result["values"]["annual_temp"] # 10.8 °C
# Check data quality by comparing providers
providers = result["metadata"]["climate_normals"]["provider_results"]
for name, data in providers.items():
print(f"{name}: {data['annual_precpt']:.1f} mm/year")
# meteostat: 1089.3 mm/year
# nasa_power: 1091.1 mm/year
Mixing Multiple Slot Types
~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. code-block:: python
# Get climate + elevation in one call
result = get_environmental_metadata(
lat=40.7128, # New York City
lon=-74.0060,
slots=["annual_precpt", "annual_temp", "elev"]
)
values = result["values"]
print(f"Elevation: {values['elev']} m")
print(f"Annual rain: {values['annual_precpt']} mm/year")
print(f"Annual temp: {values['annual_temp']} °C")
Using Specific Providers
~~~~~~~~~~~~~~~~~~~~~~~~~
.. code-block:: python
# Use only meteostat for climate (not NASA POWER)
result = get_environmental_metadata(
lat=51.5074, # London
lon=-0.1278,
slots=["annual_precpt", "annual_temp"],
providers=["meteostat"] # Only use this provider
)
metadata = result["metadata"]["climate_normals"]
print(metadata["providers_used"]) # ['meteostat']
Weather Data (Requires Datetime)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. code-block:: python
from datetime import datetime
# Get weather at sample collection time
result = get_environmental_metadata(
lat=34.0522, # Los Angeles
lon=-118.2437,
slots=["temp", "humidity", "wind_speed"],
datetime_obj=datetime(2023, 7, 15, 14, 30) # Required!
)
# Check what data was available
if "temp" in result["values"]:
print(f"Temperature: {result['values']['temp']} °C")
else:
print("Temperature data not available for this location/time")
Error Handling
~~~~~~~~~~~~~~
.. code-block:: python
# Handle invalid inputs gracefully
try:
result = get_environmental_metadata(
lat=37.7749,
lon=-122.4194,
slots=["annual_precpt", "invalid_slot_name"]
)
except ValueError as e:
print(f"Error: {e}")
# Error: Unsupported slot(s): ['invalid_slot_name'].
# Supported slots: ['air_temp', 'annual_precpt', 'annual_temp', ...]
# Check for missing data
result = get_environmental_metadata(
lat=37.7749,
lon=-122.4194,
slots=["annual_precpt", "depth"] # depth may not be available on land
)
if "annual_precpt" in result["values"]:
print(f"Got precipitation: {result['values']['annual_precpt']}")
if "depth" not in result["values"]:
print("Depth data not available (probably on land)")
.. _quick-reference:
Quick Reference
---------------
All Constants at a Glance
~~~~~~~~~~~~~~~~~~~~~~~~~
These constants are available for programmatic access:
.. code-block:: python
from biosample_enricher.environmental_metadata import (
# Slot categories
ALL_SUPPORTED_SLOTS, # All slots combined
CLIMATE_SLOTS, # {'annual_precpt', 'annual_temp'}
WEATHER_SLOTS, # {'temp', 'air_temp', 'humidity', 'wind_speed', 'wind_direction', 'solar_irradiance'}
ELEVATION_SLOTS, # {'elev'}
MARINE_SLOTS, # {'depth'}
SOIL_SLOTS, # {'ph', 'soil_type'}
# Provider names
CLIMATE_PROVIDERS, # {'meteostat', 'nasa_power'}
ELEVATION_PROVIDERS, # {'usgs', 'google', 'open_topo_data', 'osm'}
# Consensus strategies
CONSENSUS_STRATEGIES, # {'mean', 'median', 'first', 'best_quality'}
)
Slots by Category
~~~~~~~~~~~~~~~~~
.. list-table::
:header-rows: 1
:widths: 20 40 20 20
* - Category
- Slots
- Providers
- Datetime Required?
* - Climate
- ``annual_precpt``, ``annual_temp``
- meteostat, nasa_power
- No
* - Weather
- ``temp``, ``air_temp``, ``humidity``, ``wind_speed``, ``wind_direction``, ``solar_irradiance``
- meteostat, open_meteo
- **Yes**
* - Elevation
- ``elev``
- usgs, google, open_topo_data, osm
- No
* - Marine
- ``depth``
- gebco, noaa
- No
* - Soil
- ``ph``, ``soil_type``
- soilgrids, usda_nrcs
- No
.. _consensus-strategies:
Consensus Strategies
~~~~~~~~~~~~~~~~~~~~
When multiple providers return data, values are combined using a consensus strategy:
.. list-table::
:header-rows: 1
:widths: 15 60 25
* - Strategy
- Description
- When to Use
* - ``mean``
- Arithmetic average across all providers (default)
- General use, most reliable
* - ``median``
- Middle value when sorted
- When outliers are possible
* - ``first``
- Use first successful provider
- When speed matters
* - ``best_quality``
- Use provider with best quality metric
- Advanced use with quality scores
Slot Status (Reliability)
~~~~~~~~~~~~~~~~~~~~~~~~~
.. list-table::
:header-rows: 1
:widths: 15 35 50
* - Status
- Slots
- Notes
* - **Ready**
- ``annual_precpt``, ``annual_temp``, ``elev``
- Production-ready, reliable data
* - **Caution**
- ``temp``, ``ph``, ``depth``
- May have gaps or provider issues
* - **Experimental**
- ``air_temp``, ``humidity``, ``wind_speed``, ``wind_direction``, ``solar_irradiance``, ``soil_type``
- Limited testing, may change
Copy-Paste Slot Lists
~~~~~~~~~~~~~~~~~~~~~
For convenience, here are the slot names ready to copy:
**All slots (comma-separated)**::
annual_precpt, annual_temp, elev, temp, air_temp, humidity, wind_speed, wind_direction, solar_irradiance, depth, ph, soil_type
**Production-ready slots only**::
annual_precpt, annual_temp, elev
**Climate + Elevation (most common)**::
annual_precpt, annual_temp, elev
Checking Available Slots and Providers
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. code-block:: python
from biosample_enricher.environmental_metadata import (
ALL_SUPPORTED_SLOTS,
CLIMATE_SLOTS,
WEATHER_SLOTS,
ELEVATION_SLOTS,
MARINE_SLOTS,
SOIL_SLOTS,
CLIMATE_PROVIDERS,
ELEVATION_PROVIDERS,
CONSENSUS_STRATEGIES,
)
# See all supported slots
print("All slots:", sorted(ALL_SUPPORTED_SLOTS))
# See slots by category
print("Climate:", sorted(CLIMATE_SLOTS))
print("Weather:", sorted(WEATHER_SLOTS))
print("Elevation:", sorted(ELEVATION_SLOTS))
print("Marine:", sorted(MARINE_SLOTS))
print("Soil:", sorted(SOIL_SLOTS))
# See available providers
print("Climate providers:", sorted(CLIMATE_PROVIDERS))
print("Elevation providers:", sorted(ELEVATION_PROVIDERS))
# See consensus strategies
print("Strategies:", sorted(CONSENSUS_STRATEGIES))
Limitations and Known Issues
-----------------------------
Current Limitations
~~~~~~~~~~~~~~~~~~~
1. **Limited slot coverage**: Only 13 of ~200+ submission-schema slots are supported
2. **No bulk operations**: Must call once per biosample (no batch processing yet)
3. **Weather data gaps**: Historical weather not available for all locations/times
4. **Provider reliability**: Some providers have intermittent failures (see issues below)
5. **No caching control**: Cannot disable or clear HTTP cache from this function
Known Issues
~~~~~~~~~~~~
.. warning::
**Provider Reliability Issues** - Please review these before relying on data:
- Issue #181: Marine providers (GEBCO, ESA CCI, NOAA) incomplete/unreliable
- Issue #182: MODIS vegetation provider uses mock data only
- Issue #183: USGS elevation provider unreliable (marked flaky)
- Issue #184: SoilGrids provider intermittent failures (marked flaky)
Climate and elevation data are generally reliable. Marine and soil data quality varies.
Future Development
~~~~~~~~~~~~~~~~~~
These features are **planned but not yet implemented**:
- More submission-schema slots (Issue #193)
- Vegetation data from land cover (Issue #194)
- Flooding history (Issue #192)
- Batch processing for multiple biosamples
- Quality scores and confidence intervals
- Custom provider selection strategies beyond "consensus"
See Also
--------
**For Advanced Users:**
- :doc:`api/services` - Low-level service APIs for more control
- :doc:`api/providers` - Individual provider documentation
- :doc:`provider_reliability` - Provider stability and quality metrics
**For Understanding the Code:**
- :doc:`api/api_index` - Complete API reference
- :doc:`architecture` - System design and data flow
**External Resources:**
- `NMDC Submission Schema `_ - Official schema documentation
- `GOLD Biosample Fields `_ - Related biosample metadata standard