Environmental Metadata ====================== .. currentmodule:: biosample_enricher.environmental_metadata Get environmental metadata for geographic coordinates. What It Does ------------ **Get environmental metadata values** for slots like ``annual_precpt``, ``annual_temp``, ``elev``, ``depth``, etc. Give it GPS coordinates → Get back values with units and provenance. Compatible with NMDC submission-schema and other applications. Quick Example ------------- .. code-block:: python from biosample_enricher.environmental_metadata import get_environmental_metadata # Get climate data for San Francisco result = get_environmental_metadata( lat=37.7749, lon=-122.4194, slots=["annual_precpt", "annual_temp"] ) # Use the values in your NMDC submission print(result["values"]) # {'annual_precpt': 519.3, 'annual_temp': 14.1} # Check which data sources were used print(result["metadata"]["climate_normals"]["providers_used"]) # ['meteostat', 'nasa_power'] What Values Can You Get? ------------------------- Currently Supported (✅ Ready to Use) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ **Climate Data** (30-year averages, no datetime needed): .. list-table:: :header-rows: 1 :widths: 20 50 15 15 * - Slot Name - Description - Units - Type * - ``annual_precpt`` - Annual precipitation (30-year average) - millimeters/year - float * - ``annual_temp`` - Annual temperature (30-year average) - degrees Celsius - float **Providers**: meteostat, nasa_power (automatically queries both and averages results) **Elevation Data** (no datetime needed): .. list-table:: :header-rows: 1 :widths: 20 50 15 15 * - Slot Name - Description - Units - Type * - ``elev`` - Elevation above sea level - meters - float **Providers**: USGS, Google Maps, Open Topo Data, OSM (tries multiple sources) Partially Implemented (⚠️ Use with Caution) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ **Weather Data** (requires datetime - collection date/time): .. list-table:: :header-rows: 1 :widths: 20 50 15 15 * - Slot Name - Description - Units - Type * - ``temp`` - Temperature at collection time - degrees Celsius - float * - ``air_temp`` - Air temperature (alias for temp) - degrees Celsius - float * - ``humidity`` - Relative humidity - g/m³ - string * - ``wind_speed`` - Wind speed - m/s - string * - ``wind_direction`` - Wind direction - degrees - string * - ``solar_irradiance`` - Solar radiation - W/m² - string .. warning:: Weather slots require ``datetime_obj`` parameter. Data availability depends on location and date. Not all slots may return values for all locations/times. **Marine Data** (no datetime needed): .. list-table:: :header-rows: 1 :widths: 20 50 15 15 * - Slot Name - Description - Units - Type * - ``depth`` - Water depth (negative for underwater) - meters - string .. warning:: Marine providers (GEBCO, ESA CCI, NOAA) are marked as unreliable in Issue #181. Data quality varies significantly by location. **Soil Data** (no datetime needed): .. list-table:: :header-rows: 1 :widths: 20 50 15 15 * - Slot Name - Description - Units - Type * - ``ph`` - Soil pH - pH units - float * - ``soil_type`` - USDA soil texture class - text - string .. warning:: SoilGrids provider has intermittent failures (Issue #184). Success rate varies by location. Not Yet Implemented (❌ Future Work) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ These submission-schema slots are **not yet supported**: - ``cur_vegetation`` - Current vegetation type (Issue #194) - ``flooding`` - Flooding history (Issue #192) - ``fire`` - Fire history - ``extreme_event`` - Extreme weather events - Many others... See the `submission-schema documentation `_ for the complete list of slots. .. seealso:: Issue #193: Add submission-schema extraction helpers for more slots Parameters ---------- .. function:: get_environmental_metadata(lat, lon, slots, datetime_obj=None, providers=None, strategy="mean") Get NMDC submission-schema values for specified slots. :param float lat: Latitude in decimal degrees (required) Valid range: -90 to 90 :param float lon: Longitude in decimal degrees (required) Valid range: -180 to 180 :param list[str] slots: Slot names to retrieve (required) Must be from supported slots listed above. Cannot be empty. **Examples**:: # Climate only slots=["annual_precpt", "annual_temp"] # Mix climate and elevation slots=["annual_precpt", "elev"] # Weather (requires datetime_obj) slots=["temp", "humidity", "wind_speed"] :param datetime datetime_obj: Collection date/time (optional) **Required for weather slots** (temp, air_temp, humidity, etc.) Not used for climate, elevation, marine, or soil slots **Example**:: from datetime import datetime datetime_obj=datetime(2023, 7, 15, 14, 30) :param list[str] providers: Specific providers to use (optional) If None (default), queries all available providers. **Valid providers by slot category:** - Climate slots: ``["meteostat", "nasa_power"]`` - Elevation slots: ``["usgs", "google", "open_topo_data", "osm"]`` **Examples**:: # Use only meteostat for climate providers=["meteostat"] # Use only USGS for elevation providers=["usgs"] :param str strategy: How to combine values from multiple providers (optional) Default is ``"mean"``. **Valid values** (from ``CONSENSUS_STRATEGIES``): - ``"mean"``: Average across all successful providers (default, most reliable) - ``"median"``: Middle value when sorted (robust to outliers) - ``"first"``: Use first successful provider in priority order (fastest) - ``"best_quality"``: Use provider with best quality metric (closest station, highest resolution) See :ref:`consensus-strategies` for detailed descriptions and usage guidance. **Example**:: # Use median to handle outliers result = get_environmental_metadata( lat=46.8523, lon=-121.7603, slots=["elev"], strategy="median" ) :returns: Dictionary with two keys: - ``"values"``: Dict mapping slot names to submission-ready values - Values are in the correct units for submission-schema - Slots that failed to retrieve data are **omitted** (not None) - Types match submission-schema requirements (mostly float, some string) - ``"metadata"``: Dict with provider information for transparency - Shows which data sources contributed to each value - Includes provider-specific results for comparison - Lists any providers that failed with error messages :rtype: dict[str, Any] :raises ValueError: If latitude is outside -90 to 90 :raises ValueError: If longitude is outside -180 to 180 :raises ValueError: If slots list is empty :raises ValueError: If slots contains unsupported slot names (Error message will list all supported slots) :raises ValueError: If providers contains invalid provider names for requested slots (Error message will list valid providers) Return Value Structure ---------------------- The function returns a dictionary with this structure: .. code-block:: python { "values": { "annual_precpt": 519.3, # float: mm/year "annual_temp": 14.1, # float: °C "elev": 52.4 # float: meters # Missing/failed slots are omitted }, "metadata": { "climate_normals": { # Only present if climate slots requested "providers_used": ["meteostat", "nasa_power"], "consensus_strategy": "consensus", # How values were combined "provider_results": { "meteostat": { "annual_precpt": 453.1, "annual_temp": 14.2, "period": "1991-2020", "station_distance_km": 3.2 }, "nasa_power": { "annual_precpt": 585.5, "annual_temp": 14.0, "period": "2001-2020" } }, "failed_providers": {} # Dict of {provider: error_message} } # "weather", "elevation", "marine", "soil" metadata added as implemented } } .. note:: **Missing slots are omitted**, not set to None. Always check with ``if "slot_name" in result["values"]`` before accessing values. Examples -------- Basic Climate Data ~~~~~~~~~~~~~~~~~~ .. code-block:: python from biosample_enricher.environmental_metadata import get_environmental_metadata # Get 30-year climate averages for a location result = get_environmental_metadata( lat=42.3601, # Boston lon=-71.0589, slots=["annual_precpt", "annual_temp"] ) # Use the values precip = result["values"]["annual_precpt"] # 1090.2 mm/year temp = result["values"]["annual_temp"] # 10.8 °C # Check data quality by comparing providers providers = result["metadata"]["climate_normals"]["provider_results"] for name, data in providers.items(): print(f"{name}: {data['annual_precpt']:.1f} mm/year") # meteostat: 1089.3 mm/year # nasa_power: 1091.1 mm/year Mixing Multiple Slot Types ~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. code-block:: python # Get climate + elevation in one call result = get_environmental_metadata( lat=40.7128, # New York City lon=-74.0060, slots=["annual_precpt", "annual_temp", "elev"] ) values = result["values"] print(f"Elevation: {values['elev']} m") print(f"Annual rain: {values['annual_precpt']} mm/year") print(f"Annual temp: {values['annual_temp']} °C") Using Specific Providers ~~~~~~~~~~~~~~~~~~~~~~~~~ .. code-block:: python # Use only meteostat for climate (not NASA POWER) result = get_environmental_metadata( lat=51.5074, # London lon=-0.1278, slots=["annual_precpt", "annual_temp"], providers=["meteostat"] # Only use this provider ) metadata = result["metadata"]["climate_normals"] print(metadata["providers_used"]) # ['meteostat'] Weather Data (Requires Datetime) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. code-block:: python from datetime import datetime # Get weather at sample collection time result = get_environmental_metadata( lat=34.0522, # Los Angeles lon=-118.2437, slots=["temp", "humidity", "wind_speed"], datetime_obj=datetime(2023, 7, 15, 14, 30) # Required! ) # Check what data was available if "temp" in result["values"]: print(f"Temperature: {result['values']['temp']} °C") else: print("Temperature data not available for this location/time") Error Handling ~~~~~~~~~~~~~~ .. code-block:: python # Handle invalid inputs gracefully try: result = get_environmental_metadata( lat=37.7749, lon=-122.4194, slots=["annual_precpt", "invalid_slot_name"] ) except ValueError as e: print(f"Error: {e}") # Error: Unsupported slot(s): ['invalid_slot_name']. # Supported slots: ['air_temp', 'annual_precpt', 'annual_temp', ...] # Check for missing data result = get_environmental_metadata( lat=37.7749, lon=-122.4194, slots=["annual_precpt", "depth"] # depth may not be available on land ) if "annual_precpt" in result["values"]: print(f"Got precipitation: {result['values']['annual_precpt']}") if "depth" not in result["values"]: print("Depth data not available (probably on land)") .. _quick-reference: Quick Reference --------------- All Constants at a Glance ~~~~~~~~~~~~~~~~~~~~~~~~~ These constants are available for programmatic access: .. code-block:: python from biosample_enricher.environmental_metadata import ( # Slot categories ALL_SUPPORTED_SLOTS, # All slots combined CLIMATE_SLOTS, # {'annual_precpt', 'annual_temp'} WEATHER_SLOTS, # {'temp', 'air_temp', 'humidity', 'wind_speed', 'wind_direction', 'solar_irradiance'} ELEVATION_SLOTS, # {'elev'} MARINE_SLOTS, # {'depth'} SOIL_SLOTS, # {'ph', 'soil_type'} # Provider names CLIMATE_PROVIDERS, # {'meteostat', 'nasa_power'} ELEVATION_PROVIDERS, # {'usgs', 'google', 'open_topo_data', 'osm'} # Consensus strategies CONSENSUS_STRATEGIES, # {'mean', 'median', 'first', 'best_quality'} ) Slots by Category ~~~~~~~~~~~~~~~~~ .. list-table:: :header-rows: 1 :widths: 20 40 20 20 * - Category - Slots - Providers - Datetime Required? * - Climate - ``annual_precpt``, ``annual_temp`` - meteostat, nasa_power - No * - Weather - ``temp``, ``air_temp``, ``humidity``, ``wind_speed``, ``wind_direction``, ``solar_irradiance`` - meteostat, open_meteo - **Yes** * - Elevation - ``elev`` - usgs, google, open_topo_data, osm - No * - Marine - ``depth`` - gebco, noaa - No * - Soil - ``ph``, ``soil_type`` - soilgrids, usda_nrcs - No .. _consensus-strategies: Consensus Strategies ~~~~~~~~~~~~~~~~~~~~ When multiple providers return data, values are combined using a consensus strategy: .. list-table:: :header-rows: 1 :widths: 15 60 25 * - Strategy - Description - When to Use * - ``mean`` - Arithmetic average across all providers (default) - General use, most reliable * - ``median`` - Middle value when sorted - When outliers are possible * - ``first`` - Use first successful provider - When speed matters * - ``best_quality`` - Use provider with best quality metric - Advanced use with quality scores Slot Status (Reliability) ~~~~~~~~~~~~~~~~~~~~~~~~~ .. list-table:: :header-rows: 1 :widths: 15 35 50 * - Status - Slots - Notes * - **Ready** - ``annual_precpt``, ``annual_temp``, ``elev`` - Production-ready, reliable data * - **Caution** - ``temp``, ``ph``, ``depth`` - May have gaps or provider issues * - **Experimental** - ``air_temp``, ``humidity``, ``wind_speed``, ``wind_direction``, ``solar_irradiance``, ``soil_type`` - Limited testing, may change Copy-Paste Slot Lists ~~~~~~~~~~~~~~~~~~~~~ For convenience, here are the slot names ready to copy: **All slots (comma-separated)**:: annual_precpt, annual_temp, elev, temp, air_temp, humidity, wind_speed, wind_direction, solar_irradiance, depth, ph, soil_type **Production-ready slots only**:: annual_precpt, annual_temp, elev **Climate + Elevation (most common)**:: annual_precpt, annual_temp, elev Checking Available Slots and Providers ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. code-block:: python from biosample_enricher.environmental_metadata import ( ALL_SUPPORTED_SLOTS, CLIMATE_SLOTS, WEATHER_SLOTS, ELEVATION_SLOTS, MARINE_SLOTS, SOIL_SLOTS, CLIMATE_PROVIDERS, ELEVATION_PROVIDERS, CONSENSUS_STRATEGIES, ) # See all supported slots print("All slots:", sorted(ALL_SUPPORTED_SLOTS)) # See slots by category print("Climate:", sorted(CLIMATE_SLOTS)) print("Weather:", sorted(WEATHER_SLOTS)) print("Elevation:", sorted(ELEVATION_SLOTS)) print("Marine:", sorted(MARINE_SLOTS)) print("Soil:", sorted(SOIL_SLOTS)) # See available providers print("Climate providers:", sorted(CLIMATE_PROVIDERS)) print("Elevation providers:", sorted(ELEVATION_PROVIDERS)) # See consensus strategies print("Strategies:", sorted(CONSENSUS_STRATEGIES)) Limitations and Known Issues ----------------------------- Current Limitations ~~~~~~~~~~~~~~~~~~~ 1. **Limited slot coverage**: Only 13 of ~200+ submission-schema slots are supported 2. **No bulk operations**: Must call once per biosample (no batch processing yet) 3. **Weather data gaps**: Historical weather not available for all locations/times 4. **Provider reliability**: Some providers have intermittent failures (see issues below) 5. **No caching control**: Cannot disable or clear HTTP cache from this function Known Issues ~~~~~~~~~~~~ .. warning:: **Provider Reliability Issues** - Please review these before relying on data: - Issue #181: Marine providers (GEBCO, ESA CCI, NOAA) incomplete/unreliable - Issue #182: MODIS vegetation provider uses mock data only - Issue #183: USGS elevation provider unreliable (marked flaky) - Issue #184: SoilGrids provider intermittent failures (marked flaky) Climate and elevation data are generally reliable. Marine and soil data quality varies. Future Development ~~~~~~~~~~~~~~~~~~ These features are **planned but not yet implemented**: - More submission-schema slots (Issue #193) - Vegetation data from land cover (Issue #194) - Flooding history (Issue #192) - Batch processing for multiple biosamples - Quality scores and confidence intervals - Custom provider selection strategies beyond "consensus" See Also -------- **For Advanced Users:** - :doc:`api/services` - Low-level service APIs for more control - :doc:`api/providers` - Individual provider documentation - :doc:`provider_reliability` - Provider stability and quality metrics **For Understanding the Code:** - :doc:`api/api_index` - Complete API reference - :doc:`architecture` - System design and data flow **External Resources:** - `NMDC Submission Schema `_ - Official schema documentation - `GOLD Biosample Fields `_ - Related biosample metadata standard