Providers

Comprehensive documentation about all data providers available in biosample-enricher.

Provider Documentation

This document provides comprehensive information about all data providers available in the biosample-enricher package.

Table of Contents

  1. Provider Overview

  2. Comparison by Domain

  3. Detailed Provider Profiles


Provider Overview

Provider

Domain

Coverage

API Key

Cost

Stability

Google Elevation API

Elevation

Global

Required

paid

HIGH

Open Topo Data

Elevation

Global

No

free

HIGH

OSM Elevation (open-elevation.com)

Elevation

Global

No

free

HIGH

USGS 3DEP Elevation

Elevation

Global (best in USA)

No

free

LOW

Google Geocoding (Forward)

Geocoding

Global

Required

paid

HIGH

Google Reverse Geocoding

Geocoding

Global

Required

paid

HIGH

OSM Nominatim (Forward)

Geocoding

Global

No

free

HIGH

OSM Nominatim (Reverse)

Geocoding

Global

No

free

HIGH

ESA WorldCover

Land

Global

No

free

HIGH

MODIS Vegetation Indices

Land

Global

Required

free

LOW

USGS NLCD

Land

USA only

No

free

HIGH

ESA Ocean Colour CCI

Marine

Global oceans

No

free

MODERATE

GEBCO Bathymetry

Marine

Global oceans

No

free

MODERATE

NOAA OISST

Marine

Global oceans

No

free

MODERATE

ISRIC SoilGrids

Soil

Global

No

free

HIGH

USDA NRCS Web Soil Survey

Soil

USA only

No

free

HIGH

Meteostat

Weather

Global (120,000+ stations)

No

free

HIGH

NASA POWER

Weather

Global

No

free

HIGH

Open-Meteo

Weather

Global

No

free

HIGH

Comparison by Domain

Elevation

Provider

Resolution

Coverage

Data Quality

Best For

Google Elevation API

30m (varies by region)

Global

ground_truth

Production systems with API budget

Open Topo Data

250m-1km (dataset dependent)

Global

satellite

Development and testing

OSM Elevation (open-elevation.com)

90m (SRTM-based)

Global

satellite

Free alternative with decent resolution

USGS 3DEP Elevation

10-30m (varies by region)

Global (best in USA)

ground_truth

US locations when available

Geocoding

Provider

Coverage

API Key

Cost

Best For

Google Geocoding (Forward)

Global

Required

paid

Production with budget

Google Reverse Geocoding

Global

Required

paid

Production with budget

OSM Nominatim (Forward)

Global

No

free

Free geocoding

OSM Nominatim (Reverse)

Global

No

free

Free reverse geocoding

Land

Provider

Coverage

Resolution

Data Type

Best For

ESA WorldCover

Global

10m

Sentinel-1 & Sentinel-2

Global land cover classification

MODIS Vegetation Indices

Global

250-500m

MODIS satellite

Future implementation

USGS NLCD

USA only

30m

Landsat satellite classification

USA land cover classification

Marine

Provider

Coverage

Resolution

Data Type

Best For

ESA Ocean Colour CCI

Global oceans

1km

Satellite ocean color

Marine biogeochemistry when available

GEBCO Bathymetry

Global oceans

15 arc-seconds (~450m)

Compiled bathymetric surveys

Ocean depth estimates when working

NOAA OISST

Global oceans

0.25 degrees (~25km)

Optimally Interpolated SST

Sea surface temperature when available

Soil

Provider

Coverage

Resolution

Depths

Best For

ISRIC SoilGrids

Global

250m

Multiple

Global soil property estimates

USDA NRCS Web Soil Survey

USA only

Polygon-based (variable)

Multiple

USA locations requiring high accuracy

Weather

Provider

Resolution

Coverage

Data Quality

Best For

Meteostat

Station-based (point measurements)

Global (120,000+ stations)

ground_truth

Urban/suburban locations with dense station coverage

NASA POWER

0.5° x 0.625° (~50-60km grid)

Global

satellite_reanalysis

Remote locations far from weather stations

Open-Meteo

11km (ERA5-Land)

Global

satellite_reanalysis

Day-specific weather (collection date)

Detailed Provider Profiles

Below are comprehensive profiles for each provider, including technical specifications, reliability information, and use case recommendations.


Google Elevation API

Google Earth elevation database

Quick Facts

  • API Type: REST

  • Endpoint: https://maps.googleapis.com/maps/api/elevation/json

  • Authentication: api_key_required

  • API Key: GOOGLE_MAIN_API_KEY

  • Coverage: Global

  • Resolution: 30m (varies by region)

Reliability

  • Stability: HIGH

  • Data Quality: ground_truth

  • Uptime: Excellent (major provider)

Cost

  • Pricing Model: paid

  • Free Tier: No free tier

  • Quotas: Based on billing account

Strengths & Weaknesses

Strengths

Weaknesses

✓ Comprehensive global coverage

✗ Requires paid API key (no free tier)

✓ Accurate rooftop-level elevation data

✗ Cost accumulates with high-volume use

✓ Robust error handling with detailed status codes

✗ Quota exhaustion possible (OVER_QUERY_LIMIT)

✓ Well-documented API

✗ No fallback mechanisms if quota exceeded

✓ High reliability and uptime

Use Cases

Best For:

  • Production systems with API budget

  • High-accuracy requirements

  • Urban/suburban locations

Not Suitable For:

  • High-volume batch processing without budget

  • Development/testing without API key

Complements:

  • Open Topo Data (free fallback)

  • USGS 3DEP (US-specific validation)

NMDC Integration

  • Schema Slots: elev

  • Role: primary_if_key_available

  • Excellent For: urban, suburban, developed_areas

API Documentation: https://maps.googleapis.com/maps/api/elevation/json


Open Topo Data

ASTER GDEM, SRTM, ETOPO1

Quick Facts

  • API Type: REST

  • Endpoint: https://api.opentopodata.org/v1/aster30m

  • Authentication: none

  • Coverage: Global

  • Resolution: 250m-1km (dataset dependent)

Reliability

  • Stability: HIGH

  • Data Quality: satellite

  • Uptime: Good (community-maintained)

Cost

  • Pricing Model: free

  • Free Tier: 1000 requests/day

  • Quotas: 100/min, 1000/day

Strengths & Weaknesses

Strengths

Weaknesses

✓ Free access, no API key

✗ Rate limited (100/min, 1000/day)

✓ Global coverage

✗ Coarser resolution than Google (250m-1km)

✓ Multiple dataset options (ASTER, SRTM, ETOPO1)

✗ Community-maintained (not enterprise SLA)

✓ Stable service

✓ Good documentation

Use Cases

Best For:

  • Development and testing

  • Batch processing within rate limits

  • Budget-constrained projects

Not Suitable For:

  • Very high-volume applications (>1000/day)

  • Sub-100m precision requirements

Complements:

  • Google Elevation (for higher accuracy)

NMDC Integration

  • Schema Slots: elev

  • Role: primary_free_option

  • Excellent For: global

  • Poor For: requires_sub_100m_accuracy

API Documentation: https://api.opentopodata.org/v1/aster30m


OSM Elevation (open-elevation.com)

SRTM, ASTER GDEM

Quick Facts

  • API Type: REST

  • Endpoint: https://api.open-elevation.com/api/v1/lookup

  • Authentication: none

  • Coverage: Global

  • Resolution: 90m (SRTM-based)

Reliability

  • Stability: HIGH

  • Data Quality: satellite

  • Uptime: Good

Cost

  • Pricing Model: free

  • Free Tier: Unlimited (fair use)

  • Quotas: None documented

Strengths & Weaknesses

Strengths

Weaknesses

✓ Free, no API key

✗ Less mature than other services

✓ Global coverage

✗ Limited documentation

✓ 90m resolution (better than Open Topo Data)

✗ Unknown reliability guarantees

✓ No documented rate limits

Use Cases

Best For:

  • Free alternative with decent resolution

  • Development/testing

Not Suitable For:

  • Production systems requiring SLA

Complements:

  • Other free elevation providers

NMDC Integration

  • Schema Slots: elev

  • Role: fallback_free_option

  • Excellent For: global

API Documentation: https://api.open-elevation.com/api/v1/lookup


USGS 3DEP Elevation

3D Elevation Program

Quick Facts

  • API Type: ArcGIS_REST

  • Endpoint: https://elevation.nationalmap.gov/arcgis/rest/services/3DEPElevation/ImageServer/getSamples

  • Authentication: none

  • Coverage: Global (best in USA)

  • Resolution: 10-30m (varies by region)

Reliability

  • Stability: LOW

  • Data Quality: ground_truth

  • Uptime: Unreliable - multiple migrations

  • Known Issues:

    • Service has migrated multiple times (EPQS → 3DEP)

    • Endpoint URLs change without notice

    • No-data sentinel values (-1000000, -9999) complicate parsing

    • Intermittent availability

Cost

  • Pricing Model: free

  • Free Tier: Unlimited

  • Quotas: None documented

Strengths & Weaknesses

Strengths

Weaknesses

✓ Free access, no API key required

✗ ⚠️ KNOWN MIGRATION ISSUES - service frequently changes

✓ High resolution data in USA (10m)

✗ Unreliable availability

✓ Proper vertical datum (NAVD88)

✗ Complex no-data handling required

✓ Government-maintained dataset

✗ Endpoint may change without warning

✗ Limited documentation on current API

Use Cases

Best For:

  • US locations when available

  • Development/testing (free)

Not Suitable For:

  • Production systems requiring high reliability

  • International locations (lower priority/quality)

  • Time-critical applications

Complements:

  • Should be used WITH fallback providers

NMDC Integration

  • Schema Slots: elev

  • Role: fallback_with_caution

  • Excellent For: usa_conus

  • Poor For: international, oceans

API Documentation: https://elevation.nationalmap.gov/arcgis/rest/services/3DEPElevation/ImageServer/getSamples


Google Geocoding (Forward)

Google Maps database

Quick Facts

  • API Type: REST

  • Endpoint: https://maps.googleapis.com/maps/api/geocode/json

  • Authentication: api_key_required

  • API Key: GOOGLE_MAIN_API_KEY

  • Coverage: Global

  • Resolution: Address-level precision

Reliability

  • Stability: HIGH

  • Data Quality: high

  • Uptime: Excellent

Cost

  • Pricing Model: paid

  • Free Tier: No

  • Quotas: Based on billing

Strengths & Weaknesses

Strengths

Weaknesses

✓ High accuracy

✗ Requires paid API key

✓ Global coverage

✗ Cost per request

✓ Excellent address parsing

✓ Robust error handling

Use Cases

Best For:

  • Production with budget

  • High accuracy needs

Not Suitable For:

  • High-volume without budget

Complements:

  • OSM Nominatim (free fallback)

NMDC Integration

  • Schema Slots: lat_lon

  • Role: primary_if_key_available

  • Excellent For: global

API Documentation: https://maps.googleapis.com/maps/api/geocode/json


Google Reverse Geocoding

Google Maps database

Quick Facts

  • API Type: REST

  • Endpoint: https://maps.googleapis.com/maps/api/geocode/json

  • Authentication: api_key_required

  • API Key: GOOGLE_MAIN_API_KEY

  • Coverage: Global

  • Resolution: Address-level precision

Reliability

  • Stability: HIGH

  • Data Quality: high

  • Uptime: Excellent

Cost

  • Pricing Model: paid

  • Free Tier: No

  • Quotas: Based on billing

Strengths & Weaknesses

Strengths

Weaknesses

✓ High accuracy

✗ Requires paid API key

✓ Detailed address components

✓ Global coverage

Use Cases

Best For:

  • Production with budget

Not Suitable For:

  • High-volume without budget

Complements:

  • OSM Nominatim

NMDC Integration

  • Schema Slots: geo_loc_name

  • Role: primary_if_key_available

  • Excellent For: global

API Documentation: https://maps.googleapis.com/maps/api/geocode/json


OSM Nominatim (Forward)

OpenStreetMap database

Quick Facts

  • API Type: REST

  • Endpoint: https://nominatim.openstreetmap.org/search

  • Authentication: none

  • Coverage: Global

  • Resolution: Address-level precision

Reliability

  • Stability: HIGH

  • Data Quality: community_maintained

  • Uptime: Good

  • Known Issues:

    • Rate limited to 1 request/second

    • Requires User-Agent header

Cost

  • Pricing Model: free

  • Free Tier: Unlimited (fair use)

  • Quotas: 1 request/second

Strengths & Weaknesses

Strengths

Weaknesses

✓ Free access

✗ Rate limited (1/second)

✓ Global coverage

✗ Variable accuracy

✓ Community-maintained data

✗ Requires User-Agent

Use Cases

Best For:

  • Free geocoding

  • Development/testing

Not Suitable For:

  • High-volume batch (>1/second)

Complements:

  • Google Geocoding

NMDC Integration

  • Schema Slots: lat_lon

  • Role: primary_free_option

  • Excellent For: global

API Documentation: https://nominatim.openstreetmap.org/search


OSM Nominatim (Reverse)

OpenStreetMap database

Quick Facts

  • API Type: REST

  • Endpoint: https://nominatim.openstreetmap.org/reverse

  • Authentication: none

  • Coverage: Global

  • Resolution: Address-level precision

Reliability

  • Stability: HIGH

  • Data Quality: community_maintained

  • Uptime: Good

  • Known Issues:

    • Rate limited to 1 request/second

Cost

  • Pricing Model: free

  • Free Tier: Unlimited (fair use)

  • Quotas: 1 request/second

Strengths & Weaknesses

Strengths

Weaknesses

✓ Free access

✗ Rate limited

✓ Global coverage

✗ Variable accuracy

Use Cases

Best For:

  • Free reverse geocoding

Not Suitable For:

  • High-volume batch

Complements:

  • Google Reverse Geocoding

NMDC Integration

  • Schema Slots: geo_loc_name

  • Role: primary_free_option

  • Excellent For: global

API Documentation: https://nominatim.openstreetmap.org/reverse


ESA WorldCover

Sentinel-1 & Sentinel-2

Quick Facts

  • API Type: WMS

  • Endpoint: https://services.terrascope.be/wms/v2

  • Authentication: none

  • Coverage: Global

  • Resolution: 10m

  • Temporal: 2020, 2021

Reliability

  • Stability: HIGH

  • Data Quality: satellite_classified

  • Uptime: Good (ESA service)

Cost

  • Pricing Model: free

  • Free Tier: Unlimited

  • Quotas: None documented

Strengths & Weaknesses

Strengths

Weaknesses

✓ Global coverage

✗ Limited temporal coverage (only 2020, 2021)

✓ High resolution (10m - best available)

✗ Simplified classification scheme

✓ Recent data (2020, 2021)

✓ Free access

✓ Sentinel satellite quality

Use Cases

Best For:

  • Global land cover classification

  • International locations

  • Recent land cover needed

Not Suitable For:

  • Historical land cover (pre-2020)

  • Detailed US classification (use NLCD)

Complements:

  • NLCD (for USA detail)

NMDC Integration

  • Schema Slots: cur_land_use

  • Role: primary_global

  • Excellent For: global, international

API Documentation: https://services.terrascope.be/wms/v2


MODIS Vegetation Indices

MODIS satellite

Quick Facts

  • API Type: APPEEARS

  • Endpoint: https://appeears.earthdatacloud.nasa.gov/api/

  • Authentication: earthdata_login

  • Coverage: Global

  • Resolution: 250-500m

  • Temporal: 2000-present

Reliability

  • Stability: LOW

  • Data Quality: satellite

  • Uptime: Unknown

  • Known Issues:

    • ⚠️ MOCK IMPLEMENTATION ONLY

    • Not fully implemented

    • Requires NASA Earthdata authentication

Cost

  • Pricing Model: free

  • Free Tier: Unlimited (with NASA account)

  • Quotas: Unknown

Strengths & Weaknesses

Strengths

Weaknesses

✓ Global NDVI/EVI data

✗ ⚠️ NOT IMPLEMENTED - mock only

✓ Long temporal coverage

✗ Requires authentication setup

✓ Free with NASA account

✗ Complex API

Use Cases

Best For:

  • Future implementation

Not Suitable For:

  • Current use (not implemented)

Complements:

  • N/A

NMDC Integration

  • Schema Slots: ndvi, evi

  • Role: not_implemented

API Documentation: https://appeears.earthdatacloud.nasa.gov/api/


USGS NLCD

Landsat satellite classification

Quick Facts

  • API Type: WMS

  • Endpoint: https://www.mrlc.gov/geoserver/mrlc_display/NLCD_*/wms

  • Authentication: none

  • Coverage: USA only

  • Resolution: 30m

  • Temporal: Multiple years (2001, 2004, 2006, 2008, 2011, 2013, 2016, 2019)

Reliability

  • Stability: HIGH

  • Data Quality: satellite_classified

  • Uptime: Good (USGS service)

Cost

  • Pricing Model: free

  • Free Tier: Unlimited

Strengths & Weaknesses

Strengths

Weaknesses

✓ High resolution (30m)

✗ USA only coverage

✓ USA-specific classification scheme

✗ Limited to available years

✓ Multiple time periods

✓ Free access

Use Cases

Best For:

  • USA land cover classification

  • Temporal land use change studies

Not Suitable For:

  • International locations

Complements:

  • ESA WorldCover (global)

NMDC Integration

  • Schema Slots: cur_land_use

  • Role: primary_for_usa

  • Excellent For: usa

  • Poor For: international

API Documentation: https://www.mrlc.gov/geoserver/mrlc_display/NLCD_*/wms


ESA Ocean Colour CCI

Satellite ocean color

Quick Facts

  • API Type: ERDDAP

  • Endpoint: https://www.oceancolour.org/erddap/

  • Authentication: none

  • Coverage: Global oceans

  • Resolution: 1km

  • Temporal: 1997-present

Reliability

  • Stability: MODERATE

  • Data Quality: satellite

  • Uptime: Fair

  • Known Issues:

    • ⚠️ Implementation incomplete

    • Complex ERDDAP API

Cost

  • Pricing Model: free

  • Free Tier: Unlimited

  • Quotas: None documented

Strengths & Weaknesses

Strengths

Weaknesses

✓ Global ocean color data

✗ ⚠️ Incomplete implementation

✓ Long temporal coverage

✗ Complex ERDDAP queries

✓ Free access

✗ Limited documentation

Use Cases

Best For:

  • Marine biogeochemistry when available

Not Suitable For:

  • Production use (incomplete)

Complements:

  • Other marine providers

NMDC Integration

  • Schema Slots: chlorophyll

  • Role: experimental

  • Excellent For: oceans

API Documentation: https://www.oceancolour.org/erddap/


GEBCO Bathymetry

Compiled bathymetric surveys

Quick Facts

  • API Type: WCS

  • Endpoint: https://www.gebco.net/data_and_products/gebco_web_services/web_map_service/

  • Authentication: none

  • Coverage: Global oceans

  • Resolution: 15 arc-seconds (~450m)

Reliability

  • Stability: MODERATE

  • Data Quality: survey_compilation

  • Uptime: Fair

  • Known Issues:

    • ⚠️ WCS implementation incomplete

    • Service reliability issues

Cost

  • Pricing Model: free

  • Free Tier: Unlimited

Strengths & Weaknesses

Strengths

Weaknesses

✓ Global ocean coverage

✗ ⚠️ Incomplete WCS implementation

✓ High resolution (15 arc-seconds)

✗ Service stability concerns

✓ Free access

✗ Limited error handling

Use Cases

Best For:

  • Ocean depth estimates when working

Not Suitable For:

  • Production systems requiring reliability

Complements:

  • Other bathymetry providers needed

NMDC Integration

  • Schema Slots: depth

  • Role: experimental

  • Excellent For: oceans

API Documentation: https://www.gebco.net/data_and_products/gebco_web_services/web_map_service/


NOAA OISST

Optimally Interpolated SST

Quick Facts

  • API Type: ERDDAP

  • Endpoint: https://coastwatch.pfeg.noaa.gov/erddap/

  • Authentication: none

  • Coverage: Global oceans

  • Resolution: 0.25 degrees (~25km)

  • Temporal: 1981-present

Reliability

  • Stability: MODERATE

  • Data Quality: satellite_interpolated

  • Uptime: Fair

  • Known Issues:

    • ⚠️ Implementation incomplete

Cost

  • Pricing Model: free

  • Free Tier: Unlimited

  • Quotas: None documented

Strengths & Weaknesses

Strengths

Weaknesses

✓ Global sea surface temperature

✗ ⚠️ Incomplete implementation

✓ Long temporal coverage

✗ Coarse resolution (0.25°)

✓ Free access

Use Cases

Best For:

  • Sea surface temperature when available

Not Suitable For:

  • Production use (incomplete)

Complements:

  • Other SST providers

NMDC Integration

  • Schema Slots: temp, sst

  • Role: experimental

  • Excellent For: oceans

API Documentation: https://coastwatch.pfeg.noaa.gov/erddap/


ISRIC SoilGrids

Machine learning predictions from soil profiles

Quick Facts

  • API Type: WCS_REST

  • Endpoint: https://rest.isric.org/soilgrids/v2.0

  • Authentication: none

  • Coverage: Global

  • Resolution: 250m

Reliability

  • Stability: HIGH

  • Data Quality: modeled

  • Uptime: Good (ISRIC institutional service)

Cost

  • Pricing Model: free

  • Free Tier: Unlimited

  • Quotas: None documented

Strengths & Weaknesses

Strengths

Weaknesses

✓ Global coverage at 250m resolution

✗ Modeled data (not direct measurements)

✓ Multiple soil properties (pH, texture, carbon, etc.)

✗ Accuracy varies by region

✓ Depth-specific layers

✗ May not reflect recent land use changes

✓ No API key required

✓ Well-documented API

✓ Regular updates

Use Cases

Best For:

  • Global soil property estimates

  • Locations without local soil surveys

  • Comparative studies across regions

Not Suitable For:

  • High-precision agriculture requiring ground truth

  • Recent land disturbance areas

Complements:

  • USDA NRCS (for US validation)

NMDC Integration

  • Schema Slots: ph, soil_text, org_matter, oc

  • Role: primary_global

  • Excellent For: global

  • Poor For: recently_disturbed

API Documentation: https://rest.isric.org/soilgrids/v2.0


USDA NRCS Web Soil Survey

Ground surveys and lab measurements

Quick Facts

  • API Type: SDA_REST

  • Endpoint: https://sdmdataaccess.nrcs.usda.gov/Tabular/SDMTabularService/post.rest

  • Authentication: none

  • Coverage: USA only

  • Resolution: Polygon-based (variable)

Reliability

  • Stability: HIGH

  • Data Quality: ground_truth

  • Uptime: Good (USDA service)

Cost

  • Pricing Model: free

  • Free Tier: Unlimited

  • Quotas: None documented

Strengths & Weaknesses

Strengths

Weaknesses

✓ High-quality ground truth data

✗ USA only coverage

✓ Lab-measured soil properties

✗ Variable spatial resolution

✓ Detailed soil classification

✗ Complex API (SQL-based queries)

✓ Free access

Use Cases

Best For:

  • USA locations requiring high accuracy

  • Agricultural research

  • Ground truth validation

Not Suitable For:

  • International locations

Complements:

  • SoilGrids (global coverage)

NMDC Integration

  • Schema Slots: ph, soil_text, org_matter

  • Role: primary_for_usa

  • Excellent For: usa

  • Poor For: international

API Documentation: https://sdmdataaccess.nrcs.usda.gov/Tabular/SDMTabularService/post.rest


Meteostat

WMO weather stations

Quick Facts

  • API Type: Python_Library_CDN

  • Endpoint: https://bulk.meteostat.net/v2/

  • Authentication: none

  • Coverage: Global (120,000+ stations)

  • Resolution: Station-based (point measurements)

  • Temporal: 1973-present (daily), 1991-2020 (normals)

  • Freshness: 7-day lag

Reliability

  • Stability: HIGH

  • Data Quality: ground_truth

  • Uptime: Excellent (stable library)

  • Known Issues:

    • Climate normals only available for WMO standard periods (1961-1990, 1971-2000, 1981-2010, 1991-2020)

    • Station coverage sparse in remote regions

    • Requires 10/12 months minimum for normals

Cost

  • Pricing Model: free

  • Free Tier: Unlimited

Strengths & Weaknesses

Strengths

Weaknesses

✓ 30-year WMO standard period (1991-2020)

✗ Sparse coverage in remote regions (deserts, mountains, oceans)

✓ Station-based ground truth measurements

✗ Distance uncertainty (may use station 50-100km away)

✓ No API key required

✗ Only provides specific WMO standard periods

✓ Extensive station network (120,000+)

✗ Station availability varies by region

✓ Distance tracking for quality assessment

✗ Data gaps in some stations

✓ Pre-computed normals for fast retrieval

✓ High reliability

Use Cases

Best For:

  • Urban/suburban locations with dense station coverage

  • When WMO-standard 30-year normals required

  • Scientific research requiring ground-based observations

Not Suitable For:

  • Remote desert/mountain/ocean locations

  • Custom time periods (not WMO standard)

Complements:

  • NASA POWER (for remote area coverage)

NMDC Integration

  • Schema Slots: annual_precpt, annual_temp, temp, air_temp

  • Role: primary_for_stations

  • Excellent For: urban, suburban, europe, north_america, australia

  • Poor For: deserts, mountains, oceans, remote_regions

API Documentation: https://bulk.meteostat.net/v2/


NASA POWER

MERRA-2 satellite reanalysis

Quick Facts

  • API Type: REST

  • Endpoint: https://power.larc.nasa.gov/api/temporal/climatology/point

  • Authentication: none

  • Coverage: Global

  • Resolution: 0.5° x 0.625° (~50-60km grid)

  • Temporal: 2001-2020 (climatologies)

  • Freshness: Static climatologies

Reliability

  • Stability: HIGH

  • Data Quality: satellite_reanalysis

  • Uptime: Excellent (NASA operational service)

  • Known Issues:

    • Only provides 2001-2020 period (not WMO standard 1991-2020)

    • Coarser spatial resolution than station data

Cost

  • Pricing Model: free

  • Free Tier: Unlimited

  • Quotas: None documented

Strengths & Weaknesses

Strengths

Weaknesses

✓ True global coverage (satellite-based)

✗ Shorter period (20 years: 2001-2020 vs standard 30 years)

✓ No API key required

✗ Coarser resolution (0.5° × 0.625° vs station point data)

✓ Works anywhere on Earth (deserts, mountains, oceans)

✗ Satellite bias in complex terrain

✓ Consistent methodology globally

✗ Not WMO standard period

✓ High stability (NASA/GMAO service)

✗ Model-based (not direct measurements)

✓ Fast response (pre-computed)

Use Cases

Best For:

  • Remote locations far from weather stations

  • Ocean/marine samples

  • Global-scale studies requiring consistent methodology

  • Validation/comparison against station data

Not Suitable For:

  • Urban areas with local station (prefer Meteostat)

  • Studies requiring WMO standard 1991-2020 period

Complements:

  • Meteostat (for station-rich areas)

NMDC Integration

  • Schema Slots: annual_precpt, annual_temp

  • Role: fallback_for_remote_areas

  • Excellent For: oceans, deserts, mountains, antarctica, remote_regions

API Documentation: https://power.larc.nasa.gov/api/temporal/climatology/point


Open-Meteo

ERA5/ERA5-Land reanalysis

Quick Facts

  • API Type: REST

  • Endpoint: https://archive-api.open-meteo.com/v1/era5

  • Authentication: none

  • Coverage: Global

  • Resolution: 11km (ERA5-Land)

  • Temporal: 1959-present (daily)

  • Freshness: 5-day lag

Reliability

  • Stability: HIGH

  • Data Quality: satellite_reanalysis

  • Uptime: Good

  • Known Issues:

    • Does not provide pre-computed climate normals

    • Would require 30-360 API calls to compute normals

Cost

  • Pricing Model: free

  • Free Tier: 10,000 requests/day

  • Quotas: 10,000/day

Strengths & Weaknesses

Strengths

Weaknesses

✓ ERA5 reanalysis (high quality)

✗ Not used for climate normals (too many API calls)

✓ 11km resolution (better than NASA POWER)

✗ Rate limited (10,000/day)

✓ No API key required

✗ Hourly aggregation required for daily values

✓ Long temporal coverage (1959-present)

✓ Good for daily weather

Use Cases

Best For:

  • Day-specific weather (collection date)

  • Historical daily data

Not Suitable For:

  • Climate normals (use Meteostat/NASA POWER)

Complements:

  • Meteostat (for daily weather)

NMDC Integration

  • Schema Slots: temp, air_temp, humidity, wind_speed, wind_direction

  • Role: fallback_daily_weather

  • Excellent For: global

API Documentation: https://archive-api.open-meteo.com/v1/era5



This documentation was automatically generated from config/provider_metadata.yaml by scripts/generate_provider_docs.py

Documentation Generation

For information about how this documentation is generated and maintained, see:

Documentation Generation Guide

This document describes the automated documentation generation tools available in biosample-enricher.

Overview

The biosample-enricher project uses a YAML-based metadata system to maintain comprehensive documentation about all data providers. This metadata is used to generate:

  1. Python Docstrings - Auto-generated class docstrings visible in IDEs and help()

  2. API Index - Alphabetical index of all public functions and methods

  3. Provider Documentation - Comprehensive markdown documentation with comparison tables

Documentation Tools

1. Provider Docstring Generator

Purpose: Updates Python class docstrings from YAML metadata

Source: scripts/generate_provider_docstrings.py

Usage:

# Preview changes without modifying files
uv run generate-provider-docstrings --dry-run

# Update specific provider
uv run generate-provider-docstrings --provider weather.meteostat

# Update all providers
uv run generate-provider-docstrings

Output: Updates docstrings in Python files under biosample_enricher/

When to Run:

  • After modifying config/provider_metadata.yaml

  • When adding a new provider

  • To ensure IDE documentation is up-to-date

2. API Index Generator

Purpose: Creates alphabetical index of all public functions and methods

Source: scripts/generate_api_index.py

Usage:

# Generate index (default output: docs/API_INDEX.md)
uv run generate-api-index

# Custom output location
uv run generate-api-index --output path/to/output.md

Output: docs/API_INDEX.md - Alphabetical listing of 398 functions and methods

When to Run:

  • After adding new public functions or methods

  • Before releases to update API documentation

  • When restructuring code

3. Provider Documentation Generator

Purpose: Creates comprehensive markdown documentation with comparison tables

Source: scripts/generate_provider_docs.py

Usage:

# Generate documentation (default output: docs/PROVIDERS.md)
uv run generate-provider-docs

# Custom output location
uv run generate-provider-docs --output path/to/output.md

Output: docs/PROVIDERS.md - Provider profiles with:

  • Overview comparison table

  • Domain-specific comparison tables

  • Detailed provider profiles with strengths/weaknesses

  • Use case recommendations

  • NMDC integration details

When to Run:

  • After modifying config/provider_metadata.yaml

  • When adding a new provider

  • Before releases to update user documentation

Source of Truth: provider_metadata.yaml

All documentation is generated from config/provider_metadata.yaml, which contains:

Systematic Comparison Criteria

Each provider entry includes:

  1. Technical Characteristics

    • API type (REST, Python Library, etc.)

    • Endpoint URL

    • Authentication requirements

    • Coverage (global, regional, etc.)

    • Resolution (spatial/temporal)

    • Data freshness

  2. Reliability

    • Stability level (HIGH, MODERATE, LOW)

    • Data quality (ground_truth, satellite, model, etc.)

    • Uptime history

    • Known issues

  3. Cost

    • Pricing model (free, paid, freemium)

    • Free tier details

    • Quota limits

  4. Strengths (bulleted list)

    • What this provider does well

    • Advantages over alternatives

  5. Weaknesses (bulleted list)

    • Limitations

    • When not to use this provider

  6. Use Cases

    • Best for: Ideal scenarios

    • Not suitable for: When to avoid

    • Complements: Providers that work well together

  7. NMDC Integration

    • Schema slots mapped

    • Multi-provider role (primary, fallback, etc.)

    • Geographic preferences (excellent/poor regions)

Workflow

Adding a New Provider
  1. Update YAML metadata: Add complete entry to config/provider_metadata.yaml

  2. Generate docstrings: Run uv run generate-provider-docstrings

  3. Update documentation: Run uv run generate-provider-docs

  4. Regenerate index: Run uv run generate-api-index (if new public methods added)

  5. Commit all changes: YAML, Python files, and generated docs

Updating Provider Information
  1. Edit YAML: Modify config/provider_metadata.yaml

  2. Regenerate docstrings: Run uv run generate-provider-docstrings

  3. Update documentation: Run uv run generate-provider-docs

  4. Commit changes

Before Release

Run all three generators to ensure documentation is current:

make update-docs  # If you add this target to Makefile
# Or manually:
uv run generate-provider-docstrings
uv run generate-api-index
uv run generate-provider-docs

Benefits

1. Single Source of Truth
  • All provider information in one YAML file

  • No duplicate documentation

  • Easy to maintain consistency

2. IDE Integration
  • Class docstrings visible in autocomplete

  • help() function shows comprehensive info

  • Developer-friendly

3. User Documentation
  • Comparison tables for quick reference

  • Detailed profiles for deep dives

  • Use case recommendations

4. Automated Updates
  • Scripts ensure documentation stays current

  • No manual markdown editing required

  • Reduced maintenance burden

File Structure

biosample-enricher/
├── config/
│   └── provider_metadata.yaml          # Source of truth
├── scripts/
│   ├── generate_provider_docstrings.py # Python docstring generator
│   ├── generate_api_index.py           # API index generator
│   └── generate_provider_docs.py       # Markdown docs generator
├── docs/
│   ├── API_INDEX.md                    # Generated: Function/method index
│   ├── PROVIDERS.md                    # Generated: Provider docs
│   └── DOCUMENTATION_GENERATION.md     # This file
└── biosample_enricher/
    └── */providers/*.py                # Updated: Class docstrings

Maintenance Notes

  • Never edit generated files directly - They will be overwritten

  • Always update YAML first, then regenerate

  • Test docstrings with help(ProviderClass) after generation

  • Review diffs before committing to catch errors

  • Keep YAML consistent - Follow existing patterns

Example Provider Entry

weather.meteostat:
  name: "Meteostat"
  class: "MeteostatProvider"
  module: "biosample_enricher.weather.providers.meteostat"

  technical:
    api_type: "Python_Library_CDN"
    api_endpoint: "https://bulk.meteostat.net/v2/"
    authentication: "none"
    coverage: "Global (120,000+ stations)"
    resolution: "Station-based (point measurements)"
    temporal_coverage: "1973-present (daily), 1991-2020 (normals)"
    data_freshness: "7-day lag"

  reliability:
    stability: "high"
    data_quality: "ground_truth"
    uptime_history: "Excellent (stable library)"
    known_issues:
      - "Climate normals only available for WMO standard periods"
      - "Station coverage sparse in remote regions"

  cost:
    pricing_model: "free"
    free_tier: "Unlimited"

  strengths:
    - "30-year WMO standard period (1991-2020)"
    - "Station-based ground truth measurements"
    - "No API key required"

  weaknesses:
    - "Sparse coverage in remote regions"
    - "Distance uncertainty (may use station 50-100km away)"

  use_cases:
    best_for:
      - "Urban/suburban locations with dense station coverage"
      - "When WMO-standard 30-year normals required"
    not_suitable_for:
      - "Remote desert/mountain/ocean locations"
      - "Custom time periods (not WMO standard)"
    complements:
      - "NASA POWER (for remote area coverage)"

  nmdc_integration:
    schema_slots: ["annual_precpt", "annual_temp", "temp", "air_temp"]
    multi_provider_role: "primary_for_stations"
    geographic_preferences:
      excellent: ["urban", "suburban", "europe", "north_america"]
      poor: ["deserts", "mountains", "oceans", "remote_regions"]

Future Enhancements

Potential improvements:

  1. HTML generation - Convert markdown to styled HTML

  2. JSON schema validation - Validate YAML structure

  3. Coverage reports - Track documentation completeness

  4. Cross-references - Link related providers automatically

  5. Version tracking - Document when provider info last updated

  6. Performance benchmarks - Add timing data to profiles


This documentation system was created to ensure comprehensive, consistent, and maintainable provider documentation across the biosample-enricher project.