Provider Reliability
Comprehensive analysis of data provider reliability across all service domains.
Biosample Enricher: Provider Reliability Analysis
Date: 2025-10-27 Analysis Scope: All data fetching providers across 7 service domains
Executive Summary
This analysis identifies 15 unique providers across 7 service domains, with significant reliability variations:
High Reliability: 7 providers (keyless public APIs, global coverage, stable services)
Moderate Reliability: 5 providers (API key-dependent, known migration issues, regional limitations)
Known Issues: 3 providers (incomplete implementations, fallback mechanisms needed)
Critical Gaps Identified:
USGS elevation service has known migration issues and unreliability
GEBCO bathymetry provider has incomplete WCS implementation
MODIS vegetation provider is mock-only (not fully implemented)
Marine providers lack comprehensive error handling
Provider Summary Table
Service Domain |
Provider |
API |
API Key |
Coverage |
Status |
Reliability |
|---|---|---|---|---|---|---|
Elevation |
Google Elevation |
Google Maps |
✓ Required |
Global |
✓ Active |
Moderate |
USGS 3DEP |
USGS ArcGIS |
✗ None |
Global |
⚠️ Unstable |
Low |
|
Open Topo Data |
REST API |
✗ None |
Global (250m-1km) |
✓ Active |
High |
|
OSM Elevation |
open-elevation.com |
✗ None |
Global (90m) |
✓ Active |
High |
|
Soil |
ISRIC SoilGrids |
WCS/REST |
✗ None |
Global (250m) |
✓ Active |
High |
USDA NRCS |
SDA REST |
✗ None |
US Only |
✓ Active |
High |
|
Weather |
MeteoStat |
Library+CDN |
✗ None |
Global (120k+ stations) |
✓ Active |
High |
Open-Meteo |
ERA5 API |
✗ None |
Global (11km) |
✓ Active |
High |
|
Marine |
GEBCO |
WCS Service |
✗ None |
Global (15 arc-sec) |
⚠️ Incomplete |
Low |
ESA Ocean Colour CCI |
ERDDAP |
✗ None |
Global Ocean (1km) |
⚠️ Incomplete |
Moderate |
|
NOAA OISST |
ERDDAP |
✗ None |
Global Ocean (0.25°) |
⚠️ Incomplete |
Moderate |
|
Land Cover |
NLCD |
WMS |
✗ None |
US Only (30m) |
✓ Active |
High |
ESA WorldCover |
WMS |
✗ None |
Global (10m) |
✓ Active |
High |
|
Vegetation |
MODIS |
APPEEARS |
✗ None |
Global (250-500m) |
⚠️ Mock Only |
Low |
Geocoding |
Google Forward |
Google Maps |
✓ Required |
Global |
✓ Active |
Moderate |
OSM Nominatim |
Nominatim |
✗ None |
Global |
✓ Active |
High |
|
Google Reverse |
Google Maps |
✓ Required |
Global |
✓ Active |
Moderate |
|
OSM Nominatim Reverse |
Nominatim |
✗ None |
Global |
✓ Active |
High |
|
OSM Features |
Overpass API |
Overpass |
✗ None |
Global |
✓ Active |
Moderate |
Detailed Domain Analysis
1. ELEVATION PROVIDERS
Google Elevation
API: Google Maps Elevation API v1
Coverage: Global (30m resolution)
API Key: REQUIRED (
GOOGLE_MAIN_API_KEY)Timeout: 20 seconds default
Rate Limit: 50 QPS with API key
Reliability Status: MODERATE
Strengths:
Comprehensive global coverage
Accurate rooftop-level elevation data
Proper error handling with status codes (OK, ZERO_RESULTS, REQUEST_DENIED, OVER_QUERY_LIMIT)
Uses vertical datum: EGM96 (geoid)
Weaknesses:
Requires paid API key
Potential quota exhaustion (OVER_QUERY_LIMIT)
Missing fallback mechanisms
Known Issues:
None documented
Test Status: Not marked with network or flaky markers in test suite
USGS 3DEP (Elevation Point Query Service)
API: USGS ArcGIS REST Service
Coverage: Global (10-30m resolution, varies by region)
API Key: None required
Timeout: 20 seconds default
Endpoint:
https://elevation.nationalmap.gov/arcgis/rest/services/3DEPElevation/ImageServer/getSamplesReliability Status: LOW ⚠️
Strengths:
Free access, no API key required
Global coverage with high resolution in USA
Uses proper vertical datum: NAVD88
Weaknesses:
KNOWN MIGRATION ISSUES: Code comments explicitly state “USGS elevation services have experienced multiple migrations and can be unreliable”
Service has migrated from deprecated EPQS endpoint to 3DEP ArcGIS
Endpoint may change or experience outages
No-data sentinel values: -1000000, -9999 (complex handling required)
Service availability may vary
Known Issues:
Endpoint migration from EPQS to 3DEP (code comment: “Service availability may vary”)
Service unreliability documented in provider code
Complex no-data value handling
Recommendation: Use as secondary fallback only. Monitor service availability closely. Consider deprecating if USGS performs additional migrations.
Test Status: Marked with @pytest.mark.flaky (reruns=2, reruns_delay=10s) in test suite
Open Topo Data
API: Public REST API with multiple datasets
Coverage: Global (datasets: SRTM 30m/90m, ASTER 30m, EUDEM 25m, NED 10m)
API Key: None required
Timeout: 20 seconds default
Endpoint:
https://api.opentopodata.org/v1/{dataset}Reliability Status: HIGH ✓
Strengths:
Multiple dataset options for different regions
SmartOpenTopoDataProvider auto-selects optimal dataset by location
Free access, no rate limits published
Proper error handling (OK status checking)
Different vertical datums by dataset (EGM96, EVRS2000, NAVD88)
Weaknesses:
External service dependency
Dataset availability varies by region
No published rate limits or SLAs
Regional Optimization:
Europe (35-65°N, -15-40°E) → EU-DEM 25m
Polar regions (>60° or <-60°) → ASTER 30m
Global default → SRTM 30m
Test Status: Not marked with network or flaky markers
OSM Elevation (open-elevation.com)
API: OpenElevation-style API (POST JSON)
Coverage: Global (SRTM 90m data)
API Key: None required
Timeout: 20 seconds default
Endpoint:
https://api.open-elevation.com/api/v1/lookupReliability Status: HIGH ✓
Strengths:
Simple JSON POST interface
Free public access
Global coverage
Uses EGM96 vertical datum
Weaknesses:
External service dependency
No documented rate limits
Depends on open-elevation.com uptime
Test Status: Not marked with network or flaky markers
2. SOIL PROVIDERS
ISRIC SoilGrids
API: Web Coverage Service (WCS) 2.0.1 and REST API
Coverage: Global (250m resolution)
API Key: None required
Timeout: 30 seconds default
Endpoints:
WCS:
https://maps.isric.org/mapserv?map=/map/{service}.mapREST:
https://rest.isric.org/soilgrids/v2.0
Reliability Status: HIGH ✓
Features:
WRB soil classification (30 classes: Acrisols→Vertisols)
Soil properties: pH, organic carbon, bulk density, sand/silt/clay %, nitrogen
Texture classification using USDA triangle
WCS 2.0.1 with fallback to WCS 1.0.0
REST API with fallback to WCS if REST fails
Strengths:
Comprehensive global coverage
Multiple acquisition methods (REST + WCS fallback)
Good quality score calculation (completeness-based)
Proper no-data value handling
Confidence scoring for WRB classification
Weaknesses:
Dual API dependency increases complexity
Grid-based resolution may miss local variation
250m resolution may be too coarse for some applications
Quality Assessment:
Base resolution: ~125m to pixel center (250m grid)
Data completeness score: 8 possible fields (WRB, pH, SOC, BDOD, sand, silt, clay, nitrogen)
Quality score: 0.5-1.0 based on distance and completeness
Test Status: Not marked with network or flaky markers
USDA NRCS Soil Data Access
API: SDA REST (Tabular/post.rest)
Coverage: Continental USA + territories
API Key: None required
Timeout: 30 seconds default
Endpoint:
https://sdmdataaccess.sc.egov.usda.gov/Tabular/post.restReliability Status: HIGH ✓
Features:
USDA Soil Taxonomy classification (hierarchical)
Soil components with coverage percentages
Detailed taxonomy: order → suborder → great group → subgroup
Quality boost for major components and detailed taxonomy
Strengths:
Very high quality USDA-authoritative data
US-specific depth of detail
Component-based approach (multiple soils per location)
Good quality scoring (base 0.8 + bonuses up to 1.0)
Weaknesses:
US-only coverage (continental + territories)
Complex multi-query workflow (mukey → components)
No depth-specific data (full profile only)
Quality Assessment:
Base quality score: 0.8 (USDA data is authoritative)
Major component bonus: +0.1
Detailed taxonomy bonus: +0.1
Full coverage bonus: +0.05
Max score: 1.0+
Test Status: Not marked with network or flaky markers
3. WEATHER PROVIDERS
MeteoStat
Source: Meteostat Library + CDN
Coverage: Global (120,000+ weather stations)
Data Period: 1973-present (7-day lag)
API Key: None required
Temporal Resolution: Daily observations
Spatial Resolution: Station-based (distance tracked)
Reliability Status: HIGH ✓
Features:
Temperature (tmin, tmax, tavg)
Wind (speed, direction)
Precipitation
Atmospheric pressure
Station distance tracking (max 100km)
Strengths:
Longest historical record (1973+)
Station-based ground truth data
No API key required
Global coverage with ~120,000 stations
Quality penalty for distant stations
Weaknesses:
7-day lag in data
Station availability varies by region
Distance-based quality penalty (max 100km limit)
Quality Assessment:
DAY_SPECIFIC_COMPLETE: full day coverage
DAY_SPECIFIC_PARTIAL: partial day coverage
Distance factor: 1.0 (at station) to 0.5 (100km away)
Test Status: Not marked with network or flaky markers
Open-Meteo
Source: ERA5 Reanalysis (Copernicus)
Coverage: Global (11km grid resolution)
Data Period: 1959-present
API Key: None required
Temporal Resolution: Hourly (aggregated to daily)
Spatial Resolution: 11km
Reliability Status: HIGH ✓
Features:
Temperature (min, max, avg)
Precipitation
Wind (speed, direction)
Humidity
Pressure
Solar radiation
Hourly data aggregated to daily with coverage tracking
Strengths:
Longest continuous record (1959+)
Global grid coverage (no gaps)
Very recent data
Hourly resolution allows precise aggregation
Multiple parameters (7 standard)
Weaknesses:
11km resolution may miss local variation
Reanalysis product (model + observations)
Requires aggregation from hourly
Quality Assessment:
DAY_SPECIFIC_COMPLETE: 24+ hours data (≥80% coverage)
DAY_SPECIFIC_PARTIAL: <80% coverage
Aggregation method: hourly_aggregation
Test Status: Not marked with network or flaky markers
4. MARINE PROVIDERS
GEBCO (General Bathymetric Chart of the Oceans)
API: WCS (Web Coverage Service)
Coverage: Global bathymetry (15 arc-second ≈ 450m)
API Key: None required
Data Type: Static bathymetric grid
Reliability Status: LOW ⚠️
Strengths:
High-resolution global bathymetry
Authoritative data source
Static dataset (no temporal issues)
Weaknesses:
INCOMPLETE IMPLEMENTATION: Provider has fallback depth estimation (placeholder)
WCS implementation not functional (marked as “simplified approach”)
Code comment: “In production, you would implement proper WCS requests”
Uses very rough estimation based on latitude/longitude
No actual GEBCO data access in current implementation
Implementation Status:
# Mock implementation with placeholder estimation:
# - Coastal: -10m to -200m (very inaccurate)
# - Open ocean: -1000m to -5000m (very inaccurate)
Recommendation: DO NOT USE in production. This provider needs:
Proper WCS client implementation
Actual GEBCO grid data access or
Third-party bathymetry API integration
Test Status: Not marked with network or flaky markers
ESA Ocean Colour CCI
API: ERDDAP griddap (NOAA NEFSC)
Coverage: Global ocean (1km resolution, but daylight-dependent)
Data Period: 1997-09-04 to present
API Key: None required
Parameter: Chlorophyll-a concentration
Reliability Status: MODERATE ⚠️
Strengths:
High-quality satellite L3 product
Global ocean coverage
1km resolution
Long time series (1997+)
Weaknesses:
INCOMPLETE IMPLEMENTATION: Uses fallback estimation if ERDDAP fails
Chlorophyll estimates are rough approximations
Cloud/weather dependent (gaps in data)
Limited to marine/ocean areas
ERDDAP service dependency
Data Quality Issues:
No real ERDDAP integration (simplified example)
Fallback chlorophyll estimation by latitude/region (very inaccurate)
Expected range check: 0.001-100.0 mg/m³
Fallback Logic:
Tropical (<10°): 0.15 mg/m³ base
Subtropical (10-30°): 0.08 mg/m³ base
Temperate (30-60°): 0.5 mg/m³ base
Polar (>60°): 1.2 mg/m³ base
Recommendation: Needs proper ERDDAP integration or fallback to alternative chlorophyll sources
Test Status: Not marked with network or flaky markers
NOAA OISST (Optimum Interpolation Sea Surface Temperature)
API: ERDDAP griddap (NOAA NCEI)
Coverage: Global ocean (0.25° grid)
Data Period: 1981-09-01 to present
API Key: None required
Temporal Resolution: Daily
Data Type: L4 interpolated product
Reliability Status: MODERATE ⚠️
Strengths:
Long time series (1981+)
Global ocean coverage
L4 product (interpolated/gap-filled)
Daily resolution
Well-documented data format
Weaknesses:
INCOMPLETE IMPLEMENTATION: Uses placeholder/mock data retrieval
No real ERDDAP integration
Requires longitude conversion (−180/180 to 0/360)
ERDDAP service dependency
Data Validation:
SST range check: -5°C to +50°C
Returns None for out-of-range values
Recommendation: Needs proper ERDDAP integration
Test Status: Not marked with network or flaky markers
5. LAND COVER PROVIDERS
NLCD (National Land Cover Database)
API: WMS (Web Map Service)
Coverage: Continental USA (30m resolution)
API Key: None required
Available Years: 2001, 2006, 2011, 2016, 2019, 2021
Reliability Status: HIGH ✓
Features:
19 land cover classes (water, developed, forest, grassland, wetland, etc.)
Multi-year archive with temporal comparison
GetFeatureInfo queries for point data
Automatic year selection based on target date
Strengths:
High-quality USGS-authoritative data
US-specific authority
Multi-year temporal coverage
30m resolution
Proper class mappings
Weaknesses:
US-only coverage
Quality confidence decreases with temporal distance (0.85 base, -0.1 per year)
Temporal Logic:
Selects closest year ≤ target date
Adds next year for comparison
Limits to 2 years maximum
Quality Assessment:
Base confidence: 0.85
Temporal adjustment: max(0.5, 0.85 - years_diff × 0.1)
Test Status: Not marked with network or flaky markers
ESA WorldCover
API: WMS (Terrascope service)
Coverage: Global (10m resolution)
Data Version: 2021 (represents 2020-2021)
API Key: None required
Endpoint:
https://services.terrascope.be/wms/v2Reliability Status: HIGH ✓
Features:
11 land cover classes
Global coverage (10m resolution)
Tree cover, shrubland, grassland, cropland, built-up, bare land, snow/ice, water, wetland, mangroves, moss/lichen
GetFeatureInfo queries
Strengths:
Highest resolution (10m) of available providers
Global coverage
Recent data (2020-2021)
High-quality ESA product
High base confidence (0.85)
Weaknesses:
Static dataset (no annual updates)
Only one epoch available (2021)
Service dependency on Terrascope
Test Status: Not marked with network or flaky markers
6. VEGETATION PROVIDERS
MODIS Vegetation Indices
API: NASA APPEEARS API
Coverage: Global (250m-500m resolution)
Data Period: 2000-present
API Key: None required (NASA Earth Data login required)
Products:
MOD13Q1: Terra 250m 16-day NDVI/EVI
MCD15A3H: Combined 500m 4-day LAI/FPAR
Reliability Status: LOW ⚠️
Features:
NDVI (Normalized Difference Vegetation Index)
EVI (Enhanced Vegetation Index)
LAI (Leaf Area Index)
FPAR (Fraction of Absorbed Photosynthetically Active Radiation)
Weaknesses:
MOCK IMPLEMENTATION ONLY: Uses generated mock data
Code comment: “In production, this would be replaced with actual MODIS data access”
No real APPEEARS API integration
Generates realistic but fake data using seeded random
Mock Data Generation:
# Uses seeded randomness based on: latitude × 1000 + longitude × 1000 + day_of_year
# Generates seasonal and latitude-based vegetation patterns
# NOT actual MODIS observations
Recommendation: DO NOT USE in production. Requires:
APPEEARS API authentication setup
Task submission and processing workflow
Result download and parsing
Actual MODIS data retrieval
Test Status: Not marked with network or flaky markers
7. GEOCODING PROVIDERS
Google Forward Geocoding
API: Google Maps Geocoding API v1
Coverage: Global
API Key: REQUIRED (
GOOGLE_MAIN_API_KEY)Timeout: 30 seconds default
Rate Limit: 50 QPS
Reliability Status: MODERATE
Features:
Place name → coordinates
Address component parsing
Bounding boxes (viewport)
Location type determination
Relevance and confidence scoring
Partial match detection
Strengths:
Comprehensive geocoding
High accuracy for known places
Rich response metadata
Weaknesses:
Requires paid API key
Potential quota exhaustion (OVER_QUERY_LIMIT status)
Relevance/confidence heuristics required
Error Handling:
REQUEST_DENIED: API key invalid
OVER_QUERY_LIMIT: Quota exceeded
INVALID_REQUEST: Bad request
ZERO_RESULTS: No matches found
Test Status: Not marked with network or flaky markers
OSM Nominatim Forward Geocoding
API: OpenStreetMap Nominatim Search API
Coverage: Global
API Key: None required
Timeout: 30 seconds default
Rate Limit: 1 request/second (enforced in code)
Endpoint:
https://nominatim.openstreetmap.org/searchReliability Status: HIGH ✓
Features:
Place name → coordinates
Address component parsing
Country filtering
Bounding boxes
Importance scoring
OSM identifiers
Strengths:
Free, no API key
Global OSM data
Rate limiting enforced in code
Deduplication
Extra tags support (Wikipedia, Wikidata)
Weaknesses:
Rate limit (1 req/sec) slows bulk operations
Nominatim ToS require proper user-agent
External service dependency
Rate Limiting:
_min_request_interval = 1.0 # seconds
# Enforces 1-second minimum between requests
Test Status: Not marked with network or flaky markers
Google Reverse Geocoding
API: Google Maps Geocoding API v1 (reverse mode)
Coverage: Global
API Key: REQUIRED (
GOOGLE_MAIN_API_KEY)Timeout: 20 seconds default
Rate Limit: 50 QPS
Reliability Status: MODERATE
Features:
Coordinates → address
Multiple results ranked by distance
Address component hierarchy
Bounding boxes
Place type determination
Confidence scoring
Strengths:
Multiple results per query
Rich component information
High accuracy for addresses
Weaknesses:
Requires paid API key
Potential quota exhaustion
Confidence Scoring:
First result: 1.0 - 0.1 = 0.9
Second result: 1.0 - 0.2 = 0.8
Etc. (decreases by 0.1 per additional result)
Test Status: Not marked with network or flaky markers
OSM Nominatim Reverse Geocoding
API: OpenStreetMap Nominatim Reverse API
Coverage: Global
API Key: None required
Timeout: 20 seconds default
Rate Limit: 1 request/second (enforced)
Endpoint:
https://nominatim.openstreetmap.org/reverseReliability Status: HIGH ✓
Features:
Coordinates → address
Address component hierarchy
Place rank and importance
OSM identifiers
Wikipedia/Wikidata links
Multiple result levels
Strengths:
Free, no API key
Global coverage
Rich metadata (place rank, importance)
External identifiers for linking
Weaknesses:
Rate limited (1 req/sec)
External service dependency
Requires proper user-agent
Rate Limiting:
min_request_interval = 1.0 # seconds
# Enforced for public Nominatim instance
Test Status: Not marked with network or flaky markers
8. OSM FEATURES PROVIDER
Overpass API
API: OpenStreetMap Overpass QL
Coverage: Global
API Key: None required
Timeout: 180 seconds default (configurable)
Rate Limit: 1 request/second (enforced)
Endpoint:
https://overpass-api.de/api/interpreterReliability Status: MODERATE
Features:
Geographic features within radius
Named features (with name tags)
Unnamed feature counts by category
Feature categorization (natural, waterway, highway, amenity, etc.)
Geometry type detection (point, linestring, polygon, multipolygon)
Distance calculation from sample point
Strengths:
Global coverage
No API key required
Comprehensive feature extraction
Named/unnamed feature separation
Geometry type detection
Weaknesses:
Service can be slow/unstable during high load
Overpass QL complexity for comprehensive queries
Rate limiting (1 req/sec) for reliability
Timeout configurable but server limits apply
Query Strategy:
[out:json][timeout:180];
(
node(around:1000,lat,lon);
way(around:1000,lat,lon);
relation(around:1000,lat,lon);
);
out body geom qt;
Feature Categorization:
Natural (landuse, natural)
Waterway (rivers, streams)
Highway (roads, paths)
Railway, Aeroway
Amenity (services, facilities)
Leisure, Building
Boundary, Place
Tourism, Shop, Craft, Office
Distance Calculations:
Point to point: Haversine formula
Point to linestring: Min distance to segments
Point to polygon: Ray casting for containment, edge distance if outside
Test Status: Not marked with network or flaky markers
Reliability Matrices
By API Key Requirement
Category |
Count |
Providers |
|---|---|---|
No API Key (Free) |
12 |
USGS, Open Topo, OSM Elevation, SoilGrids, USDA NRCS, MeteoStat, Open-Meteo, GEBCO, ESA CCI, NOAA OISST, NLCD, ESA WorldCover, MODIS, OSM (both), Overpass |
API Key Required |
3 |
Google Elevation, Google Geocoding (both directions) |
By Coverage
Category |
Count |
Providers |
|---|---|---|
Global |
10 |
Elevation (3), Soil (1), Weather (2), Marine (2), Geocoding (2) |
US/North America |
2 |
NLCD, USDA NRCS |
Ocean/Marine |
3 |
GEBCO, ESA CCI, NOAA OISST |
Regional Variations |
2 |
Open Topo Data (smart selection), SoilGrids (250m global) |
By Temporal Data
Category |
Providers |
|---|---|
Real-time/Recent |
MeteoStat, Open-Meteo, NOAA OISST |
Historical |
MeteoStat (1973+), Open-Meteo (1959+), NOAA OISST (1981+) |
Static |
GEBCO, NLCD (multi-year), ESA WorldCover (2020-2021) |
No temporal component |
Elevation (current), Soil (current), Geocoding (no date), OSM Features |
By Implementation Status
Status |
Count |
Providers |
|---|---|---|
Fully Implemented |
12 |
All elevation, soil, weather, most geocoding, land cover, OSM features |
Incomplete (Fallback/Mock) |
3 |
GEBCO, ESA CCI, NOAA OISST |
Mock Only |
1 |
MODIS |
Known Issues |
1 |
USGS (migration history) |
Critical Reliability Gaps
Gap 1: USGS Elevation Service Unreliability
Issue: USGS elevation service has known migration history and documented unreliability.
Evidence:
Code comments: “USGS elevation services have experienced multiple migrations and can be unreliable”
Endpoint migrated from EPQS to 3DEP ArcGIS
Complex no-data value handling (-1000000, -9999)
Test marked with
@pytest.mark.flaky(reruns=2, reruns_delay=10)
Recommendation:
Use Open Topo Data as primary (stable, global, multiple datasets)
Use OSM Elevation as first fallback (stable, 90m global coverage)
Use USGS as last fallback only with extensive error handling
Monitor USGS service status continuously
Plan deprecation if USGS performs additional migrations
Gap 2: Marine Provider Implementations
Issue: Three marine providers have incomplete implementations or mock data.
Evidence:
Provider |
Status |
Issue |
|---|---|---|
GEBCO |
Low |
Placeholder WCS implementation, uses rough estimation |
ESA CCI |
Moderate |
Simplified ERDDAP integration, fallback estimates |
NOAA OISST |
Moderate |
Incomplete ERDDAP queries, mock data retrieval |
Recommendation:
Implement proper WCS clients for GEBCO
Integrate actual ERDDAP griddap endpoints
Add NCEI data source integration
Implement cloud/weather gap handling
Add alternative sources (etopo, gebco.net direct access)
Current Status: Marine providers suitable for development/testing only, not production use.
Gap 3: MODIS Vegetation Implementation
Issue: MODIS provider is entirely mock/demo implementation.
Evidence:
Code comment: “This is a simplified implementation. In production, you would: Submit APPEEARS task request, Wait for processing, Download and parse results”
Uses seeded randomness to generate realistic-looking but fake data
No real APPEEARS API integration
_get_mock_vegetation_data()explicitly notes: “For now, return mock data with realistic values”
Recommendation:
Implement APPEEARS REST API integration
Add NASA Earth Data authentication
Implement task submission and polling workflow
Add result download and parsing
Consider alternative vegetation sources (NDVI-only APIs)
Current Status: DO NOT USE in production. This provider generates synthetic data.
Gap 4: Limited Soil Depth Support
Issue: Soil providers have limited or no depth-specific data.
Evidence:
SoilGrids supports depth intervals but implementation only uses “0-5cm” default
USDA NRCS returns full profile without depth stratification
No depth-dependent reliability metrics
Recommendation:
Implement full depth interval support for SoilGrids (0-5, 5-15, 15-30, 30-60, 60-100, 100-200cm)
Add soil profile depth inference from USDA components
Add quality metrics for depth-specific data
Document depth limitations in results
Provider Reliability Recommendations
Tier 1: High Reliability (Use as Primary)
Service |
Provider |
Why |
|---|---|---|
Elevation |
Open Topo Data |
Stable, multiple datasets, global |
Elevation |
OSM Elevation |
Stable, 90m global, keyless |
Soil |
SoilGrids |
Global, WCS+REST fallback, good completeness |
Soil |
USDA NRCS |
US-authoritative, comprehensive, stable |
Weather |
MeteoStat |
120k+ stations, 1973+, station-based truth |
Weather |
Open-Meteo |
Global, 1959+, hourly reanalysis |
Land Cover |
NLCD |
US-authoritative, multi-year, stable |
Land Cover |
ESA WorldCover |
Global, 10m resolution, recent |
Geocoding |
OSM Nominatim |
Global, keyless, rich metadata |
Tier 2: Moderate Reliability (Use with Fallback)
Service |
Provider |
Limitation |
Fallback |
|---|---|---|---|
Elevation |
Google Maps |
Paid key |
Open Topo Data |
Geocoding |
Google Maps |
Paid key, quota limits |
OSM Nominatim |
Marine |
ESA CCI |
Incomplete, ERDDAP issues |
Alternative sources |
Marine |
NOAA OISST |
Incomplete ERDDAP |
Alternative sources |
OSM Features |
Overpass API |
Slow/unstable under load |
Reduce radius |
Tier 3: Low Reliability (Development/Testing Only)
Service |
Provider |
Issue |
Recommendation |
|---|---|---|---|
Elevation |
USGS 3DEP |
Known migrations, unreliable |
Use only as last fallback |
Marine |
GEBCO |
Placeholder WCS |
Needs implementation |
Vegetation |
MODIS |
Mock data only |
Not for production |
Timeout Configuration Review
Provider |
Timeout |
Assessment |
|---|---|---|
Elevation (all) |
20s |
Appropriate |
Soil (all) |
30s |
Appropriate (WCS can be slow) |
Weather (all) |
30s |
Appropriate |
Marine (all) |
30s |
Appropriate |
Land Cover (all) |
30s |
Appropriate (WMS can be slow) |
Vegetation (MODIS) |
60s |
Appropriate (APPEEARS can take time) |
Geocoding (all) |
20-30s |
Appropriate |
OSM Features |
180s |
Configurable, can timeout on large queries |
Recommendation: Add exponential backoff retry logic for timeout errors (not currently implemented).
API Key Management
Required Keys
Only Google APIs require authentication:
GOOGLE_MAIN_API_KEY: Used forGoogle Elevation API
Google Maps Geocoding (forward)
Google Maps Geocoding (reverse)
Missing Implementations
NASA APPEEARS (MODIS) requires authentication but:
No key validation in current code
Mock implementation bypasses authentication entirely
Production implementation needs NASA Earth Data setup
Rate Limiting Analysis
Provider |
Explicit Limit |
Enforcement |
|---|---|---|
Google APIs |
50 QPS |
Implicit (quota-based) |
OSM Nominatim |
1 req/sec |
Code-enforced sleep |
Overpass API |
1 req/sec |
Code-enforced sleep |
Open Topo Data |
Not published |
None |
SoilGrids |
Not published |
None |
USDA NRCS |
Not published |
None |
MeteoStat |
Not published |
None |
Open-Meteo |
Not published |
None |
Nominatim reverse |
1 req/sec |
Code-enforced sleep |
Observation: Rate limiting is explicitly enforced in code for OSM services but not for others. Consider adding circuit breakers for all external APIs.
Known Flaky Tests
From test suite analysis:
test_elevation.py:
- @pytest.mark.flaky(reruns=2, reruns_delay=10) - USGS provider
- @pytest.mark.network - Google Elevation, USGS, OSM Elevation
test_soil_enrichment.py:
- @pytest.mark.flaky - SoilGrids/USDA provider
- @pytest.mark.network - SoilGrids, USDA NRCS
test_http_cache.py:
- @pytest.mark.network - HTTP cache tests (multiple providers)
test_logging.py:
- @pytest.mark.network - Logging tests
Interpretation:
USGS elevation is marked flaky (2 retries with 10-second delay) = documented unreliability
SoilGrids is marked flaky = occasional connectivity/service issues
Most network tests are marked with
@pytest.mark.network= skipped in CI
Testing Gaps
Not Tested or Untested
GEBCO provider: No actual WCS testing (placeholder implementation)
ESA CCI provider: No actual ERDDAP testing (fallback estimates)
NOAA OISST provider: No actual data retrieval testing
MODIS provider: Mock data only, no real APPEEARS testing
Rate limiting effectiveness: No tests for concurrent requests
Recommendations
Add integration tests for all providers (marked
@pytest.mark.network)Test rate limiting under load
Test fallback mechanisms
Test timeout handling
Test error recovery and retries
Add contract tests for API schemas
Summary Recommendations
Immediate Actions
USGS Elevation: Add enhanced error handling, document unreliability, implement retry with exponential backoff
Marine Providers: Either implement actual WCS/ERDDAP clients or remove from production
MODIS: Either fully implement APPEEARS integration or mark as demo-only
Comprehensive Testing: Add network integration tests for all providers
Short Term (1-2 weeks)
Implement proper ERDDAP client for marine providers
Add circuit breaker pattern for all external APIs
Implement exponential backoff retry logic
Add service health checks
Document known limitations in user-facing docs
Medium Term (1-2 months)
Evaluate alternative providers for unreliable services (USGS, marine)
Implement proper WCS client for bathymetry
Complete MODIS APPEEARS integration
Add machine learning-based fallback provider selection
Implement provider-specific caching strategies
Appendix: Provider Quick Reference
By Domain
Elevation:
Primary: Open Topo Data
Fallback 1: OSM Elevation
Fallback 2: Google Elevation (requires key)
Fallback 3: USGS (unreliable)
Soil:
Primary: SoilGrids (global)
Primary: USDA NRCS (US only)
Weather:
Primary: Open-Meteo (gridded reanalysis)
Fallback: MeteoStat (station observations)
Marine:
ESA CCI: Chlorophyll-a (incomplete)
NOAA OISST: Sea Surface Temperature (incomplete)
GEBCO: Bathymetry (placeholder only)
Land Cover:
Primary: ESA WorldCover (global)
Primary: NLCD (US only)
Vegetation:
MODIS: Mock data only (not production-ready)
Geocoding:
Primary: OSM Nominatim (any direction, keyless)
Fallback: Google Maps (requires key)
OSM Features:
Primary: Overpass API (global, stable)
Provider Roadmap
See the full roadmap for provider improvements and future development:
Provider Reliability Roadmap
Objective: Fix critical reliability gaps and stabilize all data fetching providers
Priority 1: Critical Issues (Must Fix)
1.1 USGS Elevation Service Unreliability
Current Status: Marked as @pytest.mark.flaky in tests, documented migration issues
Action Items:
# File: biosample_enricher/elevation/providers/usgs.py
# 1. Add comprehensive retry logic with exponential backoff
from tenacity import retry, stop_after_attempt, wait_exponential
@retry(
stop=stop_after_attempt(3),
wait=wait_exponential(multiplier=1, min=1, max=10),
reraise=True
)
def fetch(self, lat, lon, **kwargs):
# Existing fetch implementation
pass
# 2. Add service health check
def _check_service_health(self) -> bool:
try:
# Test query at a known location (e.g., Mt. Everest)
test_result = self.fetch(27.9881, 86.9250, timeout_s=5)
return test_result.ok
except Exception:
return False
# 3. Add fallback provider recommendation
def fetch(self, lat, lon, **kwargs):
if not self._check_service_health():
logger.warning(
"USGS 3DEP service unhealthy. "
"Recommend using Open Topo Data as fallback."
)
raise ServiceUnavailableError(
"USGS 3DEP service unavailable. "
"Use fallback provider (Open Topo Data)"
)
# ... rest of implementation
Timeline: 1 week Owner: Primary elevation provider team
Success Criteria:
Health check passes in local testing
Retry logic reduces flakiness
Test passes consistently without
@pytest.mark.flaky
1.2 Marine Providers: GEBCO WCS Implementation
Current Status: Placeholder depth estimation, no actual WCS queries
Action Items:
# File: biosample_enricher/marine/providers/gebco.py
# Replace fallback estimation with real WCS client
from owslib.wcs import WebCoverageService
class GEBCOProvider(MarineProviderBase):
def __init__(self, timeout: int = 30):
super().__init__(timeout)
# GEBCO WCS 2.0.1 endpoint
self.wcs_url = "https://www.gebco.net/data_and_products/gebco_web_services/web_map_service"
self._wcs_client = None
def _get_wcs_client(self) -> WebCoverageService:
if self._wcs_client is None:
self._wcs_client = WebCoverageService(
self.wcs_url,
version='2.0.1'
)
return self._wcs_client
def _fetch_bathymetry_data(self, latitude, longitude) -> float | None:
"""Fetch actual GEBCO bathymetry via WCS."""
try:
wcs = self._get_wcs_client()
# Query GEBCO_2023 coverage
coverage = 'GEBCO_2023'
# Small area around point
bbox = (
longitude - 0.01,
latitude - 0.01,
longitude + 0.01,
latitude + 0.01
)
# GetCoverage request for GeoTIFF
response = wcs.getCoverage(
identifier=coverage,
BoundingBox=bbox,
format='image/tiff',
CRS='EPSG:4326'
)
# Parse GeoTIFF and extract value at point
from rasterio.io import MemoryFile
with MemoryFile(response.read()) as mem:
with mem.open() as src:
# Get value at coordinates
row, col = src.index(longitude, latitude)
value = src.read(1)[int(row), int(col)]
# Handle no-data values
if value == src.nodata or np.isnan(value):
return None
return float(value)
except Exception as e:
logger.error(f"GEBCO WCS fetch failed: {e}")
return None
Dependencies:
uv add owslib rasterio
Timeline: 1.5 weeks Owner: Marine data team
Success Criteria:
Real WCS queries return actual bathymetry data
Handle no-data values properly
Values in expected range (-11000m to +8000m)
Test passes with actual GEBCO data
1.3 Marine Providers: ERDDAP Integration
Current Status: Simplified ERDDAP queries, no actual data retrieval
Action Items:
# File: biosample_enricher/marine/providers/esa_cci.py
import xarray as xr
import requests
class ESACCIProvider(MarineProviderBase):
def __init__(self, timeout: int = 30):
super().__init__(timeout)
self.erddap_url = "https://coastwatch.pfeg.noaa.gov/erddap"
self.dataset_id = "noaa_esrl_ocean_color_v2" # Actual ESA CCI dataset
def _fetch_chlorophyll_data(self, latitude, longitude, target_date) -> float | None:
"""Fetch actual chlorophyll-a data from ERDDAP."""
try:
date_str = target_date.strftime("%Y-%m-%d")
# Build proper ERDDAP griddap query
url = (
f"{self.erddap_url}/griddap/{self.dataset_id}.nc?"
f"chlor_a[({date_str}T00:00:00Z):1:({date_str}T23:59:59Z)]"
f"[({latitude}):1:({latitude})]"
f"[({longitude}):1:({longitude})]"
)
response = requests.get(url, timeout=self.timeout)
response.raise_for_status()
# Parse NetCDF response
with xr.open_dataset(io.BytesIO(response.content)) as ds:
# Extract chlorophyll value
if 'chlor_a' in ds.data_vars:
chl_value = ds['chlor_a'].values.flatten()[0]
# Validate range and no-data values
if np.isnan(chl_value) or chl_value < 0:
return None
if not 0.001 <= chl_value <= 100:
logger.warning(f"Value outside expected range: {chl_value}")
return None
return float(chl_value)
return None
except requests.exceptions.Timeout:
logger.error(f"ERDDAP timeout after {self.timeout}s")
return None
except Exception as e:
logger.error(f"ERDDAP fetch failed: {e}")
return None
Dependencies:
uv add xarray netCDF4
Timeline: 1.5 weeks Owner: Marine data team
Success Criteria:
Real ERDDAP griddap queries return data
Proper NetCDF parsing
Values in expected range (0.001-100 mg/m³)
Handles missing data gracefully
1.4 NOAA OISST Integration
Current Status: Placeholder queries, no real data retrieval
Action Items:
# File: biosample_enricher/marine/providers/noaa_oisst.py
import xarray as xr
class NOAAOISSTProvider(MarineProviderBase):
def _fetch_sst_data(self, latitude, longitude, target_date) -> float | None:
"""Fetch actual SST from NOAA OISST ERDDAP."""
try:
# Convert to 0-360 longitude
lon_360 = longitude if longitude >= 0 else longitude + 360
date_str = target_date.strftime("%Y-%m-%d")
# Build proper ERDDAP griddap query for OISST
# Dataset: https://coastwatch.pfeg.noaa.gov/erddap/info/ncdcOisst2Agg/index.html
url = (
f"{self.base_url}/ncdcOisst2Agg.nc?"
f"sst[({date_str}):1:({date_str})]"
f"[(0.0):1:(0.0)]" # Surface level
f"[({latitude}):1:({latitude})]"
f"[({lon_360}):1:({lon_360})]"
)
response = request("GET", url, timeout=self.timeout)
response.raise_for_status()
# Parse NetCDF
with xr.open_dataset(io.BytesIO(response.content)) as ds:
if 'sst' in ds.data_vars:
sst_value = ds['sst'].values.flatten()[0]
# Check for no-data and range
if np.isnan(sst_value):
return None
if not -5.0 <= sst_value <= 50.0:
logger.warning(f"SST outside range: {sst_value}°C")
return None
return float(sst_value)
return None
except Exception as e:
logger.error(f"OISST fetch failed: {e}")
return None
Timeline: 1.5 weeks Owner: Marine data team
Success Criteria:
Real ERDDAP queries return SST data
Values in expected range (-5 to 50°C)
Proper no-data handling
1.5 MODIS Vegetation: Full APPEEARS Integration
Current Status: Mock data generation, no real APPEEARS API
Action Items:
# File: biosample_enricher/land/providers/modis_vegetation.py
import requests
import json
from datetime import datetime, timedelta
class MODISVegetationProvider(VegetationProviderBase):
def __init__(self, username: str = None, password: str = None, timeout: int = 60):
self.appeears_base = "https://appeears.earthdatacloud.nasa.gov/api/v1"
self.timeout = timeout
self.username = username or os.getenv("NASA_USERNAME")
self.password = password or os.getenv("NASA_PASSWORD")
self._session = get_session()
self._token = None
if not self.username or not self.password:
raise ValueError(
"NASA Earth Data credentials required. "
"Set NASA_USERNAME and NASA_PASSWORD environment variables."
)
def _authenticate(self) -> str:
"""Get APPEEARS API token."""
if self._token:
return self._token
try:
response = self._session.post(
f"{self.appeears_base}/login",
json={"username": self.username, "password": self.password},
timeout=5
)
response.raise_for_status()
self._token = response.json()['token']
return self._token
except Exception as e:
raise ValueError(f"APPEEARS authentication failed: {e}")
def _query_modis_product(
self,
latitude: float,
longitude: float,
target_date: date,
time_window_days: int,
product_name: str,
product_info: dict,
) -> VegetationObservation | None:
"""Query actual MODIS data via APPEEARS."""
try:
token = self._authenticate()
headers = {"Authorization": f"Bearer {token}"}
# Build date range
start_date = target_date - timedelta(days=time_window_days // 2)
end_date = target_date + timedelta(days=time_window_days // 2)
# APPEEARS task request
task = {
"task_type": "point",
"params": {
"coordinates": [{"latitude": latitude, "longitude": longitude}],
"products": [
{
"product": product_name,
"layer": product_info["layers"][0]
}
],
"dates": [
{
"startDate": start_date.isoformat(),
"endDate": end_date.isoformat()
}
]
}
}
# Submit task
response = self._session.post(
f"{self.appeears_base}/task",
json=task,
headers=headers,
timeout=self.timeout
)
response.raise_for_status()
task_id = response.json()['task_id']
# Poll for completion
max_wait = 300 # 5 minutes
start_time = datetime.now()
while (datetime.now() - start_time).seconds < max_wait:
status_resp = self._session.get(
f"{self.appeears_base}/task/{task_id}",
headers=headers,
timeout=self.timeout
)
status_resp.raise_for_status()
status = status_resp.json()['status']
if status == 'completed':
# Get results
results_resp = self._session.get(
f"{self.appeears_base}/task/{task_id}/result",
headers=headers,
timeout=self.timeout
)
results_resp.raise_for_status()
results = results_resp.json()['data']
# Parse results
return self._parse_appeears_results(
results, latitude, longitude, target_date, product_info
)
elif status in ['failed', 'cancelled']:
raise ValueError(f"APPEEARS task {status}")
time.sleep(10) # Wait 10 seconds before next poll
raise TimeoutError(f"APPEEARS task timeout after {max_wait}s")
except Exception as e:
logger.error(f"MODIS APPEEARS query failed: {e}")
return None
def _parse_appeears_results(
self, results, latitude, longitude, target_date, product_info
) -> VegetationObservation | None:
"""Parse APPEEARS result into observation."""
# Implementation depends on APPEEARS response format
# Typically returns array of values with dates
pass
Requirements:
NASA Earth Data account setup
APPEEARS API credentials
Environment variables: NASA_USERNAME, NASA_PASSWORD
Timeline: 2 weeks Owner: Land data team
Success Criteria:
Real APPEEARS task submission and polling
Proper authentication and token handling
Results parsing and validation
NDVI/EVI/LAI/FPAR extraction
Priority 2: Important Enhancements (Should Fix)
2.1 Implement Circuit Breaker Pattern
File: biosample_enricher/providers/circuit_breaker.py
from enum import Enum
from datetime import datetime, timedelta
from typing import Callable, Any
class CircuitState(Enum):
CLOSED = "closed" # Normal operation
OPEN = "open" # Provider failed, blocking calls
HALF_OPEN = "half_open" # Testing if provider recovered
class CircuitBreaker:
def __init__(
self,
failure_threshold: int = 5,
recovery_timeout: int = 60,
expected_exception: type = Exception
):
self.failure_threshold = failure_threshold
self.recovery_timeout = recovery_timeout
self.expected_exception = expected_exception
self.failure_count = 0
self.success_count = 0
self.last_failure_time = None
self.state = CircuitState.CLOSED
def call(self, func: Callable, *args, **kwargs) -> Any:
"""Call function with circuit breaker protection."""
if self.state == CircuitState.OPEN:
if self._should_attempt_reset():
self.state = CircuitState.HALF_OPEN
else:
raise CircuitOpenError(
f"Circuit open. Retry after {self._time_until_retry()}s"
)
try:
result = func(*args, **kwargs)
self._on_success()
return result
except self.expected_exception as e:
self._on_failure()
raise
def _on_success(self):
self.failure_count = 0
if self.state == CircuitState.HALF_OPEN:
self.state = CircuitState.CLOSED
logger.info("Circuit closed - service recovered")
def _on_failure(self):
self.failure_count += 1
self.last_failure_time = datetime.now()
if self.failure_count >= self.failure_threshold:
self.state = CircuitState.OPEN
logger.warning(
f"Circuit opened - service failed {self.failure_count} times"
)
def _should_attempt_reset(self) -> bool:
return (
datetime.now() - self.last_failure_time
> timedelta(seconds=self.recovery_timeout)
)
def _time_until_retry(self) -> int:
elapsed = (datetime.now() - self.last_failure_time).seconds
return max(0, self.recovery_timeout - elapsed)
Integration:
# In elevation provider
from biosample_enricher.providers.circuit_breaker import CircuitBreaker
class GoogleElevationProvider:
def __init__(self, api_key):
self.circuit_breaker = CircuitBreaker(
failure_threshold=5,
recovery_timeout=60
)
def fetch(self, lat, lon, **kwargs):
return self.circuit_breaker.call(
self._fetch_impl, lat, lon, **kwargs
)
Timeline: 1 week Owner: Infrastructure team
2.2 Implement Exponential Backoff Retry Logic
File: biosample_enricher/providers/retry_logic.py
from tenacity import (
retry,
stop_after_attempt,
wait_exponential,
wait_random,
retry_if_exception_type,
before_log,
after_log
)
RETRY_CONFIG = {
"elevation": {
"stop": stop_after_attempt(3),
"wait": wait_exponential(multiplier=1, min=1, max=10),
"retry": retry_if_exception_type(
(TimeoutError, ConnectionError, requests.RequestException)
),
},
"soil": {
"stop": stop_after_attempt(3),
"wait": wait_exponential(multiplier=1, min=2, max=30),
"retry": retry_if_exception_type(
(TimeoutError, ConnectionError, requests.RequestException)
),
},
"marine": {
"stop": stop_after_attempt(5),
"wait": wait_exponential(multiplier=2, min=2, max=60) + wait_random(0, 5),
"retry": retry_if_exception_type(Exception),
},
}
def create_retry_decorator(service: str):
config = RETRY_CONFIG.get(service, RETRY_CONFIG["elevation"])
return retry(
before=before_log(logger, logging.DEBUG),
after=after_log(logger, logging.DEBUG),
reraise=True,
**config
)
Timeline: 1 week Owner: Infrastructure team
2.3 Add Provider Health Checks
File: biosample_enricher/providers/health_check.py
from dataclasses import dataclass
from typing import Dict, Optional
from datetime import datetime, timedelta
@dataclass
class HealthStatus:
provider: str
healthy: bool
last_check: datetime
error: Optional[str] = None
response_time_ms: Optional[float] = None
def age_seconds(self) -> float:
return (datetime.now() - self.last_check).total_seconds()
class ProviderHealthChecker:
def __init__(self, cache_ttl_seconds: int = 300):
self.cache_ttl = cache_ttl_seconds
self.health_cache: Dict[str, HealthStatus] = {}
def check_provider(self, provider_name: str) -> HealthStatus:
"""Check provider health with caching."""
# Check cache
if provider_name in self.health_cache:
cached = self.health_cache[provider_name]
if cached.age_seconds() < self.cache_ttl:
return cached
# Perform health check
status = self._perform_health_check(provider_name)
self.health_cache[provider_name] = status
return status
def _perform_health_check(self, provider_name: str) -> HealthStatus:
"""Implement health check for each provider."""
health_checks = {
"google_elevation": self._check_google_elevation,
"usgs_3dep": self._check_usgs_elevation,
"open_topo_data": self._check_open_topo_data,
"osm_elevation": self._check_osm_elevation,
# ... etc
}
check_func = health_checks.get(provider_name)
if not check_func:
return HealthStatus(
provider=provider_name,
healthy=False,
last_check=datetime.now(),
error="Unknown provider"
)
try:
return check_func()
except Exception as e:
return HealthStatus(
provider=provider_name,
healthy=False,
last_check=datetime.now(),
error=str(e)
)
def _check_google_elevation(self) -> HealthStatus:
"""Health check for Google Elevation API."""
import time
start = time.time()
try:
provider = GoogleElevationProvider()
result = provider.fetch(
lat=0.0, # Equator
lon=0.0, # Prime meridian
timeout_s=5
)
response_time = (time.time() - start) * 1000
return HealthStatus(
provider="google_elevation",
healthy=result.ok,
last_check=datetime.now(),
response_time_ms=response_time
)
except Exception as e:
return HealthStatus(
provider="google_elevation",
healthy=False,
last_check=datetime.now(),
error=str(e)
)
# ... implement checks for other providers
Timeline: 1.5 weeks Owner: Infrastructure team
Priority 3: Testing and Documentation
3.1 Add Integration Tests for All Providers
File: tests/test_providers_integration.py
import pytest
from datetime import date
class TestElevationProviders:
"""Integration tests for elevation providers."""
@pytest.mark.network
def test_google_elevation_sanity(self):
"""Test Google Elevation API with known values."""
provider = GoogleElevationProvider()
# Mt. Everest: 27.9881°N, 86.9250°E
result = provider.fetch(27.9881, 86.9250)
assert result.ok
assert 8800 < result.elevation < 8850 # Expected range
assert result.vertical_datum == "EGM96"
@pytest.mark.network
def test_usgs_elevation_sanity(self):
"""Test USGS elevation with known values."""
provider = USGSElevationProvider()
# Mt. Everest
result = provider.fetch(27.9881, 86.9250)
assert result.ok
assert 8800 < result.elevation < 8850
assert result.vertical_datum == "NAVD88"
@pytest.mark.network
def test_elevation_fallback_chain(self):
"""Test elevation fallback mechanism."""
# This test would validate that if one provider fails,
# the next is tried automatically
pass
class TestSoilProviders:
"""Integration tests for soil providers."""
@pytest.mark.network
@pytest.mark.slow
def test_soilgrids_completeness(self):
"""Test SoilGrids for data completeness."""
provider = SoilGridsProvider()
# Test at a known location (e.g., Iowa cornbelt)
result = provider.get_soil_data(42.0, -93.0)
assert result.observations
obs = result.observations[0]
# Should have multiple fields
assert obs.classification_wrb is not None or obs.classification_usda is not None
assert obs.ph_h2o is not None or obs.organic_carbon is not None
@pytest.mark.network
def test_usda_nrcs_us_only(self):
"""Test USDA NRCS limits to US."""
provider = USDANRCSProvider()
# US location should work
result_us = provider.get_soil_data(40.0, -75.0) # New Jersey
assert len(result_us.observations) > 0 or result_us.quality_score > 0
# Non-US location should fail gracefully
result_non_us = provider.get_soil_data(0.0, 0.0) # Null Island
assert len(result_non_us.observations) == 0
class TestMarineProviders:
"""Integration tests for marine providers."""
@pytest.mark.network
@pytest.mark.slow
def test_gebco_bathymetry(self):
"""Test GEBCO bathymetry data."""
provider = GEBCOProvider()
# Deep ocean location
result = provider.get_marine_data(
latitude=0.0, longitude=-30.0, target_date=date.today()
)
assert result.bathymetry is not None
assert result.bathymetry.value < -1000 # Ocean depths
@pytest.mark.network
@pytest.mark.slow
def test_esa_cci_chlorophyll(self):
"""Test ESA CCI chlorophyll data."""
provider = ESACCIProvider()
# Productive ocean region (Gulf Stream)
result = provider.get_marine_data(
latitude=40.0, longitude=-70.0,
target_date=date(2023, 6, 15)
)
if result.chlorophyll_a is not None:
assert 0.001 <= result.chlorophyll_a.value <= 100
Timeline: 2 weeks Owner: QA team
3.2 Add Provider Performance Benchmarks
File: tests/test_providers_performance.py
import pytest
import time
from datetime import date
@pytest.mark.benchmark
class TestProviderPerformance:
"""Benchmark provider response times."""
@pytest.mark.network
def test_elevation_response_times(self, benchmark):
"""Benchmark elevation provider response times."""
provider = OpenTopoDataProvider()
def fetch():
return provider.fetch(40.0, -75.0)
result = benchmark(fetch)
assert result.ok
# P95 should be < 2 seconds
# P99 should be < 5 seconds
@pytest.mark.network
def test_geocoding_response_times(self, benchmark):
"""Benchmark geocoding provider response times."""
provider = OSMForwardGeocodingProvider()
def search():
return provider.search("New York City")
result = benchmark(search)
assert result.ok
# P95 should be < 1 second
Timeline: 1 week Owner: QA team
Success Metrics
Immediate (Week 1-2)
All high-priority fixes deployed
USGS provider no longer marked
@pytest.mark.flakyMarine providers updated to real implementations
100% integration test pass rate
Short Term (Month 1)
Circuit breaker deployed to all providers
Health check system operational
Retry logic reduces transient failures by 80%
All providers have timeout handling
Medium Term (Month 2)
Provider reliability dashboard active
SLA tracking for each provider
Automated failover mechanisms
Documentation of known limitations updated
Rollout Plan
Phase 1 (Week 1)
Deploy circuit breaker pattern
Fix USGS retry logic
Deploy health checks
Phase 2 (Week 2-3)
Implement ERDDAP clients (marine)
Fix GEBCO WCS integration
Complete MODIS APPEEARS
Phase 3 (Week 4)
Add comprehensive integration tests
Performance benchmarking
Documentation update
Phase 4 (Ongoing)
Monitor reliability metrics
Adjust configurations based on data
Add provider-specific optimizations