Data and AI governance in Amazon SageMaker — SageMaker Catalog for discovery, lineage, quality, and access control.


Overview

Amazon SageMaker Catalog (built on Amazon DataZone) provides:

CapabilityPurpose
DiscoveryFind data and AI assets at scale
GovernanceControl access with fine-grained permissions
CollaborationShare assets across teams and projects
QualityMonitor data quality and lineage
ComplianceAudit trail and responsible AI

SageMaker Catalog

Unified catalog for all data and AI assets:

What’s in the Catalog?

Asset TypeExamples
Structured dataTables, databases, data products
Unstructured dataDocuments, images, files
AI modelsTrained models, prompts, agents
BI dashboardsQuickSight/Quick Sight reports
ApplicationsGenAI apps, notebooks

Governance Features

Data and AI Catalog

FeatureDetails
Central catalogDiscover all assets in one place
Metadata managementTechnical and business metadata
Asset registrationAutomatic and manual registration
SearchFind assets by name, tag, or owner

Business Glossary

FeatureDetails
Shared definitionsStandardize business terminology
Customizable metadataCreate metadata forms
Classification termsTag sensitive data consistently
Governance workflowsEnforce tagging policies

Data Lineage

Track data flow across systems:

FeatureDetails
OpenLineage compatibleIndustry-standard lineage format
Origin trackingWhere data comes from
Transformation historyHow data changes
Consumption patternsWho uses the data
Impact analysisUnderstand downstream effects

Data Quality Monitoring

FeatureDetails
Quality metricsView metrics from AWS and third-party tools
Consumer trustShow quality scores in catalog
API integrationIntegrate external quality signals
Unified portalSingle view of data health

Data Discovery

FeatureDetails
Business contextEnrich technical metadata with descriptions
Auto-enrichmentAI-generated metadata
Quick understandingHelp users find and trust data

Automated Metadata Recommendations

FeatureDetails
LLM-poweredAI generates business-friendly names
DescriptionsAuto-generate asset descriptions
ConsistencyImprove clarity across assets
FeatureDetails
Natural languageSearch using plain English
Intent understandingGoes beyond keywords
Context-awareUnderstands relationships
Relevant resultsReturns what you mean, not just what you type

Data Products

Package related assets into reusable products:

FeatureDetails
Bundled assetsGroup related tables, models, dashboards
Shared metadataCommon business descriptions
Unified accessSingle subscription request
Consumption trackingMonitor product usage
Reduced overheadFewer individual permissions

Access Control

Permission Model

LevelControl
DomainTop-level organizational boundary
ProjectTeam-scoped access and collaboration
AssetIndividual table/model permissions
ColumnFine-grained column-level access

Integration with Lake Formation

  • Fine-grained permissions at no extra cost
  • Row and column level security
  • Tag-based access control
  • Cross-account sharing

Responsible AI

AI Governance Features

FeaturePurpose
Data classificationLabel sensitive data
Toxicity detectionIdentify harmful content
GuardrailsApply responsible AI policies
Model cardsDocument model behavior and limitations
Bias detectionIdentify and mitigate bias

ML Lineage

Track the full ML lifecycle:

  • Training data sources
  • Model versions
  • Experiment parameters
  • Deployment history

SageMaker Model Dashboard (Brief)

SageMaker Model Dashboard is a centralized repository and single interface for tracking model governance and performance.

CoverageDetails
Model inventoryConsolidates models in your account, including outputs from SageMaker training jobs
Imported modelsSupports models trained outside SageMaker and then hosted on SageMaker
Single stakeholder viewGives IT admins, model risk managers, and business leaders one place to review model status
Cross-service signalsAggregates data from multiple AWS services to indicate model health and performance
Deployment insightsShows endpoint details and deployed model visibility
Batch insightsIncludes batch transform job details for offline inference workloads
Monitoring insightsSurfaces monitoring job information for drift, quality, and ongoing model behavior checks

Think of it as a control panel for model inventory, deployment visibility, and risk/performance oversight.


Collaboration

Projects

FeatureDetails
Team spacesIsolated collaboration environments
Asset sharingPublish and subscribe workflows
Centralized or decentralizedFlexible governance models
Self-serviceTeams can request access independently

Publishing & Subscribing

WorkflowDescription
PublishMake assets available to others
SubscribeRequest access to shared assets
ApprovalData owners approve/deny requests
AuditTrack all sharing activities

Pricing

ComponentPricing
CatalogFree usage tier available
Metadata storagePer GB stored
API requestsPer request (with free tier)
LineageIncluded in standard pricing
Lake FormationNo extra cost for permissions

TL;DR

  • SageMaker Catalog = Central catalog for data + AI assets (built on DataZone)
  • Discovery = Semantic search, LLM-powered metadata, business glossary
  • Lineage = OpenLineage-compatible tracking of data flow
  • Quality = Unified view of data health metrics
  • Access control = Fine-grained to column level via Lake Formation
  • Collaboration = Projects for teams, publish/subscribe for sharing
  • Responsible AI = Data classification, guardrails, bias detection, model cards
  • Model Dashboard = Centralized model repository with endpoint, batch transform, and monitoring visibility

Resources

SageMaker Catalog 🔴
Data and AI governance capabilities.

Amazon DataZone 🔴
Underlying data governance platform.

AWS Lake Formation 🔴
Fine-grained access control.