Data preparation, processing, and storage tools in Amazon SageMaker AI.


Overview

SageMaker provides comprehensive data tools across the ML lifecycle:

ToolPurpose
Unified StudioSingle IDE for all data/analytics/AI
NotebooksServerless coding environment
Data AgentAI-powered data analysis
LakehouseUnified data storage
Data ProcessingETL with Athena, EMR, Glue

SageMaker Unified Studio

Single environment for analytics and AI development:

FeatureDetails
Integrated IDEWeb-based development environment
All data accessS3 data lakes, Redshift, federated sources
Tools includedModel development, GenAI apps, SQL analytics, data processing
Built-in governanceAccess controls via SageMaker Catalog
Amazon Q DeveloperAI assistant for coding and queries

Capabilities

  • Discover data and put it to work
  • Train and deploy AI models at scale
  • Build custom generative AI applications
  • Share analytics and AI artifacts securely

SageMaker Notebooks

Serverless, high-performance programming environment:

FeatureDetails
Browser-basedInteractive notebook interface
ServerlessNo infrastructure management
Multi-languageSQL, Python, natural language
ScalableUses Athena for Apache Spark
Built-in AIData Agent for acceleration

Use Cases

  • Exploratory data analysis
  • SQL queries on large datasets
  • ML model prototyping
  • Data visualization

Pricing

  • Pay for instance type + storage
  • Charged based on duration of use

SageMaker Data Agent

AI agent inside notebooks that accelerates data work:

FeatureDetails
Catalog integrationUnderstands your data catalog
Context-awareKnows notebook context
Code generationGenerates correct SQL/Python
ML assistanceHelps build ML pipelines

Pricing

  • $0.04 per credit (pay-as-you-go)
  • Simple prompts: < 1 credit
  • Complex tasks (full pipelines): 4-8 credits ($0.15-0.30)

Example

"Generate a data transformation pipeline for customer churn"
→ Data Agent creates complete pipeline (4-8 credits)

Lakehouse Architecture

Unified data across S3 and Redshift:

ComponentDetails
StorageS3 data lakes + Redshift data warehouses
FormatApache Iceberg tables
AccessAny Iceberg-compatible tool/engine
PermissionsFine-grained via Lake Formation
Zero-ETLNear real-time from operational DBs

Benefits

  • Single copy of data for analytics
  • Flexibility with open formats
  • Enterprise security built-in
  • Federated query across sources

Pricing

  • Metadata: AWS Glue Data Catalog pricing
  • Storage: S3 or Redshift Managed Storage
  • Compute: Based on query/processing engine
  • Fine-grained permissions: No extra cost

Data Processing

Use familiar AWS tools for data preparation:

ServicePurpose
Amazon AthenaServerless SQL queries
Amazon EMRBig data with Spark, Hive, Presto
AWS GlueETL and data integration
Data WranglerVisual data preparation

All accessible from SageMaker Unified Studio.


Data Wrangler

Visual, low-code data preparation:

FeatureDetails
Visual interfacePoint-and-click transformations
300+ transformationsBuilt-in data cleaning functions
Data qualityAutomatic quality analysis
ML integrationExport to training pipelines
SQL + PythonCustom transformations

TL;DR

  • Unified Studio = Single IDE for data, analytics, and AI
  • Notebooks = Serverless, scalable with SQL + Python + natural language
  • Data Agent = AI assistant for data analysis ($0.04/credit)
  • Lakehouse = Unified S3 + Redshift with Apache Iceberg
  • Processing = Athena, EMR, Glue integration in one place
  • Data Wrangler = Visual, low-code data preparation

Resources

SageMaker Unified Studio 🔴
Integrated development environment for all workloads.

SageMaker Lakehouse 🔴
Unified data architecture.

SageMaker Data Processing 🔴
Data processing capabilities.