Data preparation, processing, and storage tools in Amazon SageMaker AI.
Overview
SageMaker provides comprehensive data tools across the ML lifecycle:
| Tool | Purpose |
|---|---|
| Unified Studio | Single IDE for all data/analytics/AI |
| Notebooks | Serverless coding environment |
| Data Agent | AI-powered data analysis |
| Lakehouse | Unified data storage |
| Data Processing | ETL with Athena, EMR, Glue |
SageMaker Unified Studio
Single environment for analytics and AI development:
| Feature | Details |
|---|---|
| Integrated IDE | Web-based development environment |
| All data access | S3 data lakes, Redshift, federated sources |
| Tools included | Model development, GenAI apps, SQL analytics, data processing |
| Built-in governance | Access controls via SageMaker Catalog |
| Amazon Q Developer | AI assistant for coding and queries |
Capabilities
- Discover data and put it to work
- Train and deploy AI models at scale
- Build custom generative AI applications
- Share analytics and AI artifacts securely
SageMaker Notebooks
Serverless, high-performance programming environment:
| Feature | Details |
|---|---|
| Browser-based | Interactive notebook interface |
| Serverless | No infrastructure management |
| Multi-language | SQL, Python, natural language |
| Scalable | Uses Athena for Apache Spark |
| Built-in AI | Data Agent for acceleration |
Use Cases
- Exploratory data analysis
- SQL queries on large datasets
- ML model prototyping
- Data visualization
Pricing
- Pay for instance type + storage
- Charged based on duration of use
SageMaker Data Agent
AI agent inside notebooks that accelerates data work:
| Feature | Details |
|---|---|
| Catalog integration | Understands your data catalog |
| Context-aware | Knows notebook context |
| Code generation | Generates correct SQL/Python |
| ML assistance | Helps build ML pipelines |
Pricing
- $0.04 per credit (pay-as-you-go)
- Simple prompts: < 1 credit
- Complex tasks (full pipelines): 4-8 credits ($0.15-0.30)
Example
"Generate a data transformation pipeline for customer churn"
→ Data Agent creates complete pipeline (4-8 credits)
Lakehouse Architecture
Unified data across S3 and Redshift:
| Component | Details |
|---|---|
| Storage | S3 data lakes + Redshift data warehouses |
| Format | Apache Iceberg tables |
| Access | Any Iceberg-compatible tool/engine |
| Permissions | Fine-grained via Lake Formation |
| Zero-ETL | Near real-time from operational DBs |
Benefits
- Single copy of data for analytics
- Flexibility with open formats
- Enterprise security built-in
- Federated query across sources
Pricing
- Metadata: AWS Glue Data Catalog pricing
- Storage: S3 or Redshift Managed Storage
- Compute: Based on query/processing engine
- Fine-grained permissions: No extra cost
Data Processing
Use familiar AWS tools for data preparation:
| Service | Purpose |
|---|---|
| Amazon Athena | Serverless SQL queries |
| Amazon EMR | Big data with Spark, Hive, Presto |
| AWS Glue | ETL and data integration |
| Data Wrangler | Visual data preparation |
All accessible from SageMaker Unified Studio.
Data Wrangler
Visual, low-code data preparation:
| Feature | Details |
|---|---|
| Visual interface | Point-and-click transformations |
| 300+ transformations | Built-in data cleaning functions |
| Data quality | Automatic quality analysis |
| ML integration | Export to training pipelines |
| SQL + Python | Custom transformations |
TL;DR
- Unified Studio = Single IDE for data, analytics, and AI
- Notebooks = Serverless, scalable with SQL + Python + natural language
- Data Agent = AI assistant for data analysis ($0.04/credit)
- Lakehouse = Unified S3 + Redshift with Apache Iceberg
- Processing = Athena, EMR, Glue integration in one place
- Data Wrangler = Visual, low-code data preparation
Resources
SageMaker Unified Studio 🔴
Integrated development environment for all workloads.
SageMaker Lakehouse 🔴
Unified data architecture.
SageMaker Data Processing 🔴
Data processing capabilities.