Skip to content

Add DataZone catalog import/export spec (requirements, design, tasks)#60

Merged
vasu2856 merged 39 commits intoaws:mainfrom
vasu2856:feature/catalog_import_export
Apr 1, 2026
Merged

Add DataZone catalog import/export spec (requirements, design, tasks)#60
vasu2856 merged 39 commits intoaws:mainfrom
vasu2856:feature/catalog_import_export

Conversation

@vasu2856
Copy link
Copy Markdown
Contributor

Description of changes:

Specs to add support for Catalog resources - assets, assettypes, formtypes, glossarryterms and glossaries. The spec also highlights special handling of schedule asset.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@vasu2856 vasu2856 requested review from Shnekit and abaror75 February 25, 2026 20:35
@abaror75
Copy link
Copy Markdown
Contributor

SageMakerUnifiedStudioScheduleAssetType - That perhaps is the least interesting when you talk about Catalog.. hide this as implementation detail

@abaror75
Copy link
Copy Markdown
Contributor

Lets create user manual first. I think requirement 1 is not that critical .. Its more important to allow the user to define a set of data products, or a set of assets, and have the tool find all the connected entities (BG, FormTypes) and export them first , so the import will work with all dependent objects no matter what. Check Quick import to see how they did it.

Start with the simplest user experience to use it .. Also , I would like to see example on how we get that timestamp to put into the CLI when I use a GitHub Pipeline , how will we figure this out? Should we add tags to Git when doing the import ? Lets figure out entire user experience here.

- Add comprehensive catalog import/export guide with step-by-step instructions
- Add quick reference guide for common catalog operations
- Refine design.md to clarify manifest schema with separate subsections for assets, glossaries, data products, and metadata forms
- Remove schedule asset special handling from design (deferred to future implementation)
- Update architecture diagrams to reflect new manifest configuration structure
- Clarify CatalogExporter API routing for data products and metadata forms
- Improve docstring documentation for export_catalog function with parameter descriptions
names: # optional, defaults to all data products
- "Sales Analytics Product"
- "Customer Insights Product"
updatedAfter: "2025-01-01T00:00:00Z" # optional ISO 8601 filter
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, wondering if we can read this dynamically from the previously created bundle date. I think this is what customers are going to do manually anyway

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup. Was exploring ways to do it. There are scripts you are run as part of Github workflow, but since we don't manage the state, i don't think we can do it on the CLI directly.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can. I think I am going to add bundle versions deployed to targets S3 as part of rollback support :)

@abaror75
Copy link
Copy Markdown
Contributor

  1. Focus on Assets and Data Products. Explain dependent entities as carried along.
  2. Discuss ownership of assets for the project and that only owned assets can be exported and imported
    Manifest focus too much on EntityTypes.. I think this can be one filter .. why replicate .. Why I don't have assetFilter by name , by physical type (glue table, model) ?

@abaror75
Copy link
Copy Markdown
Contributor

  1. Manifest should not contain updateAfter date. That should be option .. otherwise manifest need to keep changing

1 similar comment
@abaror75
Copy link
Copy Markdown
Contributor

  1. Manifest should not contain updateAfter date. That should be option .. otherwise manifest need to keep changing

@abaror75
Copy link
Copy Markdown
Contributor

Assets (reference AssetTypes)
Glossaries (no dependencies)
GlossaryTerms (reference Glossaries)
Data Products (may reference Assets)

How can you do Assets before Glossaries ?

- formTypes
- assetTypes
- assets
updatedAfter: "2025-01-01T00:00:00Z" # optional ISO 8601 filter
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the use case of updatedAfter? When is it updated?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its used to limit the scope of export to what's changed since last deployment. Omitting it will pull everything in the source project.

@abaror75
Copy link
Copy Markdown
Contributor

Can I do dry-run to make sure all the things that need to be in the target are there before I make changes.. I don't want partial import failure

@abaror75
Copy link
Copy Markdown
Contributor

I think we need two filters : Name filter and EntityType filter , default :

@abaror75
Copy link
Copy Markdown
Contributor

We already discussed having the deploy add Git tag to resources.. maybe along the lines for that , we add Git Tag during the bundle ( generate bundle Id and store it) , then use that for the deploy . Or the user provide these timestamps on the command line , but again , wonder how this works. The doc does not describe it

@abaror75
Copy link
Copy Markdown
Contributor

Publish Listing feature

vasu2856 and others added 7 commits March 3, 2026 15:31
- Simplify manifest schema to use single `enabled` boolean instead of granular resource type filters
- Add `publish` flag to automatically publish assets and data products during deployment
- Update CatalogExporter to export ALL project-owned resources with optional `--updated-after` filtering
- Enhance IdentifierMapper to use externalIdentifier with normalization as primary lookup, falling back to name-based matching
- Add support for asset and data product publishing in CatalogImporter
- Clarify dependency ordering and include delete operations in import summary
- Update architecture documentation to reflect simplified configuration approach
- Streamline design diagrams to show complete resource flow and identifier mapping strategy
… support

- Expand Multi-Environment section to explain independent project/domain targeting per stage
- Update architecture diagram to show optional separate domains for dev/test/prod stages
- Add new "Multi-Domain and Multi-Project Architecture" section with use cases
- Include configuration example showing domain_id per stage
- Document use cases: organizational boundaries, compliance, multi-tenant, cross-account
- Update DataZone Helper documentation to reflect multi-domain support
- Enhance sequence and flow diagrams to show domain resolution per stage
- Clarify that each deployment stage can target independent projects in independent domains
- Add multi-domain configuration section with YAML examples
- Improve validation phase documentation to include multi-domain verification
… filtering

- Add .config.kiro file to establish spec metadata and workflow type
- Clarify that manifest contains NO filter options — only enabled, publish, and assets.access
- Document that --updated-after is a CLI-only flag on bundle command, not a manifest field
- Update design diagrams to show CLI flag as separate input to CatalogExporter
- Emphasize uniform filtering across ALL resource types via CLI timestamp
- Refine CatalogExporter docstring to clarify filter source and scope
- Update internal helper documentation to note filters come from CLI only
- Improve quick reference and guide documentation for clarity on filtering behavior
… graph and simplify tasks

- Update dependency graph to include Data Products as final resource type
- Revise creation order to place Data Products after Assets
- Revise deletion order to place Data Products first (reverse dependency)
- Add clarification that Data Products reference Assets
- Consolidate and simplify task descriptions for catalog export/import implementation
- Add new examples directory with README for catalog import/export workflows
- Update quick reference guide with streamlined information
- Reflect simplified manifest schema with only enabled, publish, and assets.access fields
…CI/CD integration

- Add CatalogExporter helper to query and serialize DataZone resources (Glossaries, GlossaryTerms, FormTypes, AssetTypes, Assets, Data Products)
- Add CatalogImporter helper to import and optionally publish exported catalog resources with identifier mapping
- Extend application manifest schema with catalog configuration (enabled, skipPublish, assets.access)
- Add --updated-after CLI flag to filter exported resources by modification timestamp
- Integrate catalog export into bundle command and import into deploy command
- Add catalog-import-export GitHub Actions workflow for automated deployment
- Add comprehensive integration tests for export, import, and round-trip scenarios
- Add unit tests for catalog helpers and manifest configuration
- Add example manifest and seed data script for catalog import/export demonstration
- Update documentation with catalog import/export guide and quick reference
- Preserve source publish state (listingStatus) during export and conditionally republish on import
@vasu2856 vasu2856 requested review from DaviesHuang and Shnekit March 17, 2026 22:09
…ith API filtering and edge case handling

- Update design documentation to clarify Search API ownership filtering and SearchTypes API client-side filtering requirements
- Document get_asset API enrichment for full asset details including formsOutput
- Correct listingStatus value from "LISTED" to "ACTIVE" for published state detection
- Add comprehensive testing guide covering export, import, and round-trip scenarios
- Expand integration tests with edge case coverage including disabled catalog and skip-publish manifests
- Add sample test fixtures (connections, workflows, code) for integration test scenarios
- Enhance unit tests for catalog export/import properties and DataZone property handling
- Update CLI, bundle, and deploy commands to support refined catalog operations
- Improve catalog export and import helper implementations with better error handling and filtering logic
- Update example documentation and seed data scripts with latest catalog patterns
…orm normalization specs

- Add `_resolve_target_data_source()` helper to match data sources by type and database name with fallback priority
- Add `_normalize_forms_input_for_api()` helper to remap form identifiers and data source references for target domain
- Add Requirement 5.15 for DataSourceReferenceForm remapping during import with database name extraction from GlueTableForm
- Add Property 18 validation for data source remapping with matching priority and fallback behavior
- Add edge case handling for missing data sources and JSON parse failures in error scenarios table
- Update task requirements list to include Requirement 5.15
- Update multilingual README translations (fr, he, it, ja, pt, zh) to reflect new functionality
- Update catalog import/export guides with data source remapping documentation
- Implement form normalization in `catalog_import.py` and deploy command integration
- Add "Back to Main README" navigation link to French README
- Add "Back to Main README" navigation link to Hebrew README
- Add "Back to Main README" navigation link to Italian README
- Add "Back to Main README" navigation link to Japanese README
- Add "Back to Main README" navigation link to Chinese README
- Improves navigation between main and translated documentation pages
…utility

- Rename _check_import_permissions to _ensure_import_permissions to reflect new behavior
- Add _POLICY_DETAIL_KEY mapping for policy type to detail key conversion
- Implement automatic policy grant creation via add_policy_grant when grants are missing
- Update permission check logic to attempt adding missing grants before failing
- Change return value from missing grants list to failed grants list
- Add comprehensive logging for grant checking and addition attempts
- Create cleanup_catalog_resources.py integration test utility to remove project-owned resources
- Update error messaging to clarify that grants are added automatically when possible
- Improve docstrings to document the new auto-grant behavior
Comment thread tests/integration/catalog-import-export/test_catalog_edge_cases.py Outdated
Copy link
Copy Markdown
Contributor

@Shnekit Shnekit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can sync to review the comments. There are some very broad ones I posted which I need clarification on

dependabot Bot and others added 24 commits March 24, 2026 16:46
Bumps [ujson](https://github.com/ultrajson/ultrajson) from 5.11.0 to 5.12.0.
- [Release notes](https://github.com/ultrajson/ultrajson/releases)
- [Commits](ultrajson/ultrajson@5.11.0...5.12.0)

---
updated-dependencies:
- dependency-name: ujson
  dependency-version: 5.12.0
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Bumps [pypdf](https://github.com/py-pdf/pypdf) from 6.8.0 to 6.9.1.
- [Release notes](https://github.com/py-pdf/pypdf/releases)
- [Changelog](https://github.com/py-pdf/pypdf/blob/main/CHANGELOG.md)
- [Commits](py-pdf/pypdf@6.8.0...6.9.1)

---
updated-dependencies:
- dependency-name: pypdf
  dependency-version: 6.9.1
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
- Update package name in pyproject.toml, setup.cfg, setup.py
- Update console_scripts entry point to aws-smus-cicd-cli
- Update all CLI command references across docs, examples, workflows, git-templates and source files
- Add 'aws-smus-cicd-cli version' subcommand
- Add --version/-v flags via app.callback()
- Use importlib.metadata as single source of truth for version
- Remove hardcoded version from setup.py and setup.cfg
- pyproject.toml remains sole version source (patched by publish-pypi.yml on release)
- Update README and docs to use 'pip install aws-smus-cicd-cli' instead of git clone
- Keep git clone in quickstart.md only where examples are needed
- Sync lang docs (fr, he, it, ja, pt, zh) with main README changes
- Revert mlflow==3.8.0rc0 to mlflow==3.1.4 in ml/training requirements
  (3.8.0rc0 requires Python >=3.10, sklearn training container uses 3.9)
- Add .github/dependabot.yml to ignore mlflow updates in ml/training and
  data-notebooks directories to prevent future breakage
- Add Portuguese (pt) to translate-readme-chunked.py LANGUAGES dict
  (was missing, causing pt/README.md to never be updated)
- Preserve back link in translated READMEs so CI validate-docs check passes
- Update package name in pyproject.toml, setup.cfg, setup.py
- Update console_scripts entry point to aws-smus-cicd-cli
- Update all CLI command references across docs, examples, workflows, git-templates and source files
- Add 'aws-smus-cicd-cli version' subcommand
- Add --version/-v flags via app.callback()
- Use importlib.metadata as single source of truth for version
- Remove hardcoded version from setup.py and setup.cfg
- pyproject.toml remains sole version source (patched by publish-pypi.yml on release)
- Update README and docs to use 'pip install aws-smus-cicd-cli' instead of git clone
- Keep git clone in quickstart.md only where examples are needed
- Sync lang docs (fr, he, it, ja, pt, zh) with main README changes
…lify filtering

- Remove --updated-after CLI flag from bundle command and workflow documentation
- Simplify CatalogExporter to export all catalog resources without timestamp filtering
- Update design documentation to reflect removal of optional filtering capability
- Remove seed_catalog_data.py helper script from examples
- Remove quick-reference documentation file
- Consolidate catalog import/export documentation into single guide
- Add catalog_test_helpers.py for shared test utilities
- Update all integration and unit tests to work without --updated-after filtering
- Simplify manifest schema to contain only enabled, skipPublish, and assets.access configuration
- Update CLI and bundle command implementations to remove filtering logic
- Filtering is now handled entirely during import phase via identifier mapping
- Reformat assert statements to place condition on separate line
- Consolidate multi-line function calls to single line where appropriate
- Add blank lines between logical sections in test helpers
- Improve code readability and maintain consistent formatting across test and helper files
@vasu2856 vasu2856 requested a review from Shnekit April 1, 2026 17:30

## Key Assumption: Physical Resources Must Have the Same Name

> ⚠️ **IMPORTANT**: Catalog import/export assumes that the underlying physical resources (e.g., Glue Tables, Glue Databases, S3 buckets) referenced by your catalog assets **have the same name in both source and target environments**. The import process matches assets across environments using normalized `externalIdentifier` values or resource names. If the physical resource names differ between environments (e.g., `my-table-dev` vs `my-table-prod`), the matching will fail and resources will be created as new rather than updated. **Additionally, any resources that exist in the target project but are not present in the bundle will be deleted during import.** This means mismatched names can cause the original target resource to be deleted and a duplicate to be created. Ensure your infrastructure provisioning uses consistent resource names across stages, or that your naming conventions allow the normalization logic (which strips AWS account IDs and region strings) to produce matching identifiers.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the resources not in the bundle deletion logic is with Dry-Run PR, let's make sure we are updating the docs and comments as well, like here

Copy link
Copy Markdown
Contributor

@Shnekit Shnekit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved all the blocking comments. Looks good now. Let's keep track of everything that is supposed to be in dry-run PR related to this.
Otherwise, LGTM

@vasu2856 vasu2856 merged commit 8bb581f into aws:main Apr 1, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants