This PR prepares the repository for a significant size reduction through Git history cleanup. The main cleanup will be performed separately as it requires force-pushing and coordinated migration.
The aiter repository currently contains 547 MB of data, with 420 MB in Git history alone. Analysis shows that the majority of this space is occupied by files that were previously committed but have since been removed:
- 288 MB: PyTorch tensor test data (
*.ptfiles) - 47 MB: Large CSV benchmark results
- 50+ MB: Compilation artifacts (
*.attfiles) and temporary outputs - 27 MB: Debug logs and Jupyter notebook outputs
These files remain in Git history even though they're no longer in the current codebase.
This PR introduces safeguards to prevent large files from being committed in the future:
- Updated
.gitignore: Explicitly excludes large test data, build artifacts, and temporary files - Pre-commit hook: Automatically rejects commits containing files larger than 5MB
- Test data script: Provides a framework for external test data management
- Documentation: Explains the cleanup plan and migration process
After this PR is merged, a separate history cleanup will be performed using git-filter-repo:
- Expected size reduction: 547 MB → ~230 MB (58% reduction)
- Git history reduction: 420 MB → ~105 MB (75% reduction)
- Benefits:
- Faster clones (60% faster)
- Reduced CI/CD checkout time
- Lower storage and bandwidth costs
Added exclusions for:
- Large model files (
*.pt,*.pth,*.bin,*.ckpt) - Test data directories (
op_tests/test_jenga_vsa/,op_tests/dump_data/) - Build artifacts (
OUT_FOLDER/,*.att) - Large CSV results
- Temporary files and debug logs
The .gitignore additions help prevent large files from being committed. While a pre-commit hook for size enforcement would be beneficial, it is not included in this PR to avoid conflicts with the existing formatting hook in .githooks/pre-commit.
Recommendation: Consider adding size checks to the existing pre-commit hook in a future update, or rely on GitHub's large file warnings during push.
To enable:
- Recommended (matches existing workflow):
bash ./.githooks/install - Alternative (global config-based):
git config core.hooksPath .githooks
scripts/download_test_data.sh provides a template for:
- Downloading test data from external storage
- Documenting what data was moved
- Helping developers who need the full test suite
Note: Storage location needs to be configured based on your infrastructure (S3, GCS, etc.)
This document explains the cleanup plan and provides guidance for contributors.
No immediate impact. This PR only adds protective measures. The actual history cleanup will happen later with advance notice.
After the history cleanup:
- Initial clone will be faster (60% reduction in download time)
- Disk space saved: 315 MB per clone
- Large files will be prevented by the pre-commit hook
When ready to execute the history cleanup:
- Notification: Send advance notice to all contributors (1-2 weeks)
- Maintenance window: Schedule 2-4 hour window for cleanup
- Backup: Create full repository backup
- Execute cleanup: Run
git-filter-repoto remove large files from history - Force push: Update remote repository (requires Admin permissions)
- Team migration: All contributors re-clone the repository
After the cleanup is complete:
# 1. Save your local changes to patch files
git diff > ~/aiter-changes.patch
git diff --staged > ~/aiter-staged.patch
# 2. Move old repository aside (don't delete yet)
cd ..
mv aiter aiter-old
# 3. Clone the cleaned repository
git clone https://github.com/rocm/aiter.git
cd aiter
# 4. Apply your saved changes
if [ -s ~/aiter-changes.patch ]; then
git apply ~/aiter-changes.patch
fi
if [ -s ~/aiter-staged.patch ]; then
git apply ~/aiter-staged.patch
git add -A
fi
# 5. Enable pre-commit hook (if using existing method)
bash ./.githooks/install
# 6. After verifying everything works, delete old repository
# rm -rf ../aiter-oldCleanup has been tested on a repository copy with the following results:
- Original size: 547 MB
- After cleanup: 232 MB
- Space saved: 315 MB (57% reduction)
- Execution time: 9 seconds
- Commits preserved: All 1,331 commits intact
- Functionality: No code changes, history remains traceable
- Git LFS: Adds complexity and hosting costs. Most large files are no longer needed.
- BFG Repo-Cleaner: Simpler but less flexible than
git-filter-repo - Manual deletion: Only affects future commits, doesn't reduce repository size
| Risk | Impact | Mitigation |
|---|---|---|
| Force-push coordination | High | Advance notice, clear migration docs |
| Data loss | High | Full backup, testing on copy first |
| PR conflicts | Medium | Schedule during low-activity period |
| CI/CD disruption | Medium | Update CI configs beforehand |
Please comment on this PR if you have:
- Questions about the cleanup process
- Concerns about data preservation
- Suggestions for the migration plan
- Need access to any of the removed test data
- Test results: Detailed in the "Testing" section above
- Files to be removed: Listed in the "Problem Statement" section (*.pt files, CSV benchmarks, *.att artifacts, debug logs)
- Tool documentation: git-filter-repo