All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
0.2.6 - 2026-04-21
- Bump
selectors0.36 → 0.37 - Bump
pyo30.28.2 → 0.28.3 - Bump
rustls-webpkito 0.103.12 - Bump
@napi-rs/cli3.6.1 → 3.6.2 - Bump
@emnapi/runtime(Node.js and WASM packages) - Bump
@biomejs/biome(Node.js and WASM packages) - Bump
pytest(Python package) - Bump CI actions:
lewagon/wait-on-check-action1.6.1 → 1.7.0,actions/upload-pages-artifact4 → 5,softprops/action-gh-release2 → 3,pnpm/action-setup5 → 6,dependabot/fetch-metadata2 → 3,actions/deploy-pages4 → 5,codecov/codecov-action5 → 6
0.2.5 - 2026-03-17
- Replace
markup5ever_rcdomwith a customDocBuilderSink(html5ever::TreeSink) that builds the internalDocumentdirectly, eliminating intermediate RcDom allocation and Rc/RefCell overhead
- Bump
html5ever0.38 → 0.39,markup5ever0.38 → 0.39 - Remove
markup5ever_rcdom(unmaintained, incompatible with markup5ever 0.39)
DocBuilderSink: depth tracking correctly updated afterreparent_childrenandinsert_before(foster parenting edge case)DocBuilderSink: text coalescing inappend_before_siblingnow checks the sibling node itself rather than its last child, matching html5ever RcDom semanticsDocBuilderSink:is_mathml_annotation_xml_integration_pointnow respectsElementFlagsinstead of unconditionally returningfalseDocBuilderSink:create_pireturnsPhantomhandle instead of creating an empty text node- Remove redundant
strip = truefrom[tool.maturin]inpyproject.toml(conflicts withmaturin developin maturin >= 1.12) - Add
biome.jsonto exclude.d.tsfiles from biome linting (biome 2.4.7 internal panic on declaration files)
0.2.4 - 2026-02-20
- Bump workspace dependencies (clap, syn, serde, wasm-bindgen, pyo3)
0.2.3 - 2026-02-16
- Bump workspace Rust dependencies (rust-minor group)
- Bump @biomejs/biome in Node.js and WASM packages
0.2.2 - 2026-02-03
- Switch CLI from
native-tlstorustlsfor cross-compilation support (fixes linux-aarch64 build)
0.2.1 - 2026-02-03
-
Phase 18: Streaming & Advanced Performance
- Full lol_html streaming integration with constant O(1) memory usage
- StreamingParser with event-driven architecture for large documents
- Memory-mapped file support via
mmapfeature - Streaming text extraction and element handlers
-
Phase 19: Developer Experience
- Enhanced error tracking and diagnostics
- Improved error messages with source context
- Better panic handling and recovery
-
Phase 20: Benchmark Infrastructure
- Comprehensive benchmark suite with criterion
- Cross-platform performance comparison tools
- Memory profiling infrastructure
-
Documentation
- mdBook-based documentation site
- Getting Started guide
- User Guide with examples
- API reference documentation
- Updated PyO3 to 0.28.0 with
skip_from_py_objectcompatibility fix - Dependabot auto-merge workflow for patch updates
- CI wait-on-check circular dependency issue
- Dependabot workflow auto-approve removal
- pyo3: 0.27.2 → 0.28.0
- bytes: 1.11.0 → 1.11.1
- clap: 4.5.54 → 4.5.56
- actions/checkout: 4 → 6
- actions/cache: 4 → 5
- actions/upload-pages-artifact: 3 → 4
- @biomejs/biome: updated in Node.js and WASM packages
0.2.0 - 2026-01-20
-
Phase 14: Performance Optimization
- SIMD-accelerated class selector matching with automatic platform detection (SSE4.2/AVX2/NEON/SIMD128)
- TagId enum for efficient HTML5 tag interning (113 tags with O(1) lookup)
- DocumentIndex for optimized ID and class-based lookups (O(1) ID, O(k) class)
- Rayon-powered batch parsing via
parse_batch()for parallel document processing - Selector fast-paths for common patterns (tag only, class only, ID only)
- Query optimization with compiled selector caching and pre-compiled selector support
-
Phase 15: Core Utilities Extraction
- New
scrape-core/src/utils.rsmodule with shared HTML escaping utilities - New
scrape-core/src/serialize.rsmodule with centralized HTML serialization escape_text()andescape_attr()with zero-copy Cow optimizationserialize_node()andcollect_text()functions for consistent HTML handling- Zero-copy optimization reduces allocations by 50-70% in typical HTML serialization
- New
-
Phase 16: Trait Abstractions and Iterator Extensions
- HtmlSerializer trait for unified HTML serialization API on Tag type
- ElementFilter iterator extensions (.elements()) for element-only iteration
- 6 new iterator types: ElementChildrenIter, ElementDescendantsIter, ElementAncestorsIter, etc.
- Simplified binding navigation code by 45% per method across all platforms (Python/Node.js/WASM)
- Comprehensive benchmark suite validating zero-overhead abstractions
-
Phase 17: Advanced Type Safety with Typestate Patterns
- DocumentState sealed trait with three lifecycle states: Building, Queryable, Sealed
- Compile-time enforced document lifecycle via PhantomData (zero runtime overhead)
- Type-safe state transitions: Building → Queryable → Sealed
- NodeType sealed trait preventing external implementations with private module pattern
- Marker types (ElementMarker, TextMarker, CommentMarker) for enhanced type safety
- Full backward compatibility via Document type alias (Document = DocumentImpl)
-
Breaking Changes (Internal API)
- Document is now DocumentImpl with generic state parameter
- Parser APIs now use DocumentImpl during construction
- Public Document type alias maintains backward compatibility
- New types exported: DocumentImpl, DocumentState, Building, Queryable, Sealed, NodeType, markers
-
Performance improvements across all binding libraries (Python/Node.js/WASM)
-
Eliminated 308 lines of duplicated code across bindings
-
More efficient DOM traversal with cached selector state
-
Reduced memory allocations in text extraction and HTML serialization
- Resolved 5 FIXME/TODO markers related to code duplication and unfinished implementations
- Fixed inconsistent text escaping patterns between core and bindings
- Improved consistency in HTML serialization across all platforms
- Zero unsafe code in Phase 17 implementation
- All sealed traits prevent unintended implementations
- Type-safe state transitions prevent invalid operations at compile time
- Query Performance: SIMD-accelerated class matching (2-10x improvement on large documents)
- Memory: 50-70% reduction in serialization allocations via Cow
- Parallel: Batch parsing scales near-linearly with thread count (Rayon)
- Zero-cost abstractions: Phase 16-17 traits generate identical code to manual implementations
- 1,121 comprehensive tests passing across all platforms
- 506 Rust unit tests + 116 doctests
- 180 Python tests, 201 Node.js tests, 118 WASM tests
- 100% coverage on all new modules (utils.rs, serialize.rs, state.rs, node_type.rs)
- Compile-time type safety verified via sealed traits and typestate patterns
0.1.6 - 2026-01-16
- Use
--ignore-scriptsfor npm main package publish (platform packages already published in loop)
0.1.5 - 2026-01-16
- Fixed
contents: writepermission for npm publish job (napi prepublish needs to create releases)
0.1.4 - 2026-01-16
- Fixed npm auth for platform packages publishing (explicit .npmrc setup)
0.1.3 - 2026-01-16
- Added
GITHUB_TOKENfor napi prepublish in npm release workflow
0.1.2 - 2026-01-16
- Fixed doctests using private
dommodule path - Fixed pnpm lockfile compatibility in CI (upgraded to pnpm 10)
- Fixed maturin cross-compilation with
--find-interpreterflag - Disabled
linux-arm64-muslNode.js build temporarily (requires proper cross-compiler)
0.1.1 - 2026-01-16
- CLI tool (
scrape-cli): Command-line HTML extraction tool- CSS selector-based extraction with
-s/--selector - Multiple output formats: text, JSON, CSV, HTML (
-f/--format) - Batch processing with parallel execution via Rayon
- Named extractions support (
-n/--named) - Attribute extraction (
-a/--attrs) - HTML fragment output (
--html) - File and stdin input support
- CSS selector-based extraction with
- CLI binary builds for 5 platforms in GitHub releases (linux-x64, linux-arm64, darwin-x64, darwin-arm64, windows-x64)
- Renamed Python package from
scrape-rstofast-scrapeon PyPI - Renamed Node.js package from
scrape-rsto@fast-scrape/nodeon npm - Renamed WASM package from
@scrape-rs/wasmto@fast-scrape/wasmon npm
- HTML comment escaping in CLI output to prevent injection via filenames
0.1.0 - 2026-01-16
- Core HTML parsing engine based on html5ever for spec-compliant parsing
- Arena-based DOM tree for memory efficiency and cache-friendly traversal
- CSS selector support via
selectorscrate with full CSS3 selector syntax - SIMD-accelerated byte scanning with memchr (SSE4.2/AVX2 on x86_64, NEON on ARM64)
- Python bindings via PyO3 (PyPI:
scrape-rs) - Node.js bindings via napi-rs (npm:
scrape-rs) - WASM bindings for browsers (npm:
@scrape-rs/wasm) - Cross-platform API consistency across all bindings
- BeautifulSoup-compatible API with
SoupandTagtypes find()andfind_all()methods for element search- Attribute access and text extraction utilities
- Parallel processing support via Rayon (native platforms only)
- Comprehensive test suite with shared test cases across bindings
- Fuzzing targets for parser and selector components
- Benchmarks comparing against BeautifulSoup and Cheerio
- 10-50x faster than BeautifulSoup for parsing and queries
- Zero-copy DOM navigation
- WASM bundle under 500KB gzipped