Architecture Documentation

Overview

RustCrawler is designed following Rust best practices with a focus on modularity, testability, and maintainability.

Design Principles

1. Separation of Concerns

Client Module: HTTP client configuration isolated from business logic
Models Module: Data structures and validation separate from implementation
Crawlers Module: Each crawler type has its own module
Utils Module: Reusable utility functions for I/O and display

2. Trait-Based Design

The Crawler trait defines a common interface for all crawlers:

pub trait Crawler {
    fn analyze(&self, client: &HttpClient, url: &str) -> Result<CrawlerResults, Box<dyn std::error::Error>>;
    fn name(&self) -> &str;
}

This allows:

Easy addition of new crawler types
Polymorphic handling of crawlers
Consistent API across all crawlers

3. Error Handling

Uses Result<T, E> for all fallible operations
Custom error types where appropriate
Clear error messages for user-facing operations

4. Testing Strategy

Unit tests in each module
Test coverage for:
- Model validation
- URL parsing
- Crawler creation
- Data structure operations
13 tests currently passing

Module Breakdown

`client.rs`

Purpose: HTTP client configuration and management

Key Components:

HttpClient: Wrapper around reqwest::blocking::Client
30-second timeout configuration
Reusable across multiple requests

Tests: Client creation and configuration

`models.rs`

Purpose: Core data structures and validation

Key Components:

UrlInfo: Server response metadata
CrawlerResults: Analysis results container
CrawlerSelection: User's crawler choices
validate_url(): URL validation function

Tests: URL validation, selection logic, results manipulation

`utils.rs`

Purpose: I/O and display utilities

Key Components:

get_url_input(): User input for URL
get_yes_no_input(): Boolean prompts
display_results(): Formatted output of crawler results

Tests: Input logic verification

`crawlers/mod.rs`

Purpose: Crawler trait and shared functionality

Key Components:

Crawler trait definition
fetch_page_content(): Common HTTP GET operation
fetch_page_with_timing(): GET with performance metrics

`crawlers/seo.rs`

Purpose: SEO analysis implementation

Features:

Title tag validation (length, presence)
Meta description checking
H1 heading verification
Canonical URL detection
Robots meta tag analysis
Internal link validation (up to 10 links)

Tests: Crawler creation, link extraction

`crawlers/performance.rs`

Purpose: Performance analysis implementation

Features:

Response time measurement
Compression detection (brotli, gzip, deflate)
Page size analysis
Script and stylesheet counting

Tests: Crawler creation

`crawlers/a11y.rs`

Purpose: Accessibility analysis implementation

Features:

HTML lang attribute checking
Image alt attribute validation
ARIA attribute detection
Semantic HTML5 tag verification
Form label association
Skip navigation link detection

Tests: Crawler creation, alt attribute checking

`main.rs`

Purpose: Application entry point

Responsibilities:

User interaction flow
Orchestration of crawlers
Display coordination

Design: Thin layer that uses library modules

Extensibility

Adding a New Crawler

Create a new file in src/crawlers/
Implement the Crawler trait
Add module declaration in src/crawlers/mod.rs
Export in src/lib.rs
Update main.rs to include in selection menu

Example:

pub struct SecurityCrawler;

impl Crawler for SecurityCrawler {
    fn name(&self) -> &str {
        "Security Crawler"
    }

    fn analyze(&self, client: &HttpClient, url: &str) 
        -> Result<CrawlerResults, Box<dyn std::error::Error>> {
        // Implementation
    }
}

Best Practices Implemented

Rust Naming Conventions: snake_case for functions, PascalCase for types
Documentation: Inline documentation for all public APIs
Module Organization: Clear hierarchy and logical grouping
Error Handling: Proper Result types, no unwrap in production paths
Testing: Unit tests for testable logic
Type Safety: Strong typing, minimal use of String where specific types work
Ownership: Proper use of references vs. owned values
Trait Usage: Polymorphism through traits rather than inheritance

Performance Considerations

HTTP Client Reuse: Single client instance for all requests
Timeout Configuration: 30-second timeout to prevent hangs
Limited Link Checking: Only checks first 10 internal links to avoid excessive requests
Blocking I/O: Uses blocking client for simplicity (async could be added later)

Future Improvements

Async/Await: Convert to async for better concurrency
Parallel Crawling: Run multiple crawlers concurrently
HTML Parsing: Use proper HTML parser (e.g., scraper crate) instead of string matching
Configuration: External config file for timeouts, limits, etc.
Reporting: JSON/HTML output options
Integration Tests: End-to-end tests with mock servers
Error Recovery: Retry logic for transient failures

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Architecture Documentation

Overview

Design Principles

1. Separation of Concerns

2. Trait-Based Design

3. Error Handling

4. Testing Strategy

Module Breakdown

`client.rs`

`models.rs`

`utils.rs`

`crawlers/mod.rs`

`crawlers/seo.rs`

`crawlers/performance.rs`

`crawlers/a11y.rs`

`main.rs`

Extensibility

Adding a New Crawler

Best Practices Implemented

Performance Considerations

Future Improvements

FilesExpand file tree

ARCHITECTURE.md

Latest commit

History

ARCHITECTURE.md

File metadata and controls

Architecture Documentation

Overview

Design Principles

1. Separation of Concerns

2. Trait-Based Design

3. Error Handling

4. Testing Strategy

Module Breakdown

client.rs

models.rs

utils.rs

crawlers/mod.rs

crawlers/seo.rs

crawlers/performance.rs

crawlers/a11y.rs

main.rs

Extensibility

Adding a New Crawler

Best Practices Implemented

Performance Considerations

Future Improvements

`client.rs`

`models.rs`

`utils.rs`

`crawlers/mod.rs`

`crawlers/seo.rs`

`crawlers/performance.rs`

`crawlers/a11y.rs`

`main.rs`