Skip to content

Latest commit

 

History

History
184 lines (139 loc) · 5.25 KB

File metadata and controls

184 lines (139 loc) · 5.25 KB

Architecture Documentation

Overview

RustCrawler is designed following Rust best practices with a focus on modularity, testability, and maintainability.

Design Principles

1. Separation of Concerns

  • Client Module: HTTP client configuration isolated from business logic
  • Models Module: Data structures and validation separate from implementation
  • Crawlers Module: Each crawler type has its own module
  • Utils Module: Reusable utility functions for I/O and display

2. Trait-Based Design

The Crawler trait defines a common interface for all crawlers:

pub trait Crawler {
    fn analyze(&self, client: &HttpClient, url: &str) -> Result<CrawlerResults, Box<dyn std::error::Error>>;
    fn name(&self) -> &str;
}

This allows:

  • Easy addition of new crawler types
  • Polymorphic handling of crawlers
  • Consistent API across all crawlers

3. Error Handling

  • Uses Result<T, E> for all fallible operations
  • Custom error types where appropriate
  • Clear error messages for user-facing operations

4. Testing Strategy

  • Unit tests in each module
  • Test coverage for:
    • Model validation
    • URL parsing
    • Crawler creation
    • Data structure operations
  • 13 tests currently passing

Module Breakdown

client.rs

Purpose: HTTP client configuration and management

Key Components:

  • HttpClient: Wrapper around reqwest::blocking::Client
  • 30-second timeout configuration
  • Reusable across multiple requests

Tests: Client creation and configuration

models.rs

Purpose: Core data structures and validation

Key Components:

  • UrlInfo: Server response metadata
  • CrawlerResults: Analysis results container
  • CrawlerSelection: User's crawler choices
  • validate_url(): URL validation function

Tests: URL validation, selection logic, results manipulation

utils.rs

Purpose: I/O and display utilities

Key Components:

  • get_url_input(): User input for URL
  • get_yes_no_input(): Boolean prompts
  • display_results(): Formatted output of crawler results

Tests: Input logic verification

crawlers/mod.rs

Purpose: Crawler trait and shared functionality

Key Components:

  • Crawler trait definition
  • fetch_page_content(): Common HTTP GET operation
  • fetch_page_with_timing(): GET with performance metrics

crawlers/seo.rs

Purpose: SEO analysis implementation

Features:

  • Title tag validation (length, presence)
  • Meta description checking
  • H1 heading verification
  • Canonical URL detection
  • Robots meta tag analysis
  • Internal link validation (up to 10 links)

Tests: Crawler creation, link extraction

crawlers/performance.rs

Purpose: Performance analysis implementation

Features:

  • Response time measurement
  • Compression detection (brotli, gzip, deflate)
  • Page size analysis
  • Script and stylesheet counting

Tests: Crawler creation

crawlers/a11y.rs

Purpose: Accessibility analysis implementation

Features:

  • HTML lang attribute checking
  • Image alt attribute validation
  • ARIA attribute detection
  • Semantic HTML5 tag verification
  • Form label association
  • Skip navigation link detection

Tests: Crawler creation, alt attribute checking

main.rs

Purpose: Application entry point

Responsibilities:

  • User interaction flow
  • Orchestration of crawlers
  • Display coordination

Design: Thin layer that uses library modules

Extensibility

Adding a New Crawler

  1. Create a new file in src/crawlers/
  2. Implement the Crawler trait
  3. Add module declaration in src/crawlers/mod.rs
  4. Export in src/lib.rs
  5. Update main.rs to include in selection menu

Example:

pub struct SecurityCrawler;

impl Crawler for SecurityCrawler {
    fn name(&self) -> &str {
        "Security Crawler"
    }

    fn analyze(&self, client: &HttpClient, url: &str) 
        -> Result<CrawlerResults, Box<dyn std::error::Error>> {
        // Implementation
    }
}

Best Practices Implemented

  1. Rust Naming Conventions: snake_case for functions, PascalCase for types
  2. Documentation: Inline documentation for all public APIs
  3. Module Organization: Clear hierarchy and logical grouping
  4. Error Handling: Proper Result types, no unwrap in production paths
  5. Testing: Unit tests for testable logic
  6. Type Safety: Strong typing, minimal use of String where specific types work
  7. Ownership: Proper use of references vs. owned values
  8. Trait Usage: Polymorphism through traits rather than inheritance

Performance Considerations

  1. HTTP Client Reuse: Single client instance for all requests
  2. Timeout Configuration: 30-second timeout to prevent hangs
  3. Limited Link Checking: Only checks first 10 internal links to avoid excessive requests
  4. Blocking I/O: Uses blocking client for simplicity (async could be added later)

Future Improvements

  1. Async/Await: Convert to async for better concurrency
  2. Parallel Crawling: Run multiple crawlers concurrently
  3. HTML Parsing: Use proper HTML parser (e.g., scraper crate) instead of string matching
  4. Configuration: External config file for timeouts, limits, etc.
  5. Reporting: JSON/HTML output options
  6. Integration Tests: End-to-end tests with mock servers
  7. Error Recovery: Retry logic for transient failures