This directory contains a complete local test environment for testing the web scraper against a controlled website with a known structure.
A test website with the following characteristics has been generated:
- 400+ HTML pages in a hierarchical structure
- Maximum depth of 5 levels
- Navigation links between pages at different levels
- Proper
robots.txtandsitemap.xmlfiles - Random metadata on pages for testing extraction
-
example-site/- Contains all the generated HTML files and resourcesindex.html- Homepagepage*.html- Top-level pagessection*/- Section directories with their own pagesrobots.txt- Contains crawler directives with some intentionally disallowed pagessitemap.xml- XML sitemap with all publicly available pages
-
nginx/- Contains Nginx configurationnginx.conf- Server configuration with directory listing enabled
-
docker-compose.yml- Docker Compose configuration for running Nginx -
generate_test_site.py- Script that generated the test site
- Make sure Docker and Docker Compose are installed and running
- Start the Nginx server:
docker-compose up -d - The test site will be available at http://localhost:8080
You can test your scraper against this environment with:
python main.py http://localhost:8080 --depth 3
Additional test commands:
-
Test with sitemap parsing:
python main.py http://localhost:8080 --use-sitemap -
Test with robots.txt consideration:
python main.py http://localhost:8080 --respect-robots-txt
- The site contains a mix of pages that link to subpages
- Some deeper pages (depth >= 3) are disallowed in robots.txt
- Pages have consistent navigation but varying depth
- The sitemap includes all non-disallowed pages with metadata
If you need to regenerate the test site with different characteristics, modify the configuration variables at the top of the generate_test_site.py file and run:
./venv/bin/python generate_test_site.py