|
| 1 | +# CLAUDE.md |
| 2 | + |
| 3 | +## Project Overview |
| 4 | + |
| 5 | +`scholarly` is a Python module for retrieving author and publication data from Google Scholar programmatically. It parses HTML responses from Google Scholar and returns structured data. |
| 6 | + |
| 7 | +- **Language:** Python 3.8+ |
| 8 | +- **License:** Unlicense (public domain) |
| 9 | +- **Current version:** 1.7.11 |
| 10 | +- **PyPI:** `pip3 install scholarly` |
| 11 | + |
| 12 | +## Repository Structure |
| 13 | + |
| 14 | +``` |
| 15 | +scholarly/ # Main package |
| 16 | + _scholarly.py # Core API: search, fill methods (_Scholarly class) |
| 17 | + _navigator.py # HTTP session management, proxy routing |
| 18 | + _proxy_generator.py # Proxy service integrations (ScraperAPI, Bright Data, FreeProxy) |
| 19 | + author_parser.py # HTML parsing for author profiles |
| 20 | + publication_parser.py # HTML parsing for publications |
| 21 | + data_types.py # TypedDict definitions (Author, Publication, etc.) |
| 22 | +test_module.py # Full test suite (unittest-based) |
| 23 | +docs/ # Sphinx documentation |
| 24 | +scripts/ # Helper scripts |
| 25 | +.github/workflows/ # CI/CD pipelines |
| 26 | +``` |
| 27 | + |
| 28 | +## Setup |
| 29 | + |
| 30 | +```bash |
| 31 | +pip3 install -e . # Editable install for development |
| 32 | +pip3 install -r requirements.txt # Runtime dependencies |
| 33 | +pip3 install -r requirements-dev.txt # Dev dependencies (sphinx, coverage) |
| 34 | +``` |
| 35 | + |
| 36 | +## Testing |
| 37 | + |
| 38 | +```bash |
| 39 | +python3 -m unittest -v test_module.py |
| 40 | +``` |
| 41 | + |
| 42 | +- Uses Python `unittest` framework (not pytest) |
| 43 | +- Test classes: `TestScholarly`, `TestLuminati`, `TestScraperAPI`, `TestTorInternal`, `TestScholarlyWithProxy` |
| 44 | +- **6 of 17 test cases require premium proxy services** (ScraperAPI or Bright Data credentials). These are skipped when credentials are unavailable. |
| 45 | +- Coverage: `coverage run test_module.py && coverage report` |
| 46 | + |
| 47 | +## Linting |
| 48 | + |
| 49 | +Uses **flake8** only. No black, isort, mypy, or pre-commit hooks. |
| 50 | + |
| 51 | +```bash |
| 52 | +flake8 |
| 53 | +``` |
| 54 | + |
| 55 | +Configuration (`.flake8`): |
| 56 | +- Max line length: **127** |
| 57 | +- Max complexity: 10 |
| 58 | +- Selected rules: E9, E111, F63, F7, F82, F401 |
| 59 | +- Ignored: E261, E265 |
| 60 | +- Excluded: `scholarly/__init__.py`, `docs/conf.py` |
| 61 | + |
| 62 | +## CI/CD (GitHub Actions) |
| 63 | + |
| 64 | +- **pythonpackage.yml** — Main CI: runs on Ubuntu, macOS, Windows with Python 3.8. Triggers on push/PR to `main`/`develop`, plus scheduled runs. |
| 65 | +- **lint.yaml** — Flake8 linting (called by main workflow). |
| 66 | +- **proxytests.yml** — Proxy-dependent tests, runs on push to `main` only (uses GitHub secrets). |
| 67 | +- **codeql-analysis.yml** — Security scanning on push/PR to `main`/`develop`. |
| 68 | +- **publish-to-pypi.yml** — Publishes to PyPI on tagged commits. |
| 69 | + |
| 70 | +## Contributing Conventions |
| 71 | + |
| 72 | +- **Base branch for PRs:** `develop` (not `main`) |
| 73 | +- **Create an issue first** before submitting a PR |
| 74 | +- **Commit message style:** imperative mood, concise |
| 75 | + - Bug fixes: `Fix <description>` or `Handle <condition>` |
| 76 | + - Features: `Add <description>` |
| 77 | + - Tests: `Add a unit test to <description>` or `Test that <description>` |
| 78 | + - Version bumps: `Bump version to X.Y.Z` |
| 79 | +- **Tests:** add tests for new features; ensure existing tests pass |
| 80 | +- **Docs:** verify documentation consistency; build with `cd docs && make html` |
| 81 | + |
| 82 | +## Key Architecture Notes |
| 83 | + |
| 84 | +- `_Scholarly` is the main singleton API class (instantiated as `scholarly` in `__init__.py`) |
| 85 | +- Google Scholar responses are parsed via `BeautifulSoup` in the parser modules |
| 86 | +- Anti-bot circumvention relies on proxy rotation (`_proxy_generator.py`) and user-agent spoofing |
| 87 | +- `_navigator.py` manages the HTTP session and handles retries, redirects, and CAPTCHA detection |
| 88 | +- Data types are `TypedDict` subclasses defined in `data_types.py` |
0 commit comments