Fixes #668
The current cache implementation in cachedownloader.py uses only the
Last-Modified header for cache validation. When a server omits that header
(but provides ETag or Cache-Control), _lastmod_from_response() returns
0.0 and _cache_hit() always returns True, so the cache never invalidates.
Replace the hand-rolled mtime-based cache logic with the
CacheControl library (v0.14, maintained
under the PSF). CacheControl wraps a requests.Session and implements full HTTP
caching semantics — ETag/If-None-Match, Last-Modified/If-Modified-Since,
and Cache-Control headers — transparently. On-disk persistence uses its
FileCache backend (backed by filelock for concurrency safety).
| Aspect | Before | After |
|---|---|---|
| ETag support | None | Full (If-None-Match / 304) |
Cache-Control header |
Ignored | Respected (max-age, no-cache, no-store) |
Last-Modified |
Only mechanism (buggy when absent) | Supported via If-Modified-Since / 304 |
| No caching headers at all | Cache never invalidates (bug) | Not cached — see note below |
| Existing cache files on upgrade | Used | Ignored (incompatible format) — see open question below |
| Cache file format | Raw content, SHA-256 filename | CacheControl msgpack-serialized HTTP response, SHA-224 filename in 5-level dir tree |
--no-cache flag |
Unchanged | Unchanged — plain session, no caching |
| Retry on error / bad data | Unchanged | Unchanged — custom retry loop preserved |
| Concurrency safety | _atomic_write (tempfile + shutil.copy) |
filelock (via FileCache) |
When a server provides no caching headers at all (no ETag, no Last-Modified,
no Cache-Control, no Expires), the current code caches the response forever
due to the _lastmod_from_response returning 0.0 bug. After this change, such
responses will not be cached, following correct HTTP semantics. In practice,
nearly all real-world schema endpoints provide at least one caching header.
CacheControl has built-in heuristic support for overriding this behavior if
needed. The ExpiresAfter heuristic (bundled, not custom) can apply a TTL to all
responses. For a conditional variant that only applies to responses lacking
server-provided caching headers, the documented BaseHeuristic extensibility
point allows a small subclass (~10 lines). Neither is a hack — heuristics are a
first-class CacheControl feature designed for exactly this purpose, and
RFC 7234 explicitly permits caching
systems to use heuristics for responses that lack caching headers.
CacheControl uses the same cache directories (check_jsonschema/schemas/,
check_jsonschema/refs/) but a different file layout (5-level SHA-224 tree vs.
flat SHA-256 files). Old cache files will sit alongside the new CacheControl
tree, ignored but taking up space.
Options:
- Do nothing (current default) — old files are harmless. Users can manually delete the cache directory if they want to reclaim space.
- Delete on first run — on
_build_session(), detect and remove files matching the old naming pattern (64-char hex + optional extension, no subdirectories). Risk: could delete user files if they happen to match. - One-time migration warning — log a message suggesting users clear their cache directory. Non-invasive but adds noise.
Proceeding with option 1 (do nothing) unless revisited.
File: pyproject.toml (line 39-46)
Add CacheControl[filecache] to runtime dependencies:
dependencies = [
'tomli>=2.0;python_version<"3.11"',
"ruamel.yaml>=0.18.10,<0.20.0",
"jsonschema>=4.18.0,<5.0",
"regress>=2024.11.1",
"requests<3.0",
"click>=8,<9",
"CacheControl[filecache]>=0.14,<0.15",
]The [filecache] extra pulls in filelock. CacheControl also depends on
msgpack (for response serialization) and requests (already a dependency).
| Symbol | Lines | Reason |
|---|---|---|
_LASTMOD_FMT |
16 | CacheControl handles Last-Modified natively |
_lastmod_from_response() |
45-54 | Same |
_cache_hit() |
88-97 | CacheControl handles freshness and conditional requests |
_atomic_write() |
77-85 | FileCache uses filelock for concurrency |
from __future__ import annotations (line 1) stays. hashlib stays (used by
url_to_cache_filename). platform stays (used by _base_cache_dir).
calendar, shutil, tempfile, time — verify each is truly unused after
the rewrite.
import cachecontrol
from cachecontrol.caches.file_cache import FileCacheThese go in the third-party import group alongside import requests, following
the project's PEP 8 / isort convention (no lazy imports in production code).
| Symbol | Lines | Reason |
|---|---|---|
_base_cache_dir() |
19-35 | Still needed for platform-specific cache path |
_resolve_cache_dir() |
38-42 | Still needed |
url_to_cache_filename() |
100-116 | Useful utility; not used by CacheControl but no reason to remove |
FailedDownloadError |
119-120 | Still raised by retry logic |
- Add
session: requests.Sessionas the first parameter. - Call
session.get(file_url)instead ofrequests.get(file_url, stream=True). - Remove
stream=True— CacheControl needs the full response to cache it, and schemas are small. - Keep the retry loop (2 retries) and
response_okcallback — CacheControl does not provide retry logic.
Retry / CacheControl interaction: When a server returns a 200 with corrupt
body and caching headers (e.g., ETag), CacheControl caches the response before
our response_ok callback runs. On a naive retry, session.get() would serve
the cached corrupt response and every retry would fail identically. To avoid
this, retry attempts (_attempt > 0) pass Cache-Control: no-cache in the
request headers. This is standard HTTP (RFC 7234 §5.2.1.4) — it tells
CacheControl to revalidate with the server rather than serve from its local
cache. CacheControl implements this natively; no custom logic is needed.
def _get_request(
session: requests.Session,
file_url: str,
*,
response_ok: t.Callable[[requests.Response], bool],
) -> requests.Response:
num_retries = 2
r: requests.Response | None = None
for _attempt in range(num_retries + 1):
try:
# On retries, bypass CacheControl's local cache to avoid
# re-serving a cached bad response.
headers = {"Cache-Control": "no-cache"} if _attempt > 0 else {}
r = session.get(file_url, headers=headers)
except requests.RequestException as e:
if _attempt == num_retries:
raise FailedDownloadError("encountered error during download") from e
continue
if r.ok and response_ok(r):
return r
assert r is not None
raise FailedDownloadError(
f"got response with status={r.status_code}, retries exhausted"
)__init__: store config only. Do not build the session eagerly —test_default_cache_dircreatesCacheDownloaderwith fake/relative cache paths purely to assert_cache_dirresolution and never makes HTTP requests. Eageros.makedirs+FileCache()in__init__would create real directories using those fake paths.- Build the session lazily via a
_sessionproperty on first access (i.e., firstopen()call). This also avoids creating sessions that are never used (e.g., if the schema turns out to be local after all). - Remove
_download()entirely — CacheControl manages on-disk cache files. - Simplify
open()— always yieldsio.BytesIO(response.content). No branching needed: when cache is enabled,self._sessionis a CacheControl-wrapped session (handles caching transparently); when disabled, it is a plainSession. bind()— unchanged API.
CacheControl imports go at the top of the module alongside requests,
following the project convention (no lazy/deferred imports in production code):
import cachecontrol
from cachecontrol.caches.file_cache import FileCacheclass CacheDownloader:
def __init__(self, cache_dir: str, *, disable_cache: bool = False) -> None:
self._cache_dir = _resolve_cache_dir(cache_dir)
self._disable_cache = disable_cache
self._cached_session: requests.Session | None = None
@property
def _session(self) -> requests.Session:
if self._cached_session is None:
self._cached_session = self._build_session()
return self._cached_session
def _build_session(self) -> requests.Session:
session = requests.Session()
if self._cache_dir and not self._disable_cache:
os.makedirs(self._cache_dir, exist_ok=True)
session = cachecontrol.CacheControl(
session, cache=FileCache(self._cache_dir)
)
return session
@contextlib.contextmanager
def open(
self,
file_url: str,
filename: str,
validate_response: t.Callable[[requests.Response], bool],
) -> t.Iterator[t.IO[bytes]]:
response = _get_request(
self._session, file_url, response_ok=validate_response
)
yield io.BytesIO(response.content)
def bind(
self,
file_url: str,
filename: str | None = None,
validation_callback: t.Callable[[bytes], t.Any] | None = None,
) -> BoundCacheDownloader:
return BoundCacheDownloader(
file_url, self, filename=filename,
validation_callback=validation_callback,
)The filename parameter in open() and bind() is kept for API compatibility
but ignored — CacheControl manages its own filenames (SHA-224 hash of normalized
URL in a 5-level directory tree).
BoundCacheDownloader.__init__ still computes self._filename = filename or url_to_cache_filename(file_url) (line 193). This is now dead code — the value
is passed through to CacheDownloader.open() which ignores it. Kept for now
since test_default_filename_from_uri asserts on cd._filename. Can be cleaned
up in a follow-up.
Skip validation on cache hits. CacheControl sets response.from_cache = True on
every response served from cache (either a direct hit or after a 304
revalidation). This preserves the fix for
#453
(validation callback must not run on cache hits).
def _validate_response(self, response: requests.Response) -> bool:
if not self._validation_callback:
return True
# CacheControl sets from_cache=True on cache hits; skip re-validation
if getattr(response, "from_cache", False):
return True
try:
self._validation_callback(response.content)
return True
except ValueError:
return FalseNo changes to readers.py or resolver.py. Both use CacheDownloader via
bind() and open(), whose signatures are unchanged.
readers.py:80—CacheDownloader("schemas", disable_cache=...).bind(url, validation_callback=...)resolver.py:53—CacheDownloader("refs", disable_cache=...)
Cache subdirectory names ("schemas", "refs") are preserved.
Update cli/main_command.py:96: The --schemafile help text says schemas
will be "downloaded and cached locally based on mtime." Replace with generic
wording (e.g., "downloaded and cached locally") since caching is no longer
mtime-based.
CacheControl's FileCache write mechanism relies on a CallbackFileWrapper
that monitors when a response stream is fully consumed. The responses mock
library creates BytesIO bodies that never report .closed = True after
exhaustion, so the cache-write callback never fires. CacheControl never
populates its FileCache when responses come from responses mocks.
This means tests cannot verify CacheControl's caching behavior (cache hits,
ETag revalidation, max-age expiry) using responses. Rather than switching test
infrastructure (e.g., adding a real HTTP server), the test strategy is:
- Trust CacheControl for caching correctness — it is a well-maintained PSF project with its own test suite.
- Test our integration: retry logic, validation callbacks,
--no-cacheflag, session construction, error handling. - Do not assert on cache file existence,
response.from_cache, orlen(responses.calls)staying constant across requests.
The responses library (v0.26.0) patches HTTPAdapter.send at the class level.
CacheControl's CacheControlAdapter calls super().send(), which hits the
responses mock. HTTP request/response flow works normally — only the cache-write
side effect is broken. Tests that use responses for mocking HTTP (retry tests,
validation tests, error tests) work fine.
Update patch_cache_dir (line 67): Works as-is — it monkeypatches
_base_cache_dir, which is still called by _resolve_cache_dir, which is still
called by CacheDownloader.__init__. CacheControl's FileCache will use the
patched temp directory.
Remove the following fixtures — they either inject raw cache files (no longer
valid format) or predict cache file paths using url_to_cache_filename
(SHA-256), which CacheControl does not use:
| Fixture | Line | Action |
|---|---|---|
url2cachepath |
77 | Remove — only consumed by the path-prediction fixtures below |
inject_cached_download |
100 | Remove — writes raw content; CacheControl uses msgpack format |
inject_cached_ref |
126 | Remove — same |
get_download_cache_loc |
92 | Remove — path prediction no longer valid |
get_ref_cache_loc |
118 | Remove — same |
Keep downloads_cache_dir / refs_cache_dir (lines 87, 113) and
keep cache_dir (line 63) — still useful for asserting cache directory
existence when needed.
Update the import block (lines 10-16): Remove _cache_hit and
_lastmod_from_response — both symbols are deleted from cachedownloader.py.
Tests to remove:
| Test | Lines | Reason |
|---|---|---|
test_cache_hit_by_mtime |
95-113 | mtime-based caching removed entirely |
test_cachedownloader_handles_bad_lastmod_header |
257-310 | _lastmod_from_response removed |
test_lastmod_from_header_uses_gmtime |
352-383 | Same |
test_cachedownloader_cached_file |
116-129 | Tests monkeypatched _download method, which is removed |
test_cachedownloader_validation_is_not_invoked_on_hit |
313-345 | Depends on inject_cached_download (removed) and response.from_cache (not testable with responses mocks) |
Tests to simplify (remove get_download_cache_loc dependency and file-
existence assertions — these tests are about retry behavior, not caching):
test_cachedownloader_on_success (line 132):
- Remove
get_download_cache_locfixture usage andf.exists()assertions. - Keep: verify
cd.open()returns correct content (b"{}").
test_cachedownloader_succeeds_after_few_errors (line 160):
- Remove
get_download_cache_locfixture usage andf.exists()assertions. - Keep: verify
cd.open()returns correct content after retries.
test_cachedownloader_fails_after_many_errors (line 194):
- Remove
get_download_cache_locfixture usage andf.exists()assertion. - Keep: verify
FailedDownloadErroris raised.
test_cachedownloader_retries_on_bad_data (line 225):
- Remove
get_download_cache_locfixture usage andf.exists()assertions. - Keep: verify
cd.open()returns correct content after retrying past bad data.
test_cachedownloader_using_alternate_target_dir (line 149):
- Remove
url2cachepath/ path-based assertions. - Keep: verify
cd.open()returns correct content.
Tests to keep as-is:
| Test | Lines | Notes |
|---|---|---|
test_default_filename_from_uri |
37-39 | url_to_cache_filename unchanged |
test_default_cache_dir |
42-92 | _base_cache_dir unchanged. Works as-is thanks to lazy session building — CacheDownloader("downloads") only stores _cache_dir, no os.makedirs or FileCache until open() is called. |
New tests to add (integration-focused):
| Test | Purpose |
|---|---|
test_disable_cache_uses_plain_session |
When disable_cache=True, verify _build_session returns a plain requests.Session (not wrapped by CacheControl). |
test_enable_cache_uses_cachecontrol_session |
When disable_cache=False and cache dir is valid, verify _build_session returns a CacheControl-wrapped session (check for CacheControlAdapter on the session). |
test_cache_dir_none_uses_plain_session |
When _resolve_cache_dir returns None (e.g., Windows with no env vars), verify _build_session returns a plain session. |
Aspirational cache-behavior tests (require a real HTTP server or alternative
to responses; not included in this PR but tracked for follow-up):
| Test | Purpose |
|---|---|
test_etag_cache_revalidation |
Server provides ETag, no Last-Modified. Second request sends If-None-Match, gets 304. |
test_cache_control_max_age |
Server provides Cache-Control: max-age=3600. Second request served from cache. |
test_last_modified_revalidation |
Server provides Last-Modified. Second request sends If-Modified-Since. |
test_no_caching_headers_not_cached |
Server provides no caching headers. Response is not cached. |
test_remote_ref_resolution_cache_control (line 68):
- Remove
get_ref_cache_locfixture dependency and file-path assertions. - When
disable_cache=False: verify the command succeeds (exit code 0). Cache behavior is trusted to CacheControl. - When
disable_cache=True: verify the command succeeds (exit code 0).
test_remote_ref_resolution_loads_from_cache (line 105):
- Remove
inject_cached_refandget_ref_cache_locfixture dependencies. - This test's premise (inject good cache data, serve bad HTTP, verify cache is used) cannot be replicated without direct FileCache manipulation. Remove this test. Cache-hit behavior is trusted to CacheControl.
test_remote_ref_resolution_callout_count_is_scale_free_in_instancefiles
(line 251):
- Should work as-is. Within-run deduplication is handled by
lru_cacheon_get_validatorand the in-memoryResourceCache— not by the file cache. CacheControl's file cache only affects cross-run persistence.
test_can_load_remote_yaml_schema_ref_from_cache (line 140):
- Remove
inject_cached_reffixture dependency. This test's premise (inject good cache, serve bad HTTP) cannot be replicated. Remove this test. Cache-hit behavior is trusted to CacheControl.
CacheControl's FileCache uses filelock for concurrency safety. When connection
errors occur during tests (e.g., test_cachedownloader_fails_after_many_errors),
filelock may leave temporary files open, triggering a ResourceWarning that pytest
converts to PytestUnraisableExceptionWarning.
This is not a bug in our code — it's a side effect of how filelock handles error
conditions. Add a filter to pyproject.toml to ignore this specific warning:
[tool.pytest.ini_options]
filterwarnings = [
# ... existing filters ...
# filelock (used by CacheControl) may leave temp files open during connection errors
'ignore:Exception ignored in.*FileIO.*:pytest.PytestUnraisableExceptionWarning',
]TODO: Investigate whether this warning can be avoided rather than suppressed.
Investigation results (2026-03-27):
- The warning originates from CacheControl's
CallbackFileWrapperclass infilewrapper.py, not filelock. CallbackFileWrapper.__init__creates aNamedTemporaryFile(line 37), which is only closed in_close()when the response is fully read (line 105-106).- When a
requests.ConnectionErroroccurs, the temp file is never closed because_close()is never called. - No existing upstream issues found in CacheControl or filelock for this specific problem.
- The
CallbackFileWrapperclass lacks a__del__method or context manager protocol that would ensure cleanup on exceptions. - Recommendation: File an upstream issue with CacheControl suggesting that
CallbackFileWrapperimplement__del__to closeself.__bufif not already closed, or use a weak reference callback to ensure cleanup.
- Add
CacheControl[filecache]dependency topyproject.toml - Rewrite
cachedownloader.py - Update
conftest.pyfixtures - Rewrite/update unit tests in
test_cachedownloader.py - Update acceptance tests
- Add pytest warning filter for filelock ResourceWarning
- Run full test suite, fix any
responseslibrary compatibility issues - Commit, push, open PR