diff --git a/examples/README.md b/examples/README.md index cb92870e..2f80dc5f 100644 --- a/examples/README.md +++ b/examples/README.md @@ -18,6 +18,7 @@ service keys. | [Build a multimodal wine recommender with OCR](./wine-recommender) | Combining preference-based retrieval with OCR-driven label detection in one UI | `encode`, `score`, `extract` | Docker Compose app plus local SIE endpoint; API key optional for unauthenticated SIE | Runnable demo | | [Build a multi-modal product classifier with embeddings](./taxonomy-classification) | Evaluating text, image, NLI, and reranking approaches for hierarchical product taxonomy classification | `extract`, `encode`, `score` | SIE endpoint, Shopify dataset prep via `uv run` scripts, standalone `uv` project | Runnable evaluation example | | [Swap an OCR model with one identifier change](./document-ocr) | Driving recognition (VLM-OCR), structured extraction (Donut), and zero-shot NER (GLiNER) through the same `extract` call by swapping the model ID | `extract` | Docker Compose plus Node UI, no API key required, hosted version on [Hugging Face Spaces](https://huggingface.co/spaces/superlinked/document-ocr) | Runnable demo | +| [Vision-first document RAG](./vision-doc-rag) | Retrieving and answering questions over a multi-tenant page corpus by looking at page images, with OCR kept out of the score path | `encode`, `extract`, `score` (optional) | SIE endpoint with a GPU recommended for ColQwen2.5 + Florence-2-DocVQA | Runnable demo | For docs publishing, lead with the quickest runnable demos, then use the benchmark and evaluation examples for deeper technical users. diff --git a/examples/vision-doc-rag/.gitignore b/examples/vision-doc-rag/.gitignore new file mode 100644 index 00000000..9a052846 --- /dev/null +++ b/examples/vision-doc-rag/.gitignore @@ -0,0 +1,9 @@ +.venv/ +__pycache__/ +data/pages.json +data/pdfs_manifest.json +data/pages_manifest.json +data/pdfs/ +data/pages/ +data/multivectors.npz +data/metadata.json diff --git a/examples/vision-doc-rag/README.md b/examples/vision-doc-rag/README.md new file mode 100644 index 00000000..3f3c586f --- /dev/null +++ b/examples/vision-doc-rag/README.md @@ -0,0 +1,261 @@ +# Vision-first document RAG + +Retrieve by image, answer by image. ColQwen2.5 reads each PDF page as a +picture and ranks pages via late interaction; Florence-2-DocVQA reads the +winning page and produces the textual answer. OCR never enters the score path, +so schematics, pinout diagrams, architecture slides, charts, and other layout +cues still drive ranking. Everything runs on one SIE endpoint. + +Each page also carries a `client` tag, so the same corpus serves multiple +tenants from one index. Queries scoped to `embedded-lab` cannot retrieve +`ops-eng` or `aerospace` pages. + +## Corpus + +The demo fetches a small public PDF batch on demand and renders selected pages +to PNGs. The page selections are deliberately capped so local ingest stays +fast while still indexing visually rich pages. + +| Tenant | Sources | Visual signal | +|---|---|---| +| `embedded-lab` | Raspberry Pi Pico datasheet, Arduino UNO R3 datasheet, Arduino UNO R3 schematic | Pinout diagrams, board diagrams, circuit schematics | +| `ops-eng` | PostgreSQL manual, CNCF Kubernetes / cloud-native architecture material | Architecture diagrams, operational flows, dense technical tables | +| `aerospace` | NASA NTRS nozzle and booster reports | Engineering drawings, cross-sections, charts, mission technical figures | + +Generated files are ignored: + +```text +data/pdfs/ # downloaded PDFs +data/pdfs_manifest.json # source manifest from fetch_pdfs.py +data/pages/ # rendered PNG pages +data/pages_manifest.json # page-level metadata from render_pages.py +data/metadata.json # index metadata from ingest.py +data/multivectors.npz # page multivectors from ingest.py +``` + +## SIE features used + +- `encode` - `vidore/colqwen2.5-v0.2` on page images at ingest and on query + text at search time. Output is a `[tokens, 128]` multivector. Late + interaction (`sie_sdk.scoring.maxsim`) is the first-stage ranking signal. +- `extract` - `mynkchaudhry/Florence-2-FT-DocVQA`. Called with + `instruction=` to get a textual answer for the top page, and + without `instruction` to OCR the same page for a display snippet. The OCR + snippet is UX-only; it never enters ranking. +- `score` optional - `Qwen/Qwen3-VL-Reranker-2B` second-stage rerank over + `(query text, page image)`. Off by default while we wait for an upstream + adapter fix; flip `search.visual_rerank: true` in `config.yaml` to enable it + on a cluster that's ready. + +## Run it + +You need Python 3.12 and a reachable SIE cluster. + +```bash +# 1. SIE locally, or point SIE_CLUSTER_URL / SIE_API_KEY at a managed cluster. +docker run -p 8080:8080 ghcr.io/superlinked/sie-server:latest-cpu-default + +# 2. Fetch public PDFs and render selected pages to PNG. +cd examples/vision-doc-rag +pip install -r python/requirements.txt +python data/fetch_pdfs.py +python data/render_pages.py + +# 3. Encode every rendered page with ColQwen2.5 and save the multivectors. +python python/ingest.py + +# 4a. CLI demo. +python python/search.py + +# 4b. Or start the UI. +uvicorn --app-dir python server:app --port 8888 +open http://localhost:8888 +``` + +`render_pages.py` uses `pdf2image` when Poppler is available. If Poppler is +not installed, it falls back to PyMuPDF, which is installed from +`python/requirements.txt`. + +First run on a cold cluster pays a one-time model load. ColQwen2.5 and +Florence-2 are both several GB, so expect roughly a minute on CPU and a few +seconds on GPU before the warm path kicks in. + +### Managed cluster + +```bash +export SIE_CLUSTER_URL="https://your-cluster-host:8080" +export SIE_API_KEY="SL-..." +``` + +The defaults in `config.yaml` point at `http://localhost:8080`. Set +`cluster.gpu` to a profile name like `l4-spot` if the cluster needs an +explicit GPU class. + +## Try these queries + +Queries are grouped by what they exercise. Each row names the expected target +page so you can spot regressions at a glance. + +### Visual signal — the ranking comes from the page image, not OCR + +| Tenant | Query | Expected target | Why it's interesting | +|---|---|---|---| +| `embedded-lab` | Raspberry Pi Pico pinout GP21 | Pi Pico datasheet pinout (pp 4-5) | Abbreviated visual label still drives retrieval. | +| `embedded-lab` | where is the ATmega16U2 on the schematic? | Arduino UNO R3 schematic (pp 1-2) | Circuit schematic retrieval, not prose. | +| `ops-eng` | cloud native architecture diagram | CNCF AI whitepaper or Kubernetes slides | Visual architecture page instead of OCR text. | +| `aerospace` | solid rocket motor nozzle design figure | Solid rocket motor nozzles report | Engineering drawing in a figure-heavy report. | + +### Table / value lookups — the DocVQA answer is the point + +| Tenant | Query | Expected target | Expected answer | +|---|---|---|---| +| `embedded-lab` | What is the operating voltage range of the Raspberry Pi Pico? | Pi Pico datasheet electrical characteristics (pp 6-8) | A voltage range, e.g. 1.8-5.5 V | +| `embedded-lab` | Which Arduino UNO pin is the built-in LED on? | UNO R3 datasheet pinout (pp 5-11) | D13 / PB5 | +| `ops-eng` | PostgreSQL default listening port | PG 18 manual config section (pp 19-24) | 5432 | +| `ops-eng` | What is the default value of max_connections in PostgreSQL? | PG 18 manual parameter table (pp 19-24) | 100 | +| `aerospace` | What is the throat diameter shown in the nozzle drawing? | Nozzle design figure | A labeled dimension off the drawing | + +### Disambiguation — two PDFs in one tenant, the right one must win + +| Tenant | Query | Should pick | Should beat | +|---|---|---|---| +| `aerospace` | solid propellant rocket nozzle cross-section | `solid-rocket-motor-nozzles.pdf` | `liquid-rocket-engine-nozzles.pdf` | +| `aerospace` | regeneratively cooled nozzle | `liquid-rocket-engine-nozzles.pdf` (regen cooling is liquid-specific) | `solid-rocket-motor-nozzles.pdf` | +| `embedded-lab` | USB-to-serial interface chip on the schematic | `arduino-uno-r3-schematic.pdf` (ATmega16U2) | `raspberry-pi-pico-datasheet.pdf` | +| `embedded-lab` | RP2040 GPIO function table | `raspberry-pi-pico-datasheet.pdf` | `arduino-uno-r3-datasheet.pdf` | + +### Tenant-leak negatives — the matching content lives in a different tenant + +| Scoped to | Query | Pass condition | +|---|---|---| +| `ops-eng` | Raspberry Pi Pico pinout GP21 | No embedded-lab pages return. | +| `ops-eng` | regeneratively cooled nozzle | No aerospace pages return. | +| `aerospace` | cloud native architecture diagram | No ops-eng pages return. | +| `embedded-lab` | PostgreSQL connection pool | No ops-eng pages return. | + +## API + +### `GET /api/search` + +| Parameter | Required | Description | +|---|---|---| +| `q` | yes | Search query | +| `client` | no | Tenant filter, for example `embedded-lab`. Omitted means search all tenants. | + +```bash +curl "http://localhost:8888/api/search?q=Raspberry+Pi+Pico+pinout+GP21&client=embedded-lab" +``` + +```json +{ + "query": "Raspberry Pi Pico pinout GP21", + "client": "embedded-lab", + "answer": "GP21 can be used for ...", + "results": [ + { + "page_id": "embedded-lab__raspberry-pi-pico-datasheet__p005", + "client": "embedded-lab", + "title": "Raspberry Pi Pico Datasheet", + "publisher": "Raspberry Pi Ltd", + "source_pdf": "raspberry-pi-pico-datasheet.pdf", + "page_number": 5, + "citation": "raspberry-pi-pico-datasheet.pdf · p.5", + "page_image": "/pages/embedded-lab/raspberry-pi-pico-datasheet_p005.png", + "scores": { "maxsim": 14.44, "rerank": null } + } + ] +} +``` + +### `GET /api/clients`, `GET /api/stats` + +Tenant list and runtime config: active models, rerank on/off, and page count. + +## How it works + +```text + ingest.py (once per corpus) + fetch_pdfs.py -> data/pdfs/{tenant}/*.pdf + -> render_pages.py -> data/pages/{tenant}/*.png + -> data/pages_manifest.json + -> SIE.encode(ColQwen2.5, images, multivector) + -> data/multivectors.npz + data/metadata.json + + search.py / server.py (per query) + q -> SIE.encode(ColQwen2.5, text, is_query=True) + -> filter metadata by tenant + -> sie_sdk.scoring.maxsim -> top_k_candidates + -> optional SIE.score(Qwen3-VL-Reranker, q, images) + -> SIE.extract(Florence-2-DocVQA, instruction=q, images=[top_page]) + -> SIE.extract(Florence-2-DocVQA, images=[top_page]) for display OCR +``` + +OCR is never on the score path. The visual reranker, when enabled, ranks over +the same modality as retrieval, so layout cues survive both stages. + +The corpus is small enough that MaxSim runs in Python. For thousands of pages, +hand the multivectors to LanceDB, Vespa, or another multivector store; the SIE +calls stay the same. + +## Customize + +`data/fetch_pdfs.py` owns the curated source list. Add a source with: + +```python +{ + "client": "my-tenant", + "slug": "my-manual", + "title": "My Manual", + "publisher": "Example Publisher", + "license": "CC BY 4.0", + "url": "https://example.com/my-manual.pdf", + "pages": [1, 2, 7, 8], +} +``` + +Then rerun: + +```bash +python data/fetch_pdfs.py +python data/render_pages.py +python python/ingest.py +``` + +`config.yaml` is the model and rendering tuning surface: + +```yaml +models: + retriever: "vidore/colqwen2.5-v0.2" + docvqa: "mynkchaudhry/Florence-2-FT-DocVQA" + reranker: "Qwen/Qwen3-VL-Reranker-2B" +render: + backend: "auto" + dpi: 160 +search: + top_k_candidates: 5 + top_k_results: 3 + visual_rerank: false + answer: true + ocr_snippet: true +``` + +## Project layout + +```text +examples/vision-doc-rag/ +├── config.yaml +├── data/ +│ ├── fetch_pdfs.py # curated public PDF source list + downloader +│ ├── render_pages.py # PDFs -> PNG pages + pages_manifest.json +│ ├── pdfs/ # generated +│ ├── pages/ # generated PNGs +│ ├── metadata.json # generated by ingest +│ └── multivectors.npz # generated by ingest +├── python/ +│ ├── ingest.py +│ ├── search.py +│ ├── server.py +│ └── requirements.txt +└── static/ + └── index.html +``` diff --git a/examples/vision-doc-rag/config.yaml b/examples/vision-doc-rag/config.yaml new file mode 100644 index 00000000..587548f2 --- /dev/null +++ b/examples/vision-doc-rag/config.yaml @@ -0,0 +1,40 @@ +# SIE server (defaults to local Docker: docker run -p 8080:8080 ghcr.io/superlinked/sie-server:latest-cpu-default). +# Override with SIE_CLUSTER_URL / SIE_API_KEY env vars when targeting a managed cluster. +cluster: + url: "http://localhost:8080" + api_key: "" + gpu: "" # only set for managed multi-GPU clusters (e.g. "l4-spot"); ignored locally + provision_timeout_s: 600 + +# Models. The retrieval signal is vision end-to-end: ColQwen2.5 reads each page +# as an image and we late-interact (MaxSim) against the same model's text-side +# embedding of the query. No OCR is involved in ranking, so charts, screenshots, +# tables, and any other layout cue that wouldn't survive an OCR round-trip +# still contributes to the score. +# +# DocVQA produces a textual answer for the top page. The model takes the page +# image + the user's question (passed via `instruction`) and returns the answer +# as an entity in the response — no separate LLM call needed. +models: + retriever: "vidore/colqwen2.5-v0.2" + docvqa: "mynkchaudhry/Florence-2-FT-DocVQA" + # Optional second-stage cross-encoder rerank. Visual model so we don't have to + # collapse the page through OCR before reranking. Disabled by default while + # we wait for the cluster-side adapter bug to land: + # https://github.com/superlinked/sie-internal/issues/1026 + # Re-enable with search.visual_rerank: true once that ships. + reranker: "Qwen/Qwen3-VL-Reranker-2B" + +# Page rendering. `auto` tries pdf2image/Poppler first and falls back to +# PyMuPDF when Poppler is not installed. +render: + backend: "auto" # auto | pdf2image | pymupdf + dpi: 160 + +# Retrieval +search: + top_k_candidates: 5 # how many pages survive MaxSim + top_k_results: 3 # how many pages return after optional rerank + visual_rerank: false # see models.reranker note above + answer: true # run DocVQA on the top page for a textual answer + ocr_snippet: true # OCR the top page for a display-only snippet in the UI diff --git a/examples/vision-doc-rag/data/fetch_pdfs.py b/examples/vision-doc-rag/data/fetch_pdfs.py new file mode 100644 index 00000000..21ade844 --- /dev/null +++ b/examples/vision-doc-rag/data/fetch_pdfs.py @@ -0,0 +1,158 @@ +"""Download the public PDF corpus for the visual document RAG demo. + +The corpus is intentionally small and curated. Each source has a tenant, a +stable slug, source metadata, and a limited page selection so the demo can be +indexed quickly while still containing diagrams, schematics, screenshots, and +technical figures that reward visual retrieval. +""" + +from __future__ import annotations + +import json +import shutil +import sys +import tempfile +from pathlib import Path +from urllib.error import HTTPError, URLError +from urllib.request import Request, urlopen + + +SOURCES = [ + { + "client": "embedded-lab", + "slug": "raspberry-pi-pico-datasheet", + "title": "Raspberry Pi Pico Datasheet", + "publisher": "Raspberry Pi Ltd", + "license": "CC BY-ND 4.0", + "url": "https://datasheets.raspberrypi.com/pico/pico-datasheet.pdf", + "pages": [4, 5, 6, 7, 8, 9], + }, + { + "client": "embedded-lab", + "slug": "arduino-uno-r3-datasheet", + "title": "Arduino UNO R3 Datasheet", + "publisher": "Arduino", + "license": "Arduino documentation / open hardware terms", + "url": "https://docs.arduino.cc/resources/datasheets/A000066-datasheet.pdf", + "pages": [5, 6, 7, 8, 9, 10, 11], + }, + { + "client": "embedded-lab", + "slug": "arduino-uno-r3-schematic", + "title": "Arduino UNO R3 Schematic", + "publisher": "Arduino", + "license": "CC BY-SA 4.0 hardware reference design", + "url": "https://docs.arduino.cc/resources/schematics/A000066-schematics.pdf", + "pages": [1, 2], + }, + { + "client": "ops-eng", + "slug": "postgresql-18-manual", + "title": "PostgreSQL 18 Documentation", + "publisher": "PostgreSQL Global Development Group", + "license": "PostgreSQL License", + "url": "https://www.postgresql.org/files/documentation/pdf/18/postgresql-18-A4.pdf", + "pages": [19, 20, 21, 22, 23, 24], + }, + { + "client": "ops-eng", + "slug": "kubernetes-infrastructure-abstraction", + "title": "Kubernetes as Infrastructure Abstraction", + "publisher": "Cloud Native Computing Foundation", + "license": "CNCF public presentation material", + "url": "https://www.cncf.io/wp-content/uploads/2020/08/2019-09-Kubernetes-as-Infrastructure-Abstraction.pdf", + "pages": [6, 7, 8, 9, 10, 11], + }, + { + "client": "ops-eng", + "slug": "cloud-native-ai-whitepaper", + "title": "Cloud Native Artificial Intelligence Whitepaper", + "publisher": "Cloud Native Computing Foundation", + "license": "CNCF documentation / report terms", + "url": "https://www.cncf.io/wp-content/uploads/2024/03/cloud_native_ai24_031424a-2.pdf", + "pages": [11, 12, 13, 14, 15, 16], + }, + { + "client": "aerospace", + "slug": "solid-rocket-motor-nozzles", + "title": "Solid Rocket Motor Nozzles", + "publisher": "NASA Technical Reports Server", + "license": "NASA STI public release", + "url": "https://ntrs.nasa.gov/archive/nasa/casi.ntrs.nasa.gov/19760013126.pdf", + "pages": [1, 2, 3, 4, 5, 6], + }, + { + "client": "aerospace", + "slug": "liquid-rocket-engine-nozzles", + "title": "Liquid Rocket Engine Nozzles", + "publisher": "NASA Technical Reports Server", + "license": "NASA STI public release", + "url": "https://ntrs.nasa.gov/archive/nasa/casi.ntrs.nasa.gov/19770009165.pdf", + "pages": [1, 2, 3, 4, 5, 6], + }, + { + "client": "aerospace", + "slug": "sls-booster-state-machine", + "title": "State Machine Modeling of the Space Launch System Solid Rocket Boosters", + "publisher": "NASA Technical Reports Server", + "license": "NASA STI public release", + "url": "https://ntrs.nasa.gov/archive/nasa/casi.ntrs.nasa.gov/20160000328.pdf", + "pages": [1, 2, 3, 4, 5, 6], + }, +] + + +def _download(url: str, out: Path) -> bool: + """Download url to out atomically. Return True when a new file was written.""" + if out.exists() and out.stat().st_size > 0: + return False + + out.parent.mkdir(parents=True, exist_ok=True) + request = Request( + url, + headers={ + "User-Agent": "sie-vision-doc-rag-demo/1.0", + "Accept": "application/pdf,*/*", + }, + ) + with tempfile.NamedTemporaryFile(delete=False, dir=out.parent, suffix=".tmp") as tmp: + tmp_path = Path(tmp.name) + try: + with urlopen(request, timeout=60) as response: + shutil.copyfileobj(response, tmp) + except (HTTPError, URLError, TimeoutError): + tmp_path.unlink(missing_ok=True) + raise + + tmp_path.replace(out) + return True + + +def main() -> None: + here = Path(__file__).resolve().parent + pdf_root = here / "pdfs" + manifest = [] + + for source in SOURCES: + pdf_path = pdf_root / source["client"] / f"{source['slug']}.pdf" + try: + downloaded = _download(source["url"], pdf_path) + except Exception as exc: + print(f"Failed to download {source['url']}: {type(exc).__name__}: {exc}", file=sys.stderr) + raise + + row = dict(source) + row["pdf_path"] = str(pdf_path.relative_to(here)) + row["source_pdf"] = pdf_path.name + manifest.append(row) + + status = "downloaded" if downloaded else "cached" + print(f" {status:10s} {source['client']:12s} {source['slug']} -> {row['pdf_path']}") + + out = here / "pdfs_manifest.json" + out.write_text(json.dumps({"sources": manifest}, indent=2) + "\n") + print(f"\nWrote {len(manifest)} PDF sources to {out}") + + +if __name__ == "__main__": + main() diff --git a/examples/vision-doc-rag/data/render_pages.py b/examples/vision-doc-rag/data/render_pages.py new file mode 100644 index 00000000..9c123305 --- /dev/null +++ b/examples/vision-doc-rag/data/render_pages.py @@ -0,0 +1,147 @@ +"""Rasterize the curated PDF corpus to page PNGs. + +The script tries pdf2image first because it produces excellent page images +when Poppler is installed. If Poppler or pdf2image is unavailable, it falls +back to PyMuPDF so the demo still works with only Python package dependencies. +""" + +from __future__ import annotations + +import json +import sys +from pathlib import Path + +import yaml + + +def _selected_pages(source: dict, total_pages: int) -> list[int]: + pages = source.get("pages") + if pages: + selected = [int(p) for p in pages if 1 <= int(p) <= total_pages] + else: + start = int(source.get("start_page", 1)) + max_pages = int(source.get("max_pages", 6)) + selected = list(range(start, min(total_pages, start + max_pages - 1) + 1)) + + if not selected: + raise ValueError(f"No valid pages selected for {source['slug']} ({total_pages} pages)") + return selected + + +def _pdf_page_count_with_pymupdf(pdf_path: Path) -> int: + import fitz + + with fitz.open(pdf_path) as doc: + return doc.page_count + + +def _render_with_pdf2image(pdf_path: Path, page_number: int, out_path: Path, dpi: int) -> None: + from pdf2image import convert_from_path + + images = convert_from_path( + str(pdf_path), + dpi=dpi, + first_page=page_number, + last_page=page_number, + fmt="png", + single_file=True, + ) + if not images: + raise RuntimeError(f"pdf2image returned no image for {pdf_path} page {page_number}") + images[0].save(out_path) + + +def _render_with_pymupdf(pdf_path: Path, page_number: int, out_path: Path, dpi: int) -> None: + import fitz + + zoom = dpi / 72 + matrix = fitz.Matrix(zoom, zoom) + with fitz.open(pdf_path) as doc: + page = doc.load_page(page_number - 1) + pixmap = page.get_pixmap(matrix=matrix, alpha=False) + pixmap.save(out_path) + + +def _render_page(pdf_path: Path, page_number: int, out_path: Path, dpi: int, backend: str) -> str: + out_path.parent.mkdir(parents=True, exist_ok=True) + if backend in {"auto", "pdf2image"}: + try: + _render_with_pdf2image(pdf_path, page_number, out_path, dpi) + return "pdf2image" + except Exception as exc: + if backend == "pdf2image": + raise + print( + f" pdf2image unavailable for {pdf_path.name} p.{page_number} " + f"({type(exc).__name__}); falling back to PyMuPDF", + file=sys.stderr, + ) + + _render_with_pymupdf(pdf_path, page_number, out_path, dpi) + return "pymupdf" + + +def main() -> None: + here = Path(__file__).resolve().parent + root = here.parent + manifest_path = here / "pdfs_manifest.json" + if not manifest_path.exists(): + print("pdfs_manifest.json not found; run `python data/fetch_pdfs.py` first", file=sys.stderr) + sys.exit(1) + + config = yaml.safe_load((root / "config.yaml").read_text()) + render_config = config.get("render", {}) + dpi = int(render_config.get("dpi", 160)) + backend = render_config.get("backend", "auto") + active_backend = backend + out_dir = here / "pages" + + pdf_manifest = json.loads(manifest_path.read_text()) + page_manifest: list[dict] = [] + backend_counts: dict[str, int] = {} + + for source in pdf_manifest["sources"]: + pdf_path = here / source["pdf_path"] + if not pdf_path.exists(): + raise FileNotFoundError(f"Missing PDF: {pdf_path}. Run data/fetch_pdfs.py.") + + total_pages = _pdf_page_count_with_pymupdf(pdf_path) + for page_number in _selected_pages(source, total_pages): + page_id = f"{source['client']}__{source['slug']}__p{page_number:03d}" + image_path = out_dir / source["client"] / f"{source['slug']}_p{page_number:03d}.png" + used_backend = _render_page(pdf_path, page_number, image_path, dpi, active_backend) + if backend == "auto" and used_backend == "pymupdf": + active_backend = "pymupdf" + backend_counts[used_backend] = backend_counts.get(used_backend, 0) + 1 + + rel_image_path = image_path.relative_to(here) + page_manifest.append( + { + "page_id": page_id, + "client": source["client"], + "title": source["title"], + "publisher": source["publisher"], + "license": source["license"], + "source_url": source["url"], + "source_pdf": source["source_pdf"], + "source_pdf_path": source["pdf_path"], + "page_number": page_number, + "image_path": str(rel_image_path), + } + ) + print( + f" {source['client']:12s} {source['slug']:38s} " + f"p.{page_number:<4d} -> data/{rel_image_path}" + ) + + out = here / "pages_manifest.json" + out.write_text(json.dumps(page_manifest, indent=2) + "\n") + + print(f"\nRendered {len(page_manifest)} pages to {out_dir}") + print(f"Wrote page manifest to {out}") + for name, count in sorted(backend_counts.items()): + print(f" {name}: {count} pages") + + +if __name__ == "__main__": + main() diff --git a/examples/vision-doc-rag/python/ingest.py b/examples/vision-doc-rag/python/ingest.py new file mode 100644 index 00000000..8b0f8e11 --- /dev/null +++ b/examples/vision-doc-rag/python/ingest.py @@ -0,0 +1,124 @@ +"""Build the per-tenant visual index. + +For every rendered PDF page PNG we ask SIE to encode the image with +vidore/colqwen2.5-v0.2, which returns a [tokens, 128] multivector. Each page's +multivector goes into a single .npz on disk, alongside a metadata.json that +keeps the client name, source PDF, page number, and source URL for routing, +filtering, and citation at query time. + +There is no vector database here. MaxSim at the scale of one team's wiki +(hundreds to thousands of pages) is cheap and avoids the indexing step. +For larger corpora swap the .npz for a multivector store (LanceDB, Vespa, +Turbopuffer); the encode call is the same. +""" + +from __future__ import annotations + +import json +import os +import time +from pathlib import Path + +import numpy as np +import yaml + +from sie_sdk import SIEClient +from sie_sdk.types import Item + + +def load_config(): + return yaml.safe_load((Path(__file__).resolve().parent.parent / "config.yaml").read_text()) + + +def load_pages(): + pages_path = Path(__file__).resolve().parent.parent / "data" / "pages_manifest.json" + if not pages_path.exists(): + raise FileNotFoundError( + "data/pages_manifest.json not found. Run `python data/fetch_pdfs.py` " + "and `python data/render_pages.py` first." + ) + return json.loads(pages_path.read_text()) + + +def encode_pages(client: SIEClient, model: str, pages: list[dict], gpu: str, timeout: float): + data_dir = Path(__file__).resolve().parent.parent / "data" + multivectors: list[np.ndarray] = [] + metadata: list[dict] = [] + + for i, page in enumerate(pages, 1): + image_path = data_dir / page["image_path"] + if not image_path.exists(): + raise FileNotFoundError(f"Missing page image: {image_path}. Run data/render_pages.py.") + + start = time.time() + result = client.encode( + model, + Item(id=page["page_id"], images=[str(image_path)]), + output_types=["multivector"], + gpu=gpu, + wait_for_capacity=True, + provision_timeout_s=timeout, + ) + elapsed = time.time() - start + mv = result["multivector"].astype(np.float32) + multivectors.append(mv) + metadata.append( + { + "page_id": page["page_id"], + "client": page["client"], + "title": page["title"], + "publisher": page["publisher"], + "license": page["license"], + "source_url": page["source_url"], + "source_pdf": page["source_pdf"], + "source_pdf_path": page["source_pdf_path"], + "page_number": page["page_number"], + "image_path": page["image_path"], + "num_tokens": int(mv.shape[0]), + } + ) + citation = f"{page['source_pdf']} · p.{page['page_number']}" + print(f" [{i}/{len(pages)}] {page['client']:12s} {citation:44s} {mv.shape} in {elapsed:.1f}s") + + return multivectors, metadata + + +def main(): + config = load_config() + pages = load_pages() + print(f"Loaded {len(pages)} pages") + + cluster_url = os.environ.get("SIE_CLUSTER_URL", config["cluster"]["url"]) + api_key = os.environ.get("SIE_API_KEY", config["cluster"]["api_key"]) + gpu = config["cluster"]["gpu"] + timeout = config["cluster"]["provision_timeout_s"] + model = config["models"]["retriever"] + + print(f"\n--- Encoding pages with {model} ---") + with SIEClient(cluster_url, api_key=api_key) as client: + multivectors, metadata = encode_pages(client, model, pages, gpu, timeout) + + data_dir = Path(__file__).resolve().parent.parent / "data" + # np.savez stores variable-length multivectors as one entry per array; we + # key them by page_id so the search side can reload without an extra index. + np.savez( + data_dir / "multivectors.npz", + **{m["page_id"]: mv for m, mv in zip(metadata, multivectors)}, + ) + (data_dir / "metadata.json").write_text(json.dumps(metadata, indent=2)) + + total_tokens = sum(m["num_tokens"] for m in metadata) + by_client: dict[str, int] = {} + for m in metadata: + by_client[m["client"]] = by_client.get(m["client"], 0) + 1 + + print(f"\n Saved {len(metadata)} multivectors to data/multivectors.npz") + print(f" Saved metadata to data/metadata.json") + print(f" Total visual tokens: {total_tokens}") + print(" Pages per tenant:") + for client_name in sorted(by_client): + print(f" {client_name}: {by_client[client_name]}") + + +if __name__ == "__main__": + main() diff --git a/examples/vision-doc-rag/python/requirements.txt b/examples/vision-doc-rag/python/requirements.txt new file mode 100644 index 00000000..1ea77ae9 --- /dev/null +++ b/examples/vision-doc-rag/python/requirements.txt @@ -0,0 +1,8 @@ +sie-sdk==0.3.4 +fastapi>=0.115.0 +uvicorn>=0.30.0 +numpy>=1.26.0 +pyyaml>=6.0 +Pillow>=10.3.0 +pdf2image>=1.17.0 +PyMuPDF>=1.24.0 diff --git a/examples/vision-doc-rag/python/search.py b/examples/vision-doc-rag/python/search.py new file mode 100644 index 00000000..74f86cd1 --- /dev/null +++ b/examples/vision-doc-rag/python/search.py @@ -0,0 +1,261 @@ +"""Visual document search + question answering, vision end-to-end. + +Pipeline per query: + 1. encode(ColQwen2.5, text) — query multivector + 2. sie_sdk.scoring.maxsim — late interaction against page images + 3. score(Qwen3-VL-Reranker, query, images) — optional, off by default + 4. extract(Florence-2-FT-DocVQA, instruction=query, images=[top page]) + — textual answer + citation + 5. extract(Florence-2-FT-DocVQA, images=[top page]) + — OCR snippet for the UI (display only, + NOT in the ranking path) + +The ranking is decided by a vision model looking at the page image, so charts, +screenshots, tables, and any other visual signal that OCR would erase still +contributes. OCR runs only on the chosen page, only to provide on-screen text +the user can read or copy. + +Multi-tenant isolation is a Python filter on metadata before MaxSim, so a +query scoped to one client never sees another client's pages. +""" + +from __future__ import annotations + +import json +import os +import time +from pathlib import Path + +import numpy as np +import yaml + +from sie_sdk import SIEClient +from sie_sdk.scoring import maxsim +from sie_sdk.types import Item + + +def load_config(): + return yaml.safe_load((Path(__file__).resolve().parent.parent / "config.yaml").read_text()) + + +def load_index(): + data_dir = Path(__file__).resolve().parent.parent / "data" + if not (data_dir / "multivectors.npz").exists(): + raise FileNotFoundError("data/multivectors.npz missing. Run `python python/ingest.py` first.") + npz = np.load(data_dir / "multivectors.npz") + metadata = json.loads((data_dir / "metadata.json").read_text()) + required = {"page_id", "client", "source_pdf", "page_number", "image_path", "publisher", "source_url"} + if metadata: + missing = required - set(metadata[0]) + if missing: + raise ValueError( + "data/metadata.json was generated by an older corpus shape. " + "Run `python data/fetch_pdfs.py`, `python data/render_pages.py`, " + "then `python python/ingest.py`." + ) + multivectors = {m["page_id"]: npz[m["page_id"]] for m in metadata} + return multivectors, metadata + + +def _ocr_snippet(entities: list[dict], max_chars: int = 400) -> str: + """Concatenate OCR text regions into a single readable snippet.""" + pieces = [] + for e in entities or []: + text = (e.get("text") or "").replace("", "").strip() + if text: + pieces.append(text) + joined = " · ".join(pieces) + if len(joined) > max_chars: + return joined[: max_chars - 1] + "…" + return joined + + +def _docvqa_answer(entities: list[dict]) -> str: + """Pick the answer string out of a Florence-2 DocVQA response. + + Florence-2 returns the answer as an entity (often the single one when the + `` task token is dispatched). We take the first non-empty text. + """ + for e in entities or []: + text = (e.get("text") or "").replace("", "").strip() + if text: + return text + return "" + + +def search( + client: SIEClient, + config: dict, + multivectors: dict[str, np.ndarray], + metadata: list[dict], + query: str, + client_filter: str | None = None, +) -> dict: + gpu = config["cluster"]["gpu"] + timeout = config["cluster"]["provision_timeout_s"] + top_k_candidates = config["search"]["top_k_candidates"] + top_k_results = config["search"]["top_k_results"] + do_visual_rerank = config["search"].get("visual_rerank", False) + do_answer = config["search"].get("answer", True) + do_ocr_snippet = config["search"].get("ocr_snippet", True) + + corpus = [m for m in metadata if not client_filter or m["client"] == client_filter] + if not corpus: + return {"results": [], "answer": None, "timings": {}} + + timings: dict[str, float] = {} + pages_root = Path(__file__).resolve().parent.parent / "data" + + # 1. Encode query (text side of ColQwen2.5). + t0 = time.time() + q_result = client.encode( + config["models"]["retriever"], + Item(text=query), + output_types=["multivector"], + is_query=True, + gpu=gpu, + wait_for_capacity=True, + provision_timeout_s=timeout, + ) + timings["encode_query_s"] = round(time.time() - t0, 3) + query_mv = q_result["multivector"].astype(np.float32) + + # 2. MaxSim against in-memory multivectors. + doc_mvs = [multivectors[m["page_id"]] for m in corpus] + t0 = time.time() + maxsim_scores = maxsim(query_mv, doc_mvs) + timings["maxsim_s"] = round(time.time() - t0, 3) + + order = np.argsort(maxsim_scores)[::-1][:top_k_candidates] + candidates: list[dict] = [] + for idx in order: + c = dict(corpus[idx]) + c["_maxsim_score"] = float(maxsim_scores[idx]) + c["_rerank_score"] = None + candidates.append(c) + + # 3. Optional visual rerank. Image-in cross-encoder so OCR never enters the + # ranking path. Disabled by default — see config.yaml for the cluster + # bug we're waiting on. + if do_visual_rerank and candidates: + try: + t0 = time.time() + rerank_items = [ + Item(id=c["page_id"], images=[str(pages_root / c["image_path"])]) + for c in candidates + ] + rerank = client.score( + config["models"]["reranker"], + Item(text=query), + rerank_items, + gpu=gpu, + wait_for_capacity=True, + provision_timeout_s=timeout, + ) + timings["visual_rerank_s"] = round(time.time() - t0, 3) + rerank_by_id = {s["item_id"]: s for s in rerank["scores"]} + for c in candidates: + s = rerank_by_id.get(c["page_id"]) + c["_rerank_score"] = float(s["score"]) if s else 0.0 + candidates.sort(key=lambda c: c["_rerank_score"] or 0.0, reverse=True) + except Exception as exc: + # Cluster adapter bug fallback: keep MaxSim ordering, surface the + # failure to the caller. See sie-internal#1026. + timings["visual_rerank_error"] = type(exc).__name__ + + results = candidates[:top_k_results] + + # 4. DocVQA answer from the top page image. instruction= goes in as the + # plain question; the adapter prepends Florence-2's `` task + # token. See superlinked.com/docs/extract/vision. + answer = None + if do_answer and results: + top = results[0] + try: + t0 = time.time() + qa = client.extract( + config["models"]["docvqa"], + Item(images=[str(pages_root / top["image_path"])]), + instruction=query, + gpu=gpu, + wait_for_capacity=True, + provision_timeout_s=timeout, + ) + timings["docvqa_s"] = round(time.time() - t0, 3) + answer = _docvqa_answer(qa["entities"]) + except Exception as exc: + timings["docvqa_error"] = type(exc).__name__ + + # 5. OCR snippet for display — only on the top result so users see the + # text on the page they're being shown. Never used as a ranking signal. + if do_ocr_snippet and results: + top = results[0] + try: + t0 = time.time() + ocr = client.extract( + config["models"]["docvqa"], # same model, no `instruction` ⇒ OCR mode + Item(images=[str(pages_root / top["image_path"])]), + gpu=gpu, + wait_for_capacity=True, + provision_timeout_s=timeout, + ) + timings["ocr_snippet_s"] = round(time.time() - t0, 3) + top["ocr_snippet"] = _ocr_snippet(ocr["entities"]) + except Exception as exc: + timings["ocr_snippet_error"] = type(exc).__name__ + + return {"results": results, "answer": answer, "timings": timings} + + +def print_run(out: dict, query: str, client_filter: str | None): + scope = client_filter or "all clients" + print(f'\n Query: "{query}" ({scope})') + print(f" Timings: {out['timings']}") + if out["answer"]: + print(f"\n Answer: {out['answer']}") + if not out["results"]: + print(" No results.") + return + for i, r in enumerate(out["results"], 1): + rerank = r.get("_rerank_score") + rerank_str = f"rerank={rerank:.4f}" if rerank is not None else "rerank=—" + print(f"\n {i}. [{r['client']}] {r['title']}") + print(f" {r['source_pdf']} · p.{r['page_number']} · {r['publisher']}") + print(f" maxsim={r['_maxsim_score']:.3f} {rerank_str}") + if r.get("ocr_snippet"): + print(f" OCR snippet: {r['ocr_snippet'][:200]}") + print(f" url: {r['source_url']}") + + +def main(): + config = load_config() + multivectors, metadata = load_index() + print(f"Loaded index: {len(metadata)} pages") + + cluster_url = os.environ.get("SIE_CLUSTER_URL", config["cluster"]["url"]) + api_key = os.environ.get("SIE_API_KEY", config["cluster"]["api_key"]) + + demo = [ + # Visual signal — ranking is driven by the page image. + ("Raspberry Pi Pico pinout GP21", "embedded-lab"), + ("cloud native architecture diagram", "ops-eng"), + ("solid rocket motor nozzle design figure", "aerospace"), + # No tenant filter: shows the query routes across tenants. + ("ATmega16U2 power tree diagram", None), + # Table / value lookup — DocVQA must return a specific value, not the title. + ("What is the operating voltage range of the Raspberry Pi Pico?", "embedded-lab"), + ("PostgreSQL default listening port", "ops-eng"), + # Disambiguation — two PDFs in one tenant; the right one must win. + ("solid propellant rocket nozzle cross-section", "aerospace"), + # Tenant-leak negative — the matching content lives in aerospace; scoping + # to ops-eng must return no aerospace pages. + ("regeneratively cooled nozzle", "ops-eng"), + ] + with SIEClient(cluster_url, api_key=api_key) as client: + for query, tenant in demo: + out = search(client, config, multivectors, metadata, query, tenant) + print_run(out, query, tenant) + + +if __name__ == "__main__": + main() diff --git a/examples/vision-doc-rag/python/server.py b/examples/vision-doc-rag/python/server.py new file mode 100644 index 00000000..990857fa --- /dev/null +++ b/examples/vision-doc-rag/python/server.py @@ -0,0 +1,99 @@ +"""FastAPI backend for the multi-tenant visual-document search + QA demo.""" + +from __future__ import annotations + +import os +from contextlib import asynccontextmanager +from pathlib import Path + +import yaml +from fastapi import FastAPI, Query +from fastapi.responses import FileResponse +from fastapi.staticfiles import StaticFiles + +from sie_sdk import SIEClient + +from search import load_index, search + +config = None +multivectors = None +metadata = None +client = None +clients_index: list[str] = [] + + +@asynccontextmanager +async def lifespan(app: FastAPI): + global config, multivectors, metadata, client, clients_index + root = Path(__file__).resolve().parent.parent + config = yaml.safe_load((root / "config.yaml").read_text()) + multivectors, metadata = load_index() + cluster_url = os.environ.get("SIE_CLUSTER_URL", config["cluster"]["url"]) + api_key = os.environ.get("SIE_API_KEY", config["cluster"]["api_key"]) + client = SIEClient(cluster_url, api_key=api_key) + clients_index = sorted({m["client"] for m in metadata}) + yield + client.close() + + +app = FastAPI(title="SIE Vision-First Document RAG", lifespan=lifespan) + +root = Path(__file__).resolve().parent.parent +static_dir = root / "static" +app.mount("/static", StaticFiles(directory=str(static_dir)), name="static") +app.mount("/pages", StaticFiles(directory=str(root / "data" / "pages")), name="pages") + + +@app.get("/") +def index(): + return FileResponse(str(static_dir / "index.html")) + + +@app.get("/api/clients") +def api_clients(): + return clients_index + + +@app.get("/api/stats") +def api_stats(): + return { + "total_pages": len(metadata), + "clients": clients_index, + "models": config["models"], + "visual_rerank": config["search"].get("visual_rerank", False), + "answer": config["search"].get("answer", True), + } + + +@app.get("/api/search") +def api_search( + q: str = Query(..., min_length=1), + client_name: str | None = Query(None, alias="client"), +): + out = search(client, config, multivectors, metadata, q, client_name) + return { + "query": q, + "client": client_name, + "answer": out["answer"], + "timings": out["timings"], + "results": [ + { + "page_id": r["page_id"], + "client": r["client"], + "title": r["title"], + "publisher": r["publisher"], + "license": r["license"], + "source_url": r["source_url"], + "source_pdf": r["source_pdf"], + "page_number": r["page_number"], + "citation": f"{r['source_pdf']} · p.{r['page_number']}", + "page_image": f"/{r['image_path']}", + "ocr_snippet": r.get("ocr_snippet", ""), + "scores": { + "maxsim": round(r["_maxsim_score"], 4), + "rerank": round(r["_rerank_score"], 4) if r.get("_rerank_score") is not None else None, + }, + } + for r in out["results"] + ], + } diff --git a/examples/vision-doc-rag/static/index.html b/examples/vision-doc-rag/static/index.html new file mode 100644 index 00000000..2b3eb7c3 --- /dev/null +++ b/examples/vision-doc-rag/static/index.html @@ -0,0 +1,199 @@ + + + + + + Vision-First Document RAG · SIE + + + +
+

Multi-Tenant Visual Doc Search + QA

+

ColQwen2.5 ranks pages by looking at the images. Florence-2-DocVQA reads the top page and answers the question. All on one SIE endpoint.

+
+
+
+ + + +
+
+
+
+
+ + +