Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions examples/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@ service keys.
| [Build a multimodal wine recommender with OCR](./wine-recommender) | Combining preference-based retrieval with OCR-driven label detection in one UI | `encode`, `score`, `extract` | Docker Compose app plus local SIE endpoint; API key optional for unauthenticated SIE | Runnable demo |
| [Build a multi-modal product classifier with embeddings](./taxonomy-classification) | Evaluating text, image, NLI, and reranking approaches for hierarchical product taxonomy classification | `extract`, `encode`, `score` | SIE endpoint, Shopify dataset prep via `uv run` scripts, standalone `uv` project | Runnable evaluation example |
| [Swap an OCR model with one identifier change](./document-ocr) | Driving recognition (VLM-OCR), structured extraction (Donut), and zero-shot NER (GLiNER) through the same `extract` call by swapping the model ID | `extract` | Docker Compose plus Node UI, no API key required, hosted version on [Hugging Face Spaces](https://huggingface.co/spaces/superlinked/document-ocr) | Runnable demo |
| [Vision-first document RAG](./vision-doc-rag) | Retrieving and answering questions over a multi-tenant page corpus by looking at page images, with OCR kept out of the score path | `encode`, `extract`, `score` (optional) | SIE endpoint with a GPU recommended for ColQwen2.5 + Florence-2-DocVQA | Runnable demo |

For docs publishing, lead with the quickest runnable demos, then use the
benchmark and evaluation examples for deeper technical users.
Expand Down
9 changes: 9 additions & 0 deletions examples/vision-doc-rag/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
.venv/
__pycache__/
data/pages.json
data/pdfs_manifest.json
data/pages_manifest.json
data/pdfs/
data/pages/
data/multivectors.npz
data/metadata.json
229 changes: 229 additions & 0 deletions examples/vision-doc-rag/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,229 @@
# Vision-first document RAG

Retrieve by image, answer by image. ColQwen2.5 reads each PDF page as a
picture and ranks pages via late interaction; Florence-2-DocVQA reads the
winning page and produces the textual answer. OCR never enters the score path,
so schematics, pinout diagrams, architecture slides, charts, and other layout
cues still drive ranking. Everything runs on one SIE endpoint.

Each page also carries a `client` tag, so the same corpus serves multiple
tenants from one index. Queries scoped to `embedded-lab` cannot retrieve
`ops-eng` or `aerospace` pages.

## Corpus

The demo fetches a small public PDF batch on demand and renders selected pages
to PNGs. The page selections are deliberately capped so local ingest stays
fast while still indexing visually rich pages.

| Tenant | Sources | Visual signal |
|---|---|---|
| `embedded-lab` | Raspberry Pi Pico datasheet, Arduino UNO R3 datasheet, Arduino UNO R3 schematic | Pinout diagrams, board diagrams, circuit schematics |
| `ops-eng` | PostgreSQL manual, CNCF Kubernetes / cloud-native architecture material | Architecture diagrams, operational flows, dense technical tables |
| `aerospace` | NASA NTRS nozzle and booster reports | Engineering drawings, cross-sections, charts, mission technical figures |

Generated files are ignored:

```text
data/pdfs/ # downloaded PDFs
data/pdfs_manifest.json # source manifest from fetch_pdfs.py
data/pages/ # rendered PNG pages
data/pages_manifest.json # page-level metadata from render_pages.py
data/metadata.json # index metadata from ingest.py
data/multivectors.npz # page multivectors from ingest.py
```

## SIE features used

- `encode` - `vidore/colqwen2.5-v0.2` on page images at ingest and on query
text at search time. Output is a `[tokens, 128]` multivector. Late
interaction (`sie_sdk.scoring.maxsim`) is the first-stage ranking signal.
- `extract` - `mynkchaudhry/Florence-2-FT-DocVQA`. Called with
`instruction=<your question>` to get a textual answer for the top page, and
without `instruction` to OCR the same page for a display snippet. The OCR
snippet is UX-only; it never enters ranking.
- `score` optional - `Qwen/Qwen3-VL-Reranker-2B` second-stage rerank over
`(query text, page image)`. Off by default while we wait for an upstream
adapter fix; flip `search.visual_rerank: true` in `config.yaml` to enable it
on a cluster that's ready.

## Run it

You need Python 3.12 and a reachable SIE cluster.

```bash
# 1. SIE locally, or point SIE_CLUSTER_URL / SIE_API_KEY at a managed cluster.
docker run -p 8080:8080 ghcr.io/superlinked/sie-server:latest-cpu-default

# 2. Fetch public PDFs and render selected pages to PNG.
cd examples/vision-doc-rag
pip install -r python/requirements.txt
python data/fetch_pdfs.py
python data/render_pages.py

# 3. Encode every rendered page with ColQwen2.5 and save the multivectors.
python python/ingest.py

# 4a. CLI demo.
python python/search.py

# 4b. Or start the UI.
uvicorn --app-dir python server:app --port 8888
open http://localhost:8888
```

`render_pages.py` uses `pdf2image` when Poppler is available. If Poppler is
not installed, it falls back to PyMuPDF, which is installed from
`python/requirements.txt`.

First run on a cold cluster pays a one-time model load. ColQwen2.5 and
Florence-2 are both several GB, so expect roughly a minute on CPU and a few
seconds on GPU before the warm path kicks in.

### Managed cluster

```bash
export SIE_CLUSTER_URL="https://your-cluster-host:8080"
export SIE_API_KEY="SL-..."
```

The defaults in `config.yaml` point at `http://localhost:8080`. Set
`cluster.gpu` to a profile name like `l4-spot` if the cluster needs an
explicit GPU class.

## Try these queries

| Tenant | Query | Why it's interesting |
|---|---|---|
| `embedded-lab` | Raspberry Pi Pico pinout GP21 | Should land on a pinout/table page even when the visual label is abbreviated. |
| `embedded-lab` | where is the ATmega16U2 on the schematic? | Circuit schematic retrieval, not prose retrieval. |
| `ops-eng` | cloud native architecture diagram | Finds a visual architecture page or slide instead of relying on OCR text only. |
| `aerospace` | solid rocket motor nozzle design figure | Targets an engineering drawing or figure-heavy report page. |
| `ops-eng` | Raspberry Pi Pico pinout GP21 | Tenant filter: the query cannot leak embedded-lab pages when scoped to ops-eng. |

## API

### `GET /api/search`

| Parameter | Required | Description |
|---|---|---|
| `q` | yes | Search query |
| `client` | no | Tenant filter, for example `embedded-lab`. Omitted means search all tenants. |

```bash
curl "http://localhost:8888/api/search?q=Raspberry+Pi+Pico+pinout+GP21&client=embedded-lab"
```

```json
{
"query": "Raspberry Pi Pico pinout GP21",
"client": "embedded-lab",
"answer": "GP21 can be used for ...",
"results": [
{
"page_id": "embedded-lab__raspberry-pi-pico-datasheet__p005",
"client": "embedded-lab",
"title": "Raspberry Pi Pico Datasheet",
"publisher": "Raspberry Pi Ltd",
"source_pdf": "raspberry-pi-pico-datasheet.pdf",
"page_number": 5,
"citation": "raspberry-pi-pico-datasheet.pdf · p.5",
"page_image": "/pages/embedded-lab/raspberry-pi-pico-datasheet_p005.png",
"scores": { "maxsim": 14.44, "rerank": null }
}
]
}
```

### `GET /api/clients`, `GET /api/stats`

Tenant list and runtime config: active models, rerank on/off, and page count.

## How it works

```text
ingest.py (once per corpus)
fetch_pdfs.py -> data/pdfs/{tenant}/*.pdf
-> render_pages.py -> data/pages/{tenant}/*.png
-> data/pages_manifest.json
-> SIE.encode(ColQwen2.5, images, multivector)
-> data/multivectors.npz + data/metadata.json

search.py / server.py (per query)
q -> SIE.encode(ColQwen2.5, text, is_query=True)
-> filter metadata by tenant
-> sie_sdk.scoring.maxsim -> top_k_candidates
-> optional SIE.score(Qwen3-VL-Reranker, q, images)
-> SIE.extract(Florence-2-DocVQA, instruction=q, images=[top_page])
-> SIE.extract(Florence-2-DocVQA, images=[top_page]) for display OCR
```

OCR is never on the score path. The visual reranker, when enabled, ranks over
the same modality as retrieval, so layout cues survive both stages.

The corpus is small enough that MaxSim runs in Python. For thousands of pages,
hand the multivectors to LanceDB, Vespa, or another multivector store; the SIE
calls stay the same.

## Customize

`data/fetch_pdfs.py` owns the curated source list. Add a source with:

```python
{
"client": "my-tenant",
"slug": "my-manual",
"title": "My Manual",
"publisher": "Example Publisher",
"license": "CC BY 4.0",
"url": "https://example.com/my-manual.pdf",
"pages": [1, 2, 7, 8],
}
```

Then rerun:

```bash
python data/fetch_pdfs.py
python data/render_pages.py
python python/ingest.py
```

`config.yaml` is the model and rendering tuning surface:

```yaml
models:
retriever: "vidore/colqwen2.5-v0.2"
docvqa: "mynkchaudhry/Florence-2-FT-DocVQA"
reranker: "Qwen/Qwen3-VL-Reranker-2B"
render:
backend: "auto"
dpi: 160
search:
top_k_candidates: 5
top_k_results: 3
visual_rerank: false
answer: true
ocr_snippet: true
```

## Project layout

```text
examples/vision-doc-rag/
├── config.yaml
├── data/
│ ├── fetch_pdfs.py # curated public PDF source list + downloader
│ ├── render_pages.py # PDFs -> PNG pages + pages_manifest.json
│ ├── pdfs/ # generated
│ ├── pages/ # generated PNGs
│ ├── metadata.json # generated by ingest
│ └── multivectors.npz # generated by ingest
├── python/
│ ├── ingest.py
│ ├── search.py
│ ├── server.py
│ └── requirements.txt
└── static/
└── index.html
```
40 changes: 40 additions & 0 deletions examples/vision-doc-rag/config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
# SIE server (defaults to local Docker: docker run -p 8080:8080 ghcr.io/superlinked/sie-server:latest-cpu-default).
# Override with SIE_CLUSTER_URL / SIE_API_KEY env vars when targeting a managed cluster.
cluster:
url: "http://localhost:8080"
api_key: ""
gpu: "" # only set for managed multi-GPU clusters (e.g. "l4-spot"); ignored locally
provision_timeout_s: 600

# Models. The retrieval signal is vision end-to-end: ColQwen2.5 reads each page
# as an image and we late-interact (MaxSim) against the same model's text-side
# embedding of the query. No OCR is involved in ranking, so charts, screenshots,
# tables, and any other layout cue that wouldn't survive an OCR round-trip
# still contributes to the score.
#
# DocVQA produces a textual answer for the top page. The model takes the page
# image + the user's question (passed via `instruction`) and returns the answer
# as an entity in the response — no separate LLM call needed.
models:
retriever: "vidore/colqwen2.5-v0.2"
docvqa: "mynkchaudhry/Florence-2-FT-DocVQA"
# Optional second-stage cross-encoder rerank. Visual model so we don't have to
# collapse the page through OCR before reranking. Disabled by default while
# we wait for the cluster-side adapter bug to land:
# https://github.com/superlinked/sie-internal/issues/1026
# Re-enable with search.visual_rerank: true once that ships.
reranker: "Qwen/Qwen3-VL-Reranker-2B"

# Page rendering. `auto` tries pdf2image/Poppler first and falls back to
# PyMuPDF when Poppler is not installed.
render:
backend: "auto" # auto | pdf2image | pymupdf
dpi: 160

# Retrieval
search:
top_k_candidates: 5 # how many pages survive MaxSim
top_k_results: 3 # how many pages return after optional rerank
visual_rerank: false # see models.reranker note above
answer: true # run DocVQA on the top page for a textual answer
ocr_snippet: true # OCR the top page for a display-only snippet in the UI
Loading