A modular multi-stage pipeline for extracting scientific tables, detecting bibliography references, resolving DOI information, and generating structured machine-readable outputs from scientific PDF documents.
This project was developed as part of a Master's thesis focusing on:
- scientific table extraction,
- OCR benchmarking,
- bibliography-aware table processing,
- DOI enrichment,
- structured scientific knowledge extraction.
The pipeline combines multiple OCR and document-processing technologies into a unified workflow capable of processing complete scientific publications automatically.
- Scientific PDF processing
- Automated table detection and cropping
- OCR-based table extraction
- Reference-table detection
- Bibliography extraction
- DOI resolution using Crossref
- Structured CSV generation
- Interactive visualization UI
- Docker-based microservice architecture
- GPU-accelerated OCR processing
PDF Upload
↓
MinerU Table Detection
↓
Table Cropping
↓
PaddleOCR-VL Extraction
↓
Reference Table Detection
↓
GROBID Bibliography Extraction
↓
Reference Matching
↓
Crossref DOI Resolution
↓
Resolved CSV Generation
The system consists of multiple independent services communicating through REST APIs.
React UI
↓
FastAPI Backend
├── MinerU Service
├── PaddleOCR-VL Service
├── GROBID
└── Kreuzberg OCR Fallback
project-root/
│
├── backend/
├── ui_input/
├── mineru_service/
├── paddle_service/
├── grobid/
├── kreuzberg_service/
│
├── dataset/
├── evaluation/
│
├── docker-compose.yml
└── README.md
Each component contains its own dedicated README file with detailed implementation and setup documentation.
| Component | Description |
|---|---|
backend |
Main orchestration and API service |
ui_input |
React frontend for pipeline interaction |
mineru_service |
Table detection and cropping |
paddle_service |
OCR and table extraction |
grobid |
Bibliography extraction |
kreuzberg_service |
OCR fallback extraction |
dataset |
Evaluation dataset and benchmark results |
evaluation |
Evaluation plots and benchmark visualizations |
- React
- TypeScript
- Vite
- CSS Modules
- FastAPI
- Python
- SQLAlchemy
- MinerU
- PaddleOCR-VL
- GROBID
- Kreuzberg OCR
- DeepSeek OCR 2
- Chandra OCR
- Docker
- Docker Compose
- GPU acceleration (CUDA)
The project includes a complete evaluation framework for benchmarking:
- table extraction quality,
- OCR robustness,
- bibliography extraction,
- DOI matching,
- runtime efficiency.
Evaluation metrics include:
- RMS-based table similarity,
- precision,
- recall,
- F1-score,
- runtime analysis.
The repository includes references to the evaluation dataset used during the thesis work.
The dataset contains:
- scientific PDFs,
- ground truth tables,
- OCR outputs,
- bibliography extraction results,
- evaluation metrics,
- benchmark outputs.
The complete dataset exceeds 700 MB and is provided separately.
See:
dataset/README.md
for detailed information.
Generated benchmark plots and visualizations are available in:
evaluation/plots/
These plots include:
- OCR model comparisons,
- RMS extraction scores,
- runtime benchmarks,
- bibliography extraction metrics.
See:
evaluation/README.md
for detailed information.
The project contains an interactive web UI for:
- uploading PDFs,
- visualizing cropped tables,
- viewing OCR extraction results,
- inspecting bibliography matches,
- downloading resolved CSV outputs.
This project is intended for:
- scientific document processing,
- OCR benchmarking,
- research on table extraction,
- bibliography-aware NLP pipelines,
- structured scientific knowledge extraction.
This repository contains experimental research software developed for scientific evaluation and benchmarking purposes.
The implementation prioritizes:
- reproducibility,
- modularity,
- transparency of intermediate outputs,
- evaluation support,
- extensibility for future research.