Reliable Follow-Up Action and Date Extraction from Clinical Notes: A Hybrid Neural-Symbolic Approach
Michal Laufera, Yehudit Apersteinb,*, Alexander Apartsinc
aBar-Ilan University, Ramat Gan, Israel
bAfeka College of Engineering, Tel Aviv, Israel
cHolon Institute of Technology, Holon, Israel
*Corresponding author: apersteiny@afeka.ac.il
Submitted to Journal of Biomedical Informatics (Elsevier).
| Read | Download |
|---|---|
| Manuscript (HTML) | Manuscript (.docx) |
| JBI submission notes | Supplementary materials (.zip) |
| Cover letter | Audit report |
A hybrid neural-symbolic system that extracts structured (action, date) pairs from outpatient clinical notes. A shared BioBERT encoder feeds a BIO action/time tagging head and a biaffine action-time linker; a deterministic dateparser-based normalizer converts the linked time phrase into an absolute ISO date anchored on the visit date. By design, the neural model is never asked to perform calendar arithmetic.
Input note + visit_date
|
v
BioBERT encoder (sliding windows, 512/128)
|
v
Head A: BIO action/time spans --> Head B: biaffine action-time linker (with NONE option)
|
v
Deterministic date normalizer
|
v
{action, period_text, period_date}
On a 2,000-note synthetic outpatient corpus (198-note held-out split, 196 gold actions):
| Model | Action F1 [95% CI] | Time F1 [95% CI] | Action-Date F1 [95% CI] | Date MAE (days) |
|---|---|---|---|---|
| BioBERT hybrid (proposed) | 0.995 [0.987, 1.000] | 0.997 [0.992, 1.000] | 0.980 [0.964, 0.992] | 0.53 |
| ChatGPT zero-shot (gpt-4o-mini) | 0.980 [0.964, 0.992] | 0.831 [0.790, 0.868] | 0.827 [0.783, 0.864] | 5.07 |
| LLaMA-3 8B fine-tuned (LoRA) | 1.000 [1.000, 1.000] | 0.816 [0.772, 0.854] | 0.806 [0.762, 0.847] | 10.88 |
The hybrid pipeline's CIs for time F1, action-date F1, and date accuracy do not overlap with either generative baseline (significant at p<0.05). The largest gap is on calendar arithmetic (0.53 vs 5-11 days MAE), supporting the design hypothesis that semantic extraction and date arithmetic should be separated. See Section 4 of the manuscript for the full results table and Section 5 for discussion.
MedFollow/
|-- Paper/
| |-- index.html rendered manuscript (KaTeX math, GitHub Pages source)
| |-- MedFollow_JBI_submission.docx camera-ready Word version (JBI single-column)
| |-- MedFollow_supplementary.zip supplementary materials bundle
| |-- cover_letter.md cover letter to JBI Editor
| |-- jbi_submission_notes.md JBI Guide-for-Authors compliance notes
| |-- anticipated_reviewer_concerns.md internal prep doc (10 likely concerns + responses)
| |-- audit_report.md automated DOCX audit (24 PASS / 0 ISSUES)
| |-- references.bib BibTeX
| |-- figures/ 6 manuscript figures (1 SVG, 5 PNG)
| |-- scripts/ every script that produces a figure or metric
| `-- templates/cnf-word-template.docx Elsevier generic single-column Word template
|-- Code/
| |-- llm_project_..._submit.ipynb training, baseline inference, evaluation notebook
| `-- requirements.txt
|-- Data/
| |-- synthetic_clinical_notes_2000.csv the released synthetic corpus
| `-- external/mtsamples/ MTSamples (Apache-2.0) for realism-check work
|-- Results/
| |-- biobert_metrics.json per-system aggregated metrics
| |-- chatgpt_metrics.json
| |-- llama_metrics.json
| `-- results_with_ci.json consolidated point estimates + 95% CIs
|-- Visuals/ earlier figures (kept for provenance)
|-- models/
| `-- MODELS.md external-checkpoint manifest
|-- Slides/ course presentations (first / interim / final)
|-- index.html redirect to Paper/
`-- README.md this file
Without the trained model checkpoints (which live on Google Drive, see models/MODELS.md), you can already:
-
Reproduce every confidence interval in the paper from the released metric files:
python Paper/scripts/compute_confidence_intervals.py
Wilson score intervals on proportion metrics; 10,000-replicate instance-level bootstrap on F1 metrics with seed 42.
-
Regenerate every figure from the released CSV and metric files:
python Paper/scripts/make_round2_figures.py # Figure 2 (composition) python Paper/scripts/make_vocab_distributions.py # Figure 3 (vocab) python Paper/scripts/make_round4_figures.py # Figure 4 (stress factors)
Figures 5 and 6 are produced by the same
compute_confidence_intervals.pyscript. -
Audit the manuscript DOCX against JBI requirements:
python Paper/scripts/audit_docx.py
Currently a single notebook (Code/llm_project_..._submit.ipynb) covering data generation, BioBERT training, baseline inference, and evaluation. Restoring this to a runnable state requires:
- The fine-tuned BioBERT checkpoint (
biobert_finetuned_2k.pth, ~440 MB) and tokenizer dir, - The LLaMA-3 LoRA adapter (
Llama3_Clinical_Action_Extraction_LoRA/, ~150 MB), - An OpenAI API key for the ChatGPT baseline (
gpt-4o-mini).
See models/MODELS.md for download instructions.
Refactoring the notebook into discrete entry-point scripts (generate_data.py, train_biobert.py, run_baselines.py, evaluate.py, make_figures.py) and persisting raw per-note predictions are the highest-priority reproducibility improvements; both are in progress.
For the planned external-validation appendix, we have downloaded the MTSamples corpus (4,999 transcribed clinical notes, Apache-2.0) and ranked it for follow-up-instruction richness. The top-100 candidate notes for manual annotation are at Data/external/mtsamples/mtsamples_top100_followup.csv; a first pass annotation of the top 40 (mapping verbatim follow-up text to the paper's 28-action closed set) is at Data/external/mtsamples/mtsamples_top40_gold.json, with a coverage analysis at Data/external/mtsamples/mtsamples_top40_coverage.md. Initial finding: closed-set coverage on the top 40 real notes is 31% (28/91 follow-up items map to the closed set), confirming the limitation declared in Section 5.1 of the manuscript that the synthetic ontology underrepresents medication changes (25%), generic follow-up appointments (21%), recurring schedules (PT regimens), and several specialist referrals.
A larger-scale Tier-B evaluation on MIMIC-IV-Note discharge summaries is planned subject to PhysioNet credentialing; see Paper/real_ehr_sources.md for the three-tier roadmap.
The released corpus is fully synthetic and contains no protected health information; no IRB approval is required for the present experiments. MTSamples is a publicly redistributed (Apache-2.0) collection of transcribed clinician sample notes; it is not real EHR documentation. Any extension to identifiable clinical data will require institutional governance, a data-use agreement, and IRB approval.
- Synthetic dataset (
Data/synthetic_clinical_notes_2000.csv): CC BY 4.0. - Code: MIT (see
LICENSE). - Manuscript text and figures: CC BY 4.0.
@article{medfollow_2026,
author = {Michal Laufer and Yehudit Aperstein and Alexander Apartsin},
title = {Reliable Follow-Up Action and Date Extraction from Clinical Notes: A Hybrid Neural-Symbolic Approach},
journal = {Journal of Biomedical Informatics},
year = {2026},
note = {Submitted}
}