Deep-learning pipeline for identifying sequence features associated with delayed co-transcriptional splicing. Fine-tunes the Borzoi sequence-to-function model (via gReLU) on timecourse RNA-seq data, then computes per-nucleotide attributions over introns, discovers de novo motifs with TF-MoDISco, and measures motif enrichment between regulatory subsets.
This pipeline reproduces the figures in:
Dearborn J.S., Frankiw L., Limoge D.W., Burns C.H., Vlach L., Turpin P., Kirch T., Miller Z.D., Dowell W., Languon S., Garcia‑Flores Y., Cockrell R.C., Baltimore D., Majumdar D. Programmed Delayed Splicing: A Mechanism for Timed Inflammatory Gene Expression. eLife (2026). https://doi.org/10.7554/eLife.109726.1
The pipeline runs in five sequential steps, all driven by a single YAML config file:
00_fine_tune.py → 01_get_attributions.py → 02_run_modisco.py → 03_run_enrichment.py → 04_map_motifs.py
↓ ↓ ↓ ↓ ↓
checkpoint.ckpt attributions.pkl modisco.h5 + sea.tsv + logos +
input_seqs.pkl forward.meme fimo results gene-map PNGs
| Step | Script | What it does |
|---|---|---|
| 0 | 00_fine_tune.py |
Fine-tunes Borzoi on timecourse RNA-seq BigWig data |
| 1 | 01_get_attributions.py |
Computes input×gradient attributions over each intron |
| 2 | 02_run_modisco.py |
Masks attributions to intron ± flank, runs TF-MoDISco, exports motifs to MEME |
| 3 | 03_run_enrichment.py |
Compares motif enrichment between subsets using SEA and FIMO |
| 4 | 04_map_motifs.py |
Maps MoDISco seqlets back to genomic coordinates, generates logo + gene-map figures |
-
MEME Suite ≥ 5.5 (provides the
seacommand used in Step 3) Install: https://meme-suite.org/meme/doc/install.html Verify:sea --version -
conda or mamba (for environment setup)
The pipeline fetches DNA sequences via genomepy, which will automatically download the mm10 mouse genome (~800 MB) on first run:
genomepy install mm10A GTF annotation file is required for Steps 3–4. The manuscript figures use GENCODE vM23 (GRCm38):
wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M23/gencode.vM23.annotation.gtf.gz
gunzip gencode.vM23.annotation.gtf.gzThen set gtf_file in your config to the path of the uncompressed .gtf file.
conda env create -f environment.yml
conda activate elife-delay-spliceNote on gReLU / Borzoi weights.
greluis installed from a custom fork. On first run, it will automatically download the pretrained Borzoi mouse weights (~2 GB). An internet connection is required for this step; subsequent runs use the cached weights.
git clone https://github.com/jsdearbo/programmed_delayed_splicing.git
cd programmed_delayed_splicing
conda env create -f environment.yml
conda activate elife-delay-spliceNo additional installation steps are required — the utils/ package is imported
directly from the repository root by each script.
The BigWig files consumed by Step 0 are derived from raw RNA-seq FASTQs through a two-step process. Scripts for both steps are in the preprocessing/ directory. Raw FASTQs for the manuscript are deposited at GEO (accession to be added upon publication).
bash preprocessing/star_loop_v2.shThe script prompts interactively for species (mouse/human), the path to your STAR index directory, and the directory containing FASTQ files. It handles both paired-end (R1/R2) and single-end reads, and outputs coordinate-sorted BAMs.
Dependencies: STAR ≥ 2.7
bash preprocessing/bam_to_bigwig.sh /path/to/bam_dir [/path/to/bw_dir]Or with options:
NORMALIZE=CPM BIN_SIZE=1 THREADS=8 STRANDED=none \
bash preprocessing/bam_to_bigwig.sh /path/to/bam_dir /path/to/bw_dirKey options (set as environment variables):
| Variable | Default | Description |
|---|---|---|
NORMALIZE |
None |
Normalization method: None, CPM, RPKM, BPM |
BIN_SIZE |
1 |
Coverage bin size in bp |
THREADS |
16 |
Parallel threads |
SKIP_DUPLICATES |
0 |
Set to 1 to skip duplicate reads |
STRANDED |
auto |
Strand mode: none, forward, reverse |
EXTEND |
auto |
Fragment extension bp; auto estimates from read length for single-end |
Dependencies: deepTools (bamCoverage), samtools
A CSV with one row per intron. Required columns:
| Column | Description |
|---|---|
chrom |
Chromosome (e.g. chr1) |
start |
Intron start (0-based) |
end |
Intron end |
strand |
+ or - |
unique_ID |
Unique intron identifier |
Optional columns:
| Column | Description |
|---|---|
expression |
Subset label (e.g. high_retention / low_retention) — used by Steps 2–3 to split data |
Strand-unstranded RNA-seq BigWig files for each timecourse condition.
Set fine_tuning.bw_dir and fine_tuning.bw_files in your config.
The exact YAML configuration files and processed input data used to generate
the figures are located in the manuscript_runs/ directory.
# Reproduce the motif discovery analysis
python scripts/01_get_attributions.py --config manuscript_runs/fig_X.yaml
python scripts/02_run_modisco.py --config manuscript_runs/fig_X.yaml
python scripts/03_run_enrichment.py --config manuscript_runs/fig_X.yaml
python scripts/04_map_motifs.py --config manuscript_runs/fig_X.yamlThe fine-tuned model checkpoint used in the manuscript is available on Zenodo:
DOI: https://doi.org/10.5281/zenodo.19296926
File: borzoi_mm10_delayed_splicing_finetuned.ckpt
Download it into a weights/ directory before running the pipeline:
mkdir -p weights
wget -O weights/borzoi_mm10_delayed_splicing_finetuned.ckpt \
"https://zenodo.org/records/19296926/files/borzoi_mm10_delayed_splicing_finetuned.ckpt"cp config/example_config.yaml config/my_run.yaml
# Edit paths and settings in my_run.yamlThe key fields to set:
experiment_dir: "/path/to/outputs/my_run"
coord_file_path: "/path/to/introns.csv"
gtf_file: "/path/to/gencode.vM23.annotation.gtf"
species: "mouse"
model_selection: "fine_tuned"
checkpoint_path: "weights/borzoi_mm10_delayed_splicing_finetuned.ckpt"See config/example_config.yaml for all options with inline documentation.
# Step 0: fine-tune Borzoi (~hours to days depending on accessible compute)
python scripts/00_fine_tune.py --config config/my_run.yaml
# Step 1: compute attributions (~minutes to hours depending on dataset size and GPU)
python scripts/01_get_attributions.py --config config/my_run.yaml
# Step 2: motif discovery
python scripts/02_run_modisco.py --config config/my_run.yaml
# Step 3: enrichment analysis
python scripts/03_run_enrichment.py --config config/my_run.yaml
# Step 4: generate plots
python scripts/04_map_motifs.py --config config/my_run.yamlSteps must be run in order: each step reads outputs from the previous one.
Step 0 only needs to be run once; the checkpoint can be reused across multiple
attribution runs by setting checkpoint_path in each config.
experiment_dir/
├── attributions.pkl # (N, 4, L) numpy array of attributions
├── input_seqs.pkl # list of N input sequences (strings)
├── element_names_list.pkl # list of N element names
├── attribution_mapping.csv # coord_index → attribution_index mapping
│
├── all_seqs_modisco/
│ └── masked_50bp_flank/
│ ├── modisco_report.h5 # raw TF-MoDISco output
│ ├── forward.meme # de novo motifs (forward strand)
│ └── combined.meme # forward + reverse-complement motifs
│
├── fasta_files/
│ └── all_seqs.fa
│
├── enrichment/
│ └── all_seqs_modisco/
│ └── primary_seqs_vs_tnf_controls_seqs/
│ ├── sea/
│ │ ├── sea.tsv
│ │ ├── sea_enrichment.csv
│ │ └── sea_enrichment_scatter.png
│ └── fimo/
│ ├── fimo_enrichment.csv
│ └── fimo_enrichment_scatter.png
│
└── all_seqs_modisco/
└── masked_50bp_flank/
└── plots/
├── indexing_df.csv
├── elements_df.csv
└── modiscolite/
└── pos_pattern_0/
├── pos_pattern_0_hits.csv
└── <intron_name>.png
All pipeline options are documented in config/example_config.yaml.
Step 0 key options (under fine_tuning:)
bw_files/bw_dir: BigWig RNA-seq inputstrans_func: label transform (log1precommended)dataset_cache_dir: where to cache pre-built datasetstrain_params.loss:mse(used in manuscript)train_params.max_epochs/lr: training schedule
Step 1 key options
centering_mode:intron_only(centre input on intron)attr_respect_to:intron_only(aggregate predictions over intron)attribution_method:inputxgradient(default)
Step 2 key options
mask_mode:intron_only(mask signal outside intron ± flank)flank: list of flank sizes in bpmodisco_len: MoDISco seqlet search window (bp)
Step 3 key options
enrichment.comparisons: list of{primary, control}pairs
Step 4 key options
motif_mapping.modisco_window: must matchmodisco_lenfrom Step 2motif_mapping.bigwig_dir: optional read-density tracks overlay
| Package | Role |
|---|---|
| gReLU | Borzoi model loading, fine-tuning, attribution calculation |
| tfmodisco-lite | De novo motif discovery |
| tangermeme | FIMO motif scanning |
| MEME Suite | SEA enrichment analysis |
| logomaker | Sequence logo plots |
| pyBigWig | Read-density track overlays |
| genomepy | Reference genome download and sequence fetching |
| wandb | Fine-tuning experiment tracking |
If you use this pipeline, please cite:
- Borzoi: Linder J, et al. Predicting RNA-seq coverage from DNA sequence as a unifying model of gene regulation. Nature Genetics (2025). https://doi.org/10.1038/s41588-024-02053-6
- gReLU: Nair S, et al. gReLU: a Python library to train, interpret, and apply deep learning models to genomics. (2024).
- TF-MoDISco / modisco-lite: Shrikumar A, et al. (2020); Trofimova D & Shrikumar A (2023).
- MEME Suite: Bailey TL, et al. The MEME Suite. Nucleic Acids Research (2015).
This analysis: This analysis: Dearborn J.S., Frankiw L., Limoge D.W., Burns C.H., Vlach L., Turpin P., Kirch T., Miller Z.D., Dowell W., Languon S., Garcia‑Flores Y., Cockrell R.C., Baltimore D., Majumdar D. Programmed Delayed Splicing: A Mechanism for Timed Inflammatory Gene Expression. eLife (manuscript under review, 2026).