Skip to content

Commit 8c76927

Browse files
committed
Add base readme and simulate script
1 parent abe3b94 commit 8c76927

2 files changed

Lines changed: 125 additions & 0 deletions

File tree

README.md

Lines changed: 77 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,77 @@
1+
# OpenAlphaDiffract
2+
3+
OpenAlphaDiffract is an open-source implementation of the AlphaDiffract research project. It provides a reproducible pipeline to:
4+
- Create a diffraction dataset from the Materials Project (CIF inputs)
5+
- Simulate powder diffraction patterns from those structures
6+
- Train and evaluate models on the generated dataset
7+
8+
For ease of use, a HF endpoint exists [TODO].
9+
10+
## Inference Quickstart
11+
[TODO]: (probably hosting on HF)
12+
[TODO]: Minimal local install with the trainer container
13+
14+
## Dataset Pipeline Overview
15+
16+
1. Acquire CIFs (Downloader Container)
17+
- Uses the Materials Project API to fetch crystal structures as CIF files
18+
- Configurable via `configs/download.yaml`
19+
- Filters structures by checking conventional cell consistency across multiple angle tolerances. This filters ~4.4% of MP structures as of 10/22/2025.
20+
21+
2. GSAS-II XRD Simulation (Simulator Container)
22+
- Generates synthetic powder diffraction patterns from CIFs
23+
- Configurable via `configs/simulator.yaml` (e.g., instrument file, noise ranges, job parallelism)
24+
- Creates .npy files with simulated pattern and metadata ready to be consumed by the training system
25+
26+
3. TODO: Training
27+
28+
## Training from Scratch Quickstart
29+
30+
Prerequisites:
31+
- Docker and Docker Compose
32+
- A Materials Project API key
33+
34+
> [!WARNING]
35+
> Building the dataset and training will take a significant amount of space and computational resources:
36+
> - Expect to use around 1TB+ of space in total to replicate the paper's 100-variation dataset
37+
> - We recommend running simulation with ~100 processes in parallel. For reference [XYZ] this should take [XYZ hours].
38+
> - Training took [XYZ hours] on [XYZ hardware]
39+
40+
41+
Setup:
42+
1. Copy the environment file and set your API key:
43+
- `cp .env.example .env`
44+
- Edit `.env` and set `MP_API_KEY`
45+
- Optionally set `UID` and `GID` so the containers write files as your user.
46+
47+
2. Download CIFs:
48+
- `scripts/download.sh`
49+
- CIFs will be written to `./data/raw_cif`
50+
51+
3. Simulate diffraction patterns:
52+
- `scripts/simulate.sh`
53+
- Patterns will be written to `./data/dataset`
54+
- Errors (if any) go to `./data/error_logs`
55+
56+
Notes:
57+
- You can pass extra CLI args to the simulator via `scripts/simulate.sh`, e.g. `--sims_per_file 1 --parallel_jobs 4`
58+
- The default container commands and mounts are defined in `compose.yaml`
59+
60+
## Project Structure
61+
62+
```
63+
OpenAlphaDiffract/
64+
├── configs/ - Pipeline configuration files
65+
├── docker/ - Container definitions
66+
├── scripts/ - User-facing scripts
67+
├── src/ - Source code for pipeline components
68+
│ ├── downloader/
69+
│ └── simulator/
70+
71+
```
72+
73+
74+
75+
## Citation
76+
77+
We hope this code was helpful to your work! If you use our code or extend our work, please consider citing our paper:

scripts/simulate.sh

Lines changed: 48 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,48 @@
1+
#!/usr/bin/env bash
2+
set -euo pipefail
3+
4+
# Helper to autobuild and run the simulator container, similar to download.sh
5+
6+
# Resolve repository root from this script's location (scripts)
7+
ROOT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)"
8+
9+
COMPOSE_FILE="$ROOT_DIR/compose.yaml"
10+
ENV_FILE="$ROOT_DIR/.env"
11+
CFG_FILE="$ROOT_DIR/configs/simulator.yaml"
12+
DATA_DIR="$ROOT_DIR/data"
13+
14+
# Environment handling:
15+
# Do NOT source .env here because it may contain UID/GID assignments that conflict with Bash's readonly UID variable.
16+
# Docker Compose will read .env automatically. This script relies on compose to handle UID/GID and other vars.
17+
18+
# Sanity checks
19+
if [[ ! -f "$CFG_FILE" ]]; then
20+
echo "Error: $CFG_FILE not found. Create and configure your simulator YAML." >&2
21+
exit 1
22+
fi
23+
24+
# Ensure host-side data directories exist (bind mounts)
25+
mkdir -p "$DATA_DIR/raw_cif" \
26+
"$DATA_DIR/dataset" \
27+
"$DATA_DIR/error_logs" \
28+
"$DATA_DIR/workers"
29+
30+
# Build simulator image
31+
echo "Building simulator image via docker compose..."
32+
docker compose -f "$COMPOSE_FILE" build simulator
33+
34+
# Run simulator
35+
# If extra args are provided, invoke the python module explicitly so args are passed through.
36+
# Otherwise, use the default command from compose.yaml.
37+
echo "Running simulator container..."
38+
if [[ $# -gt 0 ]]; then
39+
docker compose -f "$COMPOSE_FILE" run --rm simulator python -m simulator.diffraction_generator --config /app/configs/simulator.yaml "$@"
40+
else
41+
docker compose -f "$COMPOSE_FILE" run --rm simulator
42+
fi
43+
44+
# Notes:
45+
# - The simulator reads CIFs from ./data/raw_cif and writes outputs under ./data (see configs/simulator.yaml).
46+
# - You can override parameters by passing extra CLI flags, for example:
47+
# scripts/simulate.sh --sims_per_file 1 --parallel_jobs 4
48+
# These are forwarded to the underlying Python module.

0 commit comments

Comments
 (0)