Add base readme and simulate script

Linked-Liszt · Linked-Liszt · commit 8c76927e8d4c · 2025-10-22T18:46:15.000-05:00
diff --git a/README.md b/README.md
@@ -0,0 +1,77 @@
+# OpenAlphaDiffract
+
+OpenAlphaDiffract is an open-source implementation of the AlphaDiffract research project. It provides a reproducible pipeline to:
+- Create a diffraction dataset from the Materials Project (CIF inputs)
+- Simulate powder diffraction patterns from those structures
+- Train and evaluate models on the generated dataset
+
+For ease of use, a HF endpoint exists [TODO]. 
+
+## Inference Quickstart
+[TODO]: (probably hosting on HF)
+[TODO]: Minimal local install with the trainer container
+
+## Dataset Pipeline Overview
+
+1. Acquire CIFs (Downloader Container)
+    - Uses the Materials Project API to fetch crystal structures as CIF files
+    - Configurable via `configs/download.yaml`
+    - Filters structures by checking conventional cell consistency across multiple angle tolerances. This filters ~4.4% of MP structures as of 10/22/2025. 
+
+2. GSAS-II XRD Simulation (Simulator Container)
+    - Generates synthetic powder diffraction patterns from CIFs
+    - Configurable via `configs/simulator.yaml` (e.g., instrument file, noise ranges, job parallelism)
+    - Creates .npy files with simulated pattern and metadata ready to be consumed by the training system
+
+3. TODO: Training
+
+## Training from Scratch Quickstart
+
+Prerequisites:
+- Docker and Docker Compose
+- A Materials Project API key
+
+> [!WARNING]
+> Building the dataset and training will take a significant amount of space and computational resources: 
+> - Expect to use around 1TB+ of space in total to replicate the paper's 100-variation dataset
+> - We recommend running simulation with ~100 processes in parallel. For reference [XYZ] this should take [XYZ hours]. 
+> - Training took [XYZ hours] on [XYZ hardware]
+
+
+Setup:
+1. Copy the environment file and set your API key:
+   - `cp .env.example .env`
+   - Edit `.env` and set `MP_API_KEY`
+   - Optionally set `UID` and `GID` so the containers write files as your user. 
+
+2. Download CIFs:
+   - `scripts/download.sh`
+   - CIFs will be written to `./data/raw_cif`
+
+3. Simulate diffraction patterns:
+   - `scripts/simulate.sh`
+   - Patterns will be written to `./data/dataset`
+   - Errors (if any) go to `./data/error_logs`
+
+Notes:
+- You can pass extra CLI args to the simulator via `scripts/simulate.sh`, e.g. `--sims_per_file 1 --parallel_jobs 4`
+- The default container commands and mounts are defined in `compose.yaml`
+
+## Project Structure
+
+```
+OpenAlphaDiffract/
+├── configs/ - Pipeline configuration files
+├── docker/ - Container definitions
+├── scripts/ - User-facing scripts
+├── src/ - Source code for pipeline components
+│   ├── downloader/
+│   └── simulator/
+
+```
+
+
+
+## Citation
+
+We hope this code was helpful to your work! If you use our code or extend our work, please consider citing our paper:
diff --git a/scripts/simulate.sh b/scripts/simulate.sh
@@ -0,0 +1,48 @@
+#!/usr/bin/env bash
+set -euo pipefail
+
+# Helper to autobuild and run the simulator container, similar to download.sh
+
+# Resolve repository root from this script's location (scripts)
+ROOT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)"
+
+COMPOSE_FILE="$ROOT_DIR/compose.yaml"
+ENV_FILE="$ROOT_DIR/.env"
+CFG_FILE="$ROOT_DIR/configs/simulator.yaml"
+DATA_DIR="$ROOT_DIR/data"
+
+# Environment handling:
+# Do NOT source .env here because it may contain UID/GID assignments that conflict with Bash's readonly UID variable.
+# Docker Compose will read .env automatically. This script relies on compose to handle UID/GID and other vars.
+
+# Sanity checks
+if [[ ! -f "$CFG_FILE" ]]; then
+  echo "Error: $CFG_FILE not found. Create and configure your simulator YAML." >&2
+  exit 1
+fi
+
+# Ensure host-side data directories exist (bind mounts)
+mkdir -p "$DATA_DIR/raw_cif" \
+         "$DATA_DIR/dataset" \
+         "$DATA_DIR/error_logs" \
+         "$DATA_DIR/workers"
+
+# Build simulator image
+echo "Building simulator image via docker compose..."
+docker compose -f "$COMPOSE_FILE" build simulator
+
+# Run simulator
+# If extra args are provided, invoke the python module explicitly so args are passed through.
+# Otherwise, use the default command from compose.yaml.
+echo "Running simulator container..."
+if [[ $# -gt 0 ]]; then
+  docker compose -f "$COMPOSE_FILE" run --rm simulator python -m simulator.diffraction_generator --config /app/configs/simulator.yaml "$@"
+else
+  docker compose -f "$COMPOSE_FILE" run --rm simulator
+fi
+
+# Notes:
+# - The simulator reads CIFs from ./data/raw_cif and writes outputs under ./data (see configs/simulator.yaml).
+# - You can override parameters by passing extra CLI flags, for example:
+#     scripts/simulate.sh --sims_per_file 1 --parallel_jobs 4
+#   These are forwarded to the underlying Python module.