MolJSON

This repo contains the MolJSON structured-output JSON schema and related scripts. The MolJSON schema was designed to enable Large Language Models to interpret and emit molecular structures with higher accuracy.

Installation

This repo is pip-installable via pyproject.toml.

git clone https://github.com/oxpig/MolJSON.git
# or: git clone git@github.com:oxpig/MolJSON.git
cd MolJSON
pip install -e .

Quick Start

from rdkit import Chem
from moljson import GetSchema, MolToJSON, MolFromJSON, CheckRoundTrip

# 1) Get MolJSON schema
schema = GetSchema()

# 2) RDKit -> MolJSON
mol = Chem.MolFromSmiles("c1c[nH]cc1")
moljson = MolToJSON(mol)  # default atom IDs: C1, C2, N1, ...

# 3) MolJSON -> RDKit
mol2 = MolFromJSON(moljson)

# 4) Round trip check
ok, in_smiles, out_smiles, rt_json = CheckRoundTrip(mol)
print(ok, in_smiles, out_smiles)

Example Notebooks

A simple walkthrough of the MolJSON functions can be found in examples/walkthrough.ipynb. This shows:

Loading and printing the schema
RDKit -> MolJSON conversion
MolJSON -> RDKit conversion
Round-trip checks

For a minimal OpenAI API example see examples/openai_moljson_example.ipynb.

For a minimal Anthropic API example see examples/anthropic_moljson_example.ipynb.

OpenAI API Example

import json
from openai import OpenAI
from moljson import GetSchema

client = OpenAI()  # uses OPENAI_API_KEY
schema = GetSchema()

response = client.responses.create(
    model="gpt-5-nano",
    reasoning={"effort": "low"},
    input="Convert the molecule from SMILES to MolJSON: CCO",
    text={
        "verbosity": "low",
        "format": {
            "type": "json_schema",
            "name": "MolJSON",
            "strict": True,
            "schema": schema,
        },
    },
    store=False,
)

moljson = json.loads(response.output_text)
print(moljson)

Format Notes

By default MolToJSON will omit the charges and aromatic_n_h fields if they are not required for the molecule. If you want to keep these fields use MolToJSON(mol, include_empty_fields=True).

To ensure compatibility with both the OpenAI and Anthropic APIs, the MolJSON schema provided in this repo has been slightly modified. The original schema used in the paper can be found in schemas/paper_moljson.schema.json. The Anthropic API currently does not support minimum/maximum integer ranges, so the fields inside charges and aromatic_n_h now use an enumeration of integers. This is functionally equivalent and should not impact performance. Additionally, the original schema used in the paper allowed aromatic_n_h.hcount values up to 2; this has now been corrected so that only hcount = 1 is allowed.

Citation

If you use MolJSON in your work, please cite:

@article{runcie2026moljson,
  title = {Molecular Representations for Large Language Models},
  author = {Runcie, Nicholas T. and Imrie, Fergus and Deane, Charlotte M.},
  year = {2026},
  journal = {arXiv preprint arXiv:2605.01822},
  doi = {10.48550/arXiv.2605.01822},
  url = {http://arxiv.org/abs/2605.01822},
}

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
examples		examples
schemas		schemas
src/moljson		src/moljson
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MolJSON

Installation

Quick Start

Example Notebooks

OpenAI API Example

Format Notes

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

MolJSON

Installation

Quick Start

Example Notebooks

OpenAI API Example

Format Notes

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages