Skip to content

Commit ce9d6b3

Browse files
committed
Q3 2025 Updates
1 parent 9754280 commit ce9d6b3

214,996 files changed

Lines changed: 5175913 additions & 155 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.env.example

Lines changed: 50 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,50 @@
1+
# Kodezi Chronos Environment Variables
2+
# Copy this file to .env and configure for your environment
3+
4+
# Database Configuration
5+
DATABASE_URL=postgresql://chronos:password@localhost:5432/chronos
6+
REDIS_URL=redis://localhost:6379
7+
8+
# API Configuration
9+
API_KEY=your-api-key-here
10+
API_PORT=5000
11+
API_HOST=0.0.0.0
12+
13+
# Model Configuration
14+
CHRONOS_MODEL_PATH=/models/chronos-debug-llm
15+
MODEL_DEVICE=cuda # cuda, cpu, or mps (for Apple Silicon)
16+
MODEL_QUANTIZATION=int8 # int8, fp16, or none
17+
18+
# Security
19+
JWT_SECRET_KEY=your-secret-key-here
20+
ENCRYPTION_KEY=your-encryption-key-here
21+
22+
# GitHub Integration
23+
GITHUB_APP_ID=your-github-app-id
24+
GITHUB_PRIVATE_KEY_PATH=/secrets/github_private_key.pem
25+
GITHUB_WEBHOOK_SECRET=your-webhook-secret
26+
27+
# Monitoring
28+
PROMETHEUS_PORT=9090
29+
JAEGER_ENDPOINT=http://jaeger:14268/api/traces
30+
LOG_LEVEL=INFO
31+
32+
# Resource Limits
33+
MAX_CONCURRENT_REQUESTS=100
34+
SANDBOX_TIMEOUT_SECONDS=300
35+
MEMORY_LIMIT_MB=2048
36+
37+
# Feature Flags
38+
ENABLE_DOCKER_SANDBOX=true
39+
ENABLE_CACHING=true
40+
ENABLE_MONITORING=true
41+
ENABLE_ASYNC_PROCESSING=true
42+
43+
# External Services (Optional)
44+
SLACK_WEBHOOK_URL=
45+
JIRA_URL=
46+
JIRA_USERNAME=
47+
JIRA_API_TOKEN=
48+
49+
# Model Download (Optional)
50+
CHRONOS_MODEL_URL=https://models.kodezi.com/chronos-debug-llm-2025.bin

CHANGELOG.md

Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,35 @@ All notable changes to the Kodezi Chronos research repository will be documented
55
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
66
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
77

8+
## [2.0.0] - 2025-07-29
9+
10+
### Added
11+
- New co-author: Yousuf Zaii
12+
- Updated 2025 research paper with latest findings
13+
- Enhanced benchmark results (12,500 total bugs evaluated)
14+
- New model comparisons (Claude Opus 4, GPT-4.1, DeepSeek V3, Gemini 2.0 Pro)
15+
- Human preference evaluation (N=50, 89% preference)
16+
- Cohen's d effect size analysis (d=3.87)
17+
- O(k log d) complexity proof for AGR
18+
- Hardware-dependent and dynamic language limitation analysis
19+
- 2025 evaluation framework (evaluate_2025.py)
20+
- Visualization generation scripts for paper figures
21+
- Comprehensive 2025 architecture documentation
22+
23+
### Changed
24+
- Debug success rate: 65.3% → 67.3% (±2.1%)
25+
- Retrieval precision: 91% → 92%
26+
- Average iterations: 2.2 → 7.8 (more thorough debugging)
27+
- Comparison baseline: GPT-4 → GPT-4.1 and Claude-4-Opus
28+
- Performance improvement: 6-7x → 4-5x (against stronger baselines)
29+
30+
### Updated
31+
- README.md with 2025 performance metrics
32+
- CITATION.cff with new author and version 2.0.0
33+
- Architecture documentation with 4-pillar design
34+
- Benchmark documentation with expanded test suite
35+
- Performance tables with latest model comparisons
36+
837
## [1.0.0] - 2025-07-14
938

1039
### Added

CITATION.cff

Lines changed: 8 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,10 @@ authors:
2020
family-names: Patel
2121
2222
affiliation: Kodezi Inc.
23+
- given-names: Yousuf
24+
family-names: Zaii
25+
26+
affiliation: Kodezi Inc.
2327
identifiers:
2428
- type: doi
2529
value: 10.48550/arXiv.2507.12482
@@ -37,8 +41,8 @@ abstract: >-
3741
embedding memory engine, combining vector and graph-based indexing with continuous
3842
code-aware retrieval. This enables efficient and accurate reasoning over millions
3943
of lines of code, supporting repository-scale comprehension, multi-file refactoring,
40-
and real-time self-healing actions. Chronos achieves 65.3% debugging success rate,
41-
representing a 6-7x improvement over state-of-the-art models.
44+
and real-time self-healing actions. Chronos achieves 67.3% debugging success rate,
45+
representing a 4-5x improvement over state-of-the-art models including Claude Opus 4 and GPT-4.1.
4246
keywords:
4347
- debugging
4448
- language models
@@ -47,5 +51,5 @@ keywords:
4751
- autonomous systems
4852
- memory-driven AI
4953
license: MIT
50-
version: 1.0.0
51-
date-released: '2025-07-14'
54+
version: 2.0.0
55+
date-released: '2025-07-29'

LEADERBOARD.md

Lines changed: 98 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,98 @@
1+
# Chronos MRR Benchmark Leaderboard
2+
3+
## Overall Performance (5,000 scenarios)
4+
5+
| Model | Success Rate | Precision | Recall | Avg Iterations | Cost/Fix |
6+
|-------|--------------|-----------|--------|----------------|----------|
7+
| **Chronos*** | 67.3%±2.1% | 92% | 85% | 7.8 | $1.36 |
8+
| Gemini-2.0 Pro | 15.0%±1.5% | 74% | 38% | 19.2 | $4.25 |
9+
| Claude-4 Opus | 14.2%±1.3% | 67% | 34% | 21.8 | $4.89 |
10+
| GPT-4.1 | 13.8%±1.2% | 68% | 32% | 23.5 | $5.53 |
11+
| DeepSeek-V2 | 8.7%±0.9% | 52% | 21% | 28.1 | $7.82 |
12+
| Mistral-Large | 9.2%±0.8% | 48% | 19% | 31.7 | $8.95 |
13+
14+
*Chronos model available via Kodezi OS only
15+
16+
## Category Performance
17+
18+
### Syntax Errors (500 scenarios)
19+
| Model | Success Rate | Improvement vs GPT-4 |
20+
|-------|--------------|---------------------|
21+
| Chronos | 94.2% | 1.1x |
22+
| GPT-4.1 | 82.3% | - |
23+
| Claude-4 | 79.8% | 0.97x |
24+
| Gemini-2.0 | 85.1% | 1.03x |
25+
26+
### Logic Errors (1,200 scenarios)
27+
| Model | Success Rate | Improvement vs GPT-4 |
28+
|-------|--------------|---------------------|
29+
| Chronos | 72.8% | 6.0x |
30+
| GPT-4.1 | 12.1% | - |
31+
| Claude-4 | 10.7% | 0.88x |
32+
| Gemini-2.0 | 15.3% | 1.26x |
33+
34+
### Concurrency Issues (800 scenarios)
35+
| Model | Success Rate | Improvement vs GPT-4 |
36+
|-------|--------------|---------------------|
37+
| Chronos | 58.3% | 18.2x |
38+
| GPT-4.1 | 3.2% | - |
39+
| Claude-4 | 2.8% | 0.88x |
40+
| Gemini-2.0 | 4.1% | 1.28x |
41+
42+
### Memory Issues (600 scenarios)
43+
| Model | Success Rate | Improvement vs GPT-4 |
44+
|-------|--------------|---------------------|
45+
| Chronos | 61.7% | 10.8x |
46+
| GPT-4.1 | 5.7% | - |
47+
| Claude-4 | 4.3% | 0.75x |
48+
| Gemini-2.0 | 6.9% | 1.21x |
49+
50+
### API Misuse (900 scenarios)
51+
| Model | Success Rate | Improvement vs GPT-4 |
52+
|-------|--------------|---------------------|
53+
| Chronos | 79.1% | 4.2x |
54+
| GPT-4.1 | 18.9% | - |
55+
| Claude-4 | 16.2% | 0.86x |
56+
| Gemini-2.0 | 22.4% | 1.19x |
57+
58+
### Performance Bugs (400 scenarios)
59+
| Model | Success Rate | Improvement vs GPT-4 |
60+
|-------|--------------|---------------------|
61+
| Chronos | 65.4% | 8.8x |
62+
| GPT-4.1 | 7.4% | - |
63+
| Claude-4 | 6.1% | 0.82x |
64+
| Gemini-2.0 | 9.8% | 1.32x |
65+
66+
### Cross-Category (600 scenarios)
67+
| Model | Success Rate | Improvement vs GPT-4 |
68+
|-------|--------------|---------------------|
69+
| Chronos | 51.2% | 12.5x |
70+
| GPT-4.1 | 4.1% | - |
71+
| Claude-4 | 3.7% | 0.90x |
72+
| Gemini-2.0 | 5.2% | 1.27x |
73+
74+
## Repository Scale Performance
75+
76+
| Repository Size | Chronos | Best Baseline | Improvement |
77+
|-----------------|---------|---------------|-------------|
78+
| <10K LOC | 71.2% | 21.3% (Gemini) | 3.3x |
79+
| 10K-100K LOC | 68.9% | 14.7% (Gemini) | 4.7x |
80+
| 100K-1M LOC | 64.3% | 8.9% (Gemini) | 7.2x |
81+
| >1M LOC | 59.7% | 3.8% (Gemini) | 15.7x |
82+
83+
## Statistical Significance
84+
85+
- **p-value**: < 0.001 (highly significant)
86+
- **Cohen's d**: 3.87 (very large effect size)
87+
- **95% CI for Chronos**: [65.2%, 69.4%]
88+
- **Sample Size**: 5,000 scenarios
89+
90+
## Evaluation Methodology
91+
92+
All models evaluated on identical scenarios with:
93+
- Multi-file context (10-50 files)
94+
- Temporal dispersion (3-12 months)
95+
- Obfuscated dependencies
96+
- Real-world complexity
97+
98+
Last Updated: January 2025

README.md

Lines changed: 11 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -10,12 +10,12 @@
1010
[![Research](https://img.shields.io/badge/Research-Paper-orange.svg?style=for-the-badge)](paper/chronos-research.md)
1111
[![Benchmark](https://img.shields.io/badge/Benchmark-MRR-purple.svg?style=for-the-badge)](benchmarks/multi-random-retrieval/)
1212

13-
<img src="https://img.shields.io/badge/Debug%20Success-65.3%25-brightgreen?style=for-the-badge" alt="Debug Success Rate">
14-
<img src="https://img.shields.io/badge/Root%20Cause%20Accuracy-78.4%25-blue?style=for-the-badge" alt="Root Cause Accuracy">
15-
<img src="https://img.shields.io/badge/Improvement-6--7x-yellow?style=for-the-badge" alt="Improvement over GPT-4">
16-
<img src="https://img.shields.io/badge/Cost%20Efficiency-4.5x-orange?style=for-the-badge" alt="Cost Efficiency">
13+
<img src="https://img.shields.io/badge/Debug%20Success-67.3%25-brightgreen?style=for-the-badge" alt="Debug Success Rate">
14+
<img src="https://img.shields.io/badge/Human%20Preference-89%25-blue?style=for-the-badge" alt="Human Preference">
15+
<img src="https://img.shields.io/badge/Improvement-4--5x-yellow?style=for-the-badge" alt="Improvement over GPT-4.1">
16+
<img src="https://img.shields.io/badge/Time%20Reduction-40%25-orange?style=for-the-badge" alt="Time Reduction">
1717

18-
<h3>🎯 65.3% Autonomous Debugging Success • 🔍 78.4% Root Cause Accuracy • ⚡ 2.2 Average Fix Cycles • 💰 $1.36 per Bug Fix</h3>
18+
<h3>🎯 67.3% Autonomous Debugging Success • 🔍 89% Human Preference • ⚡ 7.8 Average Fix Iterations • 💰 40% Time Reduction</h3>
1919

2020
<p align="center">
2121
<img src="results/figures/architecture_overview.svg" alt="Chronos Architecture" width="800">
@@ -58,13 +58,13 @@ This repository contains research findings, benchmarks, and evaluation framework
5858

5959
### Overall Benchmark Results (5,000+ Real-World Debugging Scenarios)
6060

61-
| Metric | **Kodezi Chronos** | **GPT-4** | **Claude-3-Opus** | **Gemini-1.5-Pro** | **Improvement** |
61+
| Metric | **Kodezi Chronos** | **GPT-4.1** | **Claude-4-Opus** | **Gemini-2.0-Pro** | **Improvement** |
6262
|:------:|:------------------:|:---------:|:-----------------:|:------------------:|:---------------:|
63-
| **Debug Success Rate** | **65.3%±1.4%*** | 8.5%±2.1% | 7.8%±2.3% | 11.2%±1.7% | **5.8-8.4x** |
64-
| **Root Cause Accuracy** | **78.4%±1.2%*** | 12.3%±1.8% | 11.7%±2.0% | 15.8%±1.5% | **5.0-6.7x** |
65-
| **Average Fix Cycles** | **2.2** | 6.5 | 6.8 | 5.1 | **2.3-3.1x faster** |
66-
| **Retrieval Precision** | **91%±0.8%*** | 68%±2.3% | 67%±2.4% | 74%±1.8% | **1.2-1.4x** |
67-
| **Cost per Success** | **$1.36** | $5.53 | $6.67 | $6.07 | **4.1-4.9x cheaper** |
63+
| **Debug Success Rate** | **67.3%±2.1%*** | 13.8%±1.2% | 14.2%±1.3% | <15% | **4.7-6.0x** |
64+
| **Root Cause Accuracy** | **89%*** | 12.3%±1.8% | 11.7%±2.0% | 15.8%±1.5% | **5.6-7.6x** |
65+
| **Average Fix Iterations** | **7.8** | 1-2 | 1-2 | 1-2 | **More thorough** |
66+
| **Retrieval Precision** | **92%*** | 68%±2.3% | 67%±2.4% | 74%±1.8% | **1.2-1.4x** |
67+
| **Time Reduction** | **40%** | - | - | - | **40% faster** |
6868

6969
***p < 0.001 compared to best baseline (two-tailed t-test, n=5,000)**
7070

architecture/README.md

Lines changed: 34 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -48,34 +48,50 @@ Unlike traditional LLMs that optimize for large input contexts, Chronos recogniz
4848
- Stack traces: 200-500 tokens
4949
- Relevant code: 1K-4K tokens
5050
- Logs/tests: 500-2K tokens
51-
- Total: ~3-10K tokens
51+
- Prior fix attempts: 500-1K tokens
52+
- Total: Often < 10K tokens
5253

5354
**Output (Dense)**:
5455
- Multi-file fixes: 500-1,500 tokens
55-
- Explanations: 300-600 tokens
56+
- Root cause explanations: 300-600 tokens
5657
- Updated tests: 400-800 tokens
57-
- Documentation: 200-400 tokens
58-
- Total: ~2-4K tokens
58+
- Documentation/PR summaries: 350-700 tokens
59+
- Total: 2,000-4,000 tokens
5960

60-
This insight drives architectural decisions throughout the system.
61+
This insight drives architectural decisions throughout the system. Chronos achieves 67.3% debugging success despite competitors having 10-100x larger context windows, validating that output quality matters more than input capacity.
6162

6263
#### Adaptive Graph-Guided Retrieval (AGR)
6364

6465
AGR dynamically expands retrieval depth based on:
6566
- Query complexity scoring
66-
- Confidence thresholds
67+
- Confidence thresholds
6768
- Diminishing returns detection
6869
- Edge type priorities
70+
- O(k log d) retrieval complexity with convergence guarantees
71+
- 92% precision at 85% recall on debugging queries
72+
73+
Key improvements from 2025 research:
74+
- Adaptive k-hop expansion based on query complexity
75+
- Multi-graph fusion with weighted edges
76+
- Confidence-based termination criteria
77+
- Semantic node similarity integration
6978

7079
This enables unlimited effective context without the computational burden of massive context windows.
7180

72-
#### Persistent Debug Memory
81+
#### Persistent Debug Memory (PDM)
7382

7483
The memory system maintains:
7584
- Repository-specific bug patterns
7685
- Team coding conventions
7786
- Historical fix effectiveness
7887
- Module vulnerability profiles
88+
- Cross-session learning patterns
89+
90+
Key achievements from 2025 research:
91+
- 15M+ debugging sessions stored
92+
- 87% cache hit rate for similar bugs
93+
- Temporal pattern learning over project lifecycles
94+
- Automatic pattern extraction and generalization
7995

8096
This enables continuous improvement and rapid adaptation to new debugging scenarios.
8197

@@ -125,14 +141,16 @@ CI/CD Logs ──► (Embedding + Graph) ──► Association
125141
### Scalability
126142

127143
- **Repository Size**: Maintains >60% success rate even on 1M+ LOC repos
128-
- **Retrieval Speed**: Sub-linear complexity through hierarchical indexing
144+
- **Retrieval Speed**: Sub-linear O(k log d) complexity through AGR
129145
- **Memory Efficiency**: Compressed representations with lazy loading
146+
- **Cross-Language**: Supports 25+ programming languages
130147

131148
### Reliability
132149

133150
- **Validation Rate**: 100% of fixes tested before suggestion
134-
- **Regression Prevention**: Historical pattern matching
151+
- **Regression Prevention**: Historical pattern matching with PDM
135152
- **Rollback Capability**: Full undo for failed attempts
153+
- **Success Rate**: 67.3% on MRR benchmark (4.87x improvement)
136154

137155
## Integration Points
138156

@@ -163,19 +181,23 @@ Chronos integrates with development workflows through:
163181

164182
| Aspect | Traditional LLMs | Kodezi Chronos |
165183
|--------|------------------|----------------|
166-
| Context Handling | Fixed windows | Dynamic retrieval |
167-
| Memory | Session-based | Persistent |
184+
| Context Handling | Fixed windows | Dynamic AGR retrieval |
185+
| Memory | Session-based | Persistent (15M+ sessions) |
168186
| Validation | Post-hoc | Built-in loop |
169187
| Specialization | General purpose | Debugging-focused |
170188
| Output Focus | Token prediction | Structured fixes |
189+
| Success Rate | 13.8-14.2% | 67.3% |
190+
| Complexity | O(n) context | O(k log d) retrieval |
171191

172192
## Future Architecture Evolution
173193

174194
Planned enhancements include:
175195
- Federated learning across organizations
176196
- Visual debugging for UI issues
177-
- Hardware-specific debugging modules
197+
- Hardware-specific debugging modules (current: 23.4% success)
178198
- Real-time collaborative debugging
199+
- Improved dynamic language support (current: 41.2% success)
200+
- Enhanced distributed systems debugging (current: 30% success)
179201

180202
## Learn More
181203

0 commit comments

Comments
 (0)