Merge pull request #10 from johnburbridge/feature/red-black-tree-storage

johnburbridge · web-flow · commit b28d183c35ab · 2025-03-23T00:17:12.000-07:00
docs: update documentation to include Red-Black Tree implementation
diff --git a/docs/alternatives.md b/docs/alternatives.md
@@ -2,9 +2,42 @@
 
 This document explores the limitations of the current implementation of time-based storage using Python's native data structures and discusses alternative approaches with their respective trade-offs.
 
-## Limitations of Current Implementation
+## Current Implementations
 
-The current implementation uses Python's native data structures (dictionaries, lists, and heaps), which come with several limitations:
+The time-based storage package currently provides three implementations with different performance characteristics:
+
+### 1. TimeBasedStorage (Dictionary-Based)
+
+- **Data structure**: Python dictionaries with timestamp keys
+- **Characteristics**:
+  - Simple implementation with minimal dependencies
+  - O(1) lookup for specific timestamps
+  - O(n) insertion time due to maintaining sorted access
+  - Works well for small to medium datasets
+
+### 2. TimeBasedStorageHeap (Heap-Based)
+
+- **Data structure**: Python's heapq module with a min-heap
+- **Characteristics**:
+  - O(log n) insertion time
+  - O(1) access to earliest event
+  - O(n log n) for range queries
+  - Efficient for event processing where earliest events are prioritized
+
+### 3. TimeBasedStorageRBTree (Red-Black Tree)
+
+- **Data structure**: SortedDict from sortedcontainers package (Red-Black Tree)
+- **Characteristics**:
+  - Balanced O(log n) performance for both insertions and queries
+  - Efficient O(log n + k) range queries where k is the number of items in range
+  - Up to 470x speedup for small targeted range queries compared to the dictionary-based implementation
+  - Requires the sortedcontainers package dependency
+
+All implementations provide thread-safe variants for concurrent access and share the same core API.
+
+## Limitations of Current Implementations
+
+Despite having multiple implementations optimized for different use cases, all current implementations share some limitations:
 
 ### Memory Constraints
 
@@ -13,18 +46,6 @@ The current implementation uses Python's native data structures (dictionaries, l
 - **No compression**: Data is stored uncompressed, using more memory than necessary
 - **Copy semantics**: Range queries and other operations create copies of data
 
-### Performance Limitations
-
-- **TimeBasedStorage (sorted list/dictionary)**:
-  - O(n) insertion time as items must maintain sort order
-  - Not optimized for very large datasets (>100K entries)
-  - Full scan required for some operations
-
-- **TimeBasedStorageHeap**:
-  - O(n log n) for range queries which requires scanning the entire heap
-  - Inefficient for latest event access (requires a full heap traversal)
-  - Extra overhead for maintaining heap property
-
 ### Persistence Issues
 
 - **No built-in persistence**: Data is lost when the program terminates
@@ -58,11 +79,6 @@ The current implementation uses Python's native data structures (dictionaries, l
   - Better for larger datasets with frequent range access
   - More complex implementation than current approach
 
-- **Red-Black Trees**:
-  - Self-balancing with guaranteed O(log n) operations
-  - Consistent performance regardless of data distribution
-  - More complex than binary search trees
-
 - **Skip Lists**:
   - Probabilistic alternative to balanced trees
   - O(log n) average operations with simpler implementation
@@ -197,17 +213,22 @@ The current implementation uses Python's native data structures (dictionaries, l
 
 ### For Small to Medium-Scale Applications
 
-1. **Add Persistence Layer**:
+1. **Use the Right Implementation for Your Needs**:
+   - **TimeBasedStorage**: Simple use cases with small datasets
+   - **TimeBasedStorageHeap**: When you need fast insertion and earliest-event access
+   - **TimeBasedStorageRBTree**: When you need balanced performance and frequent range queries
+
+2. **Add Persistence Layer**:
    - Implement serialization/deserialization to/from disk
    - Consider using pickle, JSON, or MessagePack
    - Add options for periodic automatic saving
 
-2. **Implement Time-Based Partitioning**:
+3. **Implement Time-Based Partitioning**:
    - Separate storage by time periods (days/weeks/months)
    - Enable efficient archiving of older data
    - Reduce memory usage for full dataset
 
-3. **Add TTL and Cleanup**:
+4. **Add TTL and Cleanup**:
    - Automatic pruning of old data
    - Configurable retention policies
    - Background cleanup process
@@ -248,7 +269,7 @@ The current implementation uses Python's native data structures (dictionaries, l
 
 ## Conclusion
 
-The current implementation with Python's native data structures provides a simple, easy-to-understand approach for time-based storage. However, as requirements grow in terms of data volume, query complexity, or performance needs, alternative approaches may become necessary.
+The current implementations provide options for different use cases, with the Red-Black Tree offering a good balance of performance for most scenarios. However, as requirements grow in terms of data volume, query complexity, or performance needs, alternative approaches may become necessary.
 
 The right choice depends on specific requirements:
 - Data volume and growth rate
diff --git a/docs/architecture.md b/docs/architecture.md
@@ -11,11 +11,13 @@ time_based_storage/
 ├── core/
 │   ├── __init__.py
 │   ├── base.py        # Base implementation
-│   └── heap.py        # Heap-based implementation
+│   ├── heap.py        # Heap-based implementation
+│   └── rbtree.py      # Red-Black Tree implementation
 └── concurrent/
     ├── __init__.py
     ├── thread_safe.py          # Thread-safe wrapper for base implementation
-    └── thread_safe_heap.py     # Thread-safe wrapper for heap implementation
+    ├── thread_safe_heap.py     # Thread-safe wrapper for heap implementation
+    └── thread_safe_rbtree.py   # Thread-safe wrapper for RB-Tree implementation
 ```
 
 ## Class Hierarchy
@@ -59,6 +61,20 @@ classDiagram
         +is_empty(): bool
     }
     
+    class TimeBasedStorageRBTree~T~ {
+        -SortedDict _storage
+        +add(timestamp, value): void
+        +get_range(start_time, end_time): List~T~
+        +get_duration(duration): List~T~
+        +get_value_at(timestamp): Optional~T~
+        +remove(timestamp): bool
+        +clear(): void
+        +get_all(): List~T~
+        +get_timestamps(): List~datetime~
+        +size(): int
+        +is_empty(): bool
+    }
+    
     class ThreadSafeTimeBasedStorage~T~ {
         -RLock _lock
         -Condition _condition
@@ -79,12 +95,25 @@ classDiagram
         +notify_data_available(): void
     }
     
+    class ThreadSafeTimeBasedStorageRBTree~T~ {
+        -RLock _lock
+        -Condition _condition
+        +add(timestamp, value): void
+        +get_range(start_time, end_time): List~T~
+        +get_duration(duration): List~T~
+        +wait_for_data(timeout): bool
+        +notify_data_available(): void
+    }
+    
     Generic~T~ <|-- TimeBasedStorage~T~
     Generic~T~ <|-- TimeBasedStorageHeap~T~
+    Generic~T~ <|-- TimeBasedStorageRBTree~T~
     TimeBasedStorage~T~ <|-- ThreadSafeTimeBasedStorage~T~
     TimeBasedStorageHeap~T~ <|-- ThreadSafeTimeBasedStorageHeap~T~
+    TimeBasedStorageRBTree~T~ <|-- ThreadSafeTimeBasedStorageRBTree~T~
     Generic~T~ <|-- ThreadSafeTimeBasedStorage~T~
     Generic~T~ <|-- ThreadSafeTimeBasedStorageHeap~T~
+    Generic~T~ <|-- ThreadSafeTimeBasedStorageRBTree~T~
 ```
 
 ### Core Components
@@ -99,6 +128,11 @@ classDiagram
    - Optimized for accessing the earliest event
    - Maintains partial ordering based on timestamps
 
+3. **`TimeBasedStorageRBTree` (core/rbtree.py)**
+   - Uses Red-Black Tree implementation through SortedDict
+   - Balanced for both insertion and range queries
+   - Maintains full ordering with efficient operations
+
 ### Concurrent Components
 
 1. **`ThreadSafeTimeBasedStorage` (concurrent/thread_safe.py)**
@@ -111,14 +145,20 @@ classDiagram
    - Uses locks to ensure thread safety
    - Maintains the efficiency of the underlying heap
 
+3. **`ThreadSafeTimeBasedStorageRBTree` (concurrent/thread_safe_rbtree.py)**
+   - Thread-safe wrapper around TimeBasedStorageRBTree
+   - Uses locks to ensure thread safety
+   - Preserves the balanced performance characteristics of the RB-Tree
+
 ## Design Decisions
 
 ### 1. Implementation Variants
 
-Two different implementations were created to support different access patterns:
+Three different implementations were created to support different access patterns:
 
-- **List-based implementation**: Prioritizes efficient range queries and timestamp lookups, with O(log n) complexity for these operations but O(n) for insertions.
+- **Dictionary-based implementation**: Prioritizes efficient range queries and timestamp lookups, with O(1) for lookups but O(n) for insertions.
 - **Heap-based implementation**: Prioritizes efficient insertion and earliest event access, with O(log n) complexity for insertions but O(n log n) for range queries.
+- **Red-Black Tree implementation**: Provides balanced performance with O(log n) for both insertions and range queries, making it suitable for a wide range of use cases.
 
 This allows users to choose the implementation that best matches their access patterns.
 
@@ -156,17 +196,21 @@ This approach allows users to decide how to handle conflicts rather than silentl
 
 ### Storage Backend
 
-Both implementations use Python's built-in data structures:
+The implementations use different data structures:
 
 1. **TimeBasedStorage**:
-   - Uses a dictionary (`self.values`) for O(1) lookup by timestamp
-   - Maintains a sorted list of timestamps (`self.timestamps`)
-   - Uses binary search for range queries
+   - Uses a dictionary (`self._storage`) for O(1) lookup by timestamp
+   - Uses sorted key iteration for range queries
 
 2. **TimeBasedStorageHeap**:
    - Uses a binary min-heap for fast insertion and earliest event access
    - Uses a dictionary for direct timestamp lookup
 
+3. **TimeBasedStorageRBTree**:
+   - Uses a SortedDict from the sortedcontainers package
+   - Provides O(log n) operations for most operations
+   - Enables efficient range queries through key slicing operations
+
 ### Thread Safety Implementation
 
 Thread safety is achieved using Python's threading primitives:
@@ -189,15 +233,15 @@ The library follows these error handling principles:
 3. **Idempotent operations**: Some operations are designed to be safely repeated
 4. **Clear error messages**: Error messages clearly indicate the issue
 
-## Performance Considerations
+## Performance Characteristics
 
 ### TimeBasedStorage
 
 - **Space complexity**: O(n) where n is the number of stored events
 - **Time complexity**:
   - Insertion: O(n) due to maintaining sorted order
   - Lookup by timestamp: O(1)
-  - Range queries: O(log n) using binary search
+  - Range queries: O(n) linear scan through sorted dictionary keys
   - Iteration: O(1) for accessing all events
 
 ### TimeBasedStorageHeap
@@ -209,6 +253,16 @@ The library follows these error handling principles:
   - Range queries: O(n log n)
   - Earliest event access: O(1)
 
+### TimeBasedStorageRBTree
+
+- **Space complexity**: O(n)
+- **Time complexity**:
+  - Insertion: O(log n) using Red-Black Tree
+  - Lookup by timestamp: O(log n)
+  - Range queries: O(log n + k) where k is the number of items in range
+  - Iteration: O(1) for accessing all events
+  - Benchmark results: Up to 470x faster for small targeted range queries
+
 ## Testing Strategy
 
 The library employs a comprehensive testing strategy:
@@ -218,6 +272,7 @@ The library employs a comprehensive testing strategy:
 3. **Concurrency tests** for thread-safe variants
 4. **Stress tests** to ensure performance under load
 5. **Edge case tests** for boundary conditions
+6. **Benchmark tests** to compare performance between implementations
 
 ## Future Improvements
 
diff --git a/docs/concurrent_use_cases.md b/docs/concurrent_use_cases.md
@@ -79,6 +79,7 @@
 - Read-write locks for different operations
 - Fine-grained locking for better concurrency
 - Deadlock prevention strategies
+- Currently implemented in ThreadSafeTimeBasedStorage, ThreadSafeTimeBasedStorageHeap, and ThreadSafeTimeBasedStorageRBTree
 
 ### 2. Lock-Free Data Structures
 - Atomic operations where possible