Skip to content

Commit b28d183

Browse files
Merge pull request #10 from johnburbridge/feature/red-black-tree-storage
docs: update documentation to include Red-Black Tree implementation
2 parents 0667a23 + 173e7fa commit b28d183

3 files changed

Lines changed: 110 additions & 33 deletions

File tree

docs/alternatives.md

Lines changed: 44 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -2,9 +2,42 @@
22

33
This document explores the limitations of the current implementation of time-based storage using Python's native data structures and discusses alternative approaches with their respective trade-offs.
44

5-
## Limitations of Current Implementation
5+
## Current Implementations
66

7-
The current implementation uses Python's native data structures (dictionaries, lists, and heaps), which come with several limitations:
7+
The time-based storage package currently provides three implementations with different performance characteristics:
8+
9+
### 1. TimeBasedStorage (Dictionary-Based)
10+
11+
- **Data structure**: Python dictionaries with timestamp keys
12+
- **Characteristics**:
13+
- Simple implementation with minimal dependencies
14+
- O(1) lookup for specific timestamps
15+
- O(n) insertion time due to maintaining sorted access
16+
- Works well for small to medium datasets
17+
18+
### 2. TimeBasedStorageHeap (Heap-Based)
19+
20+
- **Data structure**: Python's heapq module with a min-heap
21+
- **Characteristics**:
22+
- O(log n) insertion time
23+
- O(1) access to earliest event
24+
- O(n log n) for range queries
25+
- Efficient for event processing where earliest events are prioritized
26+
27+
### 3. TimeBasedStorageRBTree (Red-Black Tree)
28+
29+
- **Data structure**: SortedDict from sortedcontainers package (Red-Black Tree)
30+
- **Characteristics**:
31+
- Balanced O(log n) performance for both insertions and queries
32+
- Efficient O(log n + k) range queries where k is the number of items in range
33+
- Up to 470x speedup for small targeted range queries compared to the dictionary-based implementation
34+
- Requires the sortedcontainers package dependency
35+
36+
All implementations provide thread-safe variants for concurrent access and share the same core API.
37+
38+
## Limitations of Current Implementations
39+
40+
Despite having multiple implementations optimized for different use cases, all current implementations share some limitations:
841

942
### Memory Constraints
1043

@@ -13,18 +46,6 @@ The current implementation uses Python's native data structures (dictionaries, l
1346
- **No compression**: Data is stored uncompressed, using more memory than necessary
1447
- **Copy semantics**: Range queries and other operations create copies of data
1548

16-
### Performance Limitations
17-
18-
- **TimeBasedStorage (sorted list/dictionary)**:
19-
- O(n) insertion time as items must maintain sort order
20-
- Not optimized for very large datasets (>100K entries)
21-
- Full scan required for some operations
22-
23-
- **TimeBasedStorageHeap**:
24-
- O(n log n) for range queries which requires scanning the entire heap
25-
- Inefficient for latest event access (requires a full heap traversal)
26-
- Extra overhead for maintaining heap property
27-
2849
### Persistence Issues
2950

3051
- **No built-in persistence**: Data is lost when the program terminates
@@ -58,11 +79,6 @@ The current implementation uses Python's native data structures (dictionaries, l
5879
- Better for larger datasets with frequent range access
5980
- More complex implementation than current approach
6081

61-
- **Red-Black Trees**:
62-
- Self-balancing with guaranteed O(log n) operations
63-
- Consistent performance regardless of data distribution
64-
- More complex than binary search trees
65-
6682
- **Skip Lists**:
6783
- Probabilistic alternative to balanced trees
6884
- O(log n) average operations with simpler implementation
@@ -197,17 +213,22 @@ The current implementation uses Python's native data structures (dictionaries, l
197213

198214
### For Small to Medium-Scale Applications
199215

200-
1. **Add Persistence Layer**:
216+
1. **Use the Right Implementation for Your Needs**:
217+
- **TimeBasedStorage**: Simple use cases with small datasets
218+
- **TimeBasedStorageHeap**: When you need fast insertion and earliest-event access
219+
- **TimeBasedStorageRBTree**: When you need balanced performance and frequent range queries
220+
221+
2. **Add Persistence Layer**:
201222
- Implement serialization/deserialization to/from disk
202223
- Consider using pickle, JSON, or MessagePack
203224
- Add options for periodic automatic saving
204225

205-
2. **Implement Time-Based Partitioning**:
226+
3. **Implement Time-Based Partitioning**:
206227
- Separate storage by time periods (days/weeks/months)
207228
- Enable efficient archiving of older data
208229
- Reduce memory usage for full dataset
209230

210-
3. **Add TTL and Cleanup**:
231+
4. **Add TTL and Cleanup**:
211232
- Automatic pruning of old data
212233
- Configurable retention policies
213234
- Background cleanup process
@@ -248,7 +269,7 @@ The current implementation uses Python's native data structures (dictionaries, l
248269

249270
## Conclusion
250271

251-
The current implementation with Python's native data structures provides a simple, easy-to-understand approach for time-based storage. However, as requirements grow in terms of data volume, query complexity, or performance needs, alternative approaches may become necessary.
272+
The current implementations provide options for different use cases, with the Red-Black Tree offering a good balance of performance for most scenarios. However, as requirements grow in terms of data volume, query complexity, or performance needs, alternative approaches may become necessary.
252273

253274
The right choice depends on specific requirements:
254275
- Data volume and growth rate

docs/architecture.md

Lines changed: 65 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -11,11 +11,13 @@ time_based_storage/
1111
├── core/
1212
│ ├── __init__.py
1313
│ ├── base.py # Base implementation
14-
│ └── heap.py # Heap-based implementation
14+
│ ├── heap.py # Heap-based implementation
15+
│ └── rbtree.py # Red-Black Tree implementation
1516
└── concurrent/
1617
├── __init__.py
1718
├── thread_safe.py # Thread-safe wrapper for base implementation
18-
└── thread_safe_heap.py # Thread-safe wrapper for heap implementation
19+
├── thread_safe_heap.py # Thread-safe wrapper for heap implementation
20+
└── thread_safe_rbtree.py # Thread-safe wrapper for RB-Tree implementation
1921
```
2022

2123
## Class Hierarchy
@@ -59,6 +61,20 @@ classDiagram
5961
+is_empty(): bool
6062
}
6163
64+
class TimeBasedStorageRBTree~T~ {
65+
-SortedDict _storage
66+
+add(timestamp, value): void
67+
+get_range(start_time, end_time): List~T~
68+
+get_duration(duration): List~T~
69+
+get_value_at(timestamp): Optional~T~
70+
+remove(timestamp): bool
71+
+clear(): void
72+
+get_all(): List~T~
73+
+get_timestamps(): List~datetime~
74+
+size(): int
75+
+is_empty(): bool
76+
}
77+
6278
class ThreadSafeTimeBasedStorage~T~ {
6379
-RLock _lock
6480
-Condition _condition
@@ -79,12 +95,25 @@ classDiagram
7995
+notify_data_available(): void
8096
}
8197
98+
class ThreadSafeTimeBasedStorageRBTree~T~ {
99+
-RLock _lock
100+
-Condition _condition
101+
+add(timestamp, value): void
102+
+get_range(start_time, end_time): List~T~
103+
+get_duration(duration): List~T~
104+
+wait_for_data(timeout): bool
105+
+notify_data_available(): void
106+
}
107+
82108
Generic~T~ <|-- TimeBasedStorage~T~
83109
Generic~T~ <|-- TimeBasedStorageHeap~T~
110+
Generic~T~ <|-- TimeBasedStorageRBTree~T~
84111
TimeBasedStorage~T~ <|-- ThreadSafeTimeBasedStorage~T~
85112
TimeBasedStorageHeap~T~ <|-- ThreadSafeTimeBasedStorageHeap~T~
113+
TimeBasedStorageRBTree~T~ <|-- ThreadSafeTimeBasedStorageRBTree~T~
86114
Generic~T~ <|-- ThreadSafeTimeBasedStorage~T~
87115
Generic~T~ <|-- ThreadSafeTimeBasedStorageHeap~T~
116+
Generic~T~ <|-- ThreadSafeTimeBasedStorageRBTree~T~
88117
```
89118

90119
### Core Components
@@ -99,6 +128,11 @@ classDiagram
99128
- Optimized for accessing the earliest event
100129
- Maintains partial ordering based on timestamps
101130

131+
3. **`TimeBasedStorageRBTree` (core/rbtree.py)**
132+
- Uses Red-Black Tree implementation through SortedDict
133+
- Balanced for both insertion and range queries
134+
- Maintains full ordering with efficient operations
135+
102136
### Concurrent Components
103137

104138
1. **`ThreadSafeTimeBasedStorage` (concurrent/thread_safe.py)**
@@ -111,14 +145,20 @@ classDiagram
111145
- Uses locks to ensure thread safety
112146
- Maintains the efficiency of the underlying heap
113147

148+
3. **`ThreadSafeTimeBasedStorageRBTree` (concurrent/thread_safe_rbtree.py)**
149+
- Thread-safe wrapper around TimeBasedStorageRBTree
150+
- Uses locks to ensure thread safety
151+
- Preserves the balanced performance characteristics of the RB-Tree
152+
114153
## Design Decisions
115154

116155
### 1. Implementation Variants
117156

118-
Two different implementations were created to support different access patterns:
157+
Three different implementations were created to support different access patterns:
119158

120-
- **List-based implementation**: Prioritizes efficient range queries and timestamp lookups, with O(log n) complexity for these operations but O(n) for insertions.
159+
- **Dictionary-based implementation**: Prioritizes efficient range queries and timestamp lookups, with O(1) for lookups but O(n) for insertions.
121160
- **Heap-based implementation**: Prioritizes efficient insertion and earliest event access, with O(log n) complexity for insertions but O(n log n) for range queries.
161+
- **Red-Black Tree implementation**: Provides balanced performance with O(log n) for both insertions and range queries, making it suitable for a wide range of use cases.
122162

123163
This allows users to choose the implementation that best matches their access patterns.
124164

@@ -156,17 +196,21 @@ This approach allows users to decide how to handle conflicts rather than silentl
156196

157197
### Storage Backend
158198

159-
Both implementations use Python's built-in data structures:
199+
The implementations use different data structures:
160200

161201
1. **TimeBasedStorage**:
162-
- Uses a dictionary (`self.values`) for O(1) lookup by timestamp
163-
- Maintains a sorted list of timestamps (`self.timestamps`)
164-
- Uses binary search for range queries
202+
- Uses a dictionary (`self._storage`) for O(1) lookup by timestamp
203+
- Uses sorted key iteration for range queries
165204

166205
2. **TimeBasedStorageHeap**:
167206
- Uses a binary min-heap for fast insertion and earliest event access
168207
- Uses a dictionary for direct timestamp lookup
169208

209+
3. **TimeBasedStorageRBTree**:
210+
- Uses a SortedDict from the sortedcontainers package
211+
- Provides O(log n) operations for most operations
212+
- Enables efficient range queries through key slicing operations
213+
170214
### Thread Safety Implementation
171215

172216
Thread safety is achieved using Python's threading primitives:
@@ -189,15 +233,15 @@ The library follows these error handling principles:
189233
3. **Idempotent operations**: Some operations are designed to be safely repeated
190234
4. **Clear error messages**: Error messages clearly indicate the issue
191235

192-
## Performance Considerations
236+
## Performance Characteristics
193237

194238
### TimeBasedStorage
195239

196240
- **Space complexity**: O(n) where n is the number of stored events
197241
- **Time complexity**:
198242
- Insertion: O(n) due to maintaining sorted order
199243
- Lookup by timestamp: O(1)
200-
- Range queries: O(log n) using binary search
244+
- Range queries: O(n) linear scan through sorted dictionary keys
201245
- Iteration: O(1) for accessing all events
202246

203247
### TimeBasedStorageHeap
@@ -209,6 +253,16 @@ The library follows these error handling principles:
209253
- Range queries: O(n log n)
210254
- Earliest event access: O(1)
211255

256+
### TimeBasedStorageRBTree
257+
258+
- **Space complexity**: O(n)
259+
- **Time complexity**:
260+
- Insertion: O(log n) using Red-Black Tree
261+
- Lookup by timestamp: O(log n)
262+
- Range queries: O(log n + k) where k is the number of items in range
263+
- Iteration: O(1) for accessing all events
264+
- Benchmark results: Up to 470x faster for small targeted range queries
265+
212266
## Testing Strategy
213267

214268
The library employs a comprehensive testing strategy:
@@ -218,6 +272,7 @@ The library employs a comprehensive testing strategy:
218272
3. **Concurrency tests** for thread-safe variants
219273
4. **Stress tests** to ensure performance under load
220274
5. **Edge case tests** for boundary conditions
275+
6. **Benchmark tests** to compare performance between implementations
221276

222277
## Future Improvements
223278

docs/concurrent_use_cases.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -79,6 +79,7 @@
7979
- Read-write locks for different operations
8080
- Fine-grained locking for better concurrency
8181
- Deadlock prevention strategies
82+
- Currently implemented in ThreadSafeTimeBasedStorage, ThreadSafeTimeBasedStorageHeap, and ThreadSafeTimeBasedStorageRBTree
8283

8384
### 2. Lock-Free Data Structures
8485
- Atomic operations where possible

0 commit comments

Comments
 (0)