You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/alternatives.md
+44-23Lines changed: 44 additions & 23 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -2,9 +2,42 @@
2
2
3
3
This document explores the limitations of the current implementation of time-based storage using Python's native data structures and discusses alternative approaches with their respective trade-offs.
4
4
5
-
## Limitations of Current Implementation
5
+
## Current Implementations
6
6
7
-
The current implementation uses Python's native data structures (dictionaries, lists, and heaps), which come with several limitations:
7
+
The time-based storage package currently provides three implementations with different performance characteristics:
8
+
9
+
### 1. TimeBasedStorage (Dictionary-Based)
10
+
11
+
-**Data structure**: Python dictionaries with timestamp keys
12
+
-**Characteristics**:
13
+
- Simple implementation with minimal dependencies
14
+
- O(1) lookup for specific timestamps
15
+
- O(n) insertion time due to maintaining sorted access
16
+
- Works well for small to medium datasets
17
+
18
+
### 2. TimeBasedStorageHeap (Heap-Based)
19
+
20
+
-**Data structure**: Python's heapq module with a min-heap
21
+
-**Characteristics**:
22
+
- O(log n) insertion time
23
+
- O(1) access to earliest event
24
+
- O(n log n) for range queries
25
+
- Efficient for event processing where earliest events are prioritized
26
+
27
+
### 3. TimeBasedStorageRBTree (Red-Black Tree)
28
+
29
+
-**Data structure**: SortedDict from sortedcontainers package (Red-Black Tree)
30
+
-**Characteristics**:
31
+
- Balanced O(log n) performance for both insertions and queries
32
+
- Efficient O(log n + k) range queries where k is the number of items in range
33
+
- Up to 470x speedup for small targeted range queries compared to the dictionary-based implementation
34
+
- Requires the sortedcontainers package dependency
35
+
36
+
All implementations provide thread-safe variants for concurrent access and share the same core API.
37
+
38
+
## Limitations of Current Implementations
39
+
40
+
Despite having multiple implementations optimized for different use cases, all current implementations share some limitations:
8
41
9
42
### Memory Constraints
10
43
@@ -13,18 +46,6 @@ The current implementation uses Python's native data structures (dictionaries, l
13
46
-**No compression**: Data is stored uncompressed, using more memory than necessary
14
47
-**Copy semantics**: Range queries and other operations create copies of data
15
48
16
-
### Performance Limitations
17
-
18
-
-**TimeBasedStorage (sorted list/dictionary)**:
19
-
- O(n) insertion time as items must maintain sort order
20
-
- Not optimized for very large datasets (>100K entries)
21
-
- Full scan required for some operations
22
-
23
-
-**TimeBasedStorageHeap**:
24
-
- O(n log n) for range queries which requires scanning the entire heap
25
-
- Inefficient for latest event access (requires a full heap traversal)
26
-
- Extra overhead for maintaining heap property
27
-
28
49
### Persistence Issues
29
50
30
51
-**No built-in persistence**: Data is lost when the program terminates
@@ -58,11 +79,6 @@ The current implementation uses Python's native data structures (dictionaries, l
58
79
- Better for larger datasets with frequent range access
59
80
- More complex implementation than current approach
60
81
61
-
-**Red-Black Trees**:
62
-
- Self-balancing with guaranteed O(log n) operations
63
-
- Consistent performance regardless of data distribution
64
-
- More complex than binary search trees
65
-
66
82
-**Skip Lists**:
67
83
- Probabilistic alternative to balanced trees
68
84
- O(log n) average operations with simpler implementation
@@ -197,17 +213,22 @@ The current implementation uses Python's native data structures (dictionaries, l
197
213
198
214
### For Small to Medium-Scale Applications
199
215
200
-
1.**Add Persistence Layer**:
216
+
1.**Use the Right Implementation for Your Needs**:
217
+
-**TimeBasedStorage**: Simple use cases with small datasets
218
+
-**TimeBasedStorageHeap**: When you need fast insertion and earliest-event access
219
+
-**TimeBasedStorageRBTree**: When you need balanced performance and frequent range queries
220
+
221
+
2.**Add Persistence Layer**:
201
222
- Implement serialization/deserialization to/from disk
202
223
- Consider using pickle, JSON, or MessagePack
203
224
- Add options for periodic automatic saving
204
225
205
-
2.**Implement Time-Based Partitioning**:
226
+
3.**Implement Time-Based Partitioning**:
206
227
- Separate storage by time periods (days/weeks/months)
207
228
- Enable efficient archiving of older data
208
229
- Reduce memory usage for full dataset
209
230
210
-
3.**Add TTL and Cleanup**:
231
+
4.**Add TTL and Cleanup**:
211
232
- Automatic pruning of old data
212
233
- Configurable retention policies
213
234
- Background cleanup process
@@ -248,7 +269,7 @@ The current implementation uses Python's native data structures (dictionaries, l
248
269
249
270
## Conclusion
250
271
251
-
The current implementation with Python's native data structures provides a simple, easy-to-understand approach for time-based storage. However, as requirements grow in terms of data volume, query complexity, or performance needs, alternative approaches may become necessary.
272
+
The current implementations provide options for different use cases, with the Red-Black Tree offering a good balance of performance for most scenarios. However, as requirements grow in terms of data volume, query complexity, or performance needs, alternative approaches may become necessary.
252
273
253
274
The right choice depends on specific requirements:
- Thread-safe wrapper around TimeBasedStorageRBTree
150
+
- Uses locks to ensure thread safety
151
+
- Preserves the balanced performance characteristics of the RB-Tree
152
+
114
153
## Design Decisions
115
154
116
155
### 1. Implementation Variants
117
156
118
-
Two different implementations were created to support different access patterns:
157
+
Three different implementations were created to support different access patterns:
119
158
120
-
-**List-based implementation**: Prioritizes efficient range queries and timestamp lookups, with O(log n) complexity for these operations but O(n) for insertions.
159
+
-**Dictionary-based implementation**: Prioritizes efficient range queries and timestamp lookups, with O(1) for lookups but O(n) for insertions.
121
160
-**Heap-based implementation**: Prioritizes efficient insertion and earliest event access, with O(log n) complexity for insertions but O(n log n) for range queries.
161
+
-**Red-Black Tree implementation**: Provides balanced performance with O(log n) for both insertions and range queries, making it suitable for a wide range of use cases.
122
162
123
163
This allows users to choose the implementation that best matches their access patterns.
124
164
@@ -156,17 +196,21 @@ This approach allows users to decide how to handle conflicts rather than silentl
156
196
157
197
### Storage Backend
158
198
159
-
Both implementations use Python's built-in data structures:
199
+
The implementations use different data structures:
160
200
161
201
1.**TimeBasedStorage**:
162
-
- Uses a dictionary (`self.values`) for O(1) lookup by timestamp
163
-
- Maintains a sorted list of timestamps (`self.timestamps`)
164
-
- Uses binary search for range queries
202
+
- Uses a dictionary (`self._storage`) for O(1) lookup by timestamp
203
+
- Uses sorted key iteration for range queries
165
204
166
205
2.**TimeBasedStorageHeap**:
167
206
- Uses a binary min-heap for fast insertion and earliest event access
168
207
- Uses a dictionary for direct timestamp lookup
169
208
209
+
3.**TimeBasedStorageRBTree**:
210
+
- Uses a SortedDict from the sortedcontainers package
211
+
- Provides O(log n) operations for most operations
212
+
- Enables efficient range queries through key slicing operations
213
+
170
214
### Thread Safety Implementation
171
215
172
216
Thread safety is achieved using Python's threading primitives:
@@ -189,15 +233,15 @@ The library follows these error handling principles:
189
233
3.**Idempotent operations**: Some operations are designed to be safely repeated
190
234
4.**Clear error messages**: Error messages clearly indicate the issue
191
235
192
-
## Performance Considerations
236
+
## Performance Characteristics
193
237
194
238
### TimeBasedStorage
195
239
196
240
-**Space complexity**: O(n) where n is the number of stored events
197
241
-**Time complexity**:
198
242
- Insertion: O(n) due to maintaining sorted order
199
243
- Lookup by timestamp: O(1)
200
-
- Range queries: O(log n) using binary search
244
+
- Range queries: O(n) linear scan through sorted dictionary keys
201
245
- Iteration: O(1) for accessing all events
202
246
203
247
### TimeBasedStorageHeap
@@ -209,6 +253,16 @@ The library follows these error handling principles:
209
253
- Range queries: O(n log n)
210
254
- Earliest event access: O(1)
211
255
256
+
### TimeBasedStorageRBTree
257
+
258
+
-**Space complexity**: O(n)
259
+
-**Time complexity**:
260
+
- Insertion: O(log n) using Red-Black Tree
261
+
- Lookup by timestamp: O(log n)
262
+
- Range queries: O(log n + k) where k is the number of items in range
263
+
- Iteration: O(1) for accessing all events
264
+
- Benchmark results: Up to 470x faster for small targeted range queries
265
+
212
266
## Testing Strategy
213
267
214
268
The library employs a comprehensive testing strategy:
@@ -218,6 +272,7 @@ The library employs a comprehensive testing strategy:
218
272
3.**Concurrency tests** for thread-safe variants
219
273
4.**Stress tests** to ensure performance under load
220
274
5.**Edge case tests** for boundary conditions
275
+
6.**Benchmark tests** to compare performance between implementations
0 commit comments