Skip to content

Latest commit

 

History

History
104 lines (88 loc) · 7.98 KB

File metadata and controls

104 lines (88 loc) · 7.98 KB

Similarity Scoring Benchmarks

Benchmarks for string similarity and alignment algorithms across Rust and Python implementations, including CPU and GPU variants.

Overview

Edit Distance calculation is a common component of Search Engines, Data Cleaning, and Natural Language Processing, as well as in Bioinformatics. It's a computationally expensive operation, generally implemented using dynamic programming, with a quadratic time complexity upper bound. For biological sequences, the Needleman-Wunsch and Smith-Waterman algorithms are more appropriate, as they allow overriding the default substitution costs. Each of those has two flavors - with linear and affine gap penalties, also known as the "Gotoh" variation.

Performance is measured in MCUPS (Million Cell Updates Per Second).

Levenshtein Distance

Library ~100 bytes lines ~1,000 bytes lines
Rust
bio::levenshtein on 1x SPR 428 MCUPS 823 MCUPS
rapidfuzz::levenshtein<Bytes> on 1x SPR 4,633 MCUPS 14,316 MCUPS
rapidfuzz::levenshtein<Chars> on 1x SPR 3,877 MCUPS 13,179 MCUPS
stringzillas::LevenshteinDistances on 1x SPR 3,315 MCUPS 13,084 MCUPS
stringzillas::LevenshteinDistancesUtf8 on 1x SPR 3,283 MCUPS 11,690 MCUPS
stringzillas::LevenshteinDistances on 16x SPR 29,430 MCUPS 105,400 MCUPS
stringzillas::LevenshteinDistancesUtf8 on 16x SPR 38,954 MCUPS 103,500 MCUPS
stringzillas::LevenshteinDistances on RTX6000 32,030 MCUPS 901,990 MCUPS
stringzillas::LevenshteinDistances on H100 31,913 MCUPS 925,890 MCUPS
stringzillas::LevenshteinDistances on B200 32,960 MCUPS 998,620 MCUPS
stringzillas::LevenshteinDistances on 384x GNR 114,190 MCUPS 3,084,270 MCUPS
stringzillas::LevenshteinDistancesUtf8 on 384x GNR 103,590 MCUPS 2,938,320 MCUPS
Python
nltk.edit_distance 2 MCUPS 2 MCUPS
jellyfish.levenshtein_distance 81 MCUPS 228 MCUPS
rapidfuzz.Levenshtein.distance 108 MCUPS 9,272 MCUPS
editdistance.eval 89 MCUPS 660 MCUPS
edlib.align 82 MCUPS 7,262 MCUPS
polyleven.levenshtein 89 MCUPS 3,887 MCUPS
stringzillas.LevenshteinDistances on 1x SPR 53 MCUPS 3,407 MCUPS
stringzillas.LevenshteinDistancesUTF8 on 1x SPR 57 MCUPS 3,693 MCUPS
cudf.edit_distance batch on H100 24,754 MCUPS 6,976 MCUPS
stringzillas.LevenshteinDistances batch on 1x SPR 2,343 MCUPS 12,141 MCUPS
stringzillas.LevenshteinDistances batch on 16x SPR 3,762 MCUPS 119,261 MCUPS
stringzillas.LevenshteinDistances batch on H100 18,081 MCUPS 320,109 MCUPS

Needleman-Wunsch (Global Alignment)

Library ~100 bytes lines ~1,000 bytes lines
Rust
bio::pairwise::global on 1x SPR 51 MCUPS 57 MCUPS
stringzillas::NeedlemanWunschScores on 1x SPR 278 MCUPS 612 MCUPS
stringzillas::NeedlemanWunschScores on 16x SPR 4,057 MCUPS 8,492 MCUPS
stringzillas::NeedlemanWunschScores on 384x GNR 64,290 MCUPS 331,340 MCUPS
stringzillas::NeedlemanWunschScores on H100 131 MCUPS 12,113 MCUPS
Python
biopython.PairwiseAligner.score on 1x SPR 95 MCUPS 557 MCUPS
stringzillas.NeedlemanWunschScores on 1x SPR 30 MCUPS 481 MCUPS
stringzillas.NeedlemanWunschScores batch on 1x SPR 246 MCUPS 570 MCUPS
stringzillas.NeedlemanWunschScores batch on 16x SPR 3,103 MCUPS 9,208 MCUPS
stringzillas.NeedlemanWunschScores batch on H100 127 MCUPS 12,246 MCUPS

Smith-Waterman (Local Alignment)

Library ~100 bytes lines ~1,000 bytes lines
Rust
bio::pairwise::local on 1x SPR 49 MCUPS 50 MCUPS
stringzillas::SmithWatermanScores on 1x SPR 263 MCUPS 552 MCUPS
stringzillas::SmithWatermanScores on 16x SPR 3,883 MCUPS 8,011 MCUPS
stringzillas::SmithWatermanScores on 384x GNR 58,880 MCUPS 285,480 MCUPS
stringzillas::SmithWatermanScores on H100 143 MCUPS 12,921 MCUPS
Python
biopython.PairwiseAligner.score on 1x SPR 95 MCUPS 557 MCUPS
stringzillas.SmithWatermanScores on 1x SPR 28 MCUPS 440 MCUPS
stringzillas.SmithWatermanScores batch on 1x SPR 255 MCUPS 582 MCUPS
stringzillas.SmithWatermanScores batch on 16x SPR 3,535 MCUPS 8,235 MCUPS
stringzillas.SmithWatermanScores batch on H100 130 MCUPS 12,702 MCUPS

Needleman-Wunsch-Gotoh (Affine Gap Penalties)

Library ~100 bytes lines ~1,000 bytes lines
Rust
stringzillas::NeedlemanWunschScores on 1x SPR 83 MCUPS 354 MCUPS
stringzillas::NeedlemanWunschScores on 16x SPR 1,267 MCUPS 4,694 MCUPS
stringzillas::NeedlemanWunschScores on 384x GNR 42,050 MCUPS 155,920 MCUPS
stringzillas::NeedlemanWunschScores on H100 128 MCUPS 13,799 MCUPS

Smith-Waterman-Gotoh (Local with Affine Gaps)

Library ~100 bytes lines ~1,000 bytes lines
Rust
stringzillas::SmithWatermanScores on 1x SPR 79 MCUPS 284 MCUPS
stringzillas::SmithWatermanScores on 16x SPR 1,026 MCUPS 3,776 MCUPS
stringzillas::SmithWatermanScores on 384x GNR 38,430 MCUPS 129,140 MCUPS
stringzillas::SmithWatermanScores on H100 127 MCUPS 13,205 MCUPS

See README.md for dataset information and replication instructions.