Faster Integer Parsing
tl;dl Rust port of https://kholdstare.github.io/technical/2020/05/26/faster-integer-parsing.html to parse fixed 16 integer.
Update 2021-04-04: gilescope sent a pull
request to the repository showing that using target-cpu=native
can speed up
the runtime performance, now the results are similar to the original blog post.
https://github.com/pickfire/parseint/pull/1 the updated results are at the
bottom of the post.
Note that a lot of the text are taken from the original article along with my additional comments but the code will be written as Rust.
The problem
Let’s say, theoretically, you have some text-based protocol, or file that contains microsecond timestamps. You need to parse these timestamps as quickly as possible. Maybe it is JSON, maybe it is a CSV file, maybe something else bespoke. It’s 16 characters long, and this could also apply to credit card numbers.
timestamp,event_id
1585201087123567,a
1585201087123585,b
1585201087123621,c
In the end you have to implement a function similar to this:
fn parse_timestamp(s: &str) -> u64 {
todo!()
}
The preparation
We will be using nightly to be able to use the feature test
which comes with
the nightly release. Use rustup toolchain add nightly
to install the nigthly
toolchain.
We will be creating the project by using cargo new parseint --lib
to create a
new library. Next, we will specify that we will be using the nightly toolchain
for this crate by echo nightly > rust-toolchain
.
Our benches/bench.rs
will be something like the following.
#![feature(test)]
extern crate test;
use parseint::*;
use test::{black_box, Bencher};
const EXAMPLE_TIMESTAMP: &str = "1585201087123789";
const EXPECTED_TIMESTAMP: u64 = 1585201087123789;
Every time we create a new function for benchmark, we will append the following
block to benches/bench.rs
prepended by bench_
to keep naming simple. Say we
have a function str_parse
(coming up next), so it will not be written later.
#[bench]
fn bench_str_parse(b: &mut Bencher) {
assert_eq!(str_parse(EXAMPLE_TIMESTAMP), EXPECTED_TIMESTAMP);
b.bytes = EXAMPLE_TIMESTAMP.len() as u64;
b.iter(|| str_parse(black_box(EXAMPLE_TIMESTAMP)));
}
Having the assert_eq
is a safety measure that our functions live up to its’
expectations. b.bytes
allows us to have nice throughput output when we run
cargo bench
to benchmark later. black_box
prevent the compiler from
optimizing the test code which may affect the result.
The native solution
Let us start with what is available. We have str::parse
from standard library
which acts as a baseline for the comparison later.
From now on, we will only be showing the code be write in src/lib.rs
, note
that we will also add code to bench.rs
as shown above but we will not specify
it here.
pub fn str_parse(s: &str) -> u64 {
s.parse().unwrap()
}
Note that comparing it to str::parse
with the other functions is not fair
since it is design to validate and handle numbers with different length. And I
would say it the standard library version is already quite optimized. One can
look into the assembly code by using cargo asm parseint::str_parse --rust
.
pub fn str_parse(s: &str) -> u64 {
sub rsp, 24
mov rdx, rsi
mov rsi, rdi
lea rdi, [rsp, +, 8]
FromStr::from_str(self) (libcore/str/mod.rs:4331)
call qword, ptr, [rip, +, _ZN4core3num52_$LT$impl$u20$core..str..FromStr$u20$for$u20$u64$GT$8from_str17hb6ec24c4ee2a5da7E@GOTPCREL]
Ok(t) => t, (libcore/result.rs:1004)
cmp byte, ptr, [rsp, +, 8], 1
je .LBB5_1
Ok(t) => t, (libcore/result.rs:1004)
mov rax, qword, ptr, [rsp, +, 16]
}
add rsp, 24
ret
.LBB5_1:
Err(e) => unwrap_failed("called `Result::unwrap()` on an `Err` value", &e), (libcore/result.rs:1005)
mov al, byte, ptr, [rsp, +, 9]
mov byte, ptr, [rsp, +, 7], al
Err(e) => unwrap_failed("called `Result::unwrap()` on an `Err` value", &e), (libcore/result.rs:1005)
lea rdi, [rip, +, .L__unnamed_2]
lea rcx, [rip, +, .L__unnamed_1]
lea r8, [rip, +, .L__unnamed_3]
lea rdx, [rsp, +, 7]
mov esi, 43
call qword, ptr, [rip, +, _ZN4core6result13unwrap_failed17h9507895bc765c906E@GOTPCREL]
ud2
Running cargo bench
gives us.
test bench_str_parse ... bench: 23 ns/iter (+/- 1) = 695 MB/s
Since we know our string contains the number we are trying to parse, and we do not need to do any whitespace skipping, can we be faster? Just how much time is spent in validation?
The naive solution
Let us write a good old for loop. Read the string character by character, and build up the result.
pub fn naive_chars(s: &str) -> u64 {
let mut result = 0;
for digit in s.chars() {
result *= 10;
result += digit as u64 - '0' as u64;
}
result
}
test bench_naive_chars ... bench: 18 ns/iter (+/- 0) = 888 MB/s
test bench_str_parse ... bench: 23 ns/iter (+/- 1) = 695 MB/s
We can see that the naive version is already slightly faster (~27%) than the original version.
In the Rust official book, it was said to produce better performing code. Let us try that out here.
pub fn naive_chars_iter(s: &str) -> u64 {
s.chars().fold(0, |a, c| a * 10 + c as u64 - '0' as u64)
}
It does not produce code which performs faster here but the code now is shorter and slightly easier to read in my own opinion.
Somewhere, during a conversion in Rust 众
, a Rust telegram group with people
mainly from China (Taiwan have a separate group rust.tw
). Someone mentioned
that it may be faster to get the value of a number by using bitwise and. Looking
at the ASCII table (man ascii
), noticed that it all ends with the same number.
2 3 4 5 6 7 30 40 50 60 70 80 90 100 110 120
------------- ---------------------------------
0: 0 @ P ` p 0: ( 2 < F P Z d n x
1: ! 1 A Q a q 1: ) 3 = G Q [ e o y
2: " 2 B R b r 2: * 4 > H R \ f p z
3: # 3 C S c s 3: ! + 5 ? I S ] g q {
4: $ 4 D T d t 4: " , 6 @ J T ^ h r |
5: % 5 E U e u 5: # - 7 A K U _ i s }
6: & 6 F V f v 6: $ . 8 B L V ` j t ~
7: ' 7 G W g w 7: % / 9 C M W a k u DEL
8: ( 8 H X h x 8: & 0 : D N X b l v
9: ) 9 I Y i y 9: ' 1 ; E O Y c m w
A: * : J Z j z
B: + ; K [ k {
C: , < L \ l |
D: - = M ] m }
E: . > N ^ n ~
F: / ? O _ o DEL
Here, we start from the digit 0
, decimal 48
, octal 30
, binary 110000
.
Going up to 9
, decimal 57
, octal 39
, binary 111001
. Noticed that the
octal version and binary version have the digits at the last byte (0000
-
1001
), we could have do a & 0x0f
to get the last 4 bits.
Being naive here, we just ignore the 5 possible values after 9
in the ASCII
table. Let us try to add this and see if it is faster.
pub fn naive_chars_and(s: &str) -> u64 {
s.chars().fold(0, |a, c| a * 10 + (c as u8 & 0x0f) as u64)
}
test bench_naive_chars ... bench: 18 ns/iter (+/- 0) = 888 MB/s
test bench_naive_chars_and ... bench: 17 ns/iter (+/- 1) = 941 MB/s
test bench_naive_chars_iter ... bench: 18 ns/iter (+/- 1) = 888 MB/s
test bench_str_parse ... bench: 23 ns/iter (+/- 1) = 695 MB/s
Nice, 1ns improvement, quite good for just an operator change from minus to bitwise and.
The bytes solution
When someone started to explore parsing and string manipulation in Rust. It is likely that one would have heard that iterating on bytes would be faster than iterating on characters.
This is because Rust String
is always UTF-8 encoded, String::chars
is
iterating on char
which represents a Unicode Scalar Value. But we know that
0
to 9
in ASCII will always be a single byte. String::bytes
which returns
an iterator on u8
would work for us. I bet this is faster, let us add that.
pub fn naive_bytes(s: &str) -> u64 {
let mut result = 0;
for digit in s.bytes() {
result *= 10;
result += (digit - b'0') as u64;
}
result
}
pub fn naive_bytes_iter(s: &str) -> u64 {
s.bytes().fold(0, |a, c| a * 10 + (c - b'0') as u64)
}
pub fn naive_bytes_and(s: &str) -> u64 {
s.bytes().fold(0, |a, c| a * 10 + (c & 0x0f) as u64)
}
It involves very simple change, by changing chars
to bytes
, and '0'
to
b'0'
. But doing this improves the benchmark by another 70%, huge drop of 7ns.
We have now reduced more than half the time from the baseline.
test bench_naive_bytes ... bench: 11 ns/iter (+/- 0) = 1454 MB/s
test bench_naive_bytes_and ... bench: 10 ns/iter (+/- 0) = 1600 MB/s
test bench_naive_bytes_iter ... bench: 11 ns/iter (+/- 0) = 1454 MB/s
test bench_naive_chars ... bench: 18 ns/iter (+/- 0) = 888 MB/s
test bench_naive_chars_and ... bench: 17 ns/iter (+/- 1) = 941 MB/s
test bench_naive_chars_iter ... bench: 18 ns/iter (+/- 1) = 888 MB/s
test bench_str_parse ... bench: 23 ns/iter (+/- 1) = 695 MB/s
The anti bound checking solution
After completing the port (including the parts below), I share the results to
simd-json issues, @Licenser
did a review and gave some feedback to improve the performance. One of them is
to reduce bound checking by specifying the number of bytes to take. I am
surprised that this can further improve the results of naive_bytes_and
.
pub fn naive_bytes_and_c16(s: &str) -> u64 {
s.bytes()
.take(16)
.fold(0, |a, c| a * 10 + (c & 0x0f) as u64)
}
It does look obvious to me that we are working with 16 characters but the
compiler does not know this. I really hoped that bound checking can be removed
later so that users (if there is a way) do not need to do godbolting to reduce
the bound checking. We can do this by specifying take
on Iterator
. I am
surprised that this improves the timing by another 2ns (20%).
test bench_naive_bytes ... bench: 11 ns/iter (+/- 0) = 1454 MB/s
test bench_naive_bytes_and ... bench: 10 ns/iter (+/- 0) = 1600 MB/s
test bench_naive_bytes_and_c16 ... bench: 8 ns/iter (+/- 0) = 2000 MB/s
test bench_naive_bytes_iter ... bench: 11 ns/iter (+/- 0) = 1454 MB/s
test bench_naive_chars ... bench: 18 ns/iter (+/- 0) = 888 MB/s
test bench_naive_chars_and ... bench: 17 ns/iter (+/- 1) = 941 MB/s
test bench_naive_chars_iter ... bench: 18 ns/iter (+/- 1) = 888 MB/s
test bench_str_parse ... bench: 23 ns/iter (+/- 1) = 695 MB/s
The brute force solution
If we know it is 16 bytes, why even have a for loop? Let us unroll it!
pub fn unrolled(s: &str) -> u64 {
let mut result = 0;
let bytes = s.as_bytes();
result += (bytes[0] - b'0') as u64 * 1000000000000000;
result += (bytes[1] - b'0') as u64 * 100000000000000;
result += (bytes[2] - b'0') as u64 * 10000000000000;
result += (bytes[3] - b'0') as u64 * 1000000000000;
result += (bytes[4] - b'0') as u64 * 100000000000;
result += (bytes[5] - b'0') as u64 * 10000000000;
result += (bytes[6] - b'0') as u64 * 1000000000;
result += (bytes[7] - b'0') as u64 * 100000000;
result += (bytes[8] - b'0') as u64 * 10000000;
result += (bytes[9] - b'0') as u64 * 1000000;
result += (bytes[10] - b'0') as u64 * 100000;
result += (bytes[11] - b'0') as u64 * 10000;
result += (bytes[12] - b'0') as u64 * 1000;
result += (bytes[13] - b'0') as u64 * 100;
result += (bytes[14] - b'0') as u64 * 10;
result += (bytes[15] - b'0') as u64;
result
}
The original article shows that unrolling makes the code faster but it looks
like the compiler is more clever than we thought, it already auto-vectorizes the
code before this. 16ns! It is way slower than any implementation using bytes,
we can see this with those movzx
in cargo asm parseint::naive_bytes_and_c16
.
test bench_naive_bytes ... bench: 11 ns/iter (+/- 0) = 1454 MB/s
test bench_naive_bytes_and ... bench: 10 ns/iter (+/- 0) = 1600 MB/s
test bench_naive_bytes_and_c16 ... bench: 8 ns/iter (+/- 0) = 2000 MB/s
test bench_naive_bytes_iter ... bench: 11 ns/iter (+/- 0) = 1454 MB/s
test bench_naive_chars ... bench: 18 ns/iter (+/- 0) = 888 MB/s
test bench_naive_chars_and ... bench: 17 ns/iter (+/- 1) = 941 MB/s
test bench_naive_chars_iter ... bench: 18 ns/iter (+/- 1) = 888 MB/s
test bench_str_parse ... bench: 23 ns/iter (+/- 1) = 695 MB/s
test bench_unrolled ... bench: 16 ns/iter (+/- 0) = 1000 MB/s
I do no trust my eyes looking at this. I heard others that they always blame the
safe code and using unsafe
makes it faster. get_unchecked
is a simple
unsafe
and we cannot go wrong by using it here right?
pub fn unrolled_unsafe(s: &str) -> u64 {
let mut result = 0;
let bytes = s.as_bytes();
result += (unsafe { bytes.get_unchecked(0) } - b'0') as u64 * 1000000000000000;
result += (unsafe { bytes.get_unchecked(1) } - b'0') as u64 * 100000000000000;
result += (unsafe { bytes.get_unchecked(2) } - b'0') as u64 * 10000000000000;
result += (unsafe { bytes.get_unchecked(3) } - b'0') as u64 * 1000000000000;
result += (unsafe { bytes.get_unchecked(4) } - b'0') as u64 * 100000000000;
result += (unsafe { bytes.get_unchecked(5) } - b'0') as u64 * 10000000000;
result += (unsafe { bytes.get_unchecked(6) } - b'0') as u64 * 1000000000;
result += (unsafe { bytes.get_unchecked(7) } - b'0') as u64 * 100000000;
result += (unsafe { bytes.get_unchecked(8) } - b'0') as u64 * 10000000;
result += (unsafe { bytes.get_unchecked(9) } - b'0') as u64 * 1000000;
result += (unsafe { bytes.get_unchecked(10) } - b'0') as u64 * 100000;
result += (unsafe { bytes.get_unchecked(11) } - b'0') as u64 * 10000;
result += (unsafe { bytes.get_unchecked(12) } - b'0') as u64 * 1000;
result += (unsafe { bytes.get_unchecked(13) } - b'0') as u64 * 100;
result += (unsafe { bytes.get_unchecked(14) } - b'0') as u64 * 10;
result += (unsafe { bytes.get_unchecked(15) } - b'0') as u64;
result
}
Oh no, unsafe
makes it worse. I guess it is not always right that
get_unchecked
makes stuff faster.
test bench_unrolled ... bench: 16 ns/iter (+/- 0) = 1000 MB/s
test bench_unrolled_unsafe ... bench: 17 ns/iter (+/- 0) = 941 MB/s
Let us apply the previous trick to reduce bounds checking here. We could use
get(..16)
on &str
like take
on Iterator
.
pub fn unrolled_safe(s: &str) -> u64 {
let mut result = 0;
let bytes = s.get(..16).unwrap().as_bytes();
result += (bytes[0] - b'0') as u64 * 1000000000000000;
result += (bytes[1] - b'0') as u64 * 100000000000000;
result += (bytes[2] - b'0') as u64 * 10000000000000;
result += (bytes[3] - b'0') as u64 * 1000000000000;
result += (bytes[4] - b'0') as u64 * 100000000000;
result += (bytes[5] - b'0') as u64 * 10000000000;
result += (bytes[6] - b'0') as u64 * 1000000000;
result += (bytes[7] - b'0') as u64 * 100000000;
result += (bytes[8] - b'0') as u64 * 10000000;
result += (bytes[9] - b'0') as u64 * 1000000;
result += (bytes[10] - b'0') as u64 * 100000;
result += (bytes[11] - b'0') as u64 * 10000;
result += (bytes[12] - b'0') as u64 * 1000;
result += (bytes[13] - b'0') as u64 * 100;
result += (bytes[14] - b'0') as u64 * 10;
result += (bytes[15] - b'0') as u64;
result
}
Looks like this is more worthwhile than using unsafe
it reduces the timing by
1ns rather than increasing it by 1ns.
test bench_unrolled ... bench: 16 ns/iter (+/- 0) = 1000 MB/s
test bench_unrolled_safe ... bench: 15 ns/iter (+/- 1) = 1066 MB/s
test bench_unrolled_unsafe ... bench: 17 ns/iter (+/- 0) = 941 MB/s
Still, loop unrolling does not help like it did on the original article. Time too look into some other trick.
The byteswap insight
Let us draw out the operations in the unrolled solution as a tree, on a simplified example of parsing ‘1234’ into a 32-bit integer:
('1' - '0') * 1000 = 1000
('2' - '0') * 100 = 200
('3' - '0') * 10 = 30
('4' - '0') * 1 = 4
1000
+ 200
= 1200
+ 30
= 1230
+ 4
= 1234
The original version have an image of a graph, but I like ASCII art more
We can see that the amount of multiplications and additions is linear with the amount of characters. It is hard to see how to improve this, because every multiplication is by a different factor (so we cannot multiple “in one go”), and at the end of the day we need to add up all the intermediate results.
However, it is still very regular. For one thing, the first character in the string is multiplied by the largest factor, because it is the most significant digit.
On a little-endian machine (like x86), an integer’s first byte contains the least significant digits, while the first byte in a string contains the most significant digit.
"1234" (octal 0x34333231)
- "0000" (octal 0x30303030)
= 0x04030201
(byteswap) -> 0x01020304
Looking at the string as an integer we can get closer to the final parsed
state in fewer operations - see how the hex representation is almost what we
want
Now to reinterpret the bytes of a string as an integer we have to use
std::ptr::copy_nonoverlapping
(similar to memcpy
in C) and we have to use
u64::swap_bytes
to swap the bytes in one instruction.
Note the code below will not be used, just for demonstration and proof of concept for the above. It was removed so this came from my mind, random code.
// dangerous act, kids should not try this at home - 眼看手不动、手动屁股痛
fn parse_8_rubbish(s: &str) -> u64 {
let s = s.as_ptr() as *const _;
let mut chunk: u64 = 0;
unsafe {
std::ptr::copy_nonoverlapping(s, &mut chunk, std::mem::size_of_val(&chunk));
}
let mut zeros = 0;
let z = "0000".as_ptr() as *const _;
unsafe {
std::ptr::copy_nonoverlapping(z, &mut zeros, std::mem::size_of_val(&chunk));
}
chunk = (chunk - zeros).swap_bytes();
todo!()
}
But now that we have an integer that kind of, sort of looks like what we want, how do we get it across the finish line without too much work?
The divide and conquer insight
From the previous step, we end up with an integer whose bit representation has each digit placed in a separate byte. Even though one byte can represent up to 256 values, we have values 0-9 in each byte of the integer. They are also in the right little endian order. Now we just need to “smash” them together somehow.
We know that doing it linearly would be too slow, what is the next possibility?
O(log(n))
! We need to combine every adjacent digit into a pair in one step,
and then each pair of digits into a group of four, and so on, until we have the
entire integer.
The key is working on adjacent digits simultaneously. This allows a tree of operations, running in O(log(n)) time.
This involves multiplying the even-index digits by a power of 10 and leaving the odd-index digits alone. This can be done with bitmasks to selectively apply operations.
0x04030201
+-----------------------------|--------------------+
| 1 byte mask trick v v--- one byte |
| +-+-+-+-+ |
| |1|2|3|4| |
| +-+-+-+-+ |
| | | |
| 0x000f000f <--+ +--> 0x0f000f00 |
| | | |
| v v |
| 0 2 0 4 1 0 3 0 |
| (>> 8) 2 0 4 0 (* 10) 10 030 0 |
| | | |
| +-----------+------------+ |
| (+) | |
| v v--- two bytes |
| +--+--+ |
| |12|34| |
| +--+--+ |
+-----------------------------|---------------------
|
+--------------------+
| 2 bytes mask trick |
+--------------------+
|
v v--- four bytes
+------+
| 1234 |
+------+
By using bitmasking, we can apply operations to more than one digit at a
time, to combine them into a larger group
Let us finish the parse_8_rubbish
function we started earsier by employing
this masking trick. As a neat side-effect of the masking, we do not need to
subtract ‘0’, since it will be masked away.
fn parse_8_chars(s: &str) -> u64 { // no need to benchmark this, to be used later
let s = s.as_ptr() as *const _;
let mut chunk = 0;
unsafe {
std::ptr::copy_nonoverlapping(s, &mut chunk, std::mem::size_of_val(&chunk));
}
// 1-byte mask trick (works on 4 pairs of single digits)
let lower_digits = (chunk & 0x0f000f000f000f00) >> 8;
let upper_digits = (chunk & 0x000f000f000f000f) * 10;
let chunk = lower_digits + upper_digits;
// 2-byte mask trick (works on 2 pairs of two digits)
let lower_digits = (chunk & 0x00ff000000ff0000) >> 16;
let upper_digits = (chunk & 0x000000ff000000ff) * 100;
let chunk = lower_digits + upper_digits;
// 4-byte mask trick (works on a pair of four digits)
let lower_digits = (chunk & 0x0000ffff00000000) >> 32;
let upper_digits = (chunk & 0x000000000000ffff) * 10000;
let chunk = lower_digits + upper_digits;
chunk
}
The trick
Putting it all together, to parse our 16-digit integer, we break it up into two chunks of 8 bytes, run parse_8_chars that we have just written and benchmark it!
pub fn trick(s: &str) -> u64 {
let (upper_digits, lower_digits) = s.split_at(8);
parse_8_chars(upper_digits) * 100000000 + parse_8_chars(lower_digits)
}
Let us strip some extra information from the benchmarks with the useful ones.
test bench_naive_bytes_and ... bench: 10 ns/iter (+/- 0) = 1600 MB/s
test bench_naive_bytes_and_c16 ... bench: 8 ns/iter (+/- 0) = 2000 MB/s
test bench_naive_bytes_iter ... bench: 11 ns/iter (+/- 0) = 1454 MB/s
test bench_str_parse ... bench: 23 ns/iter (+/- 1) = 695 MB/s
test bench_unrolled_safe ... bench: 15 ns/iter (+/- 1) = 1066 MB/s
test bench_trick ... bench: 6 ns/iter (+/- 0) = 2666 MB/s
Nice! Looks like this method that we pulled off is able to shave off yet another
2ns (25%) over the previous fastest method naive_bytes_and_c16
. How about we
let CPU do all the hard work like in the original article?
The SIMD trick
We have the main insight:
- Combine groups of digits simultaneously to archive
O(log(n))
time
We also have a 16-characters, or 128-bit string to parse - can we use SIMD? Of course we can! SIMD stands for Single Instruction Multiple Data, and is exactly what we are looking for. SSE and AVX instructions are supported on both Intel and AMD CPUs, and they typically work with wider registers. (and even on my 10 years old laptop)
Rust do provide instrinsics in std::arch
and we need to use unsafe
to use
them, the functions in the documentations will direct you to Intel Intrinsics
Guide itself.
Let us set up the digits in each of the 16 bytes first:
pub fn trick_simd(s: &str) -> u64 {
unsafe {
let chunk = _mm_lddqu_si128(std::mem::transmute_copy(&s));
let zeros = _mm_set1_epi8(b'0' as i8);
let chunk = _mm_sub_epi16(chunk, zeros);
// ...
}
}
And now, the star of the show is the madd
functions. These SIMD functions do
exactly what we did with our bitmask tricks - they take a wide register,
interpret it as a vector of smaller integers, multiply each by a given
multiplier, and add neighboring ones together into a vector of wider integers.
All in one instructions!
As an example of taking every byte, multiplying the odd ones by 10 and adding
adjacent pairs together, we can use _mm_add_epi16
.
// previous code
// The 1-byte "trick" in one instruction
let mult = _mm_set_epi8(1, 10, 1, 10, 1, 10, 1, 10, 1, 10, 1, 10, 1, 10, 1, 10);
let chunk = _mm_maddubs_epi16(chunk, mult);
// ...
There is another instruction for the 2-byte trick, but unfortunately I could not
find one for the 4-byte trick - that needed two instructions. Here is the
completed parse_simd
to parse 16 characters:
use core::arch::x86_64::{
_mm_cvtsi128_si64, _mm_lddqu_si128, _mm_madd_epi16, _mm_maddubs_epi16, _mm_packus_epi32,
_mm_set1_epi8, _mm_set_epi16, _mm_set_epi8, _mm_sub_epi16,
};
pub fn trick_simd(s: &str) -> u64 {
unsafe {
let chunk = _mm_lddqu_si128(std::mem::transmute_copy(&s));
let zeros = _mm_set1_epi8(b'0' as i8);
let chunk = _mm_sub_epi16(chunk, zeros);
let mult = _mm_set_epi8(1, 10, 1, 10, 1, 10, 1, 10, 1, 10, 1, 10, 1, 10, 1, 10);
let chunk = _mm_maddubs_epi16(chunk, mult);
let mult = _mm_set_epi16(1, 100, 1, 100, 1, 100, 1, 100);
let chunk = _mm_madd_epi16(chunk, mult);
let chunk = _mm_packus_epi32(chunk, chunk);
let mult = _mm_set_epi16(0, 0, 0, 0, 1, 10000, 1, 10000);
let chunk = _mm_madd_epi16(chunk, mult);
let chunk = _mm_cvtsi128_si64(chunk) as u64;
((chunk & 0xffffffff) * 100000000) + (chunk >> 32)
}
}
test bench_naive_bytes_and ... bench: 10 ns/iter (+/- 0) = 1600 MB/s
test bench_naive_bytes_and_c16 ... bench: 8 ns/iter (+/- 0) = 2000 MB/s
test bench_naive_bytes_iter ... bench: 11 ns/iter (+/- 0) = 1454 MB/s
test bench_str_parse ... bench: 23 ns/iter (+/- 1) = 695 MB/s
test bench_unrolled_safe ... bench: 15 ns/iter (+/- 1) = 1066 MB/s
test bench_trick ... bench: 6 ns/iter (+/- 0) = 2666 MB/s
test bench_trick_simd ... bench: 8 ns/iter (+/- 1) = 2000 MB/s
8 nanoseconds! Ugly. (2ns drop, same as bench_naive_bytes_and_c16
)
The original article have SIMD version performing better than the trick version, but seemed like SIMD perform worse here. Maybe because of my old laptop. But still it is quite surprising to see SIMD code slower than non-SIMD.
Remembering what my dad always said in Cantonese 邊度有咁大隻蛤乸隨街跳呀? In English, it means how can you ever find a big clam jumping randomly on the streets? Before you can even find it, it probably already ends up in someone else stomach. (I hate eating meat but this is just an phrase)
Saying that, I just wanted to say SIMD is not a silver bullet to everything.
We could also try employ the anti bound checking trick here as suggested by @Licenser.
pub fn trick_simd_c16(s: &str) -> u64 {
let d: &mut [u8; 16] = &mut b"0000000000000000".clone();
let b: &[u8] = s.as_bytes();
d.copy_from_slice(b);
unsafe {
let chunk = _mm_lddqu_si128(std::mem::transmute_copy(&d));
let zeros = _mm_set1_epi8(b'0' as i8);
let chunk = _mm_sub_epi16(chunk, zeros);
let mult = _mm_set_epi8(1, 10, 1, 10, 1, 10, 1, 10, 1, 10, 1, 10, 1, 10, 1, 10);
let chunk = _mm_maddubs_epi16(chunk, mult);
let mult = _mm_set_epi16(1, 100, 1, 100, 1, 100, 1, 100);
let chunk = _mm_madd_epi16(chunk, mult);
let chunk = _mm_packus_epi32(chunk, chunk);
let mult = _mm_set_epi16(0, 0, 0, 0, 1, 10000, 1, 10000);
let chunk = _mm_madd_epi16(chunk, mult);
let chunk = _mm_cvtsi128_si64(chunk) as u64;
((chunk & 0xffffffff) * 100000000) + (chunk >> 32)
}
}
Sad, it is not much of a help (10ns).
test bench_naive_bytes_and ... bench: 10 ns/iter (+/- 0) = 1600 MB/s
test bench_naive_bytes_and_c16 ... bench: 8 ns/iter (+/- 0) = 2000 MB/s
test bench_naive_bytes_iter ... bench: 11 ns/iter (+/- 0) = 1454 MB/s
test bench_str_parse ... bench: 23 ns/iter (+/- 1) = 695 MB/s
test bench_unrolled_safe ... bench: 15 ns/iter (+/- 1) = 1066 MB/s
test bench_trick ... bench: 6 ns/iter (+/- 0) = 2666 MB/s
test bench_trick_simd ... bench: 8 ns/iter (+/- 1) = 2000 MB/s
test bench_trick_simd_c16 ... bench: 10 ns/iter (+/- 0) = 1600 MB/s
The 128 bit trick
The one is not in the original article, since rust supports 128-bits integer,
why not try it out? Using this is simpler than the previous trick
function
since we need not to break the 16 characters into 2 parts, just work it out as
a whole. Awesome (つごい! - “tsugoi” Japanese way of expressing jaw dropping),
we can even do a 8-byte mask trick.
pub fn trick_128(s: &str) -> u64 {
let s = s.as_ptr() as *const _;
let mut chunk = 0_u128;
unsafe {
std::ptr::copy_nonoverlapping(s, &mut chunk, std::mem::size_of_val(&chunk));
}
// 1-byte mask trick (works on 8 pairs of single digits)
let lower_digits = (chunk & 0x0f000f000f000f000f000f000f000f00) >> 8;
let upper_digits = (chunk & 0x000f000f000f000f000f000f000f000f) * 10;
let chunk = lower_digits + upper_digits;
// 2-byte mask trick (works on 4 pairs of two digits)
let lower_digits = (chunk & 0x00ff000000ff000000ff000000ff0000) >> 16;
let upper_digits = (chunk & 0x000000ff000000ff000000ff000000ff) * 100;
let chunk = lower_digits + upper_digits;
// 4-byte mask trick (works on 2 pairs of four digits)
let lower_digits = (chunk & 0x0000ffff000000000000ffff00000000) >> 32;
let upper_digits = (chunk & 0x000000000000ffff000000000000ffff) * 10000;
let chunk = lower_digits + upper_digits;
// 8-byte mask trick (works on a pair of eight digits)
let lower_digits = (chunk & 0x00000000ffffffff0000000000000000) >> 64;
let upper_digits = (chunk & 0x000000000000000000000000ffffffff) * 100000000;
let chunk = lower_digits + upper_digits;
chunk as u64
}
Oh, this is even faster than trick_simd
but it is slightly slower (1ns) slower
than trick
. Still, this method is way easier than trick
itself such that it
looks straightforward without divide and conquer.
test bench_naive_bytes_and ... bench: 10 ns/iter (+/- 0) = 1600 MB/s
test bench_naive_bytes_and_c16 ... bench: 8 ns/iter (+/- 0) = 2000 MB/s
test bench_naive_bytes_iter ... bench: 11 ns/iter (+/- 0) = 1454 MB/s
test bench_str_parse ... bench: 23 ns/iter (+/- 1) = 695 MB/s
test bench_unrolled_safe ... bench: 15 ns/iter (+/- 1) = 1066 MB/s
test bench_trick ... bench: 6 ns/iter (+/- 0) = 2666 MB/s
test bench_trick_128 ... bench: 7 ns/iter (+/- 0) = 2285 MB/s
test bench_trick_simd ... bench: 8 ns/iter (+/- 1) = 2000 MB/s
Summary
Compilers are absolutely amazing pieces of technology. They regularly surprise the original author (or even blow the original author’s mind) at how well they can optimize code and see through what the original author is doing.
Any last words? I wish serde have support for reading fixed sized strings,
if not it would be good to have a crate that have support for it like num. I
have nothing to say here except trick
being the fastest and mind-blown by
SIMD not being the fastest.
Having said all that, there is a culture of “optimization is the root of all
evil”. That handwritten assembly or hand-optimization has no place anymore. That
we should just blindly rely on godbolting and hoping that rustc
do well.
I think both positions are complementary - trust your compiler (friend), trust your library vendor, but nothing beats carefully thought out code when you know your inputs and you have done your measurements to know it will make a difference.
用人不疑,疑人不用。 (English direct translate: Use someone don’t distrust, distrust someone don’t use, indirect translate: trust everyone, easy :P) If you do not trust the compiler, why use it?
Imagine your business revolved around parsing a firehose of telemetry data, and
you chose to use str::parse
. Would you buy more servers or spend a little time
optimizing your parsing?
test bench_naive_bytes ... bench: 11 ns/iter (+/- 0) = 1454 MB/s
test bench_naive_bytes_and ... bench: 10 ns/iter (+/- 0) = 1600 MB/s
test bench_naive_bytes_and_c16 ... bench: 8 ns/iter (+/- 0) = 2000 MB/s
test bench_naive_bytes_iter ... bench: 11 ns/iter (+/- 0) = 1454 MB/s
test bench_naive_chars ... bench: 18 ns/iter (+/- 0) = 888 MB/s
test bench_naive_chars_and ... bench: 17 ns/iter (+/- 1) = 941 MB/s
test bench_naive_chars_iter ... bench: 18 ns/iter (+/- 1) = 888 MB/s
test bench_str_parse ... bench: 23 ns/iter (+/- 1) = 695 MB/s
test bench_trick ... bench: 6 ns/iter (+/- 0) = 2666 MB/s
test bench_trick_128 ... bench: 7 ns/iter (+/- 0) = 2285 MB/s
test bench_trick_simd ... bench: 8 ns/iter (+/- 1) = 2000 MB/s
test bench_trick_simd_c16 ... bench: 10 ns/iter (+/- 0) = 1600 MB/s
test bench_unrolled ... bench: 16 ns/iter (+/- 0) = 1000 MB/s
test bench_unrolled_safe ... bench: 15 ns/iter (+/- 1) = 1066 MB/s
test bench_unrolled_unsafe ... bench: 17 ns/iter (+/- 0) = 941 MB/s
2021-04-04 update with target-cpu=native
for rustc flag provides similar to
the original blog post, this benchmark came from a newer laptop cpu which the
benchmark results may be different.
test bench_naive_bytes ... bench: 10 ns/iter (+/- 0) = 1600 MB/s
test bench_naive_bytes_and ... bench: 9 ns/iter (+/- 0) = 1777 MB/s
test bench_naive_bytes_and_c16 ... bench: 8 ns/iter (+/- 0) = 2000 MB/s
test bench_naive_bytes_iter ... bench: 10 ns/iter (+/- 0) = 1600 MB/s
test bench_naive_chars ... bench: 10 ns/iter (+/- 0) = 1600 MB/s
test bench_naive_chars_and ... bench: 10 ns/iter (+/- 0) = 1600 MB/s
test bench_naive_chars_iter ... bench: 9 ns/iter (+/- 0) = 1777 MB/s
test bench_str_parse ... bench: 17 ns/iter (+/- 0) = 941 MB/s
test bench_trick ... bench: 3 ns/iter (+/- 0) = 5333 MB/s
test bench_trick_128 ... bench: 3 ns/iter (+/- 0) = 5333 MB/s
test bench_trick_simd ... bench: 1 ns/iter (+/- 0) = 16000 MB/s
test bench_trick_simd_c16 ... bench: 1 ns/iter (+/- 0) = 16000 MB/s
test bench_unrolled ... bench: 5 ns/iter (+/- 0) = 3200 MB/s
test bench_unrolled_safe ... bench: 3 ns/iter (+/- 0) = 5333 MB/s
test bench_unrolled_unsafe ... bench: 3 ns/iter (+/- 0) = 5333 MB/s
Post Scriptum
- All benchmarks ran on a 2.70 GHz Intel(R) Core(TM) i7-2620M, on Linux, compiled with rustc 1.46.0-nightly (0cd7ff7dd 2020-07-04).
- I did not do comparison with the original code.
- I did not get to try
parse_8_chars_simd
because_mm_loadu_si64
is not available instdarch
(pull request). - You can find the code and the benchmarks here.
- New result with
target-cpu=native
ran on AMD Ryzen 5 4500U, on Linux, compiled with cargo 1.52.0-nightly (90691f2bf 2021-03-16) for 2021-04-04 update.