GIF: Optimise deinterlacing of frame rows.

The original interlacing handling looked like this: ```c static inline uint32_t f1(int height, int y) { if ((y << 3) < height) { return (y << 3); /* Pass 1 */ } y -= ((height + 7) >> 3); if ((y << 3) < (height - 4)) { return (y << 3) + 4; /* Pass 2 */ } y -= ((height + 3) >> 3); if ((y << 2) < (height - 2)) { return (y << 2) + 2; /* Pass 3 */ } y -= ((height + 1) >> 2); return (y << 1) + 1; /* Pass 4 */ } ``` Annotated with comments to show the passes. Thought it should be possible to do the job without all the calculations and subtractions from `y`, and I also didn't like that it was not super obvious how it aligned with the spec or how it dealt with pass 4 last, when it is the most common case, happening for 50% of image rows. See the following excerpt from from the GIF89a Specification. > The rows of an Interlaced images are arranged in the following order: > > Group 1 : Every 8th. row, starting with row 0. (Pass 1) > Group 2 : Every 8th. row, starting with row 4. (Pass 2) > Group 3 : Every 4th. row, starting with row 2. (Pass 3) > Group 4 : Every 2nd. row, starting with row 1. (Pass 4) > > The Following example illustrates how the rows of an interlaced image are > ordered. > > Row Number Interlace Pass > > 0 ----------------------------------------- 1 > 1 ----------------------------------------- 4 > 2 ----------------------------------------- 3 > 3 ----------------------------------------- 4 > 4 ----------------------------------------- 2 > 5 ----------------------------------------- 4 > 6 ----------------------------------------- 3 > 7 ----------------------------------------- 4 > 8 ----------------------------------------- 1 > 9 ----------------------------------------- 4 > 10 ----------------------------------------- 3 > 11 ----------------------------------------- 4 > 12 ----------------------------------------- 2 > 13 ----------------------------------------- 4 > 14 ----------------------------------------- 3 > 15 ----------------------------------------- 4 > 16 ----------------------------------------- 1 > 17 ----------------------------------------- 4 > 18 ----------------------------------------- 3 > 19 ----------------------------------------- 4 -- https://www.w3.org/Graphics/GIF/spec-gif89a.txt To test performance I ran it for all possible GIF heights: ```c for (unsigned height = 1; height < UINT16_MAX; height++) { for (uint32_t y = 0; y < height; y++) { use_val(f1(height, y)); } } ``` For the original implementation (`f1`), I got the following performance: | Compiler | Time | | -------- | ------ | | Clang | 3.830s | | GCC | 4.737s | All tests used GCC 11.2 and Clang 13.0, with `-O2` optimisation. Next I rewrote it to more closely resemble my reading of the spec. The compiler can convert the multiplies to shifts and I found this more readable than `f1`, and more closely matching the spec. The `height / N * N` bits all use powers of two for N so they boil down the a simple bitwise AND with an appropriate mask to round down to the nearest multiple of N. ```c static inline uint32_t f2(uint32_t height, uint32_t y) { height--; if (y * 8 <= height) { return y * 8; /* Pass 1 */ } if (y * 4 <= height) { return y * 8 - height / 8 * 8 - 4; /* Pass 2 */ } if (y * 2 <= height) { return y * 4 - height / 4 * 4 - 2; /* Pass 3 */ } return y * 2 - height / 2 * 2 - 1; /* Pass 4 */ } ``` For `f2`, I got the following performance: | Compiler | Time | | -------- | ------ | | Clang | 2.758s | | GCC | 3.909s | So it's faster than `f1`, and Clang is still doing better than GCC. After that I reversed the order, so that pass 4 would be handled first. ```c static inline uint32_t f3(uint32_t height, uint32_t y) { height--; if (y * 2 > height) { return y * 2 - height / 2 * 2 - 1; } if (y * 4 > height) { return y * 4 - height / 4 * 4 - 2; } if (y * 8 > height) { return y * 8 - height / 8 * 8 - 4; } return y * 8; } ``` For `f3`, I got the following performance: | Compiler | Time | | -------- | ------ | | Clang | 3.329s | | GCC | 3.630s | Interestingly this was consistently faster than `f2` with GCC, but slower than `f2` when built with Clang. (It's still faster than the original `f1` with both compilers.) I thought it might be because the branches in `f2` are more predictable than in `f3`. For example the first branch in `f2` is only true one eighth of the time, while all the branches in `f3` are 50:50. However, running under `perf` showed this was not the case. Actually the faster `f2` for Clang had more branch misses than the Clang `f3`. | Compiler | Implementation | Branches | Branch misses | | -------- | -------------- | -------------- | ------------- | | Clang | `f2` | 13,690,378,148 | 656,517 | | Clang | `f3` | 10,738,688,711 | 275,147 | | GCC | `f2` | 10,738,659,995 | 336,992 | | GCC | `f3` | 10,201,842,339 | 1,314,445 | After this I tried changing the implementation completely. All of the previous implementations were based on the idea of mapping from an uninterlaced row number to an interlaced one. ```c static inline bool f4(uint32_t height, uint32_t *y, uint8_t *pass) { *y += *pass == 0 ? 8 : 16 >> *pass; if (*y < height) return true; switch (*pass) { case 0: *pass += 1; *y = 4; if (*y < height) return true; case 1: *pass += 1; *y = 2; if (*y < height) return true; case 2: *pass += 1; *y = 1; return true; } return false; } ``` With `f4` we simply step to the next line in interlaced order. The `16 >> *pass` generates the correct step size for all but the first pass, so the first pass is special cased. The switch statement is for handling advancing from one pass to the next, so it is reached a maximum of four times per frame. The rest of the time, we return on the `if (*y < height) return true;` above. The switch falls through to handle the case where the frame height is less than the start row for the given pass. For `f4`, I got the following performance: | Compiler | Time | | -------- | ------ | | Clang | 2.315s | | GCC | 3.479s | Again, this was tested for all possible GIF heights, but the `for` loop for rows changes to a do-while loop: ```c for (unsigned height = 1; height < UINT16_MAX; height++) { uint8_t pass = 0; uint32_t y = 0; do { use_val(y); } while (f4(height, &y, &pass)); } ``` This is the fastest yet for both compilers, with Clang still performing best. One thing I didn't like about `f4` is that it calculates the y step size every call, although the step only actually changes three times, when we change to the next pass. So for `f5` the switch was updated to set the step size. ```c static inline bool f5(uint32_t height, uint32_t *y, uint8_t *step) { *y += *step & 0xf; if (*y < height) return true; switch (*step) { case 24: *y = 4; *step = 8; if (*y < height) return true; case 8: *y = 2; *step = 4; if (*y < height) return true; case 4: *y = 1; *step = 2; return true; } return false; } ``` To avoid the caller having to maintain extra state, the pass number is no longer used, and we use the step size to keep track of which pass we are on. However, the step size for the first two passes is the same, so that led to the strange use of 24 for the initial step size, and the masking of step to ensure only the bottom four bits are used. For `f5`, I got the following performance: | Compiler | Time | | -------- | ------ | | Clang | 2.364s | | GCC | 2.891s | This is the fastest yet for both compilers, with Clang still winning easily. Note that when the functions are not allowed to inline, `f2` and `f3` are the fastest.
author: Michael Drake <tlsa@netsurf-browser.org> 2021-11-07 18:16:35 +0000
committer: Michael Drake <tlsa@netsurf-browser.org> 2021-11-08 10:42:36 +0000
commit: 69581f88b3dd5be49b4de539e3a836006a988235 (patch)
tree: ffa7b38eeb0dff16d16eec430d211d3ac27c853b /src/libnsgif.c
parent: 1b0c8759f5d96e96ded4cb50bb2772103780e2eb (diff)
download: libnsgif-69581f88b3dd5be49b4de539e3a836006a988235.tar.gz
libnsgif-69581f88b3dd5be49b4de539e3a836006a988235.tar.bz2
1 files changed, 60 insertions, 23 deletions
diff --git a/src/libnsgif.c b/src/libnsgif.c
index 7d48b7e..6046769 100644
--- a/src/libnsgif.c
+++ b/src/libnsgif.c
@@ -247,21 +247,56 @@ static gif_result gif__recover_frame(
 	return GIF_OK;
 }
 
-static uint32_t gif_interlaced_line(int height, int y)
+/**
+ * Get the next line for GIF decode.
+ *
+ * Note that the step size must be initialised to 24 at the start of the frame
+ * (when y == 0).  This is because of the first two passes of the frame have
+ * the same step size of 8, and the step size is used to determine the current
+ * pass.
+ *
+ * \param[in]     height     Frame height in pixels.
+ * \param[in,out] y          Current row, starting from 0, updated on exit.
+ * \param[in,out] step       Current step starting with 24, updated on exit.
+ * \return true if there is a row to process, false at the end of the frame.
+ */
+static inline bool gif__deinterlace(uint32_t height, uint32_t *y, uint8_t *step)
 {
-	if ((y << 3) < height) {
-		return (y << 3);
+	*y += *step & 0xf;
+
+	if (*y < height) return true;
+
+	switch (*step) {
+	case 24: *y = 4; *step = 8; if (*y < height) return true;
+	         /* Fall through. */
+	case  8: *y = 2; *step = 4; if (*y < height) return true;
+	         /* Fall through. */
+	case  4: *y = 1; *step = 2; if (*y < height) return true;
+	         /* Fall through. */
+	default:
+		break;
 	}
-	y -= ((height + 7) >> 3);
-	if ((y << 3) < (height - 4)) {
-		return (y << 3) + 4;
-	}
-	y -= ((height + 3) >> 3);
-	if ((y << 2) < (height - 2)) {
-		return (y << 2) + 2;
+
+	return false;
+}
+
+/**
+ * Get the next line for GIF decode.
+ *
+ * \param[in]     interlace  Non-zero if the frame is not interlaced.
+ * \param[in]     height     Frame height in pixels.
+ * \param[in,out] y          Current row, starting from 0, updated on exit.
+ * \param[in,out] step       Current step starting with 24, updated on exit.
+ * \return true if there is a row to process, false at the end of the frame.
+ */
+static inline bool gif__next_row(uint32_t interlace,
+		uint32_t height, uint32_t *y, uint8_t *step)
+{
+	if (!interlace) {
+		return (++*y != height);
+	} else {
+		return gif__deinterlace(height, y, step);
 	}
-	y -= ((height + 1) >> 2);
-	return (y << 1) + 1;
 }
 
 static gif_result gif__decode_complex(
@@ -276,9 +311,15 @@ static gif_result gif__decode_complex(
 		uint32_t *restrict frame_data,
 		uint32_t *restrict colour_table)
 {
-	uint32_t available = 0;
-	gif_result ret = GIF_OK;
 	lzw_result res;
+	gif_result ret = GIF_OK;
+	uint32_t available = 0;
+	uint8_t step = 24;
+	uint32_t y = 0;
+
+	if (height == 0) {
+		return GIF_OK;
+	}
 
 	/* Initialise the LZW decoding */
 	res = lzw_decode_init(gif->lzw_ctx, data[0],
@@ -288,17 +329,12 @@ static gif_result gif__decode_complex(
 		return gif_error_from_lzw(res);
 	}
 
-	for (uint32_t y = 0; y < height; y++) {
+	do {
 		uint32_t x;
-		uint32_t decode_y;
 		uint32_t *frame_scanline;
 
-		if (interlace) {
-			decode_y = gif_interlaced_line(height, y) + offset_y;
-		} else {
-			decode_y = y + offset_y;
-		}
-		frame_scanline = frame_data + offset_x + (decode_y * gif->width);
+		frame_scanline = frame_data + offset_x +
+				(y + offset_y) * gif->width;
 
 		x = width;
 		while (x > 0) {
@@ -338,7 +374,8 @@ static gif_result gif__decode_complex(
 				}
 			}
 		}
-	}
+	} while (gif__next_row(interlace, height, &y, &step));
+
 	return ret;
 }
author	Michael Drake <tlsa@netsurf-browser.org>	2021-11-07 18:16:35 +0000
committer	Michael Drake <tlsa@netsurf-browser.org>	2021-11-08 10:42:36 +0000
commit	69581f88b3dd5be49b4de539e3a836006a988235 (patch)
tree	ffa7b38eeb0dff16d16eec430d211d3ac27c853b /src/libnsgif.c
parent	1b0c8759f5d96e96ded4cb50bb2772103780e2eb (diff)
download	libnsgif-69581f88b3dd5be49b4de539e3a836006a988235.tar.gz libnsgif-69581f88b3dd5be49b4de539e3a836006a988235.tar.bz2