Maxim Dounin
February 22, 2023 04:42PM
Hello!

On Thu, Feb 23, 2023 at 12:10:27AM +0900, u5h wrote:

> # HG changeset patch
> # User Yugo Horie <u5.horie@gmail.com>
> # Date 1677077775 -32400
> # Wed Feb 22 23:56:15 2023 +0900
> # Node ID 1a9487706c6af90baf2ed770db29f689c3850721
> # Parent cffaf3f2eec8fd33605c2a37814f5ffc30371989
> core: return error when the first byte is above 0xf5 in utf-8
>
> * see https://datatracker.ietf.org/doc/html/rfc3629#section-4
>

Thanks for catching, this needs to be addressed.

The existing code accepts invalid UTF-8 sequences with the first
byte being 11111xxx (>= 0xf8), and interprets them as if the first
byte was 11110xxx (as a 4-byte sequence). This is going to
misinterpret future UTF-8 if it will be ever expanded to 5-byte
sequences (unlikely though), and might cause other problems
(including security ones, as the RFC rightfully warns, though also
unlikely in this particular case).

A more verbose commit log might be beneficial here, as well as
the one matching nginx style for commit logs. See
http://nginx.org/en/docs/contributing_changes.html for some basic
tips.

> diff -r cffaf3f2eec8 -r 1a9487706c6a src/core/ngx_string.c
> --- a/src/core/ngx_string.c Thu Feb 02 23:38:48 2023 +0300
> +++ b/src/core/ngx_string.c Wed Feb 22 23:56:15 2023 +0900
> @@ -1364,7 +1364,7 @@
>
> u = **p;
>
> - if (u >= 0xf0) {
> + if (u < 0xf5 && u >= 0xf0) {
>
> u &= 0x07;
> valid = 0xffff;

The suggested change doesn't look correct to me.

In particular, the "u == 0xf5" (as well as 0xf6 and 0xf7) case is
valid in terms of the codepoint decoding, though resulting value
will be in the invalid range as per the comment above the
function. So it does not really need any special handling, except
may be to simplify things. The interesting part here is "u >= 0xf8"
(which is complimentary to "u &= 0x07" being used in "if").

Further, as the condition is written, "u >= 0xf5" is now excluded
from the first "if", but will be matched by the "if (u >= 0xe0)"
below, and misinterpreted as a 3-byte sequence. This looks worse
than the existing behaviour.

I would rather consider something like:

diff --git a/src/core/ngx_string.c b/src/core/ngx_string.c
--- a/src/core/ngx_string.c
+++ b/src/core/ngx_string.c
@@ -1364,7 +1364,12 @@ ngx_utf8_decode(u_char **p, size_t n)

u = **p;

- if (u >= 0xf0) {
+ if (u >= 0xf8) {
+
+ (*p)++;
+ return 0xffffffff;
+
+ } else if (u >= 0xf0) {

u &= 0x07;
valid = 0xffff;


--
Maxim Dounin
http://mdounin.ru/
_______________________________________________
nginx-devel mailing list
nginx-devel@nginx.org
https://mailman.nginx.org/mailman/listinfo/nginx-devel
Subject Author Views Posted

[PATCH] Core: return error when the first byte is above 0xf5 in utf-8

u5h 525 February 22, 2023 10:12AM

Re: [PATCH] Core: return error when the first byte is above 0xf5 in utf-8

Maxim Dounin 129 February 22, 2023 04:42PM

Re: [PATCH] Core: return error when the first byte is above 0xf5 in utf-8

u5h 113 February 22, 2023 07:26PM

Re: [PATCH] Core: return error when the first byte is above 0xf5 in utf-8

Maxim Dounin 114 March 01, 2023 06:52PM

Re: [PATCH] Core: return error when the first byte is above 0xf5 in utf-8

u5h 103 March 02, 2023 03:18AM

Re: [PATCH] Core: return error when the first byte is above 0xf5 in utf-8

Maxim Dounin 128 March 02, 2023 10:50AM

Re: [PATCH] Core: return error when the first byte is above 0xf5 in utf-8

Roman Arutyunyan 151 March 13, 2023 08:56AM



Sorry, you do not have permission to post/reply in this forum.

Online Users

Guests: 263
Record Number of Users: 8 on April 13, 2023
Record Number of Guests: 421 on December 02, 2018
Powered by nginx      Powered by FreeBSD      PHP Powered      Powered by MariaDB      ipv6 ready