Welcome! Log In Create A New Profile

Advanced

Re: HTML parsing support for xslt module

Maxim Dounin
March 19, 2012 01:04PM
Hello!

On Wed, Mar 14, 2012 at 05:03:56PM +0000, Laurence Rowe wrote:

> I'd like to submit the attached patch which implements an
> ``xslt_html_parser`` directive for consideration. When enabled, the
> xslt module uses the libxml2 HTMLParser to parse the response body.
> This is useful for people who want to transform HTML using XSLT,
> including users of Diazo deploying on Nginx
> (http://docs.diazo.org/en/latest/deployment.html#nginx).
>
> The patch is generated from my repository at
> https://bitbucket.org/lrowe/nginx-xslt-html-parser, forked from
> http://mdounin.ru/hg/nginx-vendor-current/. The xslt_param patch from
> http://mailman.nginx.org/pipermail/nginx-devel/2012-March/001926.html
> is included in my repository. I'll discuss each of the individual
> changesets briefly below:
>
> changeset: 668:bf4d14f51436
> user: Laurence Rowe <laurence@lrowe.co.uk>
> date: Sun Jul 11 23:54:08 2010 +0100
> summary: Skip transform when there is no content (e.g. a proxied redirect)
>
> This was originally reviewed in
> http://mailman.nginx.org/pipermail/nginx-devel/2010-July/000390.html:
>
> > Currently the way to disable XSLT processing is MIME type, "text/xml"
> > by default. Redirects usually have "text/html" type.
>
> To parse HTML you have to set ``xslt_types text/html;``. This
> changeset prevents crashing on responses with an empty body.

I don't think that skipping transformation based on
content_length_n is a correct behaviour.

1) There are more than one way to represent empty response, and
content_length_n == 0 is just one of them.

2) And the empty response is just another case of malformed input,
I see no reason to handle this specially.

> changeset: 669:9487b3a0e3ff
> user: Laurence Rowe <laurence@lrowe.co.uk>
> date: Fri Mar 09 21:01:25 2012 +0000
> summary: Use xmlCtxtUseOptions to set options.
>
> This changeset moves most option setting to the xmlCtxtUseOptions
> (foundational to next commit.)

This doesn't produce identical ctxt->loadsubset, and hence need
more details. I tend to think the current code is wrong and
your's is ok, though it would be good to see confirmation.

(Additionally, there is a style issue in this patch.)

> changeset: 670:c8349ca87381
> user: Laurence Rowe <laurence@lrowe.co.uk>
> date: Fri Mar 09 21:08:18 2012 +0000
> summary: XML_PARSE_COMPACT to save memory
>
> We can use XML_PARSE_COMPACT as the parsed input document is not
> modified (XSLT creates a new result document.)

Is it supported in libxml2 versions used by various OSes now? It
appeared relatively recently, in libxml2 2.6.21, and if some OSes
still use older versions by default - this may need configure
test.

Another question to consider - is it actually safe to use? Note
that nginx does modify input document to provide DTDs.

It would be also cool to see some stats on real world memory usage
with and without this option set.

> changeset: 671:7145bd8cc1e2
> tag: tip
> user: Laurence Rowe <laurence@lrowe.co.uk>
> date: Fri Mar 09 21:47:46 2012 +0000
> summary: xslt_html_parser
>
> This changeset adds the ``xslt_html_parser`` directive and uses the
> HTMLParser when it is set. HTML parsing is performed with
> HTML_PARSE_RECOVER as real-world HTML may not be well formed, error
> handling is thus disabled when this option is set.

I in general like the feature, though I tend to think this patch
needs more polishing. In no particular order:

1) Hardcoded "utf-8" charset detection scares me.

2) The are no error handling for html code, and while it's
probably ok to ignore parsing problems - it's certainly not ok to
ignore fatal problems like memory allocation ones.

3) DTD and entities handling needs clarification.

Maxim Dounin

p.s. You may want to use patchbomb mercurial extension (hg email)
to submit patches.

_______________________________________________
nginx-devel mailing list
nginx-devel@nginx.org
http://mailman.nginx.org/mailman/listinfo/nginx-devel
Subject Author Views Posted

HTML parsing support for xslt module Attachments

Laurence Rowe 1469 March 14, 2012 01:06PM

Re: HTML parsing support for xslt module

Maxim Dounin 557 March 19, 2012 01:04PM

Re: HTML parsing support for xslt module

Laurence Rowe 476 March 20, 2012 12:06PM

Re: HTML parsing support for xslt module

Maxim Dounin 494 March 21, 2012 08:56AM

Re: HTML parsing support for xslt module

Laurence Rowe 479 March 21, 2012 09:56AM

Re: HTML parsing support for xslt module

Maxim Dounin 511 March 21, 2012 10:32AM



Sorry, you do not have permission to post/reply in this forum.

Online Users

Guests: 280
Record Number of Users: 8 on April 13, 2023
Record Number of Guests: 421 on December 02, 2018
Powered by nginx      Powered by FreeBSD      PHP Powered      Powered by MariaDB      ipv6 ready