/plushcap/analysis/cloudflare/html-parsing-1

A History of HTML Parsing at Cloudflare: Part 1

What's this blog post about?

In November 2019, Cloudflare launched streaming HTML rewriting functionality and open sourced the Rust HTML rewriter (LOL HTML) used to back the Workers HTMLRewriter API. The first blog post explains the basics of a streaming HTML rewriter and their particular requirements. They started around eight years ago by describing the group of 'ad-hoc' parsers that were created with specific functionality such as to rewrite e-mail addresses or minify HTML. By 2016, the state machine defined in the HTML5 specification could be used to build a single spec-compliant HTML pluggable rewriter, to replace the existing collection of parsers. The source code for this rewriter is now public and available here: https://github.com/cloudflare/lazyhtml. The second blog post describes the next iteration of rewriter. With the launch of the edge compute platform Cloudflare Workers, they came to realise that developers wanted the same HTML rewriting capabilities with a JavaScript API. The post describes the thoughts behind a low latency streaming HTML rewriter with a CSS-selector based API. They open-sourced the Rust library as it can also be used as a stand-alone HTML rewriting/parsing library. A streaming HTML rewriter takes either a HTML string or byte stream input, parses it into tokens or any other structured intermediate representation (IR) - such as an Abstract Syntax Tree (AST). It then performs transformations on the tokens before converting back to HTML. This provides the ability to modify, extract or add to an existing HTML document as the bytes are being processed. Compare this with a standard HTML tree parser which needs to retrieve the entire file to generate a full DOM tree. The tree-based rewriter will both take longer to deliver the first processed bytes and require significantly more memory. The reader may already be wondering: “Isn’t this a solved problem, aren’t there many widely used open-source browsers out there with HTML parsers that can be used for this purpose?”. The reality is that writing code to run in 190+ PoPs around the world with a strict low latency requirement turns even seemingly trivial problems into complex engineering challenges. The following blog posts detail the journey of how starting with a simple idea of finding email addresses within an HTML page led to building an almost spec compliant HTML parser and then on to a CSS selector matching Virtual Machine. They learned a lot on this journey. I hope you find some of this as interesting as they did. When rewriting content through Cloudflare, they do not want to impact site performance. The balance in designing a streaming HTML rewriter is to minimise the pause in response byte flow by holding onto as little information as possible whilst retaining the ability to rewrite matching tokens. The difference in requirements compared to an HTML parser used in a browser include: Output latency, Parser throughput and Memory limitations. As most users must realise, browsers have the luxury of being able to consume memory. For example, this simple HTML markup when opened in a browser will consume a significant chunk of your system memory before eventually halting a browser tab (and all this memory will be consumed by the parser) :<script> document.write('<'); while(true) { document.write('aaaaaaaaaaaaaaaaaaaaaaaa'); } </script> Unfortunately, buffering of some fraction of the input is inevitable even for streaming HTML rewriting. Consider these 2 HTML snippets:<div foo="bar" qux="qux"> <div foo="bar" qux="qux" These seemingly similar fragments of HTML will be treated completely differently when encountered at the end of an HTML page. The first fragment will be parsed as a start tag and the second one will be ignored. By just seeing a `<` character followed by a tag name, the parser can’t determine if it has found a start tag or not. It needs to traverse the input in the search of the closing `>` to make a decision, buffering all content in between, so it can later be emitted to the consumer as a start tag token. This requirement forces browsers to indefinitely buffer content before eventually giving up with the out-of-memory error. In our case, we can’t afford to spend hundreds of megabytes of memory parsing a single HTML file (actual constraints are even tighter - even using a dozen kilobytes for each request would be unacceptable). We need to be much more sophisticated than other implementations in terms of memory usage and gracefully handle all the situations where provided memory capacity is insufficient to accomplish parsing. The first version of their HTML rewriter was called "Ad-hoc parsers". It started with a simple regex like `[\w.][email protected][\w.]+`. If the content that comes through contains the email “[email protected]”, it might be split in the following chunks: In order to keep good Time To First Byte (TTFB) and consistent speed, they wanted to ensure that the preceding chunk is emitted as soon as we determine that it’s not interesting for replacement purposes. The easiest way to do that is to transform their regex into a state machine, or a finite automata. While you could do that by hand, you will end up with hard-to-maintain and error-prone code. Instead, Ragel was chosen to transform regular expressions into efficient native state machine code. Ragel doesn’t try to take care of buffering or anything other than traversing the state machine. It provides a syntax that not only describes patterns, but can also associate custom actions (code in a host language) with any given state. In their case they can remember the position of the potential start of an email and, unless it was already discarded or replaced by the end of the current input, store the unhandled part in a permanent buffer. Then, when a new chunk comes, they can process it separately, resuming from a state Ragel remembers itself, but then use both the buffered chunk and a new one to either emit or obfuscate. Now that they have solved the problem of matching email patterns in text, they need to deal with the fact that they need to be obfuscated on pages. This is when the first hints of HTML “parsing” were introduced. I’ve put “parsing” in quotes because, rather than implementing the whole parser, the email filter (as the module was called) didn’t attempt to replicate the whole HTML grammar, but rather added custom Ragel patterns just for skipping over comments and tags where emails should not be obfuscated. This was a reasonable approach, especially back in 2010 - four years before the HTML5 specification, when all browsers had their own quirks handling of HTML. However, as you can imagine, this approach did not scale well. If you’re trying to work around quirks in other parsers, you start gaining more and more quirks in your own, and then work around these too. Simultaneously, new features started to be added, which also required modifying HTML on the fly (like automatic insertion of Google Analytics script), and an existing module seemed to be the best place for that. It grew to handle more and more tags, operations and syntactic edge cases. In 2011, Cloudflare decided to also add minification to allow customers to speed up their websites even if they had not employed minification themselves. For that, they decided to use an existing streaming minifier - jitify. It already had NGINX bindings, which made it a great candidate for integration into the existing pipeline. Unfortunately, just like most other parsers from that time as well as ours described above, it had its own processing rules for HTML, JavaScript and CSS, which weren’t precise but rather tried to parse content on a best-effort basis. This led to us having two independent streaming parsers that were incompatible and could produce bugs either individually or only in combination. By 2016, it was time to get out of the multiple ad hoc parsers business and do things ‘the right way’. The next section(s) will describe how they built their HTML5 compliant parser starting from the specification state machine. Using only this state machine it should have been straight-forward to build a parser. You may be aware that historically the parsing of HTML had not been entirely strict which meant to not break existing implementations the building of an actual DOM was required for parsing. This is not possible for a streaming rewriter so a simulator of the parser feedback was developed. In terms of performance, it is always better not to do something. We then describe why the rewriter can be ‘lazy’ and not perform the expensive encoding and decoding of text when rewriting HTML. The surprisingly difficult problem of deciding if a response is HTML is then detailed. HTML5 By 2016, HTML5 had defined precise syntax rules for parsing and compatibility with legacy content and custom browser implementations. It was already implemented by all browsers and many 3rd-party implementations. The HTML5 parsing specification defines basic HTML syntax in the form of a state machine. We already had experience with Ragel for similar use cases, so there was no question about what to use for the new streaming parser. Despite the complexity of the grammar, the translation of the specification to Ragel syntax was straightforward. The code looks simpler than the formal description of the state machine, thanks to the ability to mix regex syntax with explicit transitions. A visualisation of a small fraction of the HTML state machine. Source: https://twitter.com/RReverser/status/715937136520916992 HTML grammar doesn’t allow the first character of a tag name to be anything except an ASCII alphabetical character, so reserving numbers from 0 to 5 (00000b-00101b) for digits and numbers from 6 to 31 (00110b - 11111b) for ASCII alphabetical characters solves the problem. LazyHTML After taking everything mentioned above into consideration the LazyHTML (https://github.com/cloudflare/lazyhtml) library was created. It is a fast streaming HTML parser and serializer with a token based C-API derived from the HTML5 lexer written in Ragel. It provides a pluggable transformation pipeline to allow multiple transformation handlers to be chained together. An example of a function that transforms `href` property of links:// define static string to be used for replacements static const lhtml_string_t REPLACEMENT = { .data = "[REPLACED]", .length = sizeof("[REPLACED]") - 1 }; static void token_handler(lhtml_token_t *token, void *extra /* this can be your state */) { if (token->type == LHTML_TOKEN_START_TAG) { // we're interested only in start tags const lhtml_token_starttag_t *tag = &token->start_tag; if (tag->type == LHTML_TAG_A) { // check whether tag is of type <a> const size_t n_attrs = tag->attributes.count; const lhtml_attribute_t *attrs = tag->attributes.items; for (size_t i = 0; i < n_attrs; i++) { // iterate over attributes const lhtml_attribute_t *attr = &attrs[i]; if (lhtml_name_equals(attr->name, "href")) { // match the attribute name attr->value = REPLACEMENT; // set the attribute value } } } } lhtml_emit(token, extra); // pass transformed token(s) to next handler(s) } So, is it correct and how fast is it? It is HTML5 compliant as tested against the official test suites. Unlike the previous parser(s), it didn’t bail out on any of the 2,382,625 documents from HTTP Archive, although 0.2% of documents exceeded expected bufferization limits as they were in fact JavaScript or RSS or other types of content incorrectly served with Content-Type: text/html, and since anything is valid HTML5, the parser tried to parse e.g. a<b; x=3; y=4 as incomplete tag with attributes. This is very rare (and goes to even lower amount of 0.03% when two error-prone advertisement networks are excluded from those results), but still needs to be accounted for and is a valid case for bailing out. As for the benchmarks, In September 2016 using an example which transforms the HTML spec itself (7.9 MB HTML file) by replacing every <a href> (only that property only in those tags) to a static value. It was compared against the few existing and popular HTML parsers (only tokenization mode was used for the fair comparison, so that they don't need to build AST and so on), and timings in milliseconds for 100 iterations are the following (lazy mode means that we’re using raw strings whenever possible, the other one serializes each token just for comparison): The results show that LazyHTML parser speeds are around an order of magnitude faster. That concludes the first post in their series on HTML rewriters at Cloudflare. The next post describes how they built a new streaming rewriter on top of the ideas of LazyHTML. The major update was to provide an easier to use CSS selector API. It provides the back-end for the Cloudflare workers HTMLRewriter JavaScript API.

Company
Cloudflare

Date published
Nov. 28, 2019

Author(s)
Andrew Galloni, Ingvar Stepanyan

Word count
4169

Language
English

Hacker News points
1


By Matt Makai. 2021-2024.