AngleSharp is still far away from being finished, however, with every release more and more features are published. The 0.6 versions mark quite a dramatic change. The whole API has been polished and the core implementations for appending, removing, etc. elements have been changed. They all use the WHATWG algorithms. Additionally the parser is now avoiding these implementations, since the parser knows exactly what to do (one might guess).
The parser also uses a different model that closes (finalizes) elements. The performance gain is not dramatic, but it is certainly nice to have. The following chart demonstrates the current performance against other C# solutions for parsing HTML content.
While HtmlAgilityPack should be avoided (no update since quite a while, lacking performance, no CSS selectors, not HTML5 conform), CsQuery is a nice project. The API is really fun and it practically contains everything for parsing plain HTML. However, it does not have a CSS parser. It also does not contain any extensions for building more advanced applications on top of it.
Also the core of the library will shift a bit. The whole
DocumentBuilder approach is historic and will probably change to using
BrowsingContext. One will then either talk to a very minimalistic implementation, or come up with some custom implementation of
IBrowsingContext. In the end, this will enable a lightweight version for only parsing basic HTML (and no configuration, i.e. no CSS, external requests, etc.) without doing anything.
Version v0.7.0 will be released in September (or start of October). It will be another major milestone before reaching v1.
It turns out that CsQuery can be faster without the internal indexing being active. While this could be considered a crucial feature of the library, it is just fair to deactivate it for testing out the parser alone. In the end I also deactivated CSS parsing in AngleSharp (well, I can easily justify that).
The following code is now used for parsing HTML using CsQuery.
var factory = new ElementFactory(DomIndexProviders.Simple); using (var stream = html.ToStream()) var document = factory.Parse(stream, Encoding.UTF8);
The creator of CsQuery, James Treworgy, wrote me an email with some interesting information, including a nice performance comparison he conducted. AngleSharp still shows acceptable performance, but the following pieces need additional input:
- The HTML parser. I already made it faster, but since I tried to stay really close to the described algorithms there might be still a lot of potential for performance gains. However, one needs to become creative here.
- The CSS selectors. I was quite fond of them, but indexing seems to be the key for going to larger DOM trees with thousands of nodes. I will most probably come up with a lazy loaded solution that takes the DOM size into account.
- The CSS parser. Okay, that one is not being tested in the scenario directly, but at some point it has to be evaluated and probably improved as well.
Going forward to v0.7 I will not try to get any more performance updates done. Performance will be considered again after v0.7 has been published. As long as the performance is acceptable, I will consider implementing standardized features before going for 1% increased performance.
The priority towards v1 is as follows:
- Implement missing features
- Come up with useful extension methods to make the DOM enjoyable in C# / .NET
- Improve the performance
I will post some updated / different performance charts soon.