In this post I don't want to outline all the changes that happened from v0.7 to the latest, v0.8, release. I want to start by emphasizing that the name of the DOM
namespace had to be changed to Dom
. I know that this may cause some confusion and maybe even frustration, but it is better to do it now, than doing it for the final v1.0 release.
That being said the DocumentBuilder
is still alive. It will probably stay alive and make it in the v1.0 release. It is unclear at the moment, since the builder is still the easiest way to construct a document. However, for v0.9, the IBrowsingContext
will be more interesting. The ideal way will probably involve creating a class that inherits from Context
or just implements IBrowsingContext
. The exact behavior is still to be specified.
Fixed issues
So let's have a look at the great fixes that made it to v0.8:
- v0.8 nearly doubled the number of unit tests. That also covered many CSS selectors, which resulted in fixing (or speeding up) some of them. It seemed like the nth-child selector(s) haven't been working as they should. Now they do! See issue
- AngleSharp fixed a memory leak that will be discussed in detail below. Now AngleSharp is also really lightweight in memory consumption. See issue
- Encoding especially with chinese encodings such as GB18030 was problematic. Now AngleSharp handles all these cases correctly! See issue
- Another encoding issue with wrongly determined Shift_JIS (was in fact UTF-8). Now AngleSharp can handle such edge cases as well! See issue
- A memory leak due to leaving the response open has been handled. See issue
Finished features
AngleSharp v0.8 also fully implemented the Url
(including relative path / scheme detection, directory movements and more) type. Every link can now be normalized and trusted. Besides we now have full HTML5 constraint form validation, i.e. that attributes such as required
are correctly interpreted in dependency of the respective input type. All HTML5 input types (including datetime
and datetime-local
) are implemented according to the official specification. Their stepping behavior is also included.
Besides all those DOM features the parser has been optimized again. It can certainly compete with other solutions. To guarantee not only speed, but also memory efficiency a memory leak has been issued and fixed. It turned out that the code in AngleSharp was only part of the problem. What happened exactly?
Memory leak
The .NET garbage collector does not seem to recognize unconnected documents as being disposable. In the end that does not matter much. What we can do is to dispose the object manually. However, it turns out that this was not sufficient. The connected DOM had to be killed explicitly. So now I am removing all nodes that have been placed on the Document
. That helped a little.
What surprised me was that every spawned task, that had a continuation task, was the source of a massive memory leak. Once a DOM node had a continuation task, it could not be freed any more, resulting in other nodes not being able to be freed, thus starting a vicious circle of high memory consumption.
After just 3 minutes of runtime the memory consumption easily exceeds 100 MB. We also scratch the 100 MB mark if we explicitly dispose the document. In that case the following trend can be observed.
Of course, this is still unacceptable, especially if compared to other solutions like HAP.
In the screenshot above HAP only requires 32 MB after a few minutes. Fixing the issue by disconnecting the elements and replacing all continuation tasks, by elegant async
solutions that take await
, resolved basically all those leaks.
In the end we now have a performance that is practically in the same ballpark as the behavior experienced with HAP. However, it is still required to do the manual call to the Dispose()
method, or to use a using
construct.
Now AngleSharp only consumes 33 MB after 3 minutes, which is basically the same as offered by HAP, however, with a far more complete and accurate DOM, events, observers and much more.
Upcoming developments
The road to v0.9 will also include formal changes. Starting with v0.8 there will be NuGet preview releases, which will be available on a daily or weekly basis. Perhaps these pre-releases will be coupled to the CI system (which is AppVeyor in this case).
Coming with v0.9 or earlier will be a move of the repository. The current AngleSharp repository will be moved to an organization called AngleSharp (surprise!). The repository will then also be split up. Basically every repository will represent a solution and these solutions will communicate via NuGet releases only.
The AngleSharp.Scripting project for instance will become AngleSharp.Scripting.JavaScript, which will be placed in the AngleSharp.Scripting repository. The main idea behind this restructuring is of course to bring in more structure and provide an easier overview for newcomers. Also it should be pretty obvious that parts like the JavaScript engine integration are external projects, which build upon the core AngleSharp project.
Still, AngleSharp is searching contributors and welcomes every contribution. Feel free to write me any time if you are interested.