In this post I don't want to outline all the changes that happened from v0.7 to the latest, v0.8, release. I want to start by emphasizing that the name of the
DOM namespace had to be changed to
Dom. I know that this may cause some confusion and maybe even frustration, but it is better to do it now, than doing it for the final v1.0 release.
That being said the
DocumentBuilder is still alive. It will probably stay alive and make it in the v1.0 release. It is unclear at the moment, since the builder is still the easiest way to construct a document. However, for v0.9, the
IBrowsingContext will be more interesting. The ideal way will probably involve creating a class that inherits from
Context or just implements
IBrowsingContext. The exact behavior is still to be specified.
So let's have a look at the great fixes that made it to v0.8:
- v0.8 nearly doubled the number of unit tests. That also covered many CSS selectors, which resulted in fixing (or speeding up) some of them. It seemed like the nth-child selector(s) haven't been working as they should. Now they do! See issue
- AngleSharp fixed a memory leak that will be discussed in detail below. Now AngleSharp is also really lightweight in memory consumption. See issue
- Encoding especially with chinese encodings such as GB18030 was problematic. Now AngleSharp handles all these cases correctly! See issue
- Another encoding issue with wrongly determined Shift_JIS (was in fact UTF-8). Now AngleSharp can handle such edge cases as well! See issue
- A memory leak due to leaving the response open has been handled. See issue
AngleSharp v0.8 also fully implemented the
Url (including relative path / scheme detection, directory movements and more) type. Every link can now be normalized and trusted. Besides we now have full HTML5 constraint form validation, i.e. that attributes such as
required are correctly interpreted in dependency of the respective input type. All HTML5 input types (including
datetime-local) are implemented according to the official specification. Their stepping behavior is also included.
Besides all those DOM features the parser has been optimized again. It can certainly compete with other solutions. To guarantee not only speed, but also memory efficiency a memory leak has been issued and fixed. It turned out that the code in AngleSharp was only part of the problem. What happened exactly?
The .NET garbage collector does not seem to recognize unconnected documents as being disposable. In the end that does not matter much. What we can do is to dispose the object manually. However, it turns out that this was not sufficient. The connected DOM had to be killed explicitly. So now I am removing all nodes that have been placed on the
Document. That helped a little.
What surprised me was that every spawned task, that had a continuation task, was the source of a massive memory leak. Once a DOM node had a continuation task, it could not be freed any more, resulting in other nodes not being able to be freed, thus starting a vicious circle of high memory consumption.
After just 3 minutes of runtime the memory consumption easily exceeds 100 MB. We also scratch the 100 MB mark if we explicitly dispose the document. In that case the following trend can be observed.
Of course, this is still unacceptable, especially if compared to other solutions like HAP.
In the screenshot above HAP only requires 32 MB after a few minutes. Fixing the issue by disconnecting the elements and replacing all continuation tasks, by elegant
async solutions that take
await, resolved basically all those leaks.
In the end we now have a performance that is practically in the same ballpark as the behavior experienced with HAP. However, it is still required to do the manual call to the
Dispose() method, or to use a
Now AngleSharp only consumes 33 MB after 3 minutes, which is basically the same as offered by HAP, however, with a far more complete and accurate DOM, events, observers and much more.
The road to v0.9 will also include formal changes. Starting with v0.8 there will be NuGet preview releases, which will be available on a daily or weekly basis. Perhaps these pre-releases will be coupled to the CI system (which is AppVeyor in this case).
Coming with v0.9 or earlier will be a move of the repository. The current AngleSharp repository will be moved to an organization called AngleSharp (surprise!). The repository will then also be split up. Basically every repository will represent a solution and these solutions will communicate via NuGet releases only.
Still, AngleSharp is searching contributors and welcomes every contribution. Feel free to write me any time if you are interested.