AngleSharp is definitely a big project that takes a huge chunk of my rare spare time. In the beginning the project was aiming at something slightly different, but it has become apparent that AngleSharp should be a browser core. In the best case a completely interchangeable, modularized and lightweight core. Right now AngleSharp is on the way to fulfill some of these goals, but it is definitely not lightweight any more.
One of the problems is that the most important piece of the original AngleSharp project, the HTML parser, is just a very small part of the current AngleSharp library. The whole CSSOM is huge and takes probably even more than 50% of the core. Therefore the project should be separated into parts such as:
- The core AngleSharp library, defining interfaces and core functionality, such as the configuration system and the browsing context.
- The whole DOM with all algorithms and elements.
- The whole CSSOM that includes converters, properties, and special rules.
- The HTML / XML parser.
- The CSS parser.
But this separation has some severe problems. Let's consider the DOM for instance. One important method (which also makes AngleSharp so popular) is the implementation of querySelectorAll
. However, this method takes a CSS selector, which needs to be parsed. Hence the DOM needs to know about a CSS parser. On some occasions we also need the HTML parser, e.g., when we change the value of the innerHTML
property. The parser, on the other side, has also to know some things about the DOM. In special instances it needs to know specific methods or properties, in other scenarios just the constructor or the factory to call. Anyway, there are some connection points, which are hard to decouple.
Of course decoupling these dependencies is possible. However, if we think about performance then we may have to reconsider. Decoupling is only possible via some abstraction and unfortunately the mechanisms in the .NET-framework are far from being ideal. These are not transparent layers. We will pay the cost for every abstraction. Delaying the construction of 10k DOM nodes by just 100 cycles will cost us at least another millisecond. This may sound nitpicking, but some pages have much more nodes and / or we may have to pay even more than 100 cycles.
So what should we do about such problems? Well, first of all I see good chances in separating large parts of the CSSOM including the CSS parser from the rest of the library. I also see good chances in separating the core library. In the end we would then have a core library, something like AngleSharp.Dom
and another thing AngleSharp.Css
. Obviously this is just a first sketch and unlikely to be realized in exactly this format, but the basic direction is definitely right.