AngleSharp v0.6 released

The latest release of AngleSharp marks another milestone.

The AngleSharp project is still ongoing. With the latest release a lot of new features are available. Also the API has been changed completely, to make place for a future proof, robust and extensible version.

Major bug fixes

Some bugs have been fixed. The number was quite limited, but that was expected as a lot of unit tests ensure standard compatibility and robustness. Nevertheless, as the code is huge and the standard is complex, bugs are always possible.

The most annoying bug had to do with overlong character references. A character reference starts with an ampersand. It ends, if a non-name character (in particular a semi-colon) has been found. Then the valid character reference string with the most characters will be chosen. Unfortunately there are apparently character references on some webpages, that really go far beyond the 31 characters that may be found on the longest allowed character reference.

Furthermore bugs in the existing Location object, the CSSOM and more have been identified and fixed.

New API

The new API of AngleSharp completely relies on interfaces. If an implementation is interesting for the outside, it is public. An example is the Configuration class. This allows to directly use it, e.g., for inheriting from it and redefining only a subset of the provided methods. Or one directly starts by implementing IConfiguration. This is now up to the user to decide.

The whole DOM is only represented in interfaces. This makes sense as the W3C specifies the whole API in IDL form. We have attributes and interfaces to transport the information from IDL to C#. Additionally my first concept to just make really small changes to the official API naming is obsolete. I found that using the same API - at least for providing the same functionality in, e.g., JavaScript, makes sense, while having a naming that is close to the one provided by the W3C is not so good. The new concept is to give everything either the W3C name adjusted to .NET conventions (if the name is good, or good enough), or to have a new name that is much closer to the .NET conventions.

An example: The innerHTML property of an Element is now the InnerHtml property of the IElement interface. The W3C name is good, however, the upper- and lowercase usage has been adjusted. However, for instance the bubbles property of an Event object has been renamed to IsBubbling (for the interface IEvent). Most boolean properties start with "Is".

In general the new API allows (re-)implementing some existing interfaces. Inserting them should also work in general, however, some depend on internal stuff. This may be addressed in future versions, depending on usage, problems and solutions.

Features

AngleSharp v0.6 uses a new class called TextSource for adressing the source code. This partially replaces the SourceManager. The remains of the previous solution are now directly integrated into the BaseTokenizer, which is used by the HTML and CSS tokenizer.

Why this new way? Well, the TextReader worked somehow great, but in general unreliable and hard to control. There was no going back and it was impossible to set a buffer limit (or access the buffer). In general the only solution was to throw away the old one and create a new instance if we wanted to change, e.g., the encoding. Now everything is in our control and we can directly work with text over the new class, which is working with a (network) stream or a fixed (finished) source.

This allows, e.g., to write on the source while processing it. A feature that uses this is the Write (and additionally the WriteLine) method of the Document class.

With v0.6 nearly the whole CSSOM is included. Yet it is still too rough (and early) to work with it. I will most probably do some major rewriting for v0.7, but we can already see where this is going. The biggest problem with the current point is the distinction between raw values (color, angle, number, percentage, ...) and CSS values (which could be just raw values, or more complicated ones, such as a computed value). Right now the outcome is quite mixed and I will try to make it simple, clean and easy to work with.

Finally (and this is probably the greatest new feature) I managed to integrated the API for adding / removing script and styling engines. There will be more such integration points (e.g., storage access, ...) in the future. Right now this means that it is possible to register a scripting engine such as for JavaScript files. The only warning is: Execution is currently only implemented in a very limited fashion. There are many different scenarios and only one particular scenario has been implemented (non-async, inline ...). Nevertheless, the integration is available and maybe the other scenarios will be implemented in v0.6.x (or v0.7.0 the latest).

Styling is already completely in there, which means that you can easily integrate other styling engines additional to the default CSS engine (which is registered by default as well). One word of warning. Currently two options are available: One is WHICH engine(s) are available, the other IF an engine could be used. Therefore just providing, e.g., a scripting engine is not sufficient. We also need to activate scripting.

Created . Last updated .

References

Sharing is caring!