The future of AngleSharp

The latest release v0.5 has been an important milestone for the AngleSharp project.

AngleSharp is a fantastic library for getting information about webpages. With the latest release v0.5 submitting forms, getting information from the CSSOM and customizing the library's internal with dependency injection is all possible. What is upcoming?

The next release v0.6 officially focuses on three major points:

  • CSS model implemented (e.g. getComputedStyle works)
  • Draft interfaces for optional resource and rendering defined
  • Most important parts of HTML DOM implemented

Since the CSSOM is not fully implemented in v0.5, this means that the full CSS object model will be available in v0.6. Even better, with v0.6 it will be possible to get computed style informations for specific nodes in the HTML DOM. This works by emulating a browser - one will be able to set properties like the height or width of a viewport, which will then be used by a style computation algorithm for deducing style information.

This implicitly makes use of another important feature in v0.6: Interfaces to define a browser rendering context / environment. Also resources need to be downloaded (optionally, based on the same options that are already available in v0.5). A resource might be an image, such that the natural width and height of the resource can be computed.

Finally this will finalize some parts of the HTML DOM, which are only waiting for such a part to happen. But wait... there will be more things in v0.6!

DOM interfaces

AngleSharp tried to be accurate with the W3C specification. Some people liked this ansatz, others did not fall in love with it. AngleSharp v0.6 could bring a breaking change. The naming of most DOM classes will change to a custom name (which could be the same, or similar to the current one). Additionally the official DOM API will be covered by interfaces, which will be decorated with the attributes.

Right now there are some interfaces, however, they are only internal and not very useful. They look like the following snippet:

interface INode
{
	String BaseURI { get; }
	Node CloneNode(Boolean deep = true);
	Boolean Contains(Node otherNode);
	Node FirstChild { get; }
	Boolean HasAttributes { get; }
	Boolean HasChildNodes { get; }
        /* ... */
}

So we do use the original names, however, starting with an uppercase letter instead of a lowercase one. To make things worse the interface already couples strongly to the real class, i.e. it uses Node instead of INode. Of course there have been reasons for this, but this just renders the whole interface pointless.

In one of the next releases (hopefully v0.6) this will change. AngleSharp will only return interfaces. The code snippet from above would then look as follows:

[DOM("Node")]
interface INode
{
	[DOM("baseURI")]
	String BaseUri { get; }
	[DOM("cloneNode")]
	INode Clone(Boolean deep = true);
	[DOM("contains")]
	Boolean Contains(INode otherNode);
	[DOM("firstChild")]
	INode FirstChild { get; }
	[DOM("hasAttributes")]
	Boolean HasAttributes { get; }
	[DOM("hasChildNodes")]
	Boolean HasChildNodes { get; }
        /* ... */
}

Biggest advantage of this ansatz is that the DOM API also only resembles interfaces and not implementations. This is therefore more closely. Also it makes automatic wrapping of DOM classes possible, e.g. for using JavaScript libraries. This has been a huge pain with the current model and not a pleasure.

Give me cookies!

Okay, so with v0.5 we can finally do some form submit and everything works like a charm. The current page gets reloaded and we can do some automatic file uploading / whatever we are up to. In fact, this makes AngleSharp an excellent tool for unit testing server generated webpages.

What is currently missing? Well, cookies would be nice (at least session cookies). Why? If we use AngleSharp to log-into some webpage, we can only do this once. Since AngleSharp "forgets" the session (to be more explicit: throws away the session), we will not have an optimized experience. AngleSharp v0.6 or later would be able to see the cookie, store and and send it with the request.

What about HTTPS?

Well, encryption is all over the place, but still missing in AngleSharp. It is unlikely that v0.6 will already have what it needs to deal with all concerns, but it is definitely on the list. The question is: Which version will implement HTTPS?

At the moment my guess is that with v0.8 we will see HTTPS requests being possible with AngleSharp (out-of-the-box, however, feel free to implement your own web requester, as this is easily pluggable into AngleSharp). The key question is: When is AngleSharp too heavy? It is possible that the requester will be excluded from AngleSharp in an upcoming version, or that only the very basic version (as it is now) will be in there, with more advanced (pre-made) alternatives being available as plugin NuGet packages.

MathML and SVG

Those two beasts have not been touched lately, however, they will be touched in the future. The next iteration v0.6 will not improve on their support, but we will most likely see them being integrated from v0.7. I consider this a very important subject, as transforming MathML could be very interesting for e.g. Sumerics or other applications.

Conclusion

AngleSharp is still in active development and I am back on the roadmap. I do not know if the due dates can be fulfilled, but I promise to ensure that AngleSharp becomes an outstanding library that makes the web easily accessible from any .NET language.

Created .

References

Sharing is caring!