summaryrefslogtreecommitdiff
path: root/docs
diff options
context:
space:
mode:
authorJohn Mark Bell <jmb@netsurf-browser.org>2007-06-23 22:40:25 +0000
committerJohn Mark Bell <jmb@netsurf-browser.org>2007-06-23 22:40:25 +0000
commit7b30a5520cfb56e651f0eb4da85a3e07747da7dc (patch)
tree5d6281c071c089e1e7a8ae6f8044cecaf6a7db16 /docs
downloadlibhubbub-7b30a5520cfb56e651f0eb4da85a3e07747da7dc.tar.gz
libhubbub-7b30a5520cfb56e651f0eb4da85a3e07747da7dc.tar.bz2
Import hubbub -- an HTML parsing library.
Plenty of work still to do (like tree generation ;) svn path=/trunk/hubbub/; revision=3359
Diffstat (limited to 'docs')
-rw-r--r--docs/Architecture83
-rw-r--r--docs/Todo12
2 files changed, 95 insertions, 0 deletions
diff --git a/docs/Architecture b/docs/Architecture
new file mode 100644
index 0000000..73966eb
--- /dev/null
+++ b/docs/Architecture
@@ -0,0 +1,83 @@
+Hubbub parser architecture
+==========================
+
+Introduction
+------------
+
+ Hubbub is a flexible HTML parser. It offers two interfaces:
+
+ * a SAX-style event interface
+ * a DOM-style tree-based interface
+
+Overview
+--------
+
+ Hubbub is comprised of four parts:
+
+ * a charset handler
+ * an input stream veneer
+ * a tokeniser
+ * a tree builder
+
+ Charset handler
+ ---------------
+
+ The charset handler converts the raw data input into a requested encoding.
+
+ Input stream veneer
+ -------------------
+
+ The input stream veneer provides an abstract stream-like interface over
+ the document buffer. This is used by the tokeniser. The document buffer
+ will be encoded in either UTf-8 or UTF-16 (this is client-selectable).
+
+ Tokeniser
+ ---------
+
+ The tokeniser divides the data held in the document buffer into chunks.
+ It sends SAX-style events for each chunk. The tokeniser is agnostic to
+ the charset the document buffer is stored in.
+
+ Tree builder
+ ------------
+
+ The tree builder constructs a DOM tree from the SAX events emitted by the
+ tokeniser. The tree builder is tied to the document buffer charset.
+
+Memory usage and ownership
+--------------------------
+
+ Memory usage within the library is well defined, as is ownership of allocated
+ memory.
+
+ Raw input data provided by the library client is owned by the client.
+
+ The document buffer is allocated on the fly by the library.
+
+ The document buffer is created and resized by the charset handler. Its
+ location is passed to the tree builder through a dedicated event. While
+ parsing is occurring, the ownership of the document buffer lies with the
+ charset handler. Upon parse completion, the tree builder may request
+ ownership of the buffer. If it does not, the buffer will be freed on parser
+ destruction.
+
+ SAX events which refer to document segments contain direct references into
+ the document buffer (i.e. no copying of data held in the document buffer
+ occurs).
+
+ The tree builder will allocate memory for use as DOM nodes. References to
+ strings in the document buffer will be direct and will operate a
+ copy-on-write strategy. All strings (excepting those which comprise part of
+ the document buffer) and nodes within the DOM are reference counted. Upon a
+ reference count reaching 0, the item is freed.
+
+ The above strategy permits data copying to be kept to a minimum, hence
+ minimising memory usage.
+
+Parse errors
+------------
+
+ Notification of parse errors is made through a dedicated event similar to
+ that used for notification of movement of the document buffer. This event
+ contains the line/column offset of the error location, along with a message
+ detailing the error.
diff --git a/docs/Todo b/docs/Todo
new file mode 100644
index 0000000..2abce2b
--- /dev/null
+++ b/docs/Todo
@@ -0,0 +1,12 @@
+TODO list
+=========
+
+ + Update tokeniser to comply with latest spec draft (currently complies
+ with 2007-06-12 draft)
+ + Implement one or more tree builders
+ + More charset convertors (or make the iconv codec significantly faster)
+ + Parse error reporting from the tokeniser
+ + Implement extraneous chunk insertion/tokenisation
+ + Statistical charset autodetection
+ + Shared library, for those platforms that support such things
+ + Optimise it