summaryrefslogtreecommitdiff
path: root/test/data/tokeniser2/README.md
diff options
context:
space:
mode:
Diffstat (limited to 'test/data/tokeniser2/README.md')
-rw-r--r--test/data/tokeniser2/README.md104
1 files changed, 104 insertions, 0 deletions
diff --git a/test/data/tokeniser2/README.md b/test/data/tokeniser2/README.md
new file mode 100644
index 0000000..4218c26
--- a/dev/null
+++ b/test/data/tokeniser2/README.md
@@ -0,0 +1,104 @@
+Tokenizer tests
+===============
+
+The test format is [JSON](http://www.json.org/). This has the advantage
+that the syntax allows backward-compatible extensions to the tests and
+the disadvantage that it is relatively verbose.
+
+Basic Structure
+---------------
+
+ {"tests": [
+     {"description": "Test description",
+     "input": "input_string",
+     "output": [expected_output_tokens],
+     "initialStates": [initial_states],
+     "lastStartTag": last_start_tag,
+     "ignoreErrorOrder": ignore_error_order
+     }
+ ]}
+
+Multiple tests per file are allowed simply by adding more objects to the
+"tests" list.
+
+`description`, `input` and `output` are always present. The other values
+are optional.
+
+### Test set-up
+
+`test.input` is a string containing the characters to pass to the
+tokenizer. Specifically, it represents the characters of the **input
+stream**, and so implementations are expected to perform the processing
+described in the spec's **Preprocessing the input stream** section
+before feeding the result to the tokenizer.
+
+If `test.doubleEscaped` is present and `true`, then `test.input` is not
+quite as described above. Instead, it must first be subjected to another
+round of unescaping (i.e., in addition to any unescaping involved in the
+JSON import), and the result of *that* represents the characters of the
+input stream. Currently, the only unescaping required by this option is
+to convert each sequence of the form \\uHHHH (where H is a hex digit)
+into the corresponding Unicode code point. (Note that this option also
+affects the interpretation of `test.output`.)
+
+`test.initialStates` is a list of strings, each being the name of a
+tokenizer state. The test should be run once for each string, using it
+to set the tokenizer's initial state for that run. If
+`test.initialStates` is omitted, it defaults to `["data state"]`.
+
+`test.lastStartTag` is a lowercase string that should be used as "the
+tag name of the last start tag to have been emitted from this
+tokenizer", referenced in the spec's definition of **appropriate end tag
+token**. If it is omitted, it is treated as if "no start tag has been
+emitted from this tokenizer".
+
+### Test results
+
+`test.output` is a list of tokens, ordered with the first produced by
+the tokenizer the first (leftmost) in the list. The list must mach the
+**complete** list of tokens that the tokenizer should produce. Valid
+tokens are:
+
+ ["DOCTYPE", name, public_id, system_id, correctness]
+ ["StartTag", name, {attributes}*, true*]
+ ["StartTag", name, {attributes}]
+ ["EndTag", name]
+ ["Comment", data]
+ ["Character", data]
+ "ParseError"
+
+`public_id` and `system_id` are either strings or `null`. `correctness`
+is either `true` or `false`; `true` corresponds to the force-quirks flag
+being false, and vice-versa.
+
+When the self-closing flag is set, the `StartTag` array has `true` as
+its fourth entry. When the flag is not set, the array has only three
+entries for backwards compatibility.
+
+All adjacent character tokens are coalesced into a single
+`["Character", data]` token.
+
+If `test.doubleEscaped` is present and `true`, then every string within
+`test.output` must be further unescaped (as described above) before
+comparing with the tokenizer's output.
+
+`test.ignoreErrorOrder` is a boolean value indicating that the order of
+`ParseError` tokens relative to other tokens in the output stream is
+unimportant, and implementations should ignore such differences between
+their output and `expected_output_tokens`. (This is used for errors
+emitted by the input stream preprocessing stage, since it is useful to
+test that code but it is undefined when the errors occur). If it is
+omitted, it defaults to `false`.
+
+xmlViolation tests
+------------------
+
+`tokenizer/xmlViolation.test` differs from the above in a couple of
+ways:
+
+- The name of the single member of the top-level JSON object is
+ "xmlViolationTests" instead of "tests".
+- Each test's expected output assumes that implementation is applying
+ the tweaks given in the spec's "Coercing an HTML DOM into an
+ infoset" section.
+