link to Part 1, Part 2, Part 4
Having explained where the requirement came from and seen how the data is connected to the application from the website and CMS, now we move onto processing the product descriptions. This was the bit I had already a lot of the code worked out for, although it ended up quite different from the original Delphi version I wrote. It is interesting how the internet has grown, when I first wrote the Delphi version there was nothing no hits at all for Xpress Tags or Quark Tags on Yahoo as it was then. Now there is much more information out there on how the tags work and what they are, thank goodness!
Wouldn’t it be simple if Quark used a SGML. Nope it uses XpressTags to mark up the document, here is a very small taste:
<-> means substring
<+> means superscript
<b> means bold
<t-2> means adjust character tracking by minus two
<-> means -
A lot of it works on style sheets, @MainStyle: means we are now using a paragraph style of “MainStyle” that is broken by one of a selection of other tags, I’d like to say they were closing tags but they are not. Some tags act as inferred closing tags depending on the context in which you encounter them. The parser moves these styles into the respective class=”"MainStyle” attributes for the elements they apply to. This allows the designer to address them intelligently from CSS.
The document when processed, gets broken up into paragraph styles, then within those, character styles, then within those, tabbed tables and within those, further paragraph and character styles. You start to see the recursive nature of this.
Some of the processing was easier, <b> means bold to quark and <i> italic, as it does to HTML, but there is no closing tag, rather a selection of other tags may ultimately end up closing an open bold or italic, the fun goes on and on.[more]
Tabbed text tables
Some of the specifications for products are in tables, made up of tabbed text, defined with the syntax; <*t(144,0,"1 "181.5,0,"1 "217.5,0,"1 "251.039,0,"1 ")>, where the numbers are tab positions and alignment information.
For these a tab settings class was created to handle tab settings and parse them into HTML tables. Sometimes tabbed tables have a spacer between them and continue on, in this case an HTML table must not create another table or the columns will not line up, so we keep a track of the tables and ensure that we keep the tables open where two sets of tab definitions that match occur with only a specified subset of elements between them that don’t warrant a new table.
Other times tables are defined in a style sheet like this;
@MSIC TABLE:<@Misc Table Header><*t(52.8,0,"1 "103.8,0,"1 "156,0,"1 "221.399,0,"1 ")>
So here you can see a table header in the style table, yes <th></th> elements to worry about and track that they close, as well as table data elements<td></td>.
The interaction between all this gets real fun…
Bullet points are used quite a number of times and again these had a custom class written to deal with them, producing an unordered list that can then be restyled by addressing it with CSS.
Some tags were easy to deal with, for example: <t-1> means decrease the tracking (for none typographical people this means the space between letter in the words). This does not translate to the web as we have more of a flow layout where you don’t worry as much about millimetre perfection. A quick regular expression regex.replace with string.empty got rid of these along with many others.
Dozens of special characters are defined in Quark that need converting, for example;
«M,14\[Omega\]» should convert to Ω HTML entity that then renders as Ω in the browser.
HTML Anchor tags
There was also a desire to convert any instances of product sku codes into anchor <a href=””> tags so they became proper links. This was another easy win as another regular expression caught these instances. Where false positive matches occurred it was simple to change the source text slightly so it no longer matched a product code, for example by adding a space somewhere in the text.
In anticipation of this project, the data in CMS is fairly regimented, as is the structure of the item groups we are attempting to publish. There are paragraphs of text, standard headings and tables in the text but not much in terms of complicated layout elements like fonts. A lot of the supporting page elements have been pushed into extra fields in CMS to keep the item descriptions clean.
The Q Document class pulls it all together (see class diagram below).
You might be curious about the after ValidationError event and BeforeHTMLValidation event. I wanted to make certain we were not letting anything nasty slip through to our website and the only way to ensure that was to do some DTD validation on the resulting HTML document after the item group has been processed.
I wrote a DTD validation procedure in to validate the resulting html. This ended up spotting a couple of extra bugs one the application was released so it was good foresight.
1: Private oDTDSchemaCollection As Xml.Schema.XmlSchemaSet
3: Private oXmlCachingResolver As New CachedXmlResolver()
6: Private Sub ValidateXML()
7: RaiseEvent BeforeHTMLValidation(Me, New EventArgs)
8: Dim oHtmlToCheckTextReader As New System.IO.StringReader(Me.ConvertedDocumentAsHTMLPage)
9: Dim HtmlToCheckXMLReader As New Xml.XmlTextReader(oHtmlToCheckTextReader)
11: HtmlToCheckXMLReader.XmlResolver = oXmlCachingResolver
13: Dim ReaderSettings As New XmlReaderSettings
14: ReaderSettings.ProhibitDtd = False
15: ReaderSettings.ValidationType = ValidationType.DTD
16: ReaderSettings.XmlResolver = oXmlCachingResolver
17: AddHandler ReaderSettings.ValidationEventHandler, New Xml.Schema.ValidationEventHandler(AddressOf ValidationEventHandler)
18: Dim HtmlToCheckValidatingReader As XmlReader = XmlReader.Create(HtmlToCheckXMLReader, ReaderSettings)
19: While HtmlToCheckValidatingReader.Read()
20: End While
22: Catch ex As Exception
23: RaiseEvent AfterValidationError(Me, New EventArgs, ex)
28: RaiseEvent AfterHTMLValidation(Me, New EventArgs)
29: End Try
30: End Sub
Validation caused a problem as it takes a long time over a poor internet connection to pull down the schemas from the internet. Thus I implement a caching CachedXmlResolver that inherits from XmlUrlResolver. It creates a local cache of the schemas so that speeded things up to almost instantaneous from what was previously about fifteen seconds per validation. To further improve the user experience the validation was moved onto a separate thread so that it does not get in the way. The caching resolver had to be changed a little to implement a sync lock on the cache to prevent multiple validations trying fight with each other to update the resolver cache in succession. This leaves a very responsive application while validation is occurring.
Read part 4