MFML
From CommerceNet Wiki
Pronounced 'miffle' :) Not to be confused with MiniML.
Contents |
[edit] Goals
Goals for a pidgin XML representation of microformatted data:
- make it easy for traditional XML tools to present, store, & transform microformatted data recovered from XHTML
- secondarily, make it easier to apply SemWeb tools.
- make it easier to design new microformats.
Starting points (assumptions):
- the only normative form for data recovered from a microformat is a DOM tree. (Dicts, hashmaps, lists, and combinations of the above all fail when presented semi-ugly XHTML input. Property lists might be an even more accurate term for the s-expy crowd...)
- each microformat has a doppelganger XML tag set
- more "semantic" constraints on valid data recovered from microformats should be specifiable in schema languages (relax-ng, schematron, dtd, etc.) E.g. if there can only be one instance of family-name, don't expect a base microformat parser to enforce that...)
[edit] Obstacles
For an XML-savvy programmer, there are several obstacles to working with semi-structured data "hidden" in XHTML:
- the class attribute is hard to process in XML tools
- microformats can appear on a very wide range of HTML elements
- those elements need not be co-extensive -- a single element can be microformatted multiple times, or the 'members' of a microformat structure can be scattered around a page
- microformats require special-case handling for several constructs
- abbr[@title]
- link[@href]
- a[@href] and a content
- img[@src] and img[@alt]
- html authors require roundtripping without data loss
- html attribute information needs to be preserved
- html element names need to be preserved
- the element subsumption hierarchy cannot be inferred from XMDP alone
- misspellings and extensions -- there are no validity constraints
- microformats can embed other microformats -- including those your code predates
- xhtml fragments must be evaluated with respect to several non-local parameters
- base[@href]
- head[@profile] (XMDP)
- XHTML namespaces and/or DOCTYPEs
- other problematic parts
- [@id] and [@name] uniqueness
- non-microformat classnames
[edit] Architecture
µformatted XHTML
An example:
- start with an Excel table of names and phone numbers, export to XHTML
- microformat the XHTML using hCard
- miracle...
- print out the phone numbers, sorted by last name using XML tools
Excel table as input:
| First Name | Last Name | Phone |
|---|---|---|
| Ben | Sittler | 555-1212 |
| Rohit | Khare | 867-5309 |
| XHTML | µformatted XHTML | miracle... | "clean" XML |
|---|---|---|---|
<table xmlns="http://www.w3.org/1999/xhtml"> <tr > <th>First Name</th> <th>Last Name</th> <th>Phone </th></tr> <tr > <td>Ben</td> <td>Sittler</td> <td>555-1212 </td></tr> <tr > <td>Rohit</td> <td>Khare</td> <td>867-5309 </td></tr></table> |
<table xmlns="http://www.w3.org/1999/xhtml"> <tr > <th>First Name</th> <th>Last Name</th> <th>Phone </th></tr> <tr class="vcard"> <td class="Given-Name">Ben</td> <td class="Family-Name">Sittler</td> <td class="iridium tel">555-1212 </td></tr> <tr class="vcard"> <td class="Given-Name">Rohit</td> <td class="Family-Name">Khare</td> <td class="tel work mobile">867-5309 </td></tr></table> | MiniML |
<vcard> <n> <Family-Name>Sittler</Family-Name> <Given-Name>Ben</Given-Name> </n> <tel> 555-1212 </tel> </vcard> <vcard> <n> <Family-Name>Khare</Family-Name> <Given-Name>Rohit</Given-Name> </n> <tel> 867-5309 <mobile/> <work/> </tel> </vcard> |
[edit] miracle...
We are told that
- vcard is the root element
- n is a child
- Given-Name is a child
- Family-Name is a child
- (... there are others in the spec)
- tel is a child
- it can contain zero or more tags from: work, home, tel, fax, pref etc. (... as enumerated in the spec and at IANA)
- (...there are others in the spec, such as adr)
- n is a child
[edit] Find µformat Classes
First, collect the list of "symbols", or unique classnames that can occur in this µformat. Put "--" in front of each one.
<table xmlns="http://www.w3.org/1999/xhtml"> <tr > <th>First Name</th> <th>Last Name</th> <th>Phone </th></tr> <tr class="--vcard"> <td class="--Given-Name">Ben</td> <td class="--Family-Name">Sittler</td> <td class="iridium --tel">555-1212 </td></tr> <tr class="--vcard"> <td class="--Given-Name">Rohit</td> <td class="--Family-Name">Khare</td> <td class="--mobile --tel --work">867-5309 </td></tr></table> |
[edit] Disambiguate µformat Hierarchy (mftidy?)
Note that multiple --name classes are not allowed in MiniML (since elements have only one name,) so we need to apply each microformat's rules to break ties and insert missing levels of hierarchy.
- µformats on leaf XHTML elements (img, and potentially also hr, br, isindex, area, param, col, frame, iframe, input, select, option (arguable), meta, link, base, basefont) need to be enclosed in a new container elements, e.g. <img class="--photo" src="..."/> becomes <div class="--photo"><img src="..."/>
- "tag" classes (e.g. class="work pref mobile" inside .vcard and .tel) need to become empty child elements
- unambiguous inheritance (e.g. class="vcard n")
- omitted intermediate levels (e.g. class="vcard Given-Name") -- adjacent nodes missing the same parent share one
- shared content, either in a single microformat (e.g. class="vcard n fn")
Expand the element hierarchy to reflect each microformat's constraints.
<table xmlns="http://www.w3.org/1999/xhtml"> <tr > <th>First Name</th> <th>Last Name</th> <th>Phone </th></tr> <tr class="--vcard"> <td class="--Given-Name">Ben</td> <td class="--Family-Name">Sittler</td> <td class="iridium --tel">555-1212 </td></tr> <tr class="--vcard"> <td class="--Given-Name">Rohit</td> <td class="--Family-Name">Khare</td> <td class="--tel">867-5309 <span class="--mobile"/><span class="--work"/></td></tr></table> |
[edit] Convert to MiniML XML
Next, convert the above to MiniML XML: (the following example is controversial in that it does not preserve the original content of @class)
<xh:table xmlns:miniml="..." xmlns:xh="http://www.w3.org/1999/xhtml"> <xh:tr > <xh:th>First Name</xh:th> <xh:th>Last Name</xh:th> <xh:th>Phone </xh:th></xh:tr> <vcard miniml:element="xh:tr"> <Given-Name miniml:element="xh:td">Ben</Given-Name> <Family-Name miniml:element="xh:td">Sittler</Family-Name> <tel miniml:element="xh:td"> <miniml:attr miniml:name="xh:class">iridium</miniml:attr> 555-1212 </tel> </vcard> <vcard miniml:element="xh:tr"> <Given-Name miniml:element="xh:td">Rohit</Given-Name> <Family-Name miniml:element="xh:td">Khare</Family-Name> <tel miniml:element="xh:td"> <miniml:attr miniml:name="xh:class">work mobile</miniml:attr> 867-5309 </tel> </vcard> </table> |
[edit] MiniML XHTML generated from the "clean" XML
<div class="--vcard" xmlns="http://www.w3.org/1999/xhtml"> <div class="--n"> <span class="--Family-Name">Sittler</span> <span class="--Given-Name">Ben</span> </div> <span class="--tel"> 555-1212 </span> </div> <div class="--vcard"> <div class="--n"> <span class="--Family-Name">Khare</span> <span class="--Given-Name">Rohit</span> </div> <div class="--tel"> 867-5309 <span class="--work"/> </div> </div> |
[edit] "Rules"
- microformat tokens become tagnames: values that occurs in an XMDP profile (all valid classnames, rel/rev properties, and misspellings of same) become the names of tags in MFML.
- the highlander rule: if there can only be one of something, should it become an attribute?
- the abbr rule: the TITLE of an ABBR element is substituted for the entire list of childNodes. [moral: if you want the original pretty-printed HTML, look elsewhere]
- the scoping rule: since microformats can occur within other microformats, returning a flattened list of all microformat data found in a page discards information (e.g. "was that a relTag of "cool" on the hCalendar entry, or only the hCard of the organizer?"). How should we indicate the relative tree occurrence order?
- the repeat-yourself rule: should a second occurrence of a token force the creation of a copy of the parent node ("page break")?
- the subsumption hierarchy: a topological sort of valid tokens in a microformat must be provided as an exogenous input to Miffy. With that, the occurrence of any tag forces the creation of intermediate parent tags.
- thus
<em class="locality vcard">Galway</em>
becomes
<mfml>
<vcard>
<adr>
<locality> Galway </locality>
</adr>
</vcard>
</mfml>
Open questions:
- abbr
- base urls / scoping
- XSD-like data typing
- text
- XHTML marked up text
- ISO8601 date/interval
- number (float, int, ...?)
- enumerations ??
- hierarchy
- mispelin's
- are link microformats special? (rel/rev)
- support for "x-" classnames?
- IMG / A / LINK
- support "dict list as hash" pattern?
- e.g. a credits listing for a movie using an open-ended role vocabulary might be a DL of hCard DDs with role-types as DTs
[edit] Examples
(From http://gbraad.survion.com/site/?p=profile )
<div class="vcard">
<img style="float:right; margin:4px" src="http://gbraad.survion.com/photos/profile/0.jpg" alt="Profile photo" class="photo"/>
<a class="url fn" href="http://gbraad.survion.com/" title="Full name">Gerard Braad Jr.</a>
<span class="bday" title="Date of Birth">1981-02-22</span>
<div class="org" title="Organisation"><a class="url work" href="http://www.survion.com/">Sur-V-ioN</a></div>
<span class="role" title="Role">(Freelance) Software Developer</span>
<div class="adr">
<div class="street-address" title="Street">Rustenburgstraat 224</div>
<span class="postal-code" title="Postal code">7311JC</span>
<span class="locality" title="City">Apeldoorn</span>
<span class="country" title="Country">The Netherlands</span>
</div>
<div class="tel">
<span class="pref work voice" title="Work phonenumber">+31 (0)87 1901 799</span>
<span class="home voice" title="Home phonenumber">+31 (0)55 521 2488</span>
<span class="cell voice" title="Cell phonenumber">+31 (0)6 4256 7996</span>
</div>
<div class="email">
<span class="pref internet" title="Primary email">g_braad@survion.com</span>
<span class="internet" title="Alternate email">g_braad@spotsnel.nl</span>
</div>
</div>
could become
<vcard>
<photo> http://gbraad.survion.com/photos/profile/0.jpg </photo>
<fn> Gerard Braad Jr. </fn>
<url> http://gbraad.survion.com/ </url>
<bday> 1981-02-22 </bday>
<org> Sur-V-ioN </org>
<url>
http://www.survion.com/
<work />
</url>
<role> (Freelance) Software Developer </role>
<adr>
<Street-Address> Rustenburgstraat 224 </Street-Address>
<Postal-Code> 7311JC </Postal-Code>
<Locality> Apeldoorn </Locality>
<Country> The Netherlands </Country>
</adr>
<tel>
+31 (0)87 1901 799
<pref />
<work />
<voice />
</tel>
<tel>
+31 (0)55 521 2488
<home />
<voice />
</tel>
<tel>
+31 (0)6 4256 7996
<cell />
<voice />
</tel>
<email>
g_braad@survion.com
<pref />
<internet />
</email>
<email>
g_braad@spotsnel.nl
<internet />
</email>
</vcard>
alternative design decisions might include:
<vcard>
<photo href="http://gbraad.survion.com/photos/profile/0.jpg"/>
OR
<photo>
<content> http://gbraad.survion.com/photos/profile/0.jpg </content>
</photo>
</vcard>
BTW, prior art includes: http://www.w3.org/TR/vcard-rdf and http://www.imc.org/rfc2426 and http://www.jabber.org/jeps/jep-0054.html ; see http://xml.coverpages.org/vcard.html for a comprehensive discussion
