An aerial view of news markup schemes

Here’s a quick summary of various schemes for marking up semantic data in online news.

hAtom

From the microformats wiki:

hAtom is a microformat for content that can be syndicated, primarily but not exclusively blog postings. hAtom is based on a subset of the Atom syndication format.

While it was created with blogs in mind rather than more ‘traditional’ news articles, it actually covers most of the basics: title, author, text content, dates…

hAtom has pretty good adoption - many CMS systems support it by default. For example, many Wordpress themes implement hAtom - including the (currently) default Twenty Twelve theme.

hNews

Again, from the microformats wiki: > hNews extends hAtom, introducing a > number of fields that more completely describe a journalistic work.

To hAtom, hNews adds fields for Source Organization, dateline, geographic location, license and statement of principles.

The Associated Press spin-off NewsRight uses hNews in conjunction with a tracking beacon for syndicating news. NewsRight claims that they’ve got 900 source news websites participating (ie using hNews and a beacon), syndicating articles to nearly 50,000 other sites (source).

I did a rough survey of US newspaper sites in 2010 looking for hNews, and found a surprisingly high number. Would be interesting to update that survey…

(disclaimer: I did some work on the drafting of hNews, as part of my work at the Media Standards Trust)

HTML5

HTML5 is fast becoming standard on the web.

<article> and <section> look like they should be really useful for denoting where the actual content of an article lies, but I’m skeptical. I feel the terminology is too open to interpretation and that they’ll be abused, losing their semantic value.

<aside> is for tangential content. ‘Related articles’ boxes seem to be an obvious example.

<figure>, <figcaption> look great for photos, charts or whatever.

The new <time> element finally provides HTML with a proper way to denote dates and times, in both human and machine readable forms. This was always an annoying sticking point with microformats and HTML 4. Of course, you still need some extra semantic information to say what the timestamp is (eg date of first publication, last modified etc), but at least there is now a nice way to encode it.

HTML 5 also introduces microdata. Like microformats, microdata is a way of attaching semantic meaning to HTML elements. Whereas microformats tend to use CSS class names, microdata uses separate attributes.

I don’t think HTML5 provides a total solution, but it’s a much better place to start from.

schema.org

Schema.org defines a shared vocabulary for marking up certain types of semantic data on the web. Its NewsArticle schema seems to fit the bill nicely.

Schema.org doesn’t say how the metadata should be encoded. Any method, such as HTML5 microdata, RDFa, or whatever could be used. You can get some of the flavour of this here.

I’d expect schema.org to gain a lot of traction (it was launched in June 2011) as it is designed to make things more semantic for the major search engines. And if there’s one thing news organisations like, it’s googlejuice.

It’d be interesting to do some survey to track adoption.

rNews

Like schema.org, rNews defines a shared vocabulary rather than prescribing a concrete way to encode things. So you can pick an appropriate representation for your site.

It seems a lot of the rNews properties have been incorporated into schema.org. Yay!

Open graph protocol

The Open graph protocol is a standard designed to let you mark things up for social networks (eg facebook). It has a few properties that are useful to news publication - in particular the article type, which covers author, publication dates, title, tags and more.

It seems to have pretty good adoption. News sites love to be linked in to social networks.

Other RDF schemas

There are a bunch of RDF schemas which cover news article metadata to varying degrees, and I’m sure they can all be mapped to and fro between most of the other systems mentioned here.

But I’ve never quite got to grips with RDF enough to really talk about it with any fluency.

Twitter Cards

Twitter cards were designed to allow twitter to display a little summary of pages linked to by a tweet. From a news article point of view, the spec covers provision of a title, description and attribution (assuming the author has a twitter account :-)

Rel Attribute

There are also a bunch of “rel” attribute conventions which are relevant to news markup:

  • rel-author. Supposed to unambiguously identify an author by linking to a unique URL (eg their blog, twitter account, whatever). In practice every publication will just link to a bio page on their own site, so any authors which write for multiple publications will be known by lots of different rel-author links. But at least the potential is there :-)
  • rel-tag to mark up tags/topics/whatever.
  • rel-canonical for the One True URL of an article.
  • rel-shortlink for specifying short versions of URLs (via bit.ly, tinyurl etc…).
  • rel-next and rel-prev for indicating pagination.