Over the last few years, I’ve done a lot of thinking about the information contained in online news articles - not the news itself, but the structure and metadata. There can be a surprising amount lurking there. Here is a list of the things I’ve thought of so far…
- The Content
- I’m thinking mainly of text here, but an article might very well contain
other types of media. Especially images - Photos, charts, whatever.
For each image you might also have:
- URL of the image - maybe it’s a thumbnail for a bigger image?
- A caption
- Attribution. Who took the photo?
- Copyright & Licensing information
- The title of the article. It’s worth noting that some newspapers use different headlines for the print and online versions of articles. Presumably the print version is for a local audience, whereas the online version is for an international audience and tailored to optimise search engine results.
- The standfirst (in the UK) or kicker (USA) is a short summary of the article, usually appearing in larger text above or below the main headline.
- Publication Date(s)
At the very least, you need to know when the article was first published. Without this information, bad things can happen.
Often, there will be a “Last modified” time.
An important aspect of any date on the web is the time zone. It might be implicit, taken from the geographic location of the publication. But it’s best to be explicit and state the timezone of any given date. There are ways to encode both a precise, unambiguous machine-readable form alongside a nice, comfortable human version.
The dateline is where and when the story occurred, was written or filed.
This might as simple as “By Bob Smith”. Or you might have something like “By Bob Smith, Chief Political Correspondent in London and Fred Bloggs in New York”. Add a few more authors, job titles, locations and spelling mistakes, throw in some inconsistant use of commas and capitalisation, and you’ve suddenly got something that’s a real challenge for a machine to make sense of.
Other information in a byline may include things like email addresses, twitter accounts and links to profile pages. This is all good stuff - it helps to disambiguate journalists with the same name.
- Canonical URL
You might think that if you’ve managed to pull up an online news article, you must have it’s URL.
You have one of it’s URLs. There may be more. In some cases there may be a virtually infinite number of them. What you really want is the Canonical URL Most sites (but not all!) indicate a canonical url, either using HTTP redirection, or via metadata embedded in the page.
- Statement of Principles
One of the things that helps define the fuzzy line between “journalism” and “not journalism” is that the former is usually written under some explicitly stated set of principles. Usually this is a publication-wide thing, such as the BBC Editorial Guidelines or the Press Complaints Commission Editor’s Code of Practice. So the Statement of Principles is usually implicit, determined by the site or publication. However, guidelines do change from time to time, so I would argue that it’s important for articles to explicitly state which principles they were written under. The rel-principles convention was designed to provide a standard way of indicating this online (disclaimer: I’m a contributor to rel-principles).
Which publication was the article published in?
It’s obvious that an article on, say, www.nytimes.com is a New York Times article.
But how about an article on www.walesonline.co.uk? Is it from the Western Mail, the South Wales Echo, Wales on Sunday or Celtic Weekly? Or is it a web-only article?
Some sites are good about stating this information on article pages, and some are not.
- Sections and Tags
Which section is the article in? Science? Politics? Lifestyle? It’s nice to know. Some sites also apply tags to articles.
This data is nice to have, but if you want to compare between publications, you’ll find most of them use their own vocabulary - different terms referring to similar concepts.
A lot of sites allow readers to comment on articles. Even if the actual quality of that discussion can rather vary from place to place…
Even a simple comment count is an interesting metric, to gauge the degree of interest and debate an article generates.
- Related Articles
Listing related articles can help provide a bit of perspective and background, especially for a complex or long-running story.
However, the usefulness of most ‘related articles’ boxes is limited by the fact that it’ll only be linked to articles from the same publication. But hey. It’s better than nothing.
Is the article on the front page? Has it been bumped to the top of it’s section? Is it being actively promoted in other areas of the site?
- Multiple Pages
Large articles are often split up into multiple pages. In this case, we want to know how to navigate the whole thing:
- how many pages are there?
- are there links to each page?
- are there next/previous page links?
- is there a link to a single-page version (often termed “printer-friendly”)
- Kind of Article
Is it a video, photo slide show or podcast? Is there a textual summary to go with it?
Is it a live blog? They change rapidly and can get very large.
Is it some sort of infographic or fancy interactive data visualisation?
Where did the article come from?
Was it wire copy?
Was it syndicated from from another newspaper or agency?
What is the copyright and licensing?
This can be a really fine line - the original source might have been so heavily modified as to constitue a new work. Or the changes might have been minor, eg converting units of measurement and currency into local equivalents.
And that’s the information I think you could reasonably expect to glean from a news article online.