Updates to this document will be made in-place - this version : 2014-04-25
domain?
Feed (Hyperlink) Autodiscovery
+DOAP?
NewsMonitor requires an initial list of topic-relevant feed URLs. This may be provided in a plain text file, for example :
# RSS 1.0
http://aabs.wordpress.com/category/semanticweb/rdf
# RSS2.0
http://www.wasab.dk/morten/blog/archives/category/semweb/feed/
# RSS 2.0
http://www.w3.org/community/rww/feed/
# Atom
http://www.jenitennison.com/blog/atom/feed
# Atom
http://markwatson.com/blog/atom.xml
Lines beginning with '#' are treated as comments. Blank lines and leading/trailing whitespace on URL lines are ignored.
Syndication feeds are typically published in one of three formats (or their variants) : RSS 1.0, RSS 2.0 and Atom. RSS 1.0 is defined as an RDF vocabulary (expressed in RDF/XML), RSS 2.0 and Atom are XML. For integration within NewsMonitor the data contained in these will be converted to a common RDF model.
There are various restrictions on the values that can appear in the elements and attributes in the RSS and Atom specifications. These aren’t much for use as constraints, and for modeling purposes most can be reduced to simple strings. More problematic is the way pieces of data are formatted differently, e.g. RSS 1.0 dates follow the W3CDTF (ISO 8601) format, whereas RSS 2.0 uses the (obsolete) RFC 822 format.
(There are existing libs for feed parsing, e.g. the (dormant) Apache Jakarta FeedParser and Rome, but none of these were considered suitable for NewsMonitor, because of excessive complexity/dependencies and inappropriate handling of RSS 1.0 (RDF) feeds).
A large proportion of feeds are invalid according to their declared specification. It is very common to find incorrect media types and encoding errors, and not unusual to find format errors. The media type issue can be sidestepped by simple ignoring the
Content-TypeHTTP header and instead determining format by examining ('sniffing') the feed content. Problems with the format are dealt with by substituting the initial strict XML or RDF/XML parser/reader with a liberal, fault-tolerant one, the NewsMonitor SoupParser (so named akin to HTML Tag Soup).
Entity\Representation | RSS 1.0 | RSS 2.0 | Atom | NewsMonitor Java | NewsMonitor RDF |
---|---|---|---|---|---|
Feed |
rss:channel |
channel |
atom:feed |
feed |
rss:channel |
Entry |
rss:item |
item |
atom:entry |
entry |
schema:article |
Title |
dc:title |
title |
atom:title |
x.getTitle() |
dcterms:title |
Date |
dc:date |
pubDate |
atom:published, |
x.getDate() |
dcterms:date |
Source | link | dcterms:source | |||
Content |
dc:description, |
description, |
atom:content |
x.getContent() |
schema:articleBody |
Author |
dc:creator |
author |
atom:author |
x.getCreator() |
dcterms:creator -> foaf:name |
Note :
dcterms:creatorhas a range of
dcterms:Agent, so the text name will be represented as e.g.
<#feed> dcterms:creator [foaf:name "John Smith"] .
rss: http://purl.org/rss/1.0/
atom: http://www.w3.org/2005/Atom
dcterms: http://purl.org/dc/terms/
foaf: http://xmlns.com/foaf/0.1/
schema: http://schema.org/
nm: http://purl.org/stuff/newsmonitor/
Use a Sigmoid Function to normalise values to range 0..1.