NewsMonitor Manual

Fusepool module for discovering and monitoring RSS/Atom feeds on predetermined topics

Updates to this document will be made in-place - this version : 2014-04-25

Danny Ayers

danny.ayers@gmail.com

Contents


Introduction

Quick Start

domain?

Features

Feed (Hyperlink) Autodiscovery

+FOAF Autodiscovery

+DOAP?

Installation

Seed Feed List

NewsMonitor requires an initial list of topic-relevant feed URLs. This may be provided in a plain text file, for example :


        # RSS 1.0
           http://aabs.wordpress.com/category/semanticweb/rdf
      
        # RSS2.0
           http://www.wasab.dk/morten/blog/archives/category/semweb/feed/
       
        # RSS 2.0
           http://www.w3.org/community/rww/feed/
        
        # Atom
           http://www.jenitennison.com/blog/atom/feed
        
        # Atom
           http://markwatson.com/blog/atom.xml

Lines beginning with '#' are treated as comments. Blank lines and leading/trailing whitespace on URL lines are ignored.


For Users

For Developers

Contents

Feed Formats

Species

Syndication feeds are typically published in one of three formats (or their variants) : RSS 1.0, RSS 2.0 and Atom. RSS 1.0 is defined as an RDF vocabulary (expressed in RDF/XML), RSS 2.0 and Atom are XML. For integration within NewsMonitor the data contained in these will be converted to a common RDF model.

There are various restrictions on the values that can appear in the elements and attributes in the RSS and Atom specifications. These aren’t much for use as constraints, and for modeling purposes most can be reduced to simple strings. More problematic is the way pieces of data are formatted differently, e.g. RSS 1.0 dates follow the W3CDTF (ISO 8601) format, whereas RSS 2.0 uses the (obsolete) RFC 822 format.

(There are existing libs for feed parsing, e.g. the (dormant) Apache Jakarta FeedParser and Rome, but none of these were considered suitable for NewsMonitor, because of excessive complexity/dependencies and inappropriate handling of RSS 1.0 (RDF) feeds).

Invalid Feeds

A large proportion of feeds are invalid according to their declared specification. It is very common to find incorrect media types and encoding errors, and not unusual to find format errors. The media type issue can be sidestepped by simple ignoring the

Content-Type
HTTP header and instead determining format by examining ('sniffing') the feed content. Problems with the format are dealt with by substituting the initial strict XML or RDF/XML parser/reader with a liberal, fault-tolerant one, the NewsMonitor SoupParser  (so named akin to HTML Tag Soup).

Additionally it is common to find invalid HTML markup in the entry/item content of feeds. NewsMonitor deals with this by using the JTidy library to clean up the markup.
Mappings

Entity\Representation RSS 1.0 RSS 2.0 Atom NewsMonitor Java NewsMonitor RDF
Feed
rss:channel 
channel 
atom:feed
feed
rss:channel
     
Entry
rss:item
item
atom:entry
entry
schema:article
         
Title
dc:title
title
atom:title
x.getTitle()
dcterms:title
         
Date
dc:date
pubDate
atom:published, 
atom:updated
x.getDate()
dcterms:date
          
Source
link

dcterms:source
Content
dc:description, 
content:encoded
description, 
xhtml:body
atom:content
x.getContent()
schema:articleBody
Author
dc:creator
author
atom:author
x.getCreator()
dcterms:creator -> foaf:name 

Note :

dcterms:creator
has a range of
dcterms:Agent
, so the text name will be represented as e.g.
<#feed> dcterms:creator [foaf:name "John Smith"] . 
Namespaces

NewsMonitor Vocabulary


Poller

Extractors

Discovery Engine

Relevance Heuristics

Use a Sigmoid Function to normalise values to range 0..1.