NewsMonitor Manual

NewsMonitor requires an initial list of topic-relevant feed URLs. This may be provided in a plain text file, for example :

Lines beginning with '#' are treated as comments. Blank lines and leading/trailing whitespace on URL lines are ignored.

For Users

For Developers

Feed Formats

Species

Syndication feeds are typically published in one of three formats (or their variants) : RSS 1.0, RSS 2.0 and Atom. RSS 1.0 is defined as an RDF vocabulary (expressed in RDF/XML), RSS 2.0 and Atom are XML. For integration within NewsMonitor the data contained in these will be converted to a common RDF model.

There are various restrictions on the values that can appear in the elements and attributes in the RSS and Atom specifications. These aren’t much for use as constraints, and for modeling purposes most can be reduced to simple strings. More problematic is the way pieces of data are formatted differently, e.g. RSS 1.0 dates follow the W3CDTF (ISO 8601) format, whereas RSS 2.0 uses the (obsolete) RFC 822 format.

(There are existing libs for feed parsing, e.g. the (dormant) Apache Jakarta FeedParser and Rome, but none of these were considered suitable for NewsMonitor, because of excessive complexity/dependencies and inappropriate handling of RSS 1.0 (RDF) feeds).

Invalid Feeds

A large proportion of feeds are invalid according to their declared specification. It is very common to find incorrect media types and encoding errors, and not unusual to find format errors. The media type issue can be sidestepped by simple ignoring the

Content-Type

HTTP header and instead determining format by examining ('sniffing') the feed content. Problems with the format are dealt with by substituting the initial strict XML or RDF/XML parser/reader with a liberal, fault-tolerant one, the NewsMonitor SoupParser (so named akin to HTML Tag Soup).

Additionally it is common to find invalid HTML markup in the entry/item content of feeds. NewsMonitor deals with this by using the JTidy library to clean up the markup.

Entity\Representation	RSS 1.0	RSS 2.0	Atom	NewsMonitor Java	NewsMonitor RDF
Feed	rss:channel	channel	atom:feed	feed	rss:channel
Entry	rss:item	item	atom:entry	entry	schema:article
Title	dc:title	title	atom:title	x.getTitle()	dcterms:title
Date	dc:date	pubDate	atom:published, atom:updated	x.getDate()	dcterms:date
Source		link			dcterms:source
Content	dc:description, content:encoded	description, xhtml:body	atom:content	x.getContent()	schema:articleBody
Author	dc:creator	author	atom:author	x.getCreator()	dcterms:creator -> foaf:name

NewsMonitor Manual

Contents

Introduction

Quick Start

Features

Installation

Seed Feed List

For Users

For Developers

Contents

Feed Formats

Species

Invalid Feeds

Mappings

Namespaces

NewsMonitor Vocabulary

Poller

Extractors

Discovery Engine

Relevance Heuristics