Fusepool NewsMonitor Module Requirements

Fusepool module for discovering and monitoring RSS/Atom feeds on predetermined topics

Updates to this document will be made in-place - this version : 2014-04-21

Danny Ayers

danny.ayers@gmail.com

Overview
Motivation
Data Sources
Potential Users
Use Cases
Legacy for Fusepool
System Architecture
Program Flow
External Interfaces
Format Support
Vocabularies
Protocol Support
Miscellaneous Functionality
Roadmap & Milestones
Open Questions
Future Enhancements

Overview

NewsMonitor is a proposed OSGi module for Fusepool/Apache Stanbol. It's primary operation will be that of a (RSS/Atom) feed reader/aggregator, not unlike the online Planet-style aggregators which tend to be focused on a specific subject domain (e.g. Planet Python, Electra Atlantis: Digital Approaches to Antiquity, Star Astronomy). However, it will employ Semantic Web technologies to go far beyond the functionality typically found in feed readers. By aggregating feed metadata and extracting data from (human-readable) feed content, it will provide linked data access to much more information. Additionally, there will be feed discovery functionality, allowing the system to automatically find and subscribe to feeds which are relevant to the given topic.

In summary, it will :

discover feeds containing posts in the given subject domain
aggregate content
identify and extract information from feed metadata and content
expose the results as linked data using established vocabularies
display results via Web UI

The components and functionality of NewsMonitor map to aspects of the Fusepool Data|Hack|Award|2014 call as follows :

Use cases (described below)
Refine data - RDF will be extracted from the structured data found in feeds, both through refining RSS/Atom metadata and entity extraction applied to the content
Reuse data - the data obtained will be aggregated according to topic and displayed as a combined news channel, with corresponding linked data exposed
Release data - as above, with a live demo

Developer : the NewsMonitor module will be developed by Danny Ayers as a lone coder (aided by the Fusepool group), devoting as much time as necessary to the project.

License : Apache 2

Code Repository : initially danja/feedreader-prototype on GitHub

Motivation - why feeds?

There is a significant quantity of useful user-generated content on the Web, and much of it is found in blogs. This represents a hugely valuable resource that is currently underused. By automatically aggregating and distilling the information found in this content, it may be exposed in a fashion that provides much more value to the end user. A large proportion of blogs and news sites provide RSS or Atom feeds, which although rarely 'clean', is structured data which is much more suitable for conversion to Semantic Web models than arbitrary HTML.

The advantages of this information include :

more human-digestible than typical Semantic Web data
generated in real time, without the latency found in traditional journalism and formal research
not constrained by national borders, usually open and free

It should be noted there are disadvantages with this source of information, some of which will have implications for the implementation of NewsMonitor. The disadvantages of feed material include :

informal language, more ambiguity
less rigorous than conventional journalism/research
potential legal/copyright complications

Data Sources

To bootstrap the system, a set of URLs of feeds associated with the desired topic will be required. Additionally, RDF vocabularies and keyword dictionaries will be required to determine relevance of given sites and to identify entities in content.

The system will be domain-agnostic, but to demonstrate that it may be reused for different topics, at least two should be tried out during development. There are many vocabularies available on the Web, and there are existing dictionaries/vocabularies for string/dictionary matching within Fusepool (plus those to be provided by ProcessMonitor [Alexandro] and PlanningMapper [Andreas]).

Candidate topics include :

European Parliament - focusing on MEPs. As well as being informative, this would also provide data for sentiment analysis, e.g. estimating the popularity of individual MEPs. This should be a good starting point during development as the domain is relatively constrained so a reasonably small dictionary may be used.

Medical - diseases, pharmaceuticals etc. This is a broad domain and will require a much larger keyword dictionary. As such it should make a good test of relevance determination at the other extreme to MEPs.

Semantic Web - RDF, linked data etc. As well as (hopefully) being of interest to Fusepool developers, there are two reasons why this may make a good choice of topic, both relating to the Planet RDF site. Firstly this site offers a feedlist (blogger.rdf) that has been manually compiled over several years and should make an excellent seed list for NewsMonitor. Secondly, as the site is of known relevance, it should make a good reference against which to compare the results of feed discovery by NewsMonitor.

Potential Users

At least four demographics are most likely to be interested in NewsMonitor, for viewing and/or deployment :

domain professionals
lay users - you and I
journalists
linked data developers

Use Cases

The main use case for NewsMonitor is to allow an end user to monitor articles on topics relevant to them. The fact that the system may be configured for a specific domain leaves the door open for use cases applicable to the chosen domain. For example, as mentioned earlier, in the European Parliament domain the data generated may be used for determining popularity of particular MEPs. In the medical domain, a possible use case could be the indication of potential drug side effects. In a way this could be seen as passively crowdsourcing research. Given the informality and immediacy of blogs they are excellent for augmenting information obtained from more formal sources.

It should also be noted that thanks to the use of Semantic Web technologies, there will be plenty of scope for unanticipated reuse and serendipity.

Legacy for Fusepool

Assuming everything goes according to plan, upon completion it is believed that NewsMonitor will add significant value to the Fusepool project. Specifically it will provide :

Showcase Application - good for quick start, and offers an application that has immediate value for end users
Application reusable for different domains
Poller/Aggregator component (in module)
Intelligent Discovery Engine component (in module)

System Architecture

The main components of the system are the Data/Content Store Fusepool ECS (shown in the diagram as the three blue cylinders), the Poller NewsMonitor component, a set of data Extractors NewsMonitor RDFizers and existing Fusepool components and the Discovery Engine NewsMonitor component.

Data/Content Store

Three sets of data will be contained in the store:

Dictionaries/Ontologies will used to guide data extraction and help determine page relevance within the Discovery Engine
Discovery Rules will be applied against entities extracted from candidate page content (and page metadata) to determine page relevance
Page Metadata & Content will be the feedlist and material derived from subscribed feeds. This is the primary output of the system

Poller

The role of the Poller is to periodically step through the URLs in the feedlist and to obtain any new material found in those feeds.

Extractors

The role of the extractors is to :

convert feed content to RDF
extract entities from entry content
extract linked URLs as potential sources of further relevant feeds

Discovery Engine

The role of the Discovery Engine is to take the URLs and entities from a a given entry and by means of a series of heuristics determine the relevance of the target page. Part of the heuristics will be rule-driven, with the rules contained in the data store.
Simple heuristics will also be applied to find feeds linked from the target page (Feed Autodiscovery and scraping).
by a set of relevance measures

Program Flow

The system will be initialized by loading the feedlist into working memory (the seed feedlist will previously have been loaded from file).

There will then be three main threads running concurrently. The Poller thread will loop through feed URLs getting new content/data and passing it into the store. The Extractor thread will examine new content and identify entities and any other potentially useful data found in the feeds, including any URLs pointing to other sites. This data will also be passed into the store. The Discovery thread will take the remote URLs found by the Extractor, and for each URL determine a relevance score for the target page. If the score is above the given threshold, feed autodiscovery will be applied to the target page to find any associated RSS/Atom feeds, the URLs of which will be added to the feedlist. Relevance checks will also periodically be made on feeds already on the list, and should they fall below a given threshold their URLs will be removed from the active list (along with any feeds that are unavailable).

The exact details of all these are to be decided, it's anticipated that considerable experimentation will be needed, especially around determining the relevance score.

Poller Thread

for each URL :

Conditional GET to retrieve feed content
feed content translated to RDF
(on first run only) feed top-level metadata passed to store
new content and post-level metadata passed to store
entity extraction applied to

Extractor Thread

This will apply existing Fusepool/Stanbol components to identify entities in entries, along with whatever custom extraction is deemed necessary (e.g. URLs and link labels for remote sites.

Discovery Engine Thread

This will apply heuristics to the data retrieved from feeds alongside dictionaries/vocabularies to determine likely relevance of candidate sites. Initially this will be based on simple keyword matching, with other approaches being added as time permits (ideally the relevance tools will be pluggable, so the configuration can easily be changed and other tools added according to what isfound to work well with the given subject domain).

External Interfaces

Summary

The interfaces will primarily be accessing the graph(s) associated with the Content, Page Metadata and Entities extracted from the feeds, using the HTTP protocol. For machine-machine communications, this should be mostly achievable using existing Fusepool functionality, although access to NewsMonitor components is desirable. The user interfaces will be in-browser using standard HTML, CSS & Javascript technologies.

The interfaces will expose the following:

system data
feed data
aggregated posts, keyword searchable
linked data (using established vocabularies)
SPARQL Endpoint
RESTful interfaces to system components as appropriate

Administration Page

The admin page with include UI elements to control the following :

shutdown
start/stop polling
polling period (time between refreshes)
start/stop extraction
extraction period
start/stop discovery
discovery period
add feed URL
relevancy threshold to subscribe to new sources
relevancy threshold to remove feeds from the active list
expiry time for entries (from one day to infinite)

The admin page will also include a list of subscribed feeds. Each of these will have a control to allow removal of feed from the active list.

It is likely that other controls will be desirable, such as choice of extraction and relevancy tools. This will be added as time permits.

Again, as time permits, per-feed statistics will also be provided.

Activity Log

Each of the components of the system will log its activity to a common location (ideally the store). These logs will be viewable as a Web page, ideally with control over the level of detail.

Feed View Pages

1-Pane : “River of News” - recent posts sorted by individual post date and/or most recently updated blog (mockup)
3-Pane : feedlist, entry titles, individual entry content (mockup)

RDF View Pages

To explore the data generated by the system, some form of navigable interface will be provided over it. In the first instance the existing Fusepool RDF UI will be used, which (as time permits and necessity demands) may later be customized. Additional custom UIs may be added later.

Format Support

RSS 1.0 (for input)
RSS 2.0 (for input)
Atom (for input and output)
OPML (is de facto standard for feedlist import/export)
RDF formats (provided via existing components)

Vocabularies

Dublin Core
FOAF
Schema.org
W3C HTTP Vocabulary
Fusepool dictionary vocabularies
Custom terms where necessary for internal operations

Protocol Support

Feed Reading

HTTP/1.1 (targeting HTTPbis)
Conditional GET :

Last-Modified/If-Modified-Since

ETag/If-None-Match

Syndication Client Actions on HTTP Status Codes
Status Code	Reason Phrase	Appropriate Action
100	Continue	Record - if persistent, unsubscribe
101	Switching Protocols	Record - if persistent, unsubscribe
200	OK	* Read data
201	Created	Record - if persistent, unsubscribe
202	Accepted	Record - if persistent, unsubscribe
203	Non-Authoritative Information	Record - if persistent, unsubscribe
204	No Content	Record - if persistent, unsubscribe
205	Reset Content	Record - if persistent, unsubscribe
206	Partial Content	Record - if persistent, unsubscribe
300	Multiple Choices	* Follow redirection (keep current URI)
301	Moved Permanently	* Follow redirection, replace URI
302	Found (Temporary Redirect)	* Follow redirection (keep current URI)
303	See Other	* Follow redirection (keep current URI)
304	Not Modified	* Use current cache
305	Use Proxy	* Follow redirection (keep current URI)
307	Temporary Redirect	* Follow redirection (keep current URI)
400	Bad Request	Record - if persistent, unsubscribe
401	Unauthorized	* Unsubscribe
402	Payment Required	* Unsubscribe
403	Forbidden	* Unsubscribe
404	Not Found	* Record - if persistent, unsubscribe
405	Method Not Allowed	Record - if persistent, unsubscribe
406	Not Acceptable	Record - if persistent, unsubscribe
407	Proxy Authentication Required	Record - if persistent, unsubscribe
408	Request Time-out	Record - if persistent, unsubscribe
409	Conflict	Record - if persistent, unsubscribe
410	Gone	* Unsubscribe
411	Length Required	Record - if persistent, unsubscribe
412	Precondition Failed	Record - if persistent, unsubscribe
413	Request Entity Too Large	Record - if persistent, unsubscribe
414	Request-URI Too Large	Record - if persistent, unsubscribe
415	Unsupported Media Type	Record - if persistent, unsubscribe
416	Requested range not satisfiable	Record - if persistent, unsubscribe
417	Expectation Failed	Record - if persistent, unsubscribe
500	Internal Server Error	Record - if persistent, unsubscribe
501	Not Implemented	Record - if persistent, unsubscribe
502	Bad Gateway	Record - if persistent, unsubscribe
503	Service Unavailable	Record - if persistent, unsubscribe
504	Gateway Time-out	Record - if persistent, unsubscribe
505	HTTP Version not supported	Record - if persistent, unsubscribe

[table from Beginning RSS and Atom Programming, Ayers/Watt 2005]

Admin

HTTP/1.1

GET to examine status

POST to add feed sources

SPARQL (via existing Fusepool components)

Miscellaneous Functionality

Feedlist Loader : read list of feed URLs from text file, load into store (formats: text URL list, Turtle, OPML)
Feed Autodiscovery : given a HTML page, find URLs of any linked feeds (either explicitly stated in <head> metadata or merely linked in <body>)

Roadmap & Milestones

The plan is to approach development in a very iterative fashion, aiming to get a quick & dirty system featuring core functionality together as soon as possible. This will then be refactored as functionality is extended.

identify existing Fusepool/Stanbol components that fulfill parts of the requirements
hack prototype system, virtually standalone - will use external triplestore (Fuseki) for storage, accessing via SPARQL 1.1
minimum viable product... (i.e. get it working!)
replace/refactor prototypes components into OSGi and integrate with Fusepool
custom UIs (if necessary)
live demo deployment

Development will take place over two months. The first month will be devoted to getting a basic prototype up and running. The second month will be devoted to integrating this with Fusepool/Stanbol, making the system robust and adding extra functionality as time permits. Different areas may be addressed in parallel, though to ensure progress is being made, the provisional roadmap looks like this :

end of week

identify usable Fusepool/Stanbol components, clarify requirements & design
prototype poller
prototype extractor & discovery components
prototype system integration
details to be decided
details to be decided
completed components, OSGi module integrated with Fusepool
live demo

Open Questions

As development has only just started, many aspects have yet to be pinned down. It is hoped that the following points will be clarified by the time the prototype is ready (end of week 4) :

time, cost, scope..?
existing components?
custom UI?
discovery rules engine : use RDFS/OWL reasoning? SPARQL? Drools?
copyright issues?
subproject name(s)..?

Future Enhancements

It is expected that the project as described above is quite ambitious for the time available, though it will be open to lots of extension. It is hoped that development will be continued, especially with community input. Here are some possible enhancements :

real-time actions triggered in response to search queries
user interaction, voting up/down sources
per-user customization
UI integration with more of LOD cloud
enrichment with schema.org, RDFa annotation of human-oriented views
exploit any FOAF or other personal profile data about bloggers to build a social dimension into the system