Fusepool NewsMonitor Module Requirements

Fusepool module for discovering and monitoring RSS/Atom feeds on predetermined topics

Updates to this document will be made in-place - this version : 2014-04-21

Danny Ayers

danny.ayers@gmail.com

Contents


Overview

NewsMonitor is a proposed OSGi module for Fusepool/Apache Stanbol. It's primary operation will be that of a (RSS/Atom) feed reader/aggregator, not unlike the online Planet-style aggregators which tend to be focused on a specific subject domain (e.g. Planet Python, Electra Atlantis: Digital Approaches to Antiquity, Star Astronomy). However, it will employ Semantic Web technologies to go far beyond the functionality typically found in feed readers. By aggregating feed metadata and extracting data from (human-readable) feed content, it will provide linked data access to much more information. Additionally, there will be feed discovery functionality, allowing the system to automatically find and subscribe to feeds which are relevant to the given topic.

In summary, it will :

The components and functionality of NewsMonitor map to aspects of the Fusepool Data|Hack|Award|2014 call as follows :

Developer : the NewsMonitor module will be developed by Danny Ayers as a lone coder (aided by the Fusepool group), devoting as much time as necessary to the project.

License : Apache 2

Code Repository : initially danja/feedreader-prototype on GitHub


Motivation - why feeds?

There is a significant quantity of useful user-generated content on the Web, and much of it is found in blogs. This represents a hugely valuable resource that is currently underused. By automatically aggregating and distilling the information found in this content, it may be exposed in a fashion that provides much more value to the end user. A large proportion of blogs and news sites provide RSS or Atom feeds, which although rarely 'clean', is structured data which is much more suitable for conversion to Semantic Web models than arbitrary HTML.

The advantages of this information include :

It should be noted there are disadvantages with this source of information, some of which will have implications for the implementation of NewsMonitor. The disadvantages of feed material include :

Data Sources

To bootstrap the system, a set of URLs of feeds associated with the desired topic will be required. Additionally, RDF vocabularies and keyword dictionaries will be required to determine relevance of given sites and to identify entities in content.

The system will be domain-agnostic, but to demonstrate that it may be reused for different topics, at least two should be tried out during development. There are many vocabularies available on the Web, and there are existing dictionaries/vocabularies for string/dictionary matching within Fusepool (plus those to be provided by ProcessMonitor [Alexandro] and PlanningMapper [Andreas]).

Candidate topics include :

European Parliament - focusing on MEPs. As well as being informative, this would also provide data for sentiment analysis, e.g. estimating the popularity of individual MEPs. This should be a good starting point during development as the domain is relatively constrained so a reasonably small dictionary may be used.

Medical - diseases, pharmaceuticals etc. This is a broad domain and will require a much larger keyword dictionary. As such it should make a good test of relevance determination at the other extreme to MEPs.

Semantic Web - RDF, linked data etc. As well as (hopefully) being of interest to Fusepool developers, there are two reasons why this may make a good choice of topic, both relating to the Planet RDF site. Firstly this site offers a feedlist (blogger.rdf) that has been manually compiled over several years and should make an excellent seed list for NewsMonitor. Secondly, as the site is of known relevance, it should make a good reference against which to compare the results of feed discovery by NewsMonitor.

Potential Users

At least four demographics are most likely to be interested in NewsMonitor, for viewing and/or deployment :

Use Cases

The main use case for NewsMonitor is to allow an end user to monitor articles on topics relevant to them. The fact that the system may be configured for a specific domain leaves the door open for use cases applicable to the chosen domain. For example, as mentioned earlier, in the European Parliament domain the data generated may be used for determining popularity of particular MEPs. In the medical domain, a possible use case could be the indication of potential drug side effects. In a way this could be seen as passively crowdsourcing research. Given the informality and immediacy of blogs they are excellent for augmenting information obtained from more formal sources.

It should also be noted that thanks to the use of Semantic Web technologies, there will be plenty of scope for unanticipated reuse and serendipity.

Legacy for Fusepool

Assuming everything goes according to plan, upon completion it is believed that NewsMonitor will add significant value to the Fusepool project. Specifically it will provide :

System Architecture

architecture

The main components of the system are the Data/Content Store Fusepool ECS (shown in the diagram as the three blue cylinders), the Poller NewsMonitor component, a set of data Extractors NewsMonitor RDFizers and existing Fusepool components and the Discovery Engine NewsMonitor component.

Data/Content Store

Three sets of data will be contained in the store:

Poller

The role of the Poller is to periodically step through the URLs in the feedlist and to obtain any new material found in those feeds.

Extractors

The role of the extractors is to :

Discovery Engine

The role of the Discovery Engine is to take the URLs and entities from a a given entry and by means of a series of heuristics determine the relevance of the target page. Part of the heuristics will be rule-driven, with the rules contained in the data store.
Simple heuristics will also be applied to find feeds linked from the target page (Feed Autodiscovery and scraping).
by a set of relevance measures

Program Flow

The system will be initialized by loading the feedlist into working memory (the seed feedlist will previously have been loaded from file).

There will then be three main threads running concurrently. The Poller thread will loop through feed URLs getting new content/data and passing it into the store. The Extractor thread will examine new content and identify entities and any other potentially useful data found in the feeds, including any URLs pointing to other sites. This data will also be passed into the store. The Discovery thread will take the remote URLs found by the Extractor, and for each URL determine a relevance score for the target page. If the score is above the given threshold, feed autodiscovery will be applied to the target page to find any associated RSS/Atom feeds, the URLs of which will be added to the feedlist. Relevance checks will also periodically be made on feeds already on the list, and should they fall below a given threshold their URLs will be removed from the active list (along with any feeds that are unavailable).

The exact details of all these are to be decided, it's anticipated that considerable experimentation will be needed, especially around determining the relevance score.

Poller Thread

for each URL :

Extractor Thread

This will apply existing Fusepool/Stanbol components to identify entities in entries, along with whatever custom extraction is deemed necessary (e.g. URLs and link labels for remote sites.

Discovery Engine Thread

This will apply heuristics to the data retrieved from feeds alongside dictionaries/vocabularies to determine likely relevance of candidate sites. Initially this will be based on simple keyword matching, with other approaches being added as time permits (ideally the relevance tools will be pluggable, so the configuration can easily be changed and other tools added according to what isfound to work well with the given subject domain).

External Interfaces

Summary

The interfaces will primarily be accessing the graph(s) associated with the Content, Page Metadata and Entities extracted from the feeds, using the HTTP protocol. For machine-machine communications, this should be mostly achievable using existing Fusepool functionality, although access to NewsMonitor components is desirable. The user interfaces will be in-browser using standard HTML, CSS & Javascript technologies.

The interfaces will expose the following:

Administration Page

The admin page with include UI elements to control the following :

The admin page will also include a list of subscribed feeds. Each of these will have a control to allow removal of feed from the active list.

It is likely that other controls will be desirable, such as choice of extraction and relevancy tools. This will be added as time permits.

Again, as time permits, per-feed statistics will also be provided.

Activity Log

Each of the components of the system will log its activity to a common location (ideally the store). These logs will be viewable as a Web page, ideally with control over the level of detail.

Feed View Pages

RDF View Pages

To explore the data generated by the system, some form of navigable interface will be provided over it. In the first instance the existing Fusepool RDF UI will be used, which (as time permits and necessity demands) may later be customized. Additional custom UIs may be added later.

Format Support

Vocabularies

Protocol Support

Feed Reading

Syndication Client Actions on HTTP Status Codes
Status Code Reason Phrase Appropriate Action
100 Continue Record - if persistent, unsubscribe
101 Switching Protocols Record - if persistent, unsubscribe
200 OK * Read data
201 Created Record - if persistent, unsubscribe
202 Accepted Record - if persistent, unsubscribe
203 Non-Authoritative Information Record - if persistent, unsubscribe
204 No Content Record - if persistent, unsubscribe
205 Reset Content Record - if persistent, unsubscribe
206 Partial Content Record - if persistent, unsubscribe
300 Multiple Choices * Follow redirection (keep current URI)
301 Moved Permanently * Follow redirection, replace URI
302 Found (Temporary Redirect) * Follow redirection (keep current URI)
303 See Other * Follow redirection (keep current URI)
304 Not Modified * Use current cache
305 Use Proxy * Follow redirection (keep current URI)
307 Temporary Redirect * Follow redirection (keep current URI)
400 Bad Request Record - if persistent, unsubscribe
401 Unauthorized * Unsubscribe
402 Payment Required * Unsubscribe
403 Forbidden * Unsubscribe
404 Not Found * Record - if persistent, unsubscribe
405 Method Not Allowed Record - if persistent, unsubscribe
406 Not Acceptable Record - if persistent, unsubscribe
407 Proxy Authentication Required Record - if persistent, unsubscribe
408 Request Time-out Record - if persistent, unsubscribe
409 Conflict Record - if persistent, unsubscribe
410 Gone * Unsubscribe
411 Length Required Record - if persistent, unsubscribe
412 Precondition Failed Record - if persistent, unsubscribe
413 Request Entity Too Large Record - if persistent, unsubscribe
414 Request-URI Too Large Record - if persistent, unsubscribe
415 Unsupported Media Type Record - if persistent, unsubscribe
416 Requested range not satisfiable Record - if persistent, unsubscribe
417 Expectation Failed Record - if persistent, unsubscribe
500 Internal Server Error Record - if persistent, unsubscribe
501 Not Implemented Record - if persistent, unsubscribe
502 Bad Gateway Record - if persistent, unsubscribe
503 Service Unavailable Record - if persistent, unsubscribe
504 Gateway Time-out Record - if persistent, unsubscribe
505 HTTP Version not supported Record - if persistent, unsubscribe

[table from Beginning RSS and Atom Programming, Ayers/Watt 2005]

Admin

Miscellaneous Functionality

Roadmap & Milestones

The plan is to approach development in a very iterative fashion, aiming to get a quick & dirty system featuring core functionality together as soon as possible. This will then be refactored as functionality is extended.

Development will take place over two months. The first month will be devoted to getting a basic prototype up and running. The second month will be devoted to integrating this with Fusepool/Stanbol, making the system robust and adding extra functionality as time permits. Different areas may be addressed in parallel, though to ensure progress is being made, the provisional roadmap  looks like this :

end of week
  1. identify usable Fusepool/Stanbol components, clarify requirements & design
  2. prototype poller
  3. prototype extractor & discovery components
  4. prototype system integration
  5. details to be decided
  6. details to be decided
  7. completed components, OSGi module integrated with Fusepool
  8. live demo

Open Questions

As development has only just started, many aspects have yet to be pinned down. It is hoped that the following points will be clarified by the time the prototype is ready (end of week 4) :

Future Enhancements

It is expected that the project as described above is quite ambitious for the time available, though it will be open to lots of extension. It is hoped that development will be continued, especially with community input. Here are some possible enhancements :