↑ Return to F30 Full Posts

PRIV F31 RSS Grabber: Get Full Posts


Page no: F31

We checked all of our rss and realized that  many of them gives us only excerpt (the beginning) of the post. But we need to use the whole content and cut it where we want. So we found another problem.

Explanation

Videos and Pics

Implementation

We made a simple rss grabber, which when you give him the rss with the excerpt, it turns it into the full content. You can see on screenshots below.

The technology behind this is not so simple, but it works smoothly.

Input: Link from RSS.
Procedure:

  1. The grabber open the link from the rss,
  2. finds the whole content of the page
  3. It matches the first sentence
  4. and after that it parse the whole content and found where is the end of the content.
  5. generates absolutely the same RSS like the old one, but with the full content.

Output : The only difference is that we are showing the full content.

Grabber How to 

After the check of all the feeds, we make a list with the feeds, which are needed to be changed with the new link, using the RssGrabber.

Feed in WordPress

First we open the needed feed via WordPress. The feed link is located into Links -> All links -> Edit link of the blog

RSS Address Link
RSS Address Link

- Click to enlarge

2) The content of the old RSS. As we see it is only 100-150 words content.

Old RSS
Old RSS

- Click to enlarge

How to call RSS Grabber

3) So we need to change the RSS link using our new Tool. It is simple correction. We only need to add “http://economicblogs.org/rss/get.php?url=” before the real link and the problem is solved.

 

 

New RSS with Full Content
New RSS with Full Content

- Click to enlarge

Advanced Filters Original project: FeedWP Advanced Filters, as of 2013, latest version

 

Author in Twitter

Custom configuration: Correcting feeds with RssGrabber

Some time the Grabber can’t recognize from where to where is the real content. Some times it consider that both main and sidebar columns are our content. In this case we have to make manually a rules which content to get.

For that purpose we use xPath language. All custom rules are stored on the server in txt format with name of the blog. The content is the rule which we “tell” the grabber to follow. What is our content exactly. And what to skip.

FileZilla Path to RSS
FileZilla Path to RSS

- Click to enlarge

For example this is the custom rule for businessinsider

title://div[@class=”sl-layout-post”]/h1
body: //div[contains(@class, ‘post-content’) or contains(@class, ‘KonaBody’)]
strip: //div[contains(@class, “post-sidebar”)]
strip: //div[@id=’related-links’]
strip: //div[@id=’ooyalaplayer_popvideo’]
strip: //div[contains(concat(‘ ‘,normalize-space(@class),’ ‘),’ KonaBody ‘)]
author://div[@class=”byline”]/a
date://div[@class=”byline”]/span[@class=”date”]
prune: no

test_url: http://www.businessinsider.com/as-europe-booms-on-bailout-deal-john-boehner-just-confirmed-that-the-us-is-nowhere-2011-7?IR=T

 

 

Custom Rule for Businessinsider
Custom Rule for Businessinsider

- Click to enlarge

 

 

2nd Example for RSS Grabber:

Example Post: 

Grab Instructions:

strip: //div[contains(@class, “prebid-helper”)]
strip: //div[contains(@class, “dialog-base”)]

Types of strips:

  • div
  • span
  • p (paragraphs)
  • a (links)
  • b (bold text)
  • section
What is stripped:
Full HTML:

Default configuration

title://div[@class=”title”]/h1
body: //div[contains(@class, ‘post-content’)]
author://div[@class=”author”]/a
date://div[@class=”byline”]/span[@class=”date”]
prune: no

Default Configuration
Default Configuration

- Click to enlarge

 

Text Modifications in the Grabber

LSE Strip Author out of Text

Strip Authors from Zerohedge RSS

List of blogs, which needed correction

  • FT Alphaville on CHF
  • Never mind the markets
  • Seeking Alpha articles on CHF
  • Swiss National Bank News
  • Zerohedge Tags on SNB
  • London School of Eco. Blog
  • Ben Bernanke at Brookings
  • Steve Keen’s Debt Watch
  • Max Keiser
  • Mises Institute United States
  • Schiff Gold

All of these configurations are fixed now.

 

See more for Autofeed