Musings from Mars Banner Image
For Software Addicts: Yes!MaybeNah!
Mars Report:

Anthracite: Scrape and Mine the Web Your Way

Published August 27th, 2006
Anthracite: Web Mining Desktop Toolkit Anthracite Web Mining SoftwareOriginally downloaded 8/27/06. Anthracite appears to be a very sophisticated take on the age-old (if 10 years can be an age) art of screen-scraping... whereby you write a little script (originally in Perl) that parses a web page, given a known structure, to scrape away the little bit of data you're interested in. You can then automatically display that scraped data in another web page, and the script will keep it updated for you (if you run the script regularly). My first screen-scraping was for stock quotes, which I wanted to display on my company's Intranet. With today's web services, RSS, and XML API's offered by many companies for their data, screen scraping is less important. But it's still necessary if you really want to "syndicate" someone else's data (with or without their permission) on a personal or company website.... simply because the web still has goo-gobs of great information that isn't available any other way.

I haven't yet tried Anthracite, but its tag line is very intriguing: "Visually construct Spiders and Scrapers without scripting!" It's designed specifically for Mac OS X users and offers ways of integrating scraped data (e.g., SEC filings data) into your daily life in a way not possible without a great deal of effort and programming skills. Anthracite also offers itself as a solution for complex data-processing projects that would require a great deal more programming effort than mining a single page for a single data set. The interface suggests you can do all of this visually, without programming. The end result appears to be a customized RSS feed from a source where one didn't exist before! Inasmuch as the result is an RSS feed, you could theoretically construct data feeds for users on any platform using Anthracite. A tool with claims this sophisticated simply must be given an opportunity to "strut its stuff," so I'm downloading it and adding it to my evaluation queue. Did I mention that your Anthracite download comes with a boatload of example Anthracite "workflows" and example outputs to get you started.

Funny, in other techno-babble circles, this kind of activity is being called "mashups", although usually that involves starting with preexisting web services and RSS feeds. Anthracite offers the prospect of building mashups from non-existent web services and combining them with preexisting ones... combining the best of the new "mashup" possibilities with a great new "old" way of doing it. :-)

Version as tested: 1.6.1

    
  • del.icio.us
  • Google
  • Slashdot
  • Technorati
  • blogmarks
  • Tumblr
  • Digg
  • Facebook
  • Mixx

Show Comments
Just Say No To Flash