git clone 'https://github.com/tol/webmine.git'
A web mining toolkit for Clojure. A swiss army knife for processing text, images, and feeds from HTML.
The library gives you the most common tools you need out of the box, but give is fine grained enough to build your custom processing tools.
Get a seq of urls, or remove all urls, from an arbitrary string containing urls.
(url-seq raw-text) (remove-urls raw-text)
Expand a shortened url (e.g. from bitly)
Webmine wraps tagsoup for parsing; handles malformed HTML.
You get a dom tree from a raw html string with:
(def d (dom source-string))
From there, you can do all sorts of things:
(text-from-dom d) (strip-non-content d) (attr-map d) (links-from-dom d) (divs d)
If you don't find what you need, you can write arbitrary transformation on the DOM tree using walk-dom
(walk-dom dom visit-node-fn accum-res-fn)
Get the current entries for a feed.
Identify the canonical rss/atom feeds from a given seq of urls.
Given the url of a blog's homepage or rss feed, find the outlinks to feeds from both the homepage, and all the entries currently in this blog's feed.
Get the blogroll someone is following from their opml.
Get the most relevant image at a particular url.
Get all the images and their sizes out of a dom.
(def some-imgs (imgs d))
Size fu checks attrs and style tags for the most likely main image. We can also fetch the images to get their dimensions if the attrs and style tags both fail.
We have dom scrubbing fu in webmine.readability based on readability.js. A nice feature of our readability port is that it's easy to change how a div is scored for readability to suit the data you're working with. webmine.readability is used in webmine.images to find the div most likely to contain the main page image:
For leiningen: [webmine “0.1.1”]