davidsantiago.hickory

https://github.com/davidsantiago/hickory.git

git clone 'https://github.com/davidsantiago/hickory.git'

(ql:quickload :davidsantiago.hickory)
403

Hickory

Hickory parses HTML into Clojure data structures, so you can analyze, transform, and output back to HTML. HTML can be parsed into hiccup vectors, or into a map-based DOM-like format very similar to that used by clojure.xml. It can be used from both Clojure and Clojurescript.

There is API documentation available.

Usage

Parsing

To start, you will want to process your HTML into a parsed representation. Once the HTML is in this form, it can be converted to either Hiccup or Hickory format for further processing. There are two parsing functions, parse and parse-fragment. Both take a string containing HTML and return the parser objects representing the document. (It happens that these parser objects are Jsoup Documents and Nodes, but I do not consider this to be an aspect worth preserving if a change in parser should become necessary).

The first function, parse expects an entire HTML document, and parses it using an HTML5 parser (Jsoup on Clojure and the browser's DOM parser in Clojurescript), which will fix up the HTML as much as it can into a well-formed document. The second function, parse-fragment, expects some smaller fragment of HTML that does not make up a full document, and thus returns a list of parsed fragments, each of which must be processed individually into Hiccup or Hickory format. For example, if parse-fragment is given “<p><br>” as input, it has no common parent for them, so it must simply give you the list of nodes that it parsed.

These parsed objects can be turned into either Hiccup vector trees or Hickory DOM maps using the functions as-hiccup or as-hickory.

Here's a usage example.

user=> (use 'hickory.core)
nil
user=> (def parsed-doc (parse "<a href=\"foo\">foo</a>"))
#'user/parsed-doc
user=> (as-hiccup parsed-doc)
([:html {} [:head {}] [:body {} [:a {:href "foo"} "foo"]]])
user=> (as-hickory parsed-doc)
{:type :document, :content [{:type :element, :attrs nil, :tag :html, :content [{:type :element, :attrs nil, :tag :head, :content nil} {:type :element, :attrs nil, :tag :body, :content [{:type :element, :attrs {:href "foo"}, :tag :a, :content ["foo"]}]}]}]}
user=> (def parsed-frag (parse-fragment "<a href=\"foo\">foo</a> <a href=\"bar\">bar</a>"))
#'user/parsed-frag
user=> (as-hiccup parsed-frag)
IllegalArgumentException No implementation of method: :as-hiccup of protocol: #'hickory.core/HiccupRepresentable found for class: clojure.lang.PersistentVector  clojure.core/-cache-protocol-fn (core_deftype.clj:495)

user=> (map as-hiccup parsed-frag)
([:a {:href "foo"} "foo"] " " [:a {:href "bar"} "bar"])
user=> (map as-hickory parsed-frag)
({:type :element, :attrs {:href "foo"}, :tag :a, :content ["foo"]} " " {:type :element, :attrs {:href "bar"}, :tag :a, :content ["bar"]})

In the example above, you can see an HTML document that is parsed once and then converted to both Hiccup and Hickory formats. Similarly, a fragment is parsed, but it cannot be directly used with as-hiccup (or as-hickory), it must have those functions called on each element in the list instead.

The namespace hickory.zip provides zippers for both Hiccup and Hickory formatted data, with the functions hiccup-zip and hickory-zip. Using zippers, you can easily traverse the trees in any order you desire, make edits, and get the resulting tree back. Here is an example of that.

user=> (use 'hickory.zip)
nil
user=> (require '[clojure.zip :as zip])
nil
user=> (-> (hiccup-zip (as-hiccup (parse "<a href=foo>bar<br></a>"))) zip/node)
([:html {} [:head {}] [:body {} [:a {:href "foo"} "bar" [:br {}]]]])
user=> (-> (hiccup-zip (as-hiccup (parse "<a href=foo>bar<br></a>"))) zip/next zip/node)
[:html {} [:head {}] [:body {} [:a {:href "foo"} "bar" [:br {}]]]]
user=> (-> (hiccup-zip (as-hiccup (parse "<a href=foo>bar<br></a>"))) zip/next zip/next zip/node)
[:head {}]
user=> (-> (hiccup-zip (as-hiccup (parse "<a href=foo>bar<br></a>")))
           zip/next zip/next
           (zip/replace [:head {:id "a"}])
           zip/node)
[:head {:id "a"}]
user=> (-> (hiccup-zip (as-hiccup (parse "<a href=foo>bar<br></a>")))
           zip/next zip/next
           (zip/replace [:head {:id "a"}])
           zip/root)
([:html {} [:head {:id "a"}] [:body {} [:a {:href "foo"} "bar" [:br {}]]]])
user=> (-> (hickory-zip (as-hickory (parse "<a href=foo>bar<br></a>")))
           zip/next zip/next
           (zip/replace {:type :element :tag :head :attrs {:id "a"} :content nil})
           zip/root)
{:type :document, :content [{:type :element, :attrs nil, :tag :html, :content [{:content nil, :type :element, :attrs {:id "a"}, :tag :head} {:type :element, :attrs nil, :tag :body, :content [{:type :element, :attrs {:href "foo"}, :tag :a, :content ["bar" {:type :element, :attrs nil, :tag :br, :content nil}]}]}]}]}
user=> (hickory-to-html *1)
"<html><head id=\"a\"></head><body><a href=\"foo\">bar<br></a></body></html>"

In this example, we can see a basic document being parsed into Hiccup form. Then, using zippers, the HEAD element is navigated to, and then replaced with one that has an id of “a”. The final tree, including the modification, is also shown using zip/root. Then the same modification is made using Hickory forms and zippers. Finally, the modified Hickory version is printed back to HTML using the hickory-to-html function.

Selectors

Hickory also comes with a set of CSS-style selectors that operate on hickory-format data in the hickory.select namespace. These selectors do not exactly mirror the selectors in CSS, and are often more powerful. There is no version of these selectors for hiccup-format data, at this point.

A selector is simply a function that takes a zipper loc from a hickory html tree data structure as its only argument. The selector will return its argument if the selector applies to it, and nil otherwise. Writing useful selectors can often be involved, so most of the hickory.select package is actually made up of selector combinators; functions that return useful selector functions by specializing them to the data given as arguments, or by combining together multiple selectors. For example, if we wanted to figure out the dates of the next Formula 1 race weekend, we could do something like this:

user=> (use 'hickory.core)
nil
user=> (require '[hickory.select :as s])
nil
user=> (require '[clj-http.client :as client])
nil
user=> (require '[clojure.string :as string])
nil
user=> (def site-htree (-> (client/get "http://formula1.com/default.html") :body parse as-hickory))
#'user/site-htree
user=> (-> (s/select (s/child (s/class "subCalender") ; sic
                              (s/tag :div)
                              (s/id :raceDates)
                              s/first-child
                              (s/tag :b))
                     site-htree)
           first :content first string/trim)
"10, 11, 12 May 2013"

In this example, we get the contents of the homepage and use select to give us any nodes that satisfy the criteria laid out by the selectors. The selector in this example is overly precise in order to illustrate more selectors than we need; we could have gotten by just selecting the contents of the P and then B tags inside the element with id “raceDates”.

Using the selectors allows you to search large HTML documents for nodes of interest with a relatively small amount of code. There are many selectors available in the hickory.select namespace, including:

There are also selector combinators, which take as argument some number of other selectors, and return a new selector that combines them into one larger selector. An example of this is the child selector in the example above. Here's a list of some selector combinators in the package (see the API Documentation for the full list):

We can illustrate the selector combinators by continuing the Formula 1 example above. We suspect, to our dismay, that Sebastian Vettel is leading the championship for the fourth year in a row.

user=> (-> (s/select (s/descendant (s/class "subModule")
                                   (s/class "standings")
                                   (s/and (s/tag :tr)
                                          s/first-child)
                                   (s/and (s/tag :td)
                                          (s/nth-child 2))
                                   (s/tag :a))
                     site-htree)
           first :content first string/trim)
"Sebastian Vettel"           

Our fears are confirmed, Sebastian Vettel is well on his way to a fourth consecutive championship. If you were to inspect the page by hand (as of around May 2013, at least), you would see that unlike the child selector we used in the example above, the descendant selector allows the argument selectors to skip stages in the tree; we've left out some elements in this descendant relationship. The first table row in the driver standings table is selected with the and, tag and first-child selectors, and then the second td element is chosen, which is the element that has the driver's name (the first table element has the driver's standing) inside an A element. All of this is dependent on the exact layout of the HTML in the site we are examining, of course, but it should give an idea of how you can combine selectors to reach into a specific node of an HTML document very easily.

Finally, it's worth noting that the select function itself returns the hickory zipper nodes it finds. This is most useful for analyzing the contents of nodes. However, sometimes you may wish to examine the area around a node once you've found it. For this, you can use the select-locs function, which returns a sequence of hickory zipper locs, instead of the nodes themselves. This will allow you to navigate around the document tree using the zipper functions in clojure.zip. If you wish to go further and actually modify the document tree using zipper functions, you should not use select-locs. The problem is that it returns a bunch of zipper locs, but once you modify one, the others are out of date and do not see the changes (just as with any other persistent data structure in Clojure). Thus, their presence was useless and possibly confusing. Instead, you should use the select-next-loc function to walk through the document tree manually, moving through the locs that satisfy the selector function one by one, which will allow you to make modifications as you go. As with modifying any data structure as you traverse it, you must still be careful that your code does not add the thing it is selecting for, or it could get caught in an infinite loop. Finally, for more specialized selection needs, it should be possible to write custom selection functions that use the selectors and zipper functions without too much work. The functions discussed in this paragraph are very short and simple, you can use them as a guide.

The doc strings for the functions in the hickory.select namespace provide more details on most of these functions.

For more details, see the API Documentation.

Hickory format

Why two formats? It's very easy to see in the example above, Hiccup is very convenient to use for writing HTML. It has a compact syntax, with CSS-like shortcuts for specifying classes and ids. It also allows parts of the vector to be skipped if they are not important.

It's a little bit harder to process data in Hiccup format. First of all, each form has to be checked for the presence of the attribute map, and the traversal adjusted accordingly. Raw Hiccup vectors might also have information about class and id in one of two different places. Finally, not every piece of an HTML document can be expressed in Hiccup without resorting to writing HTML in strings. For example, if you want to put a doctype or comment on your document, it has to be done as a string in your Hiccup form containing “<!DOCTYPE html>” or “<!--stuff-->”.

The Hickory format is another data format intended to allow a roundtrip from HTML as text, into a data structure that is easy to process and modify, and back into equivalent (but not identical, in general) HTML. Because it can express all parts of an HTML document in a parsed form, it is easier to search and modify the structure of the document.

A Hickory node is either a map or a string. If it is a map, it will have some subset of the following four keys, depending on the :type:

Text and CDATA nodes are represented as strings.

This is almost the exact same structure used by clojure.xml, the only difference being the addition of the :type field. Having this field allows us to process nodes that clojure.xml leaves out of the parsed data, like doctype and comments.

Obtaining

To get hickory, add

[hickory "0.7.1"]

to your project.clj, or an equivalent entry for your Maven-compatible build tool.

ClojureScript support

Hickory expects a DOM implementation and thus won't work out of the box on node. On browsers it works for IE9+ (you can find a workaround for IE9 here).

Changes

License

Copyright © 2012 David Santiago

Distributed under the Eclipse Public License, the same as Clojure.