https://github.com/Netflix/PigPen.git
git clone 'https://github.com/Netflix/PigPen.git'
(ql:quickload :Netflix.PigPen)
PigPen is map-reduce for Clojure, or distributed Clojure. It compiles to Apache Pig or Cascading but you don't need to know much about either of them to use it.
Getting started with Clojure and PigPen is really easy.
Note: It is strongly recommended to familiarize yourself with Clojure before using PigPen.
Note: PigPen is not a Clojure wrapper for writing Pig scripts you can hand edit. While entirely possible, the resulting scripts are not intended for human consumption.
pigpen
is available from Maven:
With Leiningen:
;; core library
[com.netflix.pigpen/pigpen "0.3.3"]
;; pig support
[com.netflix.pigpen/pigpen-pig "0.3.3"]
;; cascading support
[com.netflix.pigpen/pigpen-cascading "0.3.3"]
;; rx support
[com.netflix.pigpen/pigpen-rx "0.3.3"]
The platform libraries all reference the core library, so you only need to reference the platform specific one that you require and the core library should be included transitively.
Note: PigPen requires Clojure 1.5.1 or greater
To use the parquet loader, add this to your dependencies:
[com.netflix.pigpen/pigpen-parquet-pig "0.3.3"]
Here an example of how to write parquet data.
(require '[pigpen.core :as pig])
(require '[pigpen.parquet :as pqt])
;;
;; assuming that `data` is in tuples
;;
;; [["John" "Smith" 28]
;; ["Jane" "Doe" 21]]
(defn save-to-parquet
[output-file data]
(->> data
;; turning tuples into a map
(pig/map (partial zipmap [:firstname :lastname :age]))
;; then storing to Parquet files
(pqt/store-parquet
output-file
(pqt/message "test-schema"
;; the field names here MUST match the map's keys
(pqt/binary "firstname")
(pqt/binary "lastname")
(pqt/int64 "age")))))
And how to load the records back:
(defn load-from-parquet
[input-file]
;; the output will be a sequence of maps
(pqt/load-parquet
input-file
(pqt/message "test-schema"
(pqt/binary "firstname")
(pqt/binary "lastname")
(pqt/int64 "age"))))
And check out the pigpen.parquet
namespace for usage.
Note: Parquet is currently only supported by Pig
To use the avro loader (alpha), add this to your dependencies:
[com.netflix.pigpen/pigpen-avro-pig "0.3.3"]
And check out the pigpen.avro
namespace for usage.
Note: Avro is currently only supported by Pig
*print-length*
and *print-level*
when generating scriptspigpen.pig/set-options
command to explicitly set pig options in a script. This was previously available (though undocumented) by setting {:pig-options {...}}
in any options block, but is now official.pigpen.cascading/generate-flow
- Generate a cascading flow from a pigpen querypigpen.cascading/load-tap
- Load data from an existing cascading tappigpen.cascading/store-tap
- Store data using an existing cascading tappigpen.core/keys-fn
, a new convenience macro to support named anonymous functions. Like keys destructuring, but less verbose.pigpen.core.fn
pigpen.core.op
pigpen.core/script
is now pigpen.core/store-many
pigpen.core/generate-script
is now pigpen.pig/generate-script
pigpen.core/write-script
is now pigpen.pig/write-script
pigpen.core/show
is now pigpen.viz/show
(requires dependency [com.netflix.pigpen/pigpen-viz "..."]
)pig/dump
has changed. The old version was based on rx-java, and still exists as pigpen.rx/dump
. The replacement for pigpen.core/dump
is now entirely Clojure based. The Clojure version is better for unit tests and small data. All stages are evaluated eagerly, so the stack traces are simpler to read. The rx version is lazy, including the load-* commands. This means that you can load a large file, take a few rows, and process them without loading the entire file into memory. The downside is confusing stack traces and extra dependencies. See here for more details.load-avro
in the pigpen-avro project: http://avro.apache.org/gradlew nRepl
to start an nReplload-csv
, which allows for quoting per RFC 4180pigpen.local/*max-load-records*
is no longer required.pigpen.fold/avg
where some collections would produce a NPE.pigpen.local/*max-load-records*
to the maximum number of records you want to read locally when reading large files. This now defaults to nil
(no limit).load-tsv
and load-lazy
load-lazy
and speed up both load-tsv
and load-lazy
(pig/return [])
(pig/dump (pig/reduce + (pig/return [])))
Long
s in scripts that are larger than an Integerpig/dump
:pigpen-jar-location
option with pig/generate-script
or pig/write-script
.dump&show
and dump&show+
in favor of pigpen.oven/bake
. Call bake
once and pass to as many outputs as you want. This is a breaking change, but I didn't increment the version because dump&show
was just a tool to be used in the REPL. No scripts should break because of this change.dymp-async
. It appeared to be broken and was a bad idea from the start.pigpen.oven/clean
. When it was pruning the graph, it was also removing REGISTER commands.cogroup
with fold
over multiple relations.:partition-by
option to distinct
load-json
, store-json
, load-string
, store-string
filter-by
, and remove-by
pigpen.fold/vec
. This would also cause fold/map
and fold/filter
to not work when run in the cluster.for
to generate scriptsmap
followed by reduce
or fold
; before
(pig/join (foo on :f)
(bar on :b optional)
(fn [f b] ...))
; after
(pig/join [(foo :on :f)
(bar :on :b :type :optional)]
(fn [f b] ...))
Each of the select clauses must now be wrapped in a vector - there is no longer a varargs overload to either of these forms. Within each of the select clauses, :on is now a keyword instead of a symbol, but a symbol will still work if used. If optional
or required
were used, they must be updated to :type :optional
and :type :required
, respectively.