Writing Apache Pig UDFs in Clojure
A previous post explored writing Apache Pig user defined functions (UDFs) in JRuby.
This post explores writing equivalent UDFs in Clojure.
Prima facie, a Clojure UDF, because it compiles directly to Java bytecode, should integrate extremely well with Pig’s Java code. There shouldn’t be any impedance mismatches between the two languages and performance should be essentially the same as if the UDF were to be written in Java itself.
But there doesn’t seem to have been much prior experimentation with Clojure UDFs. Perhaps this quote from a Pig wiki page when Clojure was a candidate for adding control flow and modularity constructs to Pig may help to explain the reticence:
“Clojure is a functional language, a paradigm which seems to engender one of love, terror, or confusion. As such, it probably is not a good choice for Pig Latin. Also, Cascalog already exists for those who like Clojure.”
Even us junior members of the Parentherati would view this as shorted sighted.
However I did find one old-ish but particularly useful example: Matt Kangas’s proof-of-concept mountain-pig. Matt’s code was useful to me on both the Java - Clojure interop (new territory to me) and also how to comply with Pig’s UDF implementation requirements (API). You will see snippets of Matt’s code in the example below. Thanks Matt!
It turns out that writing Pig UDFs in Clojure is as easy to writing them in JRuby; the setup is a bit more involved but after that it is plain sailing.
But the potential to exploit fully the Java UDF API in Clojure is (IMHO) more obvious than JRuby, Python or JavaScript: I think (qualitatively) the Clojure - Java interop would make it easier to write more closely integrated UDFs in Clojure.
The rest of this post follows broadly the same structure as the JRuby post and it is worth reading that post as its very much a scene setter. In this post I focus on writing Clojure UDFs to return a scalars (string), maps (hashes) and (Pig) tuples. As before, the results are destined to be stored in HBase.
Also, as before, the log file was from a Ubuntu audit subsystem log (auditd).
Writing Apache Pig UDFs in JRuby
Apache Pig is part of the Hadoop MapReduce ecosystem.
Pig describes itself as:
“a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets.”
Pig pre-empts the need to write lower-level MapReduce jobs to process large datasets. It has its own “high level language” for data manipulation and relationships called Pig Latin. Pig Latin is really an external domain specific language (DSL) written in Java.
Pig is a very good tool for extract-transform-and-load (etl) ‘duct tape’ tasks e.g. taking data from text-base log files, parsing the records and loading the extracted fields into e.g a HBase table.
But Pig Latin is both a strength and weakness. A strength because, once the basics have been learnt, useful stuff can be done pretty quickly. A weakness because its another thing to learn and quite orthogonal to anything else in the Hadoop world. Although written in Java, a Pig Latin script can’t contain Java directly.
Contrast this with Cascalog - a competitor to Pig - that provides its own data manipulation and relationship DSL but in a Clojure environment - Cascalog programs can directly define and use Clojure functions and use all the facilities of the Clojure language to do their work.
Recognising a limitation, the authors of Pig support the creation of user defined functions (UDFs). The pretty good documentation for UDFs shows how to create them in Java, Python, JavaScript and, with the latest release (0.10.0) of of Pig, JRuby.
Writing JRuby UDFs is very straightforward and easy once the recipe has been learnt. The authors of the JRuby support have done a very good job at minimising the impedance mismatch. (The chronology in this link demonstrates how the shape of the final support was arrived at by smart, mutually respectful and talented people. An example, IMHO, of the best in open source.)
BTW I was lead to looking at JRuby UDFs by this very good post by Russell Jurney.
In the rest of the post, I explore the creation of a JRuby UDF to etl a log file into HBase. In my example, the log file was from a Ubuntu audit subsystem log (auditd).
A service-oriented administrative architecture
About a year ago, over a few beers with a fellow infrastructure veteran, we fell to discussing the challenges of managing large scale infrastructures.
He had been facing similar headaches in his environment. I had been discussing the same challenges with another infrastructure mate who manages 1000s of servers, and explaining some thoughts I had had on how a scalable, coherent and consistent administration approach could be implemented.
We decided to put our thoughts down on paper to show to some others and I pulled together a first cut. The following notes are broadly those thoughts, with some updates (by me) to fix typos and general improvements for clarity.
The solution described below is something of a swiss army knife - it can be sensibly used not only the manage the bread’n’butter administration of an infrastructure but also perform many value add functions easily. To illustrate, in another post I’ve shown an example “data acquisition” handler written in Ruby.
Some notes on using Ruby handlers with mongrel2
The first version of mongrel was a well known, very quick and well respected web server.
The author of mongrel1, Zed Shaw, has been working on version 2 for some time with the intention of addressing the lessons he learnt developing mongrel1.
mongrel2 has some novel and compelling features that make it an ideal and flexible centrepiece of a web environment.
Installing ZeroMQ 2.2.0 with the Ruby gem on Ubuntu 12.04
ZeroMQ calls itself an intelligent transport layer or, colloquially, sockets on steroids.
Put simply, ZeroMQ is a relatively low level socket-like interface that allows two cooperating scripts, applications, whatever to communicate.
The website has a 100 word definition:
ØMQ (ZeroMQ, 0MQ, zmq) looks like an embeddable networking library but acts like a concurrency framework. It gives you sockets that carry atomic messages across various transports like in-process, inter-process, TCP, and multicast. You can connect sockets N-to-N with patterns like fanout, pub-sub, task distribution, and request-reply. It’s fast enough to be the fabric for clustered products. Its asynchronous I/O model gives you scalable multicore applications, built as asynchronous message-processing tasks. It has a score of language APIs and runs on most operating systems. ØMQ is from iMatix and is LGPLv3 open source.
This post shows how to install ZeroMQ with the Ruby gem using Ubuntu packages and also from source.