At LiveRamp, much of the code we write is built around the Hadoop ecosystem. While the tools in the ecosystem are very powerful, the APIs are evolving rapidly and many have “gotcha” methods which can cause serious bugs when misused. We’ve found that certain methods were especially common sources of frustration among new developers. For example:
- org.apache.hadoop.io.BytesWritable:getBytes Many of our MapReduce jobs involve translating BytesWritable object into bytes and vice versa. The BytesWritable object has a method getBytes which at first glance seems perfect, but on closer inspection returns a padded array. This can cause major problems when joining two datastores, since objects will not match the padded versions of each other.
- java.nio.ByteBuffer:array Similar problem as to getBytes–the underlying array can be padded, causing issues during joins.
Other times, Hadoop or other APIs provide methods too dangerous to expose to usercode:
- org.apache.hadoop.fs.FileSystem:delete The command line Hadoop utilities are configured to allow all deleted files to be sent to the Trash before deletion, allowing engineers to recover files from the trash when deleted in error. The FileSystem:delete method, however, deletes files immediately, making any mistaken deletes extremely difficult to reverse.
Other times, there are classes which are useful for testing, but cause chaos when they make it into production code:
- cascading.operation.Debug Cascading’s Debug pipe is invaluable during local debugging, by dumping the contents of a pipe to standard out. However, the pipe can cause disasters in production, because all of the standard out is piped into the TaskTracker’s log and can easily fill up all of the node’s disk space.
When possible, we’ve wrapped the functionality for each of these methods in helper classes (for example, Bytes and TrashHelper), but we couldn’t find any obvious way to prevent developers from accidentally using the original methods. To fix this, we took a look at the PMD source code analyzer.
The PMD source code analyzer ships with a set of rules designed to catch common programming mistakes, but also provides an API for building custom rules. We ended up building three custom rules and have released them as a small project on github (pmd_extensions):
- BlacklistMethods: this rule can be given a list of class:method pairs, and will flag any Java file where that class+method is invoked
- BlacklistClassUsages: this rule can be given a list of classes, and any class importing or instantiating one of these classes will be flagged
- NoLoggingInClasses: this rule is a bit more subtle. Similar to the damage the Debug operation can cause, accidentally leaving a stray System.out.println in a MapReduce job can cause the task logs to explode in size. This rule can be given a list of classes, and if the children any of those classes invoke System.out.println internally, the class will be flagged. For example, if passed “cascading.operation.BaseOperation”, PMD will flag logging in any class extending BaseOperation.
These rules can be configured via a custom ruleset. For usage, see, example_ruleset.xml. Classes and methods are passed in via properties:
<rule name="BlacklistedClassUsage" message="Blacklisted class found!" class="com.liveramp.pmd_extensions.BlacklistClassUsages"> <description> Black-list classes which should not be used in production. </description> <priority>1</priority> <properties> <property name="BlacklistedClasses" value=" cascading.operation.Debug "/> </properties> </rule>
The flags can also be suppressed via annotations if necessary (for example, if code is intentionally calling .array() in performance-critical code):
To integrate PMD into our continuous integration build, we use the PMD Maven plugin . The custom ruleset can be added off of the classpath in the plugin configuration:
<plugin> <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-pmd-plugin</artifactId> <version>3.1</version> <configuration> <rulesets> <ruleset>/com/liveramp/build_tools/pmd/liveramp_ruleset.xml</ruleset> </rulesets> ... snip </configuration> </plugin>
Of course, PMD can’t replace a complete code review, but auto-blacklisting methods and classes has helped prevent a bit of frustration and a number of unnecessary bugs making it into production.