Apache Eagle keeps an eye on big data usage


, originally developed at eBay and then , fills big data security niche that remains thinly populated, if not bare: It sniffs out possible security and performance issues with big data frameworks.

To do this, Eagle uses other Apache open source components, such as Kafka, Spark, and Storm, to generate and analyze machine learning models from the behavioral data of big data clusters.

Looking in from the inside

Data for Eagle can come from activity logs for various data source (HDFS, Hive, MapR FS, Cassandra, etc.) or from performance metrics harvested directly from frameworks like Spark. The data can then be piped by the Kafka streaming framework into a real-time detection system that’s built with Apache Storm, or into a model-training system built on Apache Spark. The former’s for generating alerts and reports based on existing policies; the latter is for creating machine learning models to drive new policies.

This emphasis on real-time behavior tops the list of “key qualities” in the for Eagle. It’s followed by “scalability,” “metadata driven” (meaning changes to policies are deployed automatically when their metadata is changed), and “extensibility.” This last means the data sources, alerting systems, and policy engines used by Eagle are supplied by plugins and aren’t limited to what’s in the box.

like analyzing job performance and monitoring for anomalous behavior, Eagle can also analyze user behaviors. This isn’t about, say, analyzing data from a web application to learn about the public users of that app, but rather the users of the big data framework itself — the folks building and managing the Hadoop or Spark back end. An example of how to run such analysis is , and it could be deployed as-is or modified.

Eagle also allows application data access to be classified according to . Only HDFS, Hive, and HBase applications can make use of this feature right now, but its interaction with them provides a model for how other data sources could also be classified.

Let’s keep this under control

Because big data frameworks are fast-moving creations, it’s been . Eagle’s premise is that it can provide policy-based analysis and alerting as a possible complement to other projects like . Ranger provides authentication and access control across Hadoop and its related technologies; Eagle gives you some idea of what people are doing once they’re allowed inside.

that commercial offerings could compete on.