Cybersecurity and the Big Yellow Elephant

Categories: General Security, Risk, and Compliance

Cybersecurity has become the topic of conversation for organizations across every industry. With the average cost per breach reaching $12.7 million in 2014[1], organizations are turning to new technologies in order to avoid massive reputational and monetary losses. In recent years a convergence of factors have occurred that is putting Hadoop at the forefront of the cybersecurity arms race. Consider the following Venn Diagram.



  • Threats – with more of our lives and infrastructure connected via the Internet, the attack surface is much larger and attackers know this
  • Data – we have more data, from more sources, than ever before at our disposal
  • Technology – the Hadoop ecosystem has reached a level of maturity and capability such that more and more organizations can use it for more and more use cases

What we are beginning to see is that Hadoop is being deployed to harness large volumes of diverse, fast moving data in order to fight against this new generation of threats. The next generation of agile, data-driven cybersecurity platforms is augmenting, and at times replacing, traditional systems as the good guys seek to stay ahead of the bad guys.

A typical Information Security organization employs the following individuals in the fight to protect its data and infrastructure:

  • CISO/CIO/CTO – they just want to keep the company off of the front pages; responsible for strategy, budget, and policy enforcement
  • Incident Responder – they monitor, triage, and contains attacks; take a primarily defensive posture, investigate false positives, utilize standard techniques/procedures
  • Forensic Threat Analyst – they remediate and analyze attacks in order to revise the protection approach; focus on the kill chain, 360 degree topic of interest, and chain of custody

The tool of choice most frequently used by these teams is called a Security Information and Event Management system, or SIEM. This class of software collects and analyzes (primarily) log data, uses rule engines for signature-based threat detection, fires alerts when threats suspected, and creates correlated events. They also frequently have limited visualization and exploration of data and events. SIEM tools have served the enterprise quite well over the years, but in the face of new and emerging data sources and threats, several shortcomings are exposed.

As the attack surface expands at an unbelievable rate because of the social, mobile, and connected environment that we now operate in, organizations need to evolve in order to block these new entry points into their organizations. However, they have a hard time keeping up with the volume and variety of data available today – certainly without breaking the bank. As raw source data is forced into the required schema, fidelity loss occurs. It is difficult, if not impossible, for them to keep up with the growing sophistication of threats. Finally, these tools are usually optimized for ingest, not query, so exploration and analysis options are limited.

Given this reality, there is a risk that the good guys will fall too far behind in the struggle. There are two primary ways that the Hadoop ecosystem, including the many technologies that integrate with it, is being used to overcome these. The first way is to Assist the Human, and the second way is to Assist the Machines.

Assist the Human

Incident Responders and Forensic Threat Analysts typically have relatively few tools at their disposal. Because SIEM systems primarily collect data with known structure, and not at extreme volumes, the responders and analysts have less data from fewer sources. Once the data arrives, they have essentially just rule-based detection/correlation and search-based exploration at their disposal. These tools can be quite sophisticated, and they perform well, but there is much more that can be done.

A Hadoop-based cybersecurity system can unlock “dark data” by managing greater volumes, from more sources, and retaining it for much longer. This provides a deeper and wider data set for incident responders and forensic analysts to draw from.

With the myriad processing options, query engines, and algorithms available within Hadoop, incident responders and forensic analysts can now do more types of activities as they piece together what is happening and what might happen. A common strategy exemplifying this approach is depicted in the diagram below, which shows how a Cloudera EDH can augment Splunk. The items on the left are examples of the types of Deep Analytics that the EDH provides.


  • Arbitrary SQL queries without the need to force all data into a schema up front.
  • Assess theperformance of threat models more frequently using more test data.
  • Correlate data from structured, unstructured, and textual sources for 360 degree views of actors, assigning risk scores that incorporate behavior, sentiment analysis, and identity.
  • Large scale data processing and storage to create derivative data sets for exploration of hypotheses.
  • Modify rules or machine learning models to evaluate changes to rules, which are fed back into an external SIEM system.

What if you could ingest not just firewall, proxy, and Active Directory logs, but also arbitrarily structured clickstream data from your web properties, log files from your company travel agent service, the Twitter firehose, public mailing lists, etc? All in its raw, full fidelity form? And store it for years instead of purging it after 90 days? And index, correlate, join, etc to your heart’s content? With Hadoop, you can. And the CISO/CIO/CTO will be happier.

Assist the Machine


One of the tactics used by attackers these days is to design exploits to explicitly avoid detection by traditional SIEM tools, many of whose detection strategies are widely known. This type of attack is known as Advanced Persistent Threat, or APT.

One example of this is to deploy a data exfiltration attack that lasts for 95 days, knowing that most SIEM tools only retain data for 90 days, due to cost or scalability limitations. Another example is, quite simply, to employ a sequence of events that is novel so the signature-based strategies cannot detect the attack.

In addition to the ability to store and process more data for a longer period of time, several algorithms are available within the Hadoop ecosystem for anomaly detection. This is a statistical modeling technique used to identify behavior that is out of the ordinary without having to know a priori what to look for, which is required for signature-based approaches. Examples of these include Spark MLlib, H20, R, SAS, among others. While this use case is difficult to do well, it is the best available option to sniff out the unknown unknowns that organizations are faced with.

Elephants and Cybersecurity

As Information Security departments are being asked to do more to protect their infrastructure and data, their tools need to do more. With the maturity of Hadoop, the big yellow elephant is stepping in to fill this gap. To be sure, this type of cybersecurity platform is not trivial to implement, and it’s still early in the game. But the promise for the good guys to stay ahead of the bad guys – with a little help from the yellow elephant – is great.

Next Steps: Learn more on our June 10th webinar, “Data Powered Threat Detection”.


3 responses on “Cybersecurity and the Big Yellow Elephant

Leave a Reply