Open Data Models Accelerate Machine Learning in Cybersecurity

Categories: Security, Risk, and Compliance


Cyber threats used to be something that humans could handle with the right tools, but today’s threats have grown too big, too fast, and too complex for existing solutions or methodology to handle. Cybersecurity has become a machine-scale problem, and the threats of the future will require a machine-scale solution.

Advanced Persistent Threats (APTs) exploded in 2016 and continue to grow. Machine learning and artificial intelligence will be the most important allies for CISOs and SOC analysts confronting these threats.

In a March 11, 2016 contributed post on, John Lovelock said it best, “Your new security opponent will be a smart machine, so your new defender must be an algorithm.”

But there are Serious Roadblocks to ML Success in Cybersecurity 

The key factor in making ML and AI successful in cybersecurity is data. Any ML system needs the right data, from many different sources across the business; and that data needs to be available in one place, in a known format.

In the real world, most businesses don’t work that way. They’re siloed. Sales has their own data. IT has their own. Security teams, and application teams, and operations teams each have their own political and technical fiefdoms. Active Directory, badge swipes, access logs, activity logs, NetFlow, endpoint logs, DNS, proxy logs… the list of data types relevant to security goes on and on, but they might as well be on different continents for how difficult it is to bring them together.

As a result of this separation, we’re living in a world where security point solutions, each solving a narrow problem, are like the fabled blind men touching an elephant. Each one has a different perspective, and none can grasp the full nature of the beast.

Forward-thinking CISOs recognize this challenge, and are aggressively looking for solutions. It is rapidly becoming a best practice to centralize all the data from throughout an organization into an enterprise data hub built on Hadoop, to make it available for business intelligence, operations optimization, and of course, cybersecurity.

But, centralizing all this data is easier said than done. There are a few core challenges, endemic to many large enterprises, that make it particularly thorny:

  1. Internal politics: The bigger the business, the more data silos there are, but more importantly, the more disparate the owners of that data there are. When data owners in an organization don’t share their data, cybersecurity is made much less effective.
  1. Data sources and Infrastructure: They’re all over the place! When you need different collection and ingestion methods for dozens of sources, the manual setup work becomes untenable. You’re stuck in data-ingestion hell.
  1. Proprietary formats: Even if you can get all your data centralized, you’ve likely got a handful of different formats, many of them proprietary. If any of your data is in a vendor-owned format, and they don’t play nice with your analytics, you still can’t actually make use of it.

So, What’s A Modern CISO To Do?

Fortunately, times are changing. More executives are realizing the scale of the impact cybersecurity has on their business, and CISOs are finally getting the board-level political sway they need to break down political and cost barriers. But, without outside assistance, it is incredibly difficult to do this. Even the best CISOs can benefit from an unbiased management consultancy to navigate this labyrinth, and if you’re trying to do it now I would highly recommend getting some “organizational psychology” air-cover to make it happen.

The technical challenges are easier, but still nontrivial. We at Versive are partnering with Cloudera to use Hadoop’s massive scalability, Apache Spark’s lightning-fast distributed computing capabilities, and the Open Data Models (ODM) of Apache Spot (incubating) to make the technical challenges of data collection and formatting become far simpler and more standardized.

The final piece of the puzzle is machine learning. There is a growing list of options with various pros and cons. We’ve been building a product, the Versive Security Engine, which provides machine-learning driven APT-detection that we think is unmatched by any other platform—but I’ll let you judge for yourself.

Either way, the mission is clear: Your customers are counting on you to protect their data. Forward-looking corporate boards are raising the political sway of their CISOs so they can bring data from the entire enterprise together to do this. And the technical solution is crystallizing: Store everything centrally in Hadoop. Get it there easily, and in open data formats with Apache Spot. And then layer on the critical machine learning required to find and stop the attackers that are increasingly already inside your environment laying in wait, expanding their access, and stealing or destroying valuable data.


Leave a Reply