How GoPro uses Apache Hadoop in the Cloud

Categories: Cloud Data Science Partners Success Stories

Have you ever wondered what it’s like to soar like an eagle, scale the face of an icy mountain, or ski behind a horse galloping 40-miles-per-hour?

OK, even if skijoring (somehow this is a real thing) isn’t your idea of fun, it’s still pretty cool to watch someone else do it.

GoPro, the company that made everyone want to own a camera again, is now running a premier user-generated content network. That network spans several channels from desktops, mobile devices, and airplane seat backs, to embedded apps in TVs and gaming consoles.

The goal of this content network, aside from hosting some pretty entertaining videos, is to create a virtuous cycle of influence that ultimately leads to a new purchase. That cycle looks something like this:

Screen Shot 2016-04-28 at 2.16.08 PM

Naturally for a data analyst, this channel provides a trove of insights into product usage patterns including which camera features are most popular with customers. Understanding that user interaction across the ecosystem is important in helping guide GoPro’s research and development spend. Additionally, if GoPro is better able to understand the profile of a user who is likely to share videos, the company can better tailor its marketing and predict revenue.

The key for the GoPro Data Science and Engineering (DS&E) team was to figure out how to take this data in, make sense of it, and report their findings to executives. This data is generated in the cloud, and it’s important for GoPro to be able to manage and process that data where it lives. That’s why their Hadoop-based data management platform includes Cloudera Enterprise on Amazon Web Services (AWS).

Cloudera, Tableau, and Trifacta recently hosted a webinar on this subject with GoPro data architect, Josh Byrd and principal data engineer, David Winters. We covered a variety of topics from their use of Apache Kafka and Spark Streaming for massive data ingest to ETL cluster set-up to TPS report cover sheets. In case you don’t have an hour to spare (that doesn’t leave much time for surfing with your dog), here are a few highlights:

GoPro is ingesting logs containing product analytics, social media data, web traffic, GoPro channels, 3rd-party systems, and internal ERP systems – some streaming and some batch. Those logs are then streamed into Kafka and Apache Spark before landing in HDFS. This pipeline supports synchronous and asynchronous reads and writes and rapid processing. You can get a closer look at their ingest cluster in the diagram below.

Screen Shot 2016-04-26 at 4.57.26 PM

GoPro doesn’t refer to its data platform as a “data lake” because the company puts a great deal of emphasis on data governance and access control. They use Apache Sentry for role-based permissions (3 different permissions – read, read/write, admin – and several Active Directory groups) at the HDFS and Apache Hive level and authenticate through Kerberos. That governance structure extends through Trifacta and Tableau to allow analysts with the proper permissions to slice and dice data in ways they were unable to before and share those reports with key decision makers.

Screen Shot 2016-04-27 at 10.14.05 AM

GoPro’s data platform infrastructure runs entirely on AWS EC2 nodes. AWS provides the DS&E team with speed and flexibility. No need to rack hardware. The team is running a Cloudera Enterprise cluster with local storage because it provides them with “best-of-breed” technologies along with the ability to select the analytical engine required for the specific job they’re trying to run. For example, SQL queries are run on Apache Impala (incubating). Transactional and streaming workloads leverage Kafka and Spark, where operational database-type jobs use Apache HBase.

Cameras don’t just take photos and videos. They capture moments and tell stories. To hear the story of GoPro’s ambitious data modernization strategy, watch the webinar in full, “Extreme Sports and Beyond: Exploring a New Frontier in Data with GoPro.


Related Links:
Learn more about running Cloudera on Amazon Web Services
Discover more about the Cloudera + Tableau partnership
Find out more about the Cloudera + Trifacta partnership


Leave a Reply