Data Science at Scale Using Apache Spark and Apache Hadoop: Q&A from the Webinar

Categories: General

In a recent webinar, “Breaking Through The Noise: Defining Data Science Skills, Tools, and Training”, Cloudera’s Director of Data Science, Sean Owen, and Principal Curriculum Developer, Tom Wheeler, presented on the state of the data science industry and went over these topics:

  • What is data science and who are data scientists
  • What tools does a data scientist use
  • Why companies need data scientists
  • Who should take introductory data science training

We had a number of great questions that both Sean Owen and Tom Wheeler helped answer, so we wrote up some of them here in the Q&A post below…

Training & Certification

Q1: What path do you suggest to get certified as a data scientist?

A1: Our course is a great place to start, but certification success is not as much a function of a training course, but experience with the tools such as Hadoop, Spark, Linux, data formats, ideas of machine learning, and how to apply all of those to a real problem. Our recommendation is to solve a Kaggle competition on a Hadoop cluster to get that experience you need to prepare for certification.

Q2: What are the prerequisites to getting started with data science?

A2: You need a little engineering and a little stats. Pick up at least one language. Python is good, but Java and Scala are great places to start for the Hadoop ecosystem. For a technology to learn, start with Spark. If you want to learn basic theory, check out Stanford’s Machine Learning course.

Q3: Do we offer certification for data science?

A3: Yes, we do! While we do have an introductory data science course we have a certification exam CCP Data Scientist that follows it. You can think of it as a compressed Kaggle competition, where you’re given a real problem and a real cluster and asked to produce a solution.

Q4: Are our courses offered just online or in-person?

A4: Our courses are delivered in both! Our classes are made available around the globe, so you can sign up for a public course today no matter where you’re located or your preference. If a course isn’t available in person, we have regularly scheduled virtual classroom offerings in both the North American and European time zones.

Q5: Do you have to take the course in order to take the certification?

A5: No, you are not required to take any course before taking certification. With that said, we would like to advise the the CCP Data Science exam is very challenging. While the course will prepare you for the certification, you should be self-studying on your own before the exam. This is not a multiple choice certification,  you will be interacting with a live cluster and produce a solution.

Q6: How is learning data science different from learning another engineering language?

A6: Learning data science can involve learning a new language (e.g. R, Python, Scala), but it mainly involves learning new statistical ideas. These are things such as: Test and training sets, evaluation metrics, noise, overfitting and so on. This is why we try to present a course that addresses both language background and theory background.

Q7: I am from a SAP BI/BW/Analytics background and I’m thinking to go ahead towards Big Data / Data Science? What do you suggest?

A7: The thing to focus on is most likely the language environment rather than the size of the data or the data itself. You would want to start by getting used to Hadoop and tools, brush up on your programming skills, and look into Spark. A great place to start, when you feel comfortable with Scala (or Python) and basic Linux command lines, is our Developer Training for Spark and Hadoop. You will receive introductions to Apache Spark and how it integrates with the entire Hadoop ecosystem.

Q8: MLlib is pretty nice, but the Spark direction is on ML (and from RDDs to Dataframe/Dataset). Does the course look towards ML?

A8: Yes, it does to some degree. It talks about the basic concepts such as RDDs and MLlib, but it does touch on ML and the integration with DataFrames.

Languages & Tools

Q9: Which language is most suitable for data science: R, Python or Scala?

A9: If you take away Hadoop, we can argue that Python and R are more suitable due to the rich libraries. In the Hadoop ecosystem, which is fundamentally JVM-based, Scala is the most natural language in which to use Spark. While you can use Python and to some degree R, Spark was developed in Scala. It has no runtime overhead penalty. While a Python developer can be comfortable using something like PySpark, and can get started there for a while, if we had to choose one language, it would be Scala.

Q10: In the course structure R comes just before Spark MLlib. But for data evaluation don’t we need R?

A10: No. If you want to evaluate the models of something that Spark puts out, you can do it all within Spark.

Q11: How is Spark different from other massively parallel programming?

A11: This is a fairly broad question, but we’ll highlight a few major features. In two words, Spark is: Distributed Scala. It tries to take some ideas that Scala has popularized, functional programming ideas, such as operating on immutable datasets to produce other immutable datasets and translating that to the big data distributed cluster paradigm. Why is that important? If you’re operating largely in terms of immutable datasets, it means you can easily optimize for some operations on data because you know that some data is not changing and can be recomputed, dropped, or cached as you like. The net effect is that Spark can produce a more complicated plan to execute a large distributed program and optimize it and even leverage memory cache for immediate results – making the overall execution quite fast. Spark is really good at trading memory for I/O, which is important because memory is cheaper than I/O. To boil all this down: Spark is Scala made for Hadoop and gains speed from using memory.

Q12: What about leveraging Notebooks to integrate different languages?

A12: Right now it’s not hard to add a notebook, like Jupyter, to a Hadoop cluster and put that in front of PySpark or a Scala Shell to get a notebook-like interface with the Spark shell.

Q13: What about SparkR?

A13: There is a subproject that integrates Spark and R to a certain degree. This subproject allows you to take the Spark framework and access it with R. While it does have its uses SparkR is not shippable with CDH today, but we are always looking to ship and support new features as they become more stable.

Q14: For someone who doesn’t know Java or Unix, is it hard to learn Spark and Mahout?

A14: Yes. In general, Hadoop is a developer framework. To understand what’s going on, you would need to have working knowledge of Unix or Linux. If you are looking to start from the basics, start with Java before getting into Hadoop and data science.

Q15: When to use SparkR, MLlib, or PySpark for a data science project?

A15: MlLib through the Scala API is going to be easiest or the least hassle, since that is the native language and the API will be the broadest and most up to date. PySpark is usable, but it can be troublesome in the long term; there’s a runtime performance penalty and an extra layer of indirection. We simply don’t recommend SparkR at this time.

Use Cases

Q16: Why would you use Spark over Apache Flink?

A16: Simply put? Maturity. Spark started a couple years earlier, has an order of magnitude more contributors and activity, and more integrations. For example: Cloudera ships and supports only Spark, so you can get production support for Spark. On a purely technical level, Flink is a great tool. It’s streaming model is well praised, while not that different from Spark it certainly has its uses. If we had to choose a framework, right now, it would be Spark.

Q17: Do you have examples of “cognitive learning” solutions in the Hadoop ecosystem? This goes beyond machine learning and utilizes graph databases to exploit relationships to achieve insights beyond traditional statistics and machine learning?

A17: There are graph databases and there are graph processing engines. There aren’t any graph databases in the Hadoop ecosystem itself, but there is a graph processing engine within Spark named GraphX. If you’re looking to solve graphite problems, you can do that with GraphX without requiring a graph database.

Q18: Use cases for data science?

A18: Recommender engines (taught in our course), anomaly detection (detection of things like fraudulent transactions or sensor data from a car), classification of all kinds (e.g. spam filtering or predicting if a customer is likely to buy, sentiment analysis through natural language processing, etc… The uses cases and questions you can ask are only limited by the data sets that you have access to.

We hope you found this information useful! Start your journey towards data science today!

P.S.: If you have more questions, please feel free to drop us a note in the comments section!


2 responses on “Data Science at Scale Using Apache Spark and Apache Hadoop: Q&A from the Webinar

Leave a Reply