Big Data Processing in Cloud Computing

Big Data Processing in Cloud Computing

What is Big Data?

It’s a practice of analyzing a large amount of data that may come at you very fast in different formats. We’ve had Big Data for a long time, actually in the form of data warehouses. We’ve been doing that for a couple decades. In the modern architecture these days, you have a lot of tools in your arsenal things like Hadoop, Spark and so forth. So I thought that’s really enabling us to analyze things that are coming at us in gigabytes and terabytes per day in different formats and from different data sources.

How does a Big Data Architect interface and provide value for data scientists?

The data scientist needs to have that data available to them real-time in dashboards and BI tools and notebooks and so forth. So Big Data architect will architect their notebooks spaces, their pipelines for them to get that data as fast as they need it, in a way that’s reliable, in a way that’s useful for them.

Why is architecting for perfection with Big Data important?

It’s important because these days you have dozens maybe even hundreds of different data sources and that just keeps on growing for businesses. They get different data sources: third party market research there, relational databases, surveys, even manual entry stuff, that all needs to get into a central Data Lake, so data scientists can access that and start extracting business insights to make their organizations more data driven.

How has cloud computing effected Big Data processing especially for small companies that now have access to tools that would have cost millions of dollars even just five years ago?

For small to medium enterprises, the cloud brings a lot of new features that were not accessible to them before. For example, with Amazon S3, you have virtually unlimited storage with very high durability allowing you to easily create a Data Lake. Whereas normally, processing gigabytes and terabytes of data would have cost you thousands of dollars to create, acquire and set up the hardware. Now you can have a Data Lake at your fingertips in a matter of seconds or minutes. Even with Big Data processing through Hadoop and Spark where you need commodity hardware to do on-premise, now in the cloud you can spin up a Hadoop or Spark cluster in minutes.