Data Fundamentals: Setting the Stage for Big Ideas

Data Engineering and Beyond

Ankit Rathi
15 min readNov 7, 2024

Hello data enthusiasts,

Welcome to the first post of Data Engineering and Beyond blog!

We hear a lot of buzz around AI these days, and it’s true — AI is everywhere.

But here’s the key: modern AI is built on a fundamental building block — data.

Before we dive into exciting topics like machine learning and AI, it’s important to start with a solid understanding of data, because without data, none of the AI we talk about would exist.

In this post, we’ll walk through the fundamentals of data.

Here’s what you will learn:

  • We’ll start by understanding what data really is and why it’s so important in our lives and businesses.
  • Data isn’t just numbers and text. It can also be images, videos, or even something as abstract as a pattern. We’ll explore these forms.
  • I’ll explain how raw data becomes meaningful insights and even wisdom, using something called the DIKW Pyramid.
  • We’ll talk about two types of systems: one for daily operations (OLTP) and one for deeper analysis (OLAP).
  • I’ll highlight why having good-quality data is essential and how poor data can lead to bad decisions.
  • Lastly, we’ll take a look at Big Data — huge datasets that open up exciting new opportunities.
DIKW is an acronym for Data, Information, Knowledge and Wisdom.
OLTP - Online Transaction Processing
OLAP - Online Analtical Processing

Don’t worry about the terms — they’re just ways of handling data, whether it’s for daily operations or for analysis.

By the end of this post, you’ll have a solid grasp of the fundamentals of data, which is essential for understanding the world of AI.

And best of all, we’ll dive into these data fundamentals through our favorite use case — myPizza bakery!

Let’s get started!

Data Fundamentals

So we’ll first cover what is data? why is it important? and how we use data in our daily life and business.

What, Why and How of Data

What is data? It’s actually pretty simple — data is just information.

Not exactly but for now yes, I will explain the difference later.

It’s anything we collect, store, and analyze to make better decisions.

Data can be in many forms: numbers, text, images, or anything that can be recorded.

Let’s take myPizza bakery as an example.

Data at myPizza bakery could be the number of pizzas sold each day, the types of pizzas customers order the most, or the ingredients purchased for baking.

All of this is data.

Now, why is data important?

Data is essential because it helps us make informed decisions.

Imagine if, at myPizza bakery, you tracked which pizza flavors are the most popular.

You could then focus on those flavors and reduce the ones that don’t sell as well, right?

This not only saves you costs but also improves customer satisfaction.

You’re giving the people what they want while cutting back on waste.

And that’s not all.

Data can also help you manage stock — so you’re never running out of dough or cheese — track employee performance, and even plan promotions more effectively.

The truth is, we all use data all the time, sometimes without even realizing it.

For example, when you check the weather before deciding what to wear, you’re using data.

Or when you check your bank balance before making a purchase — that’s data helping you make a decision.

Now, bringing it back to myPizza bakery again, you might use sales data to determine how much dough or cheese you need to prepare for the next day.

By looking at the data, you can ensure you don’t run out of ingredients, but at the same time, you don’t waste too much by over-preparing. That’s data in action!

Now that you understand what data is and how it’s used, the next question is:

If you got some data to work with, what form would it be in?

What are the formats of data you’d be dealing with?

That’s what we’re going to dive into next!

Different Data Formats

Now that we’ve talked about the basics of data, let’s explore the different formats of data you might come across.

Structured data is the most organized type.

It’s data that’s neatly arranged in a clear, defined format, often stored in rows and columns, like in a table.

This makes it really easy to search, analyze, and understand because it follows a consistent pattern.

Let’s go back to myPizza bakery for an example.

Imagine you have a spreadsheet that lists your daily sales.

It includes items like the pizza name, the number of pizzas sold, and the price.

That’s structured data.

It’s nice and tidy, making it super simple to work with.

On the other hand, we have unstructured data.

This type of data doesn’t have a set format, which makes it much harder to organize and analyze.

It includes things like text, images, videos, or even customer reviews.

The information is more freeform, and it doesn’t fit neatly into rows and columns.

For example, if customers are leaving feedback or reviews about your bakery on social media or in emails, that’s unstructured data.

A customer might write:

The Margherita pizza was delicious, but the crust was a bit too chewy.

Now, while this feedback is valuable, it’s not easy to analyze like a spreadsheet would be.

Then there’s semi-structured data, which falls somewhere between structured and unstructured.

It has some organization — like tags or labels — but it’s not as neatly organized as structured data.

A good example of semi-structured data would be an email from a supplier.

It might list ingredients and prices, but the format is informal.

It could look something like this:

Ingredients Ordered:
- Cheese: 100 kg @ $5/kg
- Flour: 200 kg @ $2/kg

There’s some structure here, but it’s not as clean as a spreadsheet. It’s a mix.

Now, a lot of people get confused between data, information, knowledge, and wisdom.

These terms often get thrown around, but they’re not the same thing.

To clear up the confusion, let’s take a look at something called the DIKW Pyramid, which explains the progression from data to wisdom.

DIKW Pyramid

Let’s now talk about the DIKW Pyramid, which helps us understand how we move from simple facts to making wise decisions.

DIKW stands for Data, Information, Knowledge, and Wisdom, and I’ll explain each step using our myPizza bakery example.

First, we have data. Data is just raw facts or figures without any context. On its own, it doesn’t tell you much.

For example, if we know that 50 Margherita pizzas were sold each day at our myPizza bakery, that’s just a number.

It’s data, but it doesn’t explain much by itself.

When we organize that data and give it some context, it becomes information.

Information answers the question, “What is happening?”

So, if we take that sales data and organize it, we might learn that Margherita pizzas are the best-selling item at the bakery.

Now, we’ve turned raw data into something that actually tells us what’s going on in our business.

The next step is knowledge. This comes when we analyze the information and start seeing patterns or relationships.

Knowledge helps us understand, “Why is this happening?”

For example, after analyzing the sales data, we might realize that Margherita pizzas sell the most because they’re cheaper than other options.

Now we’re starting to see the reasons behind what’s happening, and we understand the bigger picture.

Finally, we reach wisdom, which is about using that knowledge to make smart decisions.

Wisdom answers, “What should I do next?”

In our case, based on the knowledge that Margherita pizzas are the most popular and affordable, we could decide to stock extra ingredients for Margherita pizzas on Fridays — because that’s when sales tend to be the highest.

Wisdom helps us make better choices for the business based on what we’ve learned.

So, that’s the DIKW Pyramid.

It’s a progression from data, to information, to knowledge, and finally, to wisdom, which helps you make better decisions.

Now, while understanding the DIKW Pyramid is great, you’re probably wondering, how do businesses move from data to information systematically?

This is where two important systems come in:

OLTP (Online Transaction Processing) and OLAP (Online Analytical Processing).

These two types of systems serve different purposes and are used in different ways.

Let’s break down what they are and how they work, especially in the context of myPizza bakery.

OLTP vs OLAP

Let’s talk about two important systems that businesses use to manage and process data: OLTP and OLAP systems.

First, we have OLTP systems — these handle the day-to-day operations.

OLTP systems are designed to process many small transactions quickly, and they deal with real-time data.

Their job is to record, insert, update, or delete data right away.

In the case of myPizza bakery, every time a customer places an order for a pizza, the OLTP system is at work.

It instantly records the order details — like the type of pizza, the price, and the quantity.

The system ensures this data is up-to-date and accurate at all times.

The key features of OLTP systems are fast, real-time processing, handling multiple small transactions, and focusing on recording data efficiently.

Next, we have OLAP systems. These are designed for analyzing large amounts of historical data.

Instead of handling transactions, OLAP focuses on answering complex queries, generating reports, and helping with decision-making.

So, while your OLTP system is recording daily pizza orders, the OLAP system can help you analyze broader trends — like how sales have changed over the past year.

It answers deeper questions, such as:

  • “Which pizza is most popular during the winter months?”
  • “Which day of the week brings in the highest sales?”

OLAP helps you make important business decisions by identifying trends and patterns over time.

The key features of OLAP systems include data analysis and reporting, working with large amounts of data, and focusing on summarizing and analyzing data.

Now that we know how businesses make use of transactional and analytical processing, there’s an important point we need to consider:

What if the quality of the data being processed by these systems is compromised?

If the data isn’t accurate or reliable, it’s a classic case of “garbage in, garbage out.”

Let’s dive into why data quality is so critical and what impact it can have on businesses.

Data Quality

Alright, now let’s talk about data quality.

Data quality refers to how accurate, complete, consistent, and reliable the data is.

The better the quality of your data, the better the decisions you can make.

On the other hand, poor-quality data can lead to all kinds of mistakes.

Let’s think about myPizza bakery.

If your customer orders were recorded incorrectly — maybe the wrong pizza names or quantities were entered — that would cause confusion.

You wouldn’t know which pizzas were actually popular, and this would hurt your business.

So, maintaining good data quality ensures you have the right information to run your business smoothly.

It allows you to make informed decisions.

For example, if you have accurate sales data, you can confidently decide how much dough or cheese to order, or which pizzas to promote.

But if your data is incorrect, you might overstock ingredients you don’t need, or run out of your most popular pizzas, which would affect both sales and customer satisfaction.

Let’s break down the six main elements of data quality:

1. Accuracy: This means that the data you’re working with is correct.

For example, if 50 Margherita pizzas were sold, that needs to be recorded accurately.

If you mix that up and record 30 instead of 50, your orders will be off.

So, double-check customer orders and your inventory.

2. Completeness: All the necessary data should be collected.

Missing information can cause problems.

For instance, every pizza order should include details like the type, size, and quantity.

If you miss out on any of these, you could have confusion in preparing the order.

3. Consistency: The data should be consistent across different systems.

For example, the sales data in your bakery’s register should match what’s recorded in the accounting system.

If the systems don’t match, you might have a hard time trusting the numbers and figuring out what’s really going on in the business.

4. Timeliness: Your data should be up-to-date.

This ensures that decisions are made based on the latest information.

At myPizza bakery, regularly updating your inventory helps you avoid running out of ingredients or overstocking items you don’t need.

5. Integrity: This means that the data remains accurate and reliable throughout its entire lifecycle.

For instance, at myPizza bakery, your sales data should be properly backed up, and customer information should be stored securely.

If a system error causes you to lose important order details, it could lead to financial losses or unhappy customers.

6. Validity: Data should follow the defined rules and meet expectations.

It should fall within a proper range or format.

For example, when entering a pizza order, the size should be valid options like “small,” “medium,” or “large.”

If someone accidentally enters “extra-large” when that isn’t on the menu, it could cause problems during preparation and affect your data’s accuracy.

Now that we understand the importance of data quality, we need to dive into the data lifecycle.

This will help us understand how data moves from the moment it’s created all the way to when it’s no longer needed and disposed of.

Knowing the data lifecycle ensures that data remains useful, secure, and reliable at every stage.

This is key for anyone using data to make decisions or run a business efficiently.

Data Lifecycle

Now we talk about the data lifecycle and how it applies to a business, like myPizza bakery.

The data lifecycle refers to the different stages data goes through, from the moment it’s created until it’s no longer needed.

Let’s break it down step by step.

1. Data Creation/Collection: This is the first stage, where data is generated or collected.

Think about myPizza bakery.

Data is created when customers place their orders, when sales are recorded, or when inventory is tracked.

All these activities generate data that you can later use for business decisions.

2. Data Storage: Once you’ve collected data, it needs to be stored somewhere safely for future use.

For example, at myPizza bakery, you’d store order details, sales figures, or customer information in a database or cloud storage.

This makes sure the data is there when you need it.

3. Data Processing: After the data is stored, it needs to be processed to make it more useful.

This stage could involve organizing or cleaning up the data.

At myPizza bakery, you might organize customer orders by date or sort your inventory data to figure out what ingredients need to be restocked.

Processing the data helps turn raw information into something more meaningful.

4. Data Analysis: Once the data is processed, the next step is analyzing it to get insights.

For instance, you might analyze the data at myPizza bakery to see which pizzas sell the most on weekends or what times of day are busiest.

Analyzing the data helps you make better business decisions based on patterns or trends you discover.

5. Data Sharing: After the data has been analyzed, the results often need to be shared with others.

In the case of myPizza bakery, you might share the analysis with the management team, who can then use that information to plan promotions or make changes in operations.

Sharing data ensures the right people have access to the insights that can improve the business.

6. Data Archiving: Not all data is needed all the time.

When data is no longer actively used but still valuable, it can be archived.

At myPizza bakery, you might archive last year’s sales records or customer feedback.

This frees up space for current data but keeps the old data accessible for future reference.

7. Data Purging: The final stage is purging data that is no longer needed.

This means safely deleting or destroying outdated data to protect privacy.

For example, myPizza bakery might delete old customer information to comply with data privacy regulations.

This ensures you’re only keeping the data that’s still useful and necessary.

Each stage of the data lifecycle is important for effectively managing data.

It helps ensure that myPizza bakery can use its data to improve operations, make smart decisions, and keep everything secure and organized.

Now that we understand how the data lifecycle works, let’s talk about something bigger — big data.

As businesses grow, so does the amount of data they collect, which brings both challenges and opportunities.

Big data helps us make sense of these large, complex datasets, and it can be a game-changer when it comes to making better, more informed decisions.

Big Data

Ok, let’s dive into Big Data and what it means for businesses, like myPizza bakery.

Big Data refers to huge, complex sets of data that traditional systems can’t easily handle.

Think about how much data is being generated today from the internet, social media, and all kinds of connected devices.

It’s a lot!

This data holds valuable insights, but because of its size and complexity, we need special tools and techniques to manage and analyze it.

To help us better understand Big Data, we can break it down into four key characteristics, known as the 4 Vs.

1. Volume: The first V is Volume, which refers to the sheer amount of data.

Big Data involves massive volumes of information, way more than traditional systems can handle efficiently.

For example, at myPizza bakery, we collect tons of data — from online orders, customer reviews, sales at multiple locations, inventory records, to employee schedules.

With so much data coming in, we need special storage and analysis systems to handle it properly.

2. Velocity: Next up is Velocity — this is the speed at which data is generated and needs to be processed.

Big Data often comes in fast, and businesses need to process it in real-time or near real-time.

At myPizza bakery, when a customer places an order online, the system needs to process that order instantly so the kitchen can start preparing it.

The same goes for customer feedback coming in from social media or delivery apps — everything has to be processed quickly to keep up with the fast pace of business and ensure customers are happy.

3. Variety: The third V is Variety, which refers to the different types of data Big Data includes.

You have structured data, like numbers, dates, and online transactions.

But you also have unstructured data, like customer reviews, videos, social media posts, or images.

At myPizza bakery, we get structured data from our online transactions — like customer names, addresses, and order amounts.

But we also get unstructured data, like photos of pizzas shared on social media, customer reviews, or emails from suppliers.

Managing both structured and unstructured data can be a challenge, but it’s necessary if you want a full picture of the business.

4. Veracity: Finally, we have Veracity, which refers to the accuracy or trustworthiness of the data.

Not all data is reliable, and poor-quality data can lead to bad decisions.

At myPizza bakery, some customer reviews might be biased, or transaction data could have errors — like a customer accidentally entering the wrong delivery address.

We need to ensure that the data we use for decision-making is accurate and trustworthy.

This means cleaning up the data, verifying its accuracy, and making sure we’re basing our decisions on solid information.

So, those are the 4 Vs of Big Data — Volume, Velocity, Variety, and Veracity.

Big Data is all about handling large, fast-moving, and varied data sets while making sure the information is reliable and useful for the business.

Alright, let’s wrap things up!

We’ve covered a lot on the core aspects of data, moving from the basics to more advanced topics.

Here’s a quick recap of what we discussed:

  • We started with understanding what data is and why it’s so important.
  • We looked at the different formats data can take, from numbers and text to images.
  • We explored the DIKW Pyramid, which shows how data transforms into insights and wisdom.
  • We talked about OLTP and OLAP systems — how businesses use data for daily operations versus analysis.
  • We emphasized the importance of data quality — because better data leads to better decisions.
  • Finally, we took a peek into Big Data, where massive datasets open up new possibilities.

Having a strong foundation in data is essential in today’s world.

It helps you make smarter decisions and improve efficiency, whatever you’re working on.

Think of it as your toolkit for navigating a data-driven world.

Whether you’re just starting or looking to deepen your understanding, you’re now better equipped to tackle the challenges and opportunities that come with working with data.

With all this knowledge under your belt, you’re ready to move forward confidently in this data-driven era.

Stay tuned for the upcoming posts in this series!

--

--

No responses yet