📊 Understanding Data Platform Projects – A Primer for Project & Program Managers

As businesses evolve into data-driven organizations, data platform projects are becoming increasingly common. For PMs/PgMs who haven't worked in this space before, here's a quick primer to help you get familiar with the key concepts, components, and terminologies.

🔍 What is a Data Project?

A data project focuses on collecting, processing, storing, and delivering data to support decision-making, analytics, and product features. It’s not about building user-facing apps—it’s about enabling data flows, quality, insights, and governance.

Examples include:

Building a centralized data warehouse
Creating a customer 360° view
Enabling real-time analytics or dashboards
Developing a machine learning pipeline

🧰 What is Data Engineering?

Data Engineering is the backbone of any data platform project. It involves:

Ingesting data from multiple sources (APIs, databases, files, etc.)
Cleaning and transforming the data (ETL/ELT)
Moving it into storage systems (like data lakes or warehouses)
Making it available for consumption by analysts, data scientists, or other systems

Think of data engineers as the plumbers of the data world—making sure data flows efficiently, reliably, and securely.

📦 What is a Data Product?

The rectangle above shows a data product.

A Data Product is a curated, trustworthy, and reusable dataset or insight that serves a specific business need.

Examples:

A customer segmentation dataset for marketing
A sales performance dashboard
A recommendation engine input dataset

Data products are owned, versioned, and maintained like software products.

🧱 Key Layers of a Data Product / Platform

Data Ingestion Layer
Pulls data from various sources (CRM, ERP, logs, APIs, etc.)
Data Storage Layer
Stores raw and processed data (e.g., Data Lakes, Data Warehouses)
Data Transformation Layer
Cleans, joins, filters, and reshapes the data using pipelines (ETL/ELT)
Semantic/Business Logic Layer
Defines KPIs, metrics, and business rules (used by BI tools)
Consumption Layer
Dashboards, APIs, machine learning models, or data apps that use the processed data
Data Governance & Security Layer
Ensures compliance, data quality, lineage, access controls, and auditability

🛡️ What is Data Governance?

Data Governance ensures that data is:

Accurate
Secure
Compliant with policies (e.g., GDPR, HIPAA)
Well-documented and easily discoverable

Key aspects include:

Data Catalogs (e.g., Alation, Collibra)
Data Lineage (track origin and changes)
Access Control & Policies
Quality Rules & Monitoring

📦Data Product Details

A data product is a high-quality, reusable dataset or data service that delivers value to end-users—such as analysts, business teams, or downstream systems—just like a software product.

It is not just raw data; it’s data that is:

Curated
Governed
Reliable
Purpose-built
Discoverable & usable

Think of it as the final output of your data platform pipelines that supports business decision-making or operational processes.

🧠 Examples of Data Products

Business Function	Data Product Example
Marketing	Customer segmentation dataset
Finance	Monthly revenue dashboard-ready table
Sales	Sales funnel conversion metrics
Operations	Inventory movement API for daily reports
ML Team	Feature store for predictive models

🧱 Key Characteristics of a Data Product

A true data product follows these principles:

Principle	Description
Ownership	Clear owner responsible for quality, evolution, support
Quality	Cleaned, validated, trusted data with defined SLAs
Discoverability	Documented and cataloged for easy access
Security	Access controlled, role-based visibility
Interoperability	Usable across teams and tools (BI, ML, APIs)
Monitoring	Tracked for freshness, reliability, and usage
Versioning	Changes are tracked, and old versions are retained when needed

🧩 Where Do Data Products Fit in the Architecture?

Data products usually live in the Gold layer of the Medallion Architecture, and are exposed via:

BI dashboards
Data marts
APIs or data services
ML pipelines
Data marketplace/catalogs

🎯 Why Data Products Matter

Shift from “data as a byproduct” to “data as a product”

Promotes accountability and trust in data

Enables self-service analytics

Reduces dependency on IT and data engineering

Improves data reusability

One product, multiple consumers (dashboards, ML models, reports)

Scales with business

New domains or teams can plug into existing data products instead of building from scratch

🚧 Common Pitfalls When Delivering Data Products

Pitfall	Risk
No clear owner	Leads to outdated or untrusted products
Poor documentation	Hard for users to understand or find the data
No quality monitoring	Broken pipelines go unnoticed
Tight coupling with UI	Hard to reuse across other domains or systems

🥉🥈🥇 Stages of Data: Bronze, Silver & Gold Explained for Project Managers

In modern data platforms, especially those following medallion architecture (popular in Delta Lake and Lakehouse models), data is processed and organized into three core layers: Bronze, Silver, and Gold.

These stages reflect the level of refinement, trust, and usability of data as it moves through the platform.

🥉 Bronze Layer – Raw / Ingested Data

What it is:
This layer contains raw, unprocessed data ingested from various sources like databases, APIs, flat files, logs, and streaming platforms.

Key Characteristics:

Data is stored as-is, with minimal or no transformation
May include duplicates, nulls, or inconsistent formats
Primarily used for auditing, backup, and replay purposes
Useful for data exploration and lineage tracking

Purpose: Act as the source of truth, untouched, useful for audit and reprocessing

Characteristics:

Schema-on-read (flexible)
Minimal validation or transformation
Often large and semi-structured (JSON, CSV)

Example:
Customer sign-up logs from the website in their original format, with all columns including noise or junk data

Project Implication:
Ensure scalable and secure ingestion pipelines with metadata tracking

🥈 Silver Layer – Cleansed / Structured Data

What it is:
This layer consists of cleaned and standardized data, typically after applying transformation rules such as joins, filters, deduplication, and type casting.

Key Characteristics:

Data quality improves (nulls removed, data types aligned)
Applied business logic begins (e.g., mapping country codes to names)
Used by analysts and data scientists for deeper exploration

Purpose: Create a trusted, query-ready foundation for analysis and modeling

Characteristics:

Applied data quality rules
Standardized schema and formats
Joins across tables, null handling, type casting

Example:
Cleaned customer data with valid email addresses, duplicate accounts removed, and all timestamps standardized

Project Implication:
Coordinate closely with business/data SMEs to define transformation rules; implement data quality checks

🥇 Gold Layer – Curated / Business-Ready Data

What it is:
This is the final, most refined layer, containing aggregated and domain-specific data products. It supports business intelligence, analytics, and ML models.

Key Characteristics:

Tailored to specific business use cases (sales, marketing, operations)
High trust, high usability datasets
Often consumed via dashboards, reports, and APIs

Purpose: Serve business users, dashboards, and downstream systems

Characteristics:

Highly reliable and fast for querying
Often used in BI tools and ML models
Built with stakeholder-defined KPIs

Example:
Monthly sales revenue by product category and region, enriched with customer segmentation

Project Implication:
Align gold-layer design with business stakeholders; measure adoption and value of the data products

Layer	Purpose	Users	Trust Level	Key Tools/Tasks
Bronze	Raw data capture	Data engineers	🟡 Medium	Ingestion, storage
Silver	Data cleaning & shaping	Analysts, Data Scientists	🟠 High	Transformation, QA
Gold	Business-ready insights	Executives, BI teams, Apps	🟢 Very High	Aggregation, Modeling

🔁 How Medallion Architecture Supports Data Projects

As a PM, using this architecture allows you to:

Phase the delivery: You can deliver Bronze/Silver early and Gold iteratively.
Isolate issues: Data quality problems can be fixed at the Silver layer without touching raw data.
Standardize pipelines: Create reusable ETL patterns across use cases.
Improve stakeholder confidence: Gold datasets are always vetted and production-grade.

🔄 OLTP vs. 📊 OLAP – Key Concepts for Data Platform PMs

When managing data platform projects, it's important to understand the distinction between OLTP (Online Transaction Processing) and OLAP (Online Analytical Processing) systems. These two serve very different purposes and require different data architectures, processing strategies, and expectations.

🔄 What is OLTP? (Online Transaction Processing)

OLTP systems are designed to handle day-to-day business transactions quickly and reliably. These are the systems your business operations depend on.

Examples:

Banking applications
E-commerce checkout systems
Inventory management systems
CRM applications

Key Characteristics:

Handles a large number of short, atomic transactions
Supports real-time inserts, updates, and deletes
Data is highly normalized (to reduce redundancy)
Prioritizes speed and consistency

Technology Examples:
MySQL, PostgreSQL, Oracle DB, SQL Server

Project Implications for PMs:

OLTP systems are data sources in data projects
Data ingestion pipelines must extract data without affecting live operations
Often require CDC (Change Data Capture) mechanisms for real-time sync

📊 What is OLAP? (Online Analytical Processing)

OLAP systems are designed for data analysis, reporting, and decision support. These systems process aggregated and historical data, often derived from OLTP systems.

Examples:

Sales performance dashboards
Customer lifetime value analysis
Trend forecasting

Key Characteristics:

Handles complex queries on large datasets
Data is often denormalized for fast querying
Supports slice-and-dice, drill-down, roll-up analysis
Optimized for read-heavy workloads

Technology Examples:
Snowflake, Amazon Redshift, Google BigQuery, Azure Synapse, Apache Druid

Project Implications for PMs:

OLAP is often the end product of your data pipelines
Business users rely on OLAP systems for insights and reporting
Performance and latency must be tuned for query efficiency, not write speed

Feature	OLTP	OLAP
Purpose	Run business operations	Analyze business data
Data Structure	Normalized	Denormalized / star schema
Query Type	Short, simple transactions	Complex, aggregated queries
Read/Write	High write volume	High read volume
Users	Frontline employees, systems	Analysts, BI teams, executives
Speed Focus	Transaction speed	Query performance

🏗️ Architecture of a Modern Data Platform – A PM’s Guide

A data platform architecture is the foundation that supports the ingestion, processing, storage, and consumption of data across the organization. It brings together various tools, layers, and design principles to deliver trusted, scalable, and usable data products.

Here's a high-level breakdown of a modern data architecture:

📥 1. Data Sources (Upstream Systems)

These are the origin points of your data. Examples include:

OLTP systems (CRM, ERP, POS, web apps)
Third-party APIs (weather, market data)
Logs, IoT streams, spreadsheets
Flat files (CSV, Excel), SFTP sources

PM Tip:
Define source systems early and plan for secure access + refresh cadence (real-time, hourly, daily, etc.)

🚰 2. Data Ingestion Layer

This layer collects data from the source systems into the data platform.

Ingestion Types:

Batch (e.g., daily file loads)
Streaming (e.g., Kafka, real-time logs)
Change Data Capture (CDC) for real-time updates

Common Tools:
Apache NiFi, Fivetran, Airbyte, Kafka, AWS Glue

PM Tip:
Watch for performance bottlenecks and source system impacts during ingestion.

🧼 3. Data Storage Layer

Once ingested, data is stored in stages:
Bronze → Silver → Gold

Bronze (Raw Layer): As-is data
Silver (Cleaned Layer): Structured, deduplicated, joined
Gold (Curated Layer): Aggregated, business-consumable datasets

Storage Types:

Data Lake: S3, ADLS, GCS (for raw and semi-structured data)
Data Warehouse: Snowflake, Redshift, BigQuery (for structured, query-ready data)

PM Tip:
Clarify data retention, backup, and archival policies early in planning.

🔄 4. Data Processing / Transformation Layer

This layer handles ETL (Extract, Transform, and Load) or ELT pipelines to convert raw data into meaningful formats.

Common Tools:
Apache Spark, dbt, Dataflow, Airflow, Azure Data Factory

Tasks Performed:

Data cleaning
Business rule application
Joining multiple datasets
Creating KPIs/metrics

PM Tip:
Map transformations to business logic; involve domain SMEs during development.

🔍 5. Semantic / Business Logic Layer

This is where business logic is centralized—so analysts and BI tools use consistent metrics (e.g., revenue, active users).

Examples:
Looker semantic models, dbt models, Power BI datasets

PM Tip:
Helps avoid "multiple versions of the truth" in reports—centralize this early.

📊 6. Consumption Layer (BI & ML)

This is where users interact with the data via tools and apps.

Consumers Include:

Dashboards (Power BI, Tableau, Looker)
Reports and ad-hoc queries
ML pipelines and models
APIs for other apps or clients

PM Tip:
Plan training or enablement sessions—dashboards are only useful if people can interpret and trust them.

🔐 7. Data Governance & Security Layer

Ensures your platform meets compliance, security, and data quality standards.

Functions:

Role-based access control (RBAC)
Data catalogs and lineage tracking
Auditing and compliance logging
Data quality rules and alerting

Tools:
Collibra, Alation, Unity Catalog, Great Expectations

PM Tip:
Prioritize governance from the start—it’s hard to retro-fit later.

🔁 8. Monitoring & Orchestration Layer

Ensures data jobs run as expected and issues are detected early.

Tools:
Airflow, Dagster, Prefect (orchestration)
Grafana, Datadog (monitoring)

PM Tip:
Use alerts and dashboards to track data freshness, pipeline health, and failures.

🧱 Summary Architecture Diagram (Text Format)

+-----------------------+

| Source Systems |

| (CRM, APIs, Logs) |

+----------+------------+

[Ingestion Layer]

+-------v--------+

| Raw Storage | <== Bronze

+-------+--------+

[Transformation Layer]

+-------v--------+

| Cleaned Storage| <== Silver

+-------+--------+

[Business Logic Layer]

+-------v--------+

| Curated Data | <== Gold

+-------+--------+

+------------+--------------+

| BI Tools / ML / APIs etc. |

+---------------------------+

⚖️ Why OLTP and OLAP Should Be Separate in a Data Architecture

In any data-driven organization, separating OLTP (Online Transaction Processing) and OLAP (Online Analytical Processing) systems is a best practice for both performance and architectural clarity.

This separation ensures that operational efficiency and analytical capability can coexist without compromising each other.

🔄 1. They Serve Different Purposes

Purpose	OLTP	OLAP
Goal	Run business operations	Analyze and optimize business
User	Frontline systems & employees	Analysts, business users, data scientists
Focus	Transaction integrity & speed	Historical analysis & decision-making

Why separate?
You don’t want analytics queries slowing down your live order processing, or transaction issues impacting dashboards.

⚙️ 2. Different Workload Patterns

OLTP: Write-heavy, lots of small transactions (e.g., placing an order, updating inventory)
OLAP: Read-heavy, large-scale complex queries (e.g., “What was the month-over-month growth?”)

Why separate?
Combining them leads to resource contention—reporting queries can slow down business-critical operations.

📐 3. Different Data Models

OLTP: Highly normalized for write efficiency and data integrity
OLAP: Often denormalized for query performance (e.g., star/snowflake schemas)

Why separate?
What’s efficient for storing orders and customers separately (OLTP) is inefficient for summarizing trends (OLAP).

⚠️ 4. Risk of Performance Bottlenecks

If OLTP and OLAP share the same database or infrastructure:

A slow dashboard refresh can block incoming transactions
High-volume transactions can delay reporting refreshes

Why separate?
Ensures high availability and scalability on both fronts.

🔐 5. Different Data Retention & Volume Needs

OLTP only needs recent data (e.g., last 30 days for operations)
OLAP often stores years of data for trends, predictions, audits

Why separate?
Reduces storage and performance strain on OLTP systems.

🧩 6. Enables Better Scalability

OLTP systems scale vertically (e.g., bigger servers, higher IOPS)
OLAP systems scale horizontally (e.g., distributed query engines)

Why separate?
You can scale each system based on its usage profile without over-provisioning.

🔁 7. Enables a Robust Data Pipeline Architecture

Separating OLTP and OLAP enables the creation of dedicated ingestion pipelines, data quality checks, transformation logic, and semantic layers—without touching live operational systems.

PM Insight:
This enables better governance, reusability, and visibility across the data lifecycle.

✅ Summary – Key Benefits of Separation

Benefit	Impact
System Stability	No risk of BI users impacting production
Performance Optimization	Tuned individually for read vs write
Data Modeling Flexibility	Normalize for OLTP, denormalize for OLAP
Scalable Architecture	Independent scaling paths
Clear Ownership	Ops vs analytics teams can own and optimize separately

🔧 Real-World Analogy

Think of it like a kitchen (OLTP) vs. a restaurant review dashboard (OLAP):

The kitchen must operate fast, reliably, and consistently to serve customers (OLTP).
The dashboard helps management analyze popular dishes, customer feedback trends, and supply chain performance (OLAP).

You don’t want customers waiting for food because someone is running a quarterly report!

🧰 Sample Tech Stack for a Modern Data Platform

A typical data platform is made up of multiple layers, each with a set of tools and technologies for data ingestion, storage, processing, governance, and consumption.

1️⃣ Data Ingestion Layer

Purpose: Bring data from source systems (OLTP, APIs, files, streaming) into the platform.

Type	Tools / Technologies
Batch Ingestion	Fivetran, Stitch, Informatica, Azure Data Factory, AWS Glue
Streaming Ingestion	Apache Kafka, Apache NiFi, Amazon Kinesis, Azure Event Hubs
Change Data Capture (CDC)	Debezium, Qlik Replicate, HVR, StreamSets

2️⃣ Data Storage Layer

Purpose: Store raw and processed data at scale.

Storage Type	Tools / Services
Data Lake	Amazon S3, Azure Data Lake (ADLS), Google Cloud Storage
Data Warehouse	Snowflake, Amazon Redshift, Google BigQuery, Azure Synapse
Lakehouse / Delta Lake	Databricks, Apache Hudi, Apache Iceberg

3️⃣ Data Processing & Transformation Layer

Purpose: Clean, transform, and structure data (ETL/ELT pipelines).

Type	Tools
Batch Processing	Apache Spark, dbt, Dataform, Azure Data Factory
Streaming Processing	Apache Flink, Spark Structured Streaming, Kafka Streams
Workflow Orchestration	Apache Airflow, Prefect, Dagster, Luigi

4️⃣ Semantic / Business Logic Layer

Purpose: Apply business definitions and KPIs to transformed data.

Tools:

dbt (data build tool)
LookML (Looker modeling layer)
Power BI Datasets
Tableau Data Models

5️⃣ Data Consumption Layer

Purpose: Expose data to users via dashboards, APIs, notebooks, and ML models.

Type	Tools
BI & Dashboards	Power BI, Tableau, Looker, Qlik Sense
Notebooks & Exploration	Jupyter, Databricks Notebooks, Hex, Mode
ML Platforms	MLflow, SageMaker, Vertex AI, Databricks ML
APIs / Data Services	GraphQL, REST APIs, PostgREST, FastAPI

6️⃣ Data Governance, Security & Cataloging

Purpose: Manage data access, quality, lineage, and compliance.

Function	Tools
Data Catalog	Alation, Collibra, Atlan, Amundsen, Unity Catalog
Lineage & Metadata	OpenLineage, Marquez, Monte Carlo, Great Expectations
Access Control & Security	Apache Ranger, Azure Purview, AWS Lake Formation, RBAC in Databricks
Quality & Observability	Great Expectations, Soda, Monte Carlo, Databand

7️⃣ Monitoring & DevOps

Purpose: Monitor pipelines, data health, and platform performance.

Tools:

Grafana, Prometheus, CloudWatch, Datadog
CI/CD: GitHub Actions, GitLab CI/CD, Jenkins
Terraform, Pulumi (for infra as code)
dbt Cloud / dbt Core for deployment & testing

✅Platform Hosting Options

Type	Examples
Cloud-Native	AWS, Azure, Google Cloud Platform
Managed Platforms	Snowflake, Databricks, BigQuery, Azure Synapse
Hybrid / On-Prem	Hadoop + Hive, Cloudera, private cloud with Kubernetes

🪷Other Supporting Details

🔧 1. Data Platform Roles & Responsibilities

Help PMs understand who does what.

Role	Responsibility
Data Engineer	Builds and maintains ingestion, transformation pipelines
Data Analyst	Explores data, builds reports & dashboards
Data Scientist	Builds ML models and advanced analytics
Data Architect	Designs the overall data architecture
Data Product Owner	Defines requirements for data products
Governance Lead	Ensures compliance, cataloging, access control
BI Developer	Builds visualization layers (e.g., dashboards)

🎯 Why include this: PMs must manage these roles, coordinate tasks, and resolve dependencies.

🧪 2. Data Quality Dimensions

Highlight the importance of data quality and how to monitor it.

Key dimensions:

Accuracy – Is the data correct?
Completeness – Are all required fields present?
Timeliness – Is the data fresh?
Consistency – Is it standardized across sources?
Validity – Is data within expected formats/ranges?

🎯 Why include this: PMs should track quality KPIs and know when data is "ready for use."

📊 3. BI & Reporting Layer

How data is consumed by business users.

Include:

Common BI tools (Power BI, Tableau, Looker)
Embedded analytics vs. self-service
Real-time vs scheduled dashboards
Row-level security (RLS), dashboard governance

🎯 Why include this: PMs often get judged on dashboard delivery timelines and usability.

⚙️ 4. Orchestration & Scheduling

Managing pipelines and data jobs.

Tools: Airflow, Dagster, Prefect
Concepts: DAGs, dependencies, retries
Use cases: Scheduling daily refresh, triggering downstream tasks

🎯 Why include this: Helps PMs identify potential delays and bottlenecks in pipeline runs.

📦 5. Data Cataloging & Discoverability

Make data usable and discoverable by business users.

Include:

Metadata management
Data lineage tools
Tools: Alation, Collibra, Unity Catalog

🎯 Why include this: PMs can proactively reduce “where is my data?” queries.

🧠 6. Machine Learning & Advanced Analytics Layer (optional, depending on scope)

If your platform supports ML use cases, mention:

Feature stores
Model training & tracking
Data drift monitoring
Tools: MLflow, SageMaker, Vertex AI

🎯 Why include this: Clarifies what infra/data setup is needed to support ML.

🛡️ 7. Security, Privacy & Compliance

Critical for regulated industries (e.g., finance, healthcare).

Include:

Role-based access control (RBAC)
Data masking & encryption
PII detection & anonymization
Audit logging
GDPR, HIPAA, ISO considerations

🎯 Why include this: PMs must plan secure environments and coordinate with InfoSec.

📅 8. Typical Phases of a Data Platform Project

Break it down like a delivery roadmap:

Discovery & stakeholder alignment
Data source inventory & access setup
Ingestion pipelines (Bronze)
Data transformations & validations (Silver)
Data product definition (Gold)
BI/ML layer build
Governance setup
User onboarding, training & adoption
Monitoring, automation, support

🎯 Why include this: Helps PMs structure sprints and deliverables.

📈 9. KPIs & Success Metrics for PMs

What defines success for the PM in a data platform project?

Examples:

% of critical data sources integrated
Time to deliver first dashboard
Data freshness SLAs met
Number of active data products used by business
User adoption rate

🎯 Why include this: Aligns PM goals with business value.

🧩 10. Challenges & Pitfalls to Watch For

Forewarned is forearmed. Common issues:

Scope creep (“just one more dataset”)
Lack of clear data ownership
Delayed access to source systems
Unclear data definitions (leading to conflicting reports)
Data quality gaps due to upstream changes
Gold layer built without Silver being stable

🎯 Why include this: Helps PMs proactively mitigate risks and set realistic expectations.

📚 Common Jargon in Data Platform Projects

Term	Meaning
ETL / ELT	Extract, Transform, Load (or Load, then Transform)
Data Lake	Raw, unstructured data storage
Data Warehouse	Structured, query-optimized storage
Data Mesh	A decentralized approach to managing data as a product
Data Mart	Domain-specific subset of a data warehouse
Data Pipeline	Automated flow of data processing steps
Streaming	Real-time data processing (vs. batch)
Big Data	Very large and complex data sets
Schema	Structure/definition of the data
Partitioning	Splitting data for performance or parallelism