Module 20: Data Discovery and Classification
You cannot protect what you do not know about. The exam treats data discovery and classification as prerequisites for every other data security control. Without knowing what data exists and how sensitive it is, encryption, DLP, and access controls are applied blindly.
Data Discovery
Data discovery is the process of identifying what data exists across your cloud environment. In organizations using multiple cloud services, data proliferates rapidly — databases, object storage, SaaS applications, file shares, logs, backups, and temporary processing stores. Discovery must find all of it.
Discovery Techniques
- Automated scanning: Tools that crawl cloud storage, databases, and file systems looking for data patterns (PII, financial data, health records).
- Metadata analysis: Examining file properties, database schemas, and service configurations to identify likely sensitive data stores.
- Network traffic analysis: Monitoring data flows to discover where data is created, stored, and transferred.
- API integration: Using cloud provider APIs to inventory all data stores and their configurations.
Data Classification
Classification assigns sensitivity levels to data based on its content, context, and regulatory requirements. The exam uses a standard classification hierarchy:
- Public: No impact if disclosed. Marketing materials, published reports.
- Internal: Low impact if disclosed. Internal policies, non-sensitive communications.
- Confidential: Moderate to high impact. Business plans, financial reports, employee records.
- Restricted/Highly Confidential: Severe impact. Personal health information, credit card data, trade secrets.
The exam tests who is responsible for classification. The data owner classifies data — not the IT department, not the CSP, not the data custodian. Classification is a business decision based on data sensitivity and regulatory requirements.
Exam trap: Technical teams often classify data based on technical characteristics (file type, location). The exam expects classification based on business impact and sensitivity. A text file containing credit card numbers is more sensitive than an encrypted database of public records, regardless of technical format.
Classification in Cloud Environments
Cloud introduces unique classification challenges:
- Volume: Cloud data grows rapidly. Manual classification cannot scale. Automated tools are necessary.
- Distributed data: The same data may exist in multiple services, regions, and formats. Classification must be consistent across all instances.
- Dynamic data: Data classification can change as data is combined, analyzed, or enriched. Individually non-sensitive data points may become sensitive when aggregated.
- Shared responsibility: The customer classifies data; the CSP provides tools and infrastructure to implement classification-based controls.
Automated Classification Tools
Cloud providers and third-party tools offer automated classification using pattern matching, machine learning, and context analysis. The exam recognizes that automated classification is necessary for cloud scale but requires human oversight to handle edge cases and verify accuracy.
Classification Driving Controls
Classification determines which controls apply:
- Public data: Integrity controls (prevent tampering), availability controls.
- Internal data: Basic access controls, standard encryption.
- Confidential data: Strong access controls, encryption at rest and in transit, DLP monitoring, audit logging.
- Restricted data: All of the above plus customer-managed encryption keys, enhanced monitoring, strict access reviews, and potentially dedicated infrastructure.
Key Takeaways
Discovery comes before classification. Classification drives all other controls. Data owners classify; technical teams implement. Automated tools are necessary at cloud scale but require human oversight. Classification must account for data aggregation sensitivity. Map classifications to control tiers and enforce them consistently across all cloud services.