Comprehensive and correct classification is fundamental for data security. Only if you know which data is (particularly) worth protecting can you implement appropriate measures. This is all the more important in times of AI assistants such as Microsoft Copilot.
The tool’s access to information depends largely on the user’s access rights and the classification of the data. AI risks can only be reduced if both factors are cleverly controlled.
Accordingly, data classification is a fundamental element of a strong security strategy. However, many classification projects fail because the scanning engine cannot process large data sets or produces too many false positives. When looking for the right solution, you should look for providers whose applications are successfully used by companies similar to your own (in terms of size and type of data). There are five key capabilities that a classification solution should have.
1. Scalable and efficient scanning in real time
There is a big difference between scanning a terabyte of storage space for a medium-sized company and scanning 12 petabytes of data for a global corporation. Relying on products that cannot scale due to latency issues means that scans are never completed or provide outdated and incomplete data classification insights. This, of course, has a negative impact on important security decisions.
Environments with hundreds of large data stores require a distributed multi-threaded engine that can process multiple systems simultaneously without consuming too many resources. You should also rely on solutions that use incremental real-time scanning methods and only scan the data that has been newly created or changed since the last scan.
2. Precise classification
Accuracy is the essential element of data classification. Unreliable data recognition and analysis undermines data loss prevention (DLP) policies, CASB capabilities and threat detection. According to Gartner, one in three DLP projects fail due to inadequate data classification and recognition.
Many classification tools rely on third-party libraries or open source packages with unchecked and unvalidated regular expressions, dictionaries and patterns to find sensitive data. Modern classification solutions have their own databases and specific patterns (e.g. for GDPR), proximity matching and algorithmic verification to deliver precise results. When evaluating a tool, its accuracy should always be tested by using test data from dlptest.com, for example.
3. Complete results
Classification is only successful if all areas are covered by it. This applies to both storage locations and file types. If a scan stops halfway through (for example, for the reasons mentioned above), you only get half the image and therefore only half the protection. While sampling can be an effective tool for databases, this unfortunately does not apply to data storage such as NAS arrays or object storage such as AWS S3 and Azure Blob. If you have scanned two TB of an S3 account and found no sensitive content, you cannot assume that the other 500 TB of data is also not sensitive.
Before deciding on a classification tool, you should make sure that it supports the data types that are important for your company. This could be CAD drawings, Office documents or even images. A robust classification solution should be able to scan and classify all structured, semi-structured and unstructured data, regardless of type, format, storage location or platform. This is the only way to provide security managers with a centralized and comprehensive overview of data protection across SaaS, IaaS, databases, on-premises file shares and hybrid NAS devices. Cloud-based solutions are also able to respond to updated regulatory requirements or standards and quickly deploy new classification policies without time-consuming upgrades, package downloads or patches.
4. Context-related findings
While classifying data is an important first step, it is usually not enough to protect valuable data. Security managers need additional context to achieve their security goals. In particular, access rights and file activity play a key role. The key to minimizing risk lies in recognizing the risk and knowing who can access the data. This access must then be minimized so that productivity is not compromised.
At the same time, security managers must be able to detect and stop unusual behavior, determine who is accessing data and prevent excessive access in a way that does not affect productivity. All of this is only possible with the appropriate context. By analyzing the metadata, key questions such as “Which data is sensitive, has overly broad access rights or is no longer needed?” can be answered.
5. Ongoing review
Some tools use periodic or scheduled scans that scan data at fixed intervals. However, files are constantly changing. And the criticality often changes accordingly. Only if a classification solution can keep pace with this can the data be permanently protected. It should therefore keep a real-time audit log of all activities.
The right classification solution for your organization can help security managers prevent data breaches, quickly investigate incidents and ensure compliance with increasingly stringent regulations. They should pay particular attention to the coverage of the most important data storage systems used by the company, SaaS and IaaS, high precision and scalability. This is the only way to optimally protect valuable and dynamic assets such as data in the future.