From Wikipedia, the free encyclopedia - View original article
Data loss/leak prevention solution is a system that is designed to detect potential data breach / data ex-filtration transmissions and prevent them by monitoring, detecting and blocking sensitive data while in-use (endpoint actions), in-motion (network traffic), and at-rest (data storage). In data leakage incidents, sensitive data is disclosed to unauthorized personnel either by malicious intent or inadvertent mistake. Such sensitive data can come in the form of private or company information, intellectual property (IP), financial or patient information, credit-card data, and other information depending on the business and the industry.
The terms "data loss" and "data leak" are closely related and are often used interchangeably, though they are somewhat different. Data loss incidents turn into data leak incidents in cases where media containing sensitive information is lost and subsequently acquired by unauthorized party. However, a data leak is possible without the data being lost in the originating side. Some other terms associated with data leakage prevention are information leak detection and prevention (ILDP), information leak prevention (ILP), content monitoring and filtering (CMF), information protection and control (IPC), and extrusion prevention system (EPS), as opposed to intrusion prevention system.
The technological means employed for dealing with data leakage incidents can be divided into the following categories: standard security measures, advanced/intelligent security measures, access control and encryption, and designated DLP systems .
Standard security measures, such as firewalls, intrusion detection systems (IDSs), and antivirus software, are commonly available mechanisms that guard computers against outsider as well as insider attacks. The use of firewall, for example, limits the access of outsiders to the internal network, and an intrusion detection system detects intrusion attempts by outsiders. Inside attacks can be averted through antivirus scans that detect Trojan horses installed on PCs which send confidential information, and by the use of thin clients, which operate in a client-server architecture with no personal or sensitive data stored on a client's computer.
Advanced security measures employ machine learning and temporal reasoning algorithms for detecting abnormal access to data (i.e., databases or information retrieval systems) or abnormal email exchange, honeypots for detecting authorized personnel with malicious intentions, and activity-based verification (e.g., recognition of keystrokes dynamics) for detecting abnormal access to data.
Designated DLP solutions detect and prevent unauthorized attempts to copy or send sensitive data, intentionally or unintentionally, without authorization, mainly by personnel who are authorized to access the sensitive information. In order to classify certain information as sensitive, these solutions use mechanisms, such as exact data matching, structured data fingerprinting, statistical methods, rule and regular expression matching, published lexicons, conceptual definitions, and keywords .
Typically a software or hardware solution that is installed at network egress points near the perimeter. It analyzes network traffic to detect sensitive data that is being sent in violation of information security policies.
Such systems run on end-user workstations or servers in the organization. Like network-based systems, endpoint-based can address internal as well as external communications, and can therefore be used to control information flow between groups or types of users (e.g. 'Chinese walls'). They can also control email and Instant Messaging communications before they are stored in the corporate archive, such that a blocked communication (i.e., one that was never sent, and therefore not subject to retention rules) will not be identified in a subsequent legal discovery situation. Endpoint systems have the advantage that they can monitor and control access to physical devices (such as mobile devices with data storage capabilities) and in some cases can access information before it has been encrypted. Some endpoint-based systems can also provide application controls to block attempted transmissions of confidential information, and provide immediate feedback to the user. They have the disadvantage that they need to be installed on every workstation in the network, cannot be used on mobile devices (e.g., cell phones and PDAs) or where they cannot be practically installed (for example on a workstation in an internet café).
DLP solutions include a number of techniques for identifying confidential or sensitive information. Sometimes confused with discovery, data identification is a process by which organizations use a DLP technology to determine what to look for (in motion, at rest, or in use).
Data is classified as structured or unstructured. Structured data resides in fixed fields within a file such as a spreadsheet while unstructured data refers to free-form text as in text documents or PDF files. An estimated 80% of all data is unstructured and 20% to be structured. Data classification is divided into content analysis, focused on structured data, and contextual analysis which looks at the place of origin or the application or system that generated the data.
Methods for describing sensitive content are abundant. They can be divided into two categories: precise methods and imprecise methods.
Precise methods are, by definition, those that involve Content Registration and trigger almost zero false positive incidents.
All other methods are imprecise and can include: keywords, lexicons, regular expressions, extended regular expressions, meta data tags, bayesian analysis, statistical analysis such as Machine Learning, etc.
The strength of the analysis engine directly correlates to its accuracy. The accuracy of DLP identification is important to lowering/avoiding false positives and negatives. Accuracy can depend on many variables, some of which may be situational or technological. Testing for accuracy is recommended to ensure a solution has virtually zero false positives/negatives. High False Positive Rates will cause the system to be DLD not DLP.
Sometimes a data distributor gives sensitive data to a set of third parties. Some time later, some of the data is found in an unauthorized place (e.g., on the web or on a user's laptop). The distributor must then investigate if data leaked from one or more of the third parties, or if it was independently gathered by other means.
"Data at rest" specifically refers to old archived information that is stored on either a client PC hard drive, on a network storage drive or remote file server, or even data stored on a backup system, such as a tape or CD media. This information is of great concern to businesses and government institutions simply because the longer data is left unused in storage, the more likely it might be retrieved by unauthorized individuals outside the Network.