Spirion Sensitive Data Platform Scan Performance

This article describes how scan performance - consumption and throughput - are measured and utilized.

Discovery

In the case of on-premise Endpoint data protection, Spirion’s local agent is ideal to discover and classify data across any number of endpoints and report back to an appropriate SDP Console.

Scans can be performed on any schedule.

For the use case of cloud discovery, there are two factors which any vendor encounters when performing Data-At-Rest discovery:

1. Retrieval of data off the cloud store

Spirion has only limited influence on this item as it is at the host’s behest to provide enough throughput/API access to retrieve.

A. In the M365 case, Microsoft API call “throttling” can happen at any time, effectively lengthening a scan in order to maintain their QOS.

B. Spirion has built-in appropriate throttle detection and implemented exponential back-off in order to minimize throttling impact.

2. Consumption of cloud data stores in a timely manner

Spirion can apply any number of Scan Nodes to consume data on a given scan.

A. For example, an O365 tenant could have 50 scan nodes, each retrieving mailbox data, or 100.

B. API throttling notwithstanding, the limits of a given scan are now only:

I. Throughput of storage devices

II. Networking bandwidth

III. CPU resources

Throughput vs Consumption – “What is the rate of file discovery/day”

A single Agent or Node on a workstation with 4 cores has a throughput of about 1.8GB per hour when scanning a NAS file share, for example.

This means that it can connect, enumerate, download, pre-process (for example, “unzip”, “filter”, “OCR”) into a text stream, analyze with enabled identifiers, package results (if any) and metadata, and stream to Console every in-scope file.

Common file types: It is likely that on a given data store with an out-of-the-box “Common” file list, it is not necessary to scan every file.
This can impact scan performance on a target store and be perceived as “speed”.
Mixed file types: On an average file store with mixed file types, Spirion appears to surpass agent throughput when taking the entirety of the data store into account (that is, consumption of the data store)

Distributed Scanning and Scan Nodes

Discovery Teams

To consume large data stores, Spirion can leverage “Discovery Teams” which are collections of scan nodes, working together to split up the workload.

The high-level overview is that one scan node is chosen as the “queue manager” and it queries the Target(s) and collects locations to scan, creating a list.

All other nodes (two or more nodes) connect to the queue manager node and retrieve Targets to scan. There are several advantages to using Discovery Teams:

Fault Tolerance – If one node is active, the scan continues
Speed multiplier – The more nodes available, the faster the scan is completed
Flexibility – Nodes can be reallocated flexibly to accommodate new scan Targets
- Discovery Team 1 is scanning on-premises server assets
- Discovery Team 2 can be pointed at either on-premises or cloud assets and manually redirected to another asset once the search has completed
- Discovery Team 3 is scanning cloud-based assets.

The Importance of Accuracy

When considering the entirety of a data discovery scan, it is important to also consider the time spent reviewing results and mitigating false positives.

No solution is or will be 100% accurate, so this time must be accounted for.
Spirion excels here by providing fewer false positives and more accurate results resulting in a significant decrease in the time spent reviewing results.
When considering the potential scale of an endpoint deployment in your or any environment, the time saved is substantial.

Example

Here is a simple example:

10,000 machines, each with 100 matches with 85% accuracy = 150,000 false positives
10,000 machines, each with 100 matches with 95% accuracy = 50,000 false positives
10,000 machines, each with 100 matches with 98.5% accuracy = 15,000 false positives