How to Build a SearchAPI

Below is practical documentation on how to build a SearchAPI/SearchDLL.

It’s based on the standard SearchDLL architecture and patterns:

RegEx first-pass
C++ validation/cleaning second-pass

1. What you’re building

A SearchAPI/SearchDLL is a custom C++ DLL plugin that the Spirion/Identity Finder engine loads to detect content that built-in detectors don’t cover (or don’t cover accurately).

Two-stage detection model (recommended):

Stage 1 (Regex/keyword): capture candidates using labels/context (higher recall).
Stage 2 (C++ validation + cleaning): extract the real ID, normalize it, and strictly validate it (high precision).

2. High-level workflow

Define the identifier
- Name, purpose, examples (positive/negative)
- Format rules (length, allowed chars, prefixes)
- Checksum rules (Luhn, mod-89, etc.) if applicable
- Context labels (e.g., “DL”, “Driver’s License #”, “ABN”, “ICCID”)
Design Stage-1 regex (context-anchored)
- Prefer labels + same-line capture
- Support “label before number” and “number before label” when needed
- Use numeric boundaries like (?<!\d) and (?!\d) to avoid partial matches
Implement Stage-2 C++
- Tokenize match text, ignore label words
- Reconstruct OCR-split IDs by merging tokens
- Validate format/state/checksum
- Implement Clean() to return a canonical normalized value
Compile the DLL
Deploy/configure in Spirion console
Test and tune your code
- Ensure true positives match
- Ensure false positives are rejected
- Validate OCR/noisy variants

3. Required exported functions (the “plugin contract”)

A typical SearchDLL exports these functions (names and signatures must match what the engine expects):

GetDisplayName()
- Returns the detector name displayed in the console.
GetResultType()
- Returns a numeric result type ID (must be in your custom range and mapped/configured).
GetSearchItemData(SearchItemData*& pData)
- Provides the pattern used for the engine’s first-pass search:
- - dataType = 1 → regex
  - dataType = 2 → keyword
  - pData->data → the regex/keyword string
DeleteSearchItemData(SearchItemData* pData)
- Frees what you allocated in GetSearchItemData.
DoTest() / DoTestEx()
- Called after a match is found; you return true only if it’s a valid finding.
- Best practice: put all logic in DoTestEx and have DoTest call into it.
Clean() / FreeCleanedResult()
- Normalizes the match into a canonical representation (digits-only, uppercase alnum, etc.) and frees memory.

4. Build the Stage-1 regex correctly

Core rules

Make it case-insensitive: (?i)
Anchor to labels/context (avoid naked numbers)
Keep matches on the same logical line with [^\r\n]*?
Avoid partial numeric overlaps with (?<!\d) and (?!\d)
If label can be before or after the number, include both orders

Example pattern shapes

Label before OR after number (ICCID/IMEI style):

LABEL ... NUMBER
NUMBER ... LABEL

Driver’s License label-first context capture:

Capture label + up to ~80 chars of same line
Let C++ extract/validate actual ID from that context (state-specific formats, OCR splits)

5. Stage-2 C++ validation patterns (what to implement)

5.1 Normalize and tokenize

Common helper logic:

ToUpperCopy()
NormalizeId() → keep only letters/digits; uppercase
TokenizeAlnum() → split into alphanumeric tokens
IsIgnoredWord() → drop tokens like “DL”, “LICENSE”, “NO”, “DRIVER”, etc.

5.2 Candidate extraction (handles OCR splitting)

Typical approach:

From the match text, produce tokens
Filter out ignored words (and optionally state codes)
Try:
- single-token candidates
- 2-token merges
- 3-token merges (for heavily chopped OCR)
Pick the best candidate using guardrails:
- must contain at least one digit
- reject overly short all-digit values (avoid addresses like “1802”)
- upper bounds for sanity

5.3 Validation

Depending on ID type:

Length/structure rules (letters + digits pattern)
Region/state rules (e.g., US DL state tables)
Checksum rules (e.g., ABN mod-89; CCN Luhn)

5.4 Context-based validation (optional but powerful)

If formats vary by state/region:

Use fileDataPtr + location in DoTestEx to pull surrounding line context
Detect state code/name near match
Validate candidate against that state’s rule set
If no state found: “any-state” fallback, but still strict

6. Clean() behavior (how results should be stored)

Clean() should return the canonical representation you want stored:

Digits-only for numeric IDs (ICCID/IMEI/ABN/CCN)
Uppercase alphanumeric for mixed IDs (many DLs)

Rule of thumb: Clean() and DoTestEx() must use the same candidate extraction logic so the thing you validate is the thing you store.

7. Build & deployment checklist

Build

Compile as a Windows DLL
Ensure exports are visible (extern "C" __declspec(dllexport))
Match expected calling conventions/signatures used by your Spirion version

Deploy

Place the DLL where the scanner/service can load it (your org’s standard plugin directory)
Register/configure:
- Display name
- Result type (custom range)
- Any console mapping to show findings under the desired category

(Exact deployment steps vary by Spirion/Identity Finder version and your environment; the DLL interface and two-stage approach are the stable core.)

8. How to Test

Create a test suite with:

Must match

Realistic examples with label + valid ID
OCR-split variants (spaces inside the ID; chopped tokens)
“Number before label” cases if your regex supports it

Must NOT match

Label present but no valid ID
Short numeric noise (addresses, years, zip codes)
IDs failing checksum (for checksum-based detectors)
Near-misses (wrong length, invalid prefix, etc.)

Common Pitfalls

Regex too broad → floods DoTestEx with junk, slows scanning and increases false positives
Naked number regex (no labels) → high false positives unless you have strong checksum validation
No numeric boundary checks → partial matches inside longer digit sequences
Clean differs from validation → validated value ≠ stored value → confusing results
No OCR token merging → misses real IDs from OCR’d images/PDF text