How to Build a SearchAPI

Below is practical documentation on how to build a SearchAPI/SearchDLL.

It’s based on the standard SearchDLL architecture and patterns:

  1. RegEx first-pass
  2. C++ validation/cleaning second-pass


1. What you’re building

A SearchAPI/SearchDLL is a custom C++ DLL plugin that the Spirion/Identity Finder engine loads to detect content that built-in detectors don’t cover (or don’t cover accurately).

Two-stage detection model (recommended):

  1. Stage 1 (Regex/keyword): capture candidates using labels/context (higher recall).
  2. Stage 2 (C++ validation + cleaning): extract the real ID, normalize it, and strictly validate it (high precision).


2. High-level workflow

  1. Define the identifier
    • Name, purpose, examples (positive/negative)
    • Format rules (length, allowed chars, prefixes)
    • Checksum rules (Luhn, mod-89, etc.) if applicable
    • Context labels (e.g., “DL”, “Driver’s License #”, “ABN”, “ICCID”)
  2. Design Stage-1 regex (context-anchored)
    • Prefer labels + same-line capture
    • Support “label before number” and “number before label” when needed
    • Use numeric boundaries like (?<!\d) and (?!\d) to avoid partial matches
  3. Implement Stage-2 C++
    • Tokenize match text, ignore label words
    • Reconstruct OCR-split IDs by merging tokens
    • Validate format/state/checksum
    • Implement Clean() to return a canonical normalized value
  4. Compile the DLL
  5. Deploy/configure in Spirion console
  6. Test and tune your code
    • Ensure true positives match
    • Ensure false positives are rejected
    • Validate OCR/noisy variants


3. Required exported functions (the “plugin contract”)

A typical SearchDLL exports these functions (names and signatures must match what the engine expects):

  • GetDisplayName()
    • Returns the detector name displayed in the console.
  • GetResultType()
    • Returns a numeric result type ID (must be in your custom range and mapped/configured).
  • GetSearchItemData(SearchItemData*& pData)
    • Provides the pattern used for the engine’s first-pass search:
      • dataType = 1 → regex
      • dataType = 2 → keyword
      • pData->data → the regex/keyword string
  • DeleteSearchItemData(SearchItemData* pData)
    • Frees what you allocated in GetSearchItemData.
  • DoTest() / DoTestEx()
    • Called after a match is found; you return true only if it’s a valid finding.
    • Best practice: put all logic in DoTestEx and have DoTest call into it.
  • Clean() / FreeCleanedResult()
    • Normalizes the match into a canonical representation (digits-only, uppercase alnum, etc.) and frees memory.


4. Build the Stage-1 regex correctly

Core rules

  • Make it case-insensitive: (?i)
  • Anchor to labels/context (avoid naked numbers)
  • Keep matches on the same logical line with [^\r\n]*?
  • Avoid partial numeric overlaps with (?<!\d) and (?!\d)
  • If label can be before or after the number, include both orders

Example pattern shapes

Label before OR after number (ICCID/IMEI style):

  • LABEL ... NUMBER
  • NUMBER ... LABEL

Driver’s License label-first context capture:

  • Capture label + up to ~80 chars of same line
  • Let C++ extract/validate actual ID from that context (state-specific formats, OCR splits)


5. Stage-2 C++ validation patterns (what to implement)

5.1 Normalize and tokenize

Common helper logic:

  • ToUpperCopy()
  • NormalizeId() → keep only letters/digits; uppercase
  • TokenizeAlnum() → split into alphanumeric tokens
  • IsIgnoredWord() → drop tokens like “DL”, “LICENSE”, “NO”, “DRIVER”, etc.

5.2 Candidate extraction (handles OCR splitting)

Typical approach:

  • From the match text, produce tokens
  • Filter out ignored words (and optionally state codes)
  • Try:
    • single-token candidates
    • 2-token merges
    • 3-token merges (for heavily chopped OCR)
  • Pick the best candidate using guardrails:
    • must contain at least one digit
    • reject overly short all-digit values (avoid addresses like “1802”)
    • upper bounds for sanity

5.3 Validation

Depending on ID type:

  • Length/structure rules (letters + digits pattern)
  • Region/state rules (e.g., US DL state tables)
  • Checksum rules (e.g., ABN mod-89; CCN Luhn)

5.4 Context-based validation (optional but powerful)

If formats vary by state/region:

  • Use fileDataPtr + location in DoTestEx to pull surrounding line context
  • Detect state code/name near match
  • Validate candidate against that state’s rule set
  • If no state found: “any-state” fallback, but still strict


6. Clean() behavior (how results should be stored)

Clean() should return the canonical representation you want stored:

  • Digits-only for numeric IDs (ICCID/IMEI/ABN/CCN)
  • Uppercase alphanumeric for mixed IDs (many DLs)

Rule of thumb: Clean() and DoTestEx() must use the same candidate extraction logic so the thing you validate is the thing you store.


7. Build & deployment checklist

Build

  • Compile as a Windows DLL
  • Ensure exports are visible (extern "C" __declspec(dllexport))
  • Match expected calling conventions/signatures used by your Spirion version

Deploy

  • Place the DLL where the scanner/service can load it (your org’s standard plugin directory)
  • Register/configure:
    • Display name
    • Result type (custom range)
    • Any console mapping to show findings under the desired category

(Exact deployment steps vary by Spirion/Identity Finder version and your environment; the DLL interface and two-stage approach are the stable core.)


8. How to Test

Create a test suite with:

Must match

  • Realistic examples with label + valid ID
  • OCR-split variants (spaces inside the ID; chopped tokens)
  • “Number before label” cases if your regex supports it

Must NOT match

  • Label present but no valid ID
  • Short numeric noise (addresses, years, zip codes)
  • IDs failing checksum (for checksum-based detectors)
  • Near-misses (wrong length, invalid prefix, etc.)


Common Pitfalls

  • Regex too broad → floods DoTestEx with junk, slows scanning and increases false positives
  • Naked number regex (no labels) → high false positives unless you have strong checksum validation
  • No numeric boundary checks → partial matches inside longer digit sequences
  • Clean differs from validation → validated value ≠ stored value → confusing results
  • No OCR token merging → misses real IDs from OCR’d images/PDF text



Was this article helpful?