How to Build a SearchAPI
Below is practical documentation on how to build a SearchAPI/SearchDLL.
It’s based on the standard SearchDLL architecture and patterns:
- RegEx first-pass
- C++ validation/cleaning second-pass
1. What you’re building
A SearchAPI/SearchDLL is a custom C++ DLL plugin that the Spirion/Identity Finder engine loads to detect content that built-in detectors don’t cover (or don’t cover accurately).
Two-stage detection model (recommended):
- Stage 1 (Regex/keyword): capture candidates using labels/context (higher recall).
- Stage 2 (C++ validation + cleaning): extract the real ID, normalize it, and strictly validate it (high precision).
2. High-level workflow
- Define the identifier
- Name, purpose, examples (positive/negative)
- Format rules (length, allowed chars, prefixes)
- Checksum rules (Luhn, mod-89, etc.) if applicable
- Context labels (e.g., “DL”, “Driver’s License #”, “ABN”, “ICCID”)
- Design Stage-1 regex (context-anchored)
- Prefer labels + same-line capture
- Support “label before number” and “number before label” when needed
- Use numeric boundaries like
(?<!\d)and(?!\d)to avoid partial matches
- Implement Stage-2 C++
- Tokenize match text, ignore label words
- Reconstruct OCR-split IDs by merging tokens
- Validate format/state/checksum
- Implement
Clean()to return a canonical normalized value
- Compile the DLL
- Deploy/configure in Spirion console
- Test and tune your code
- Ensure true positives match
- Ensure false positives are rejected
- Validate OCR/noisy variants
3. Required exported functions (the “plugin contract”)
A typical SearchDLL exports these functions (names and signatures must match what the engine expects):
GetDisplayName()- Returns the detector name displayed in the console.
GetResultType()- Returns a numeric result type ID (must be in your custom range and mapped/configured).
GetSearchItemData(SearchItemData*& pData)- Provides the pattern used for the engine’s first-pass search:
dataType = 1→ regexdataType = 2→ keywordpData->data→ the regex/keyword string
DeleteSearchItemData(SearchItemData* pData)- Frees what you allocated in
GetSearchItemData.
- Frees what you allocated in
DoTest()/DoTestEx()- Called after a match is found; you return true only if it’s a valid finding.
- Best practice: put all logic in
DoTestExand haveDoTestcall into it.
Clean()/FreeCleanedResult()- Normalizes the match into a canonical representation (digits-only, uppercase alnum, etc.) and frees memory.
4. Build the Stage-1 regex correctly
Core rules
- Make it case-insensitive:
(?i) - Anchor to labels/context (avoid naked numbers)
- Keep matches on the same logical line with
[^\r\n]*? - Avoid partial numeric overlaps with
(?<!\d)and(?!\d) - If label can be before or after the number, include both orders
Example pattern shapes
Label before OR after number (ICCID/IMEI style):
LABEL ... NUMBERNUMBER ... LABEL
Driver’s License label-first context capture:
- Capture label + up to ~80 chars of same line
- Let C++ extract/validate actual ID from that context (state-specific formats, OCR splits)
5. Stage-2 C++ validation patterns (what to implement)
5.1 Normalize and tokenize
Common helper logic:
ToUpperCopy()NormalizeId()→ keep only letters/digits; uppercaseTokenizeAlnum()→ split into alphanumeric tokensIsIgnoredWord()→ drop tokens like “DL”, “LICENSE”, “NO”, “DRIVER”, etc.
5.2 Candidate extraction (handles OCR splitting)
Typical approach:
- From the match text, produce tokens
- Filter out ignored words (and optionally state codes)
- Try:
- single-token candidates
- 2-token merges
- 3-token merges (for heavily chopped OCR)
- Pick the best candidate using guardrails:
- must contain at least one digit
- reject overly short all-digit values (avoid addresses like “1802”)
- upper bounds for sanity
5.3 Validation
Depending on ID type:
- Length/structure rules (letters + digits pattern)
- Region/state rules (e.g., US DL state tables)
- Checksum rules (e.g., ABN mod-89; CCN Luhn)
5.4 Context-based validation (optional but powerful)
If formats vary by state/region:
- Use
fileDataPtr+locationinDoTestExto pull surrounding line context - Detect state code/name near match
- Validate candidate against that state’s rule set
- If no state found: “any-state” fallback, but still strict
6. Clean() behavior (how results should be stored)
Clean() should return the canonical representation you want stored:
- Digits-only for numeric IDs (ICCID/IMEI/ABN/CCN)
- Uppercase alphanumeric for mixed IDs (many DLs)
Rule of thumb: Clean() and DoTestEx() must use the same candidate extraction logic so the thing you validate is the thing you store.
7. Build & deployment checklist
Build
- Compile as a Windows DLL
- Ensure exports are visible (
extern "C" __declspec(dllexport)) - Match expected calling conventions/signatures used by your Spirion version
Deploy
- Place the DLL where the scanner/service can load it (your org’s standard plugin directory)
- Register/configure:
- Display name
- Result type (custom range)
- Any console mapping to show findings under the desired category
(Exact deployment steps vary by Spirion/Identity Finder version and your environment; the DLL interface and two-stage approach are the stable core.)
8. How to Test
Create a test suite with:
Must match
- Realistic examples with label + valid ID
- OCR-split variants (spaces inside the ID; chopped tokens)
- “Number before label” cases if your regex supports it
Must NOT match
- Label present but no valid ID
- Short numeric noise (addresses, years, zip codes)
- IDs failing checksum (for checksum-based detectors)
- Near-misses (wrong length, invalid prefix, etc.)
Common Pitfalls
- Regex too broad → floods DoTestEx with junk, slows scanning and increases false positives
- Naked number regex (no labels) → high false positives unless you have strong checksum validation
- No numeric boundary checks → partial matches inside longer digit sequences
- Clean differs from validation → validated value ≠ stored value → confusing results
- No OCR token merging → misses real IDs from OCR’d images/PDF text