Using Custom Data Types: Regular Expression (RegEx) Examples

Using the RegEx custom data type, organizations are able to configure search criteria to include meet their specific needs.

About Regular Expressions (RegEx)

Spirion includes the ability to create custom datatypes for an exact match, proprietary and patterned data using Keywords, and Regular Expressions.

  • Using the RegEx Custom Type data type, organizations are able to configure search criteria to include meet their specific needs. 
  • Regular expressions can be run directly from the Client interface, or via a Console policy. 

What is a Regular Expression?

A regular expression (regex or regexp) is a specially formatted text string that describes a search pattern.

  • Regular Expressions are a compact way of describing data and the many forms it can take by including:
    • Booleans
    • Grouping
    • Quantifiers
    • Wildcards

Learning how to create regular expressions can be fun and challenging. There are troves of tutorials available for free on the internet that can be found by searching for terms such as "regular expression tutorial", “learning regular expressions", or "introduction to regular expressions". 

A few that we have found useful are:

Spirion Regular Expression Implementation

There are dozens of regular expression libraries that provide various features.

  • Spirion utilizes the Boost library with Perl Syntax which is one of the few that has a full feature set.
  • The implementation of a regex library may be unique to the application and some flags or modifiers can be implemented as defaults when utilized. 
  • These flags can be modified with specific syntax when they contradict the desired functionality.

Spirion-enabled flags:

  • Global
  • Insensitive

Example: Passwords with Specific Requirements

The following expression searches for passwords with specific requirements that require case sensitivity.

Expression

This regular expression searches for strings that contain: at least one digit, at least one lower case letter, at least one capital letter, at least one symbol in the list, and the string must be 8 to 16 characters in length.

(?-i)(?<!\S)(?=\S*?[A-Z])(?=\S*?[a-z])(?=\S*?[0-9])(?=\S*?[~`!@#$%^&*()_\-=+{}\[\]|\\:;"'<>,.\/?])\S{8,16}(?!\S)
*NOTE – the use of the (?-i) at the beginning of the expression which is used to remove the default enabled flag for case insensitive.

Example: Western Currency

The following expression searches for various western currency amounts using simple matching. This pattern finds the following patterns with one of the following currency symbols required $ £ €

Note: https://www.languageediting.com/format-numbers-eu-vs-us/

  • $ 0.99
  • $ .09
  • £123
  • €123,456
  • $123,456.99

Expression:

(^|\s)(?:$|\£|\€)[ ]?(?:[1-9]{1}[0-9]{0,2}(?,)[0-9]{3})*(?.)[0-9]{2})?||[1-9]{1}[0-9]{0,2}(?.)[0-9]{2})?|0(?.)[0-9]{2})|(?.)[0-9]{2}))($|\s)

Example: Medicare Beneficiary ID numbers

The following expression searches for MBIs (Medicare beneficiary ID numbers).

Note: https://www.cms.gov/Medicare/New-Medicare-Card/Understanding-the-MBI.pdf

Expression

(?:^|\b)(??:mbi|medicare|HICN)\b.{0,25}?\b\K(??:[1-9])(?!(?:S|L|O|I|B|Z))[A-Z](?:[0-9]|(?!(?:S|L|O|I|B|Z))[A-Z])(?:[0-9])((?:\s|[-])?)(?!(?:S|L|O|I|B|Z))[A-Z](?:[0-9]|(?!(?:S|L|O|I|B|Z))[A-Z])[0-9]\1(?!(?:S|L|O|I|B|Z))[A-Z](?!(?:S|L|O|I|B|Z))[A-Z][0-9][0-9]\b)|(??:[1-9])(?!(?:S|L|O|I|B|Z))[A-Z](?:[0-9]|(?!(?:S|L|O|I|B|Z))[A-Z])(?:[0-9])((?:\s|[-])?)(?!(?:S|L|O|I|B|Z))[A-Z](?:[0-9]|(?!(?:S|L|O|I|B|Z))[A-Z])[0-9]\2(?!(?:S|L|O|I|B|Z))[A-Z](?!(?:S|L|O|I|B|Z))[A-Z][0-9][0-9]\b)(?=(?:.|\n){0,25}\b(?:mbi|medicare|HICN)\b))

Example: IPv4 Addresses

The following expression searches for IPv4 addresses with a simple pattern.

Expression

(^|\s)(25[0-5]|2[0-4][0-9]|[0-1]{1}[0-9]{2}|[1-9]{1}[0-9]{1}|[1-9])\.(25[0-5]|2[0-4][0-9]|[0-1]{1}[0-9]{2}|[1-9]{1}[0-9]{1}|[1-9]|0)\.(25[0-5]|2[0-4][0-9]|[0-1]{1}[0-9]{2}|[1-9]{1}[0-9]{1}|[1-9]|0)\.(25[0-5]|2[0-4][0-9]|[0-1]{1}[0-9]{2}|[1-9]{1}[0-9]{1}|[0-9])($|\s)

Pattern: IPv4 addresses, does not capture repeating and sequential digits

The following is an alternative that does not capture repeating and sequential digits such as: 0.0.0.0 and 1.2.3.4

Expression

\b(?!([0-9])\.\1\.\1\.\1\b)(?!(0\.)?1\.2\.3\.4\b|2\.3\.4\.5\b|3\.4\.5\.6\b|4\.5\.6\.7\b|5\.6\.7\.8\b|6\.7\.8\.9\b)((25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\b

Pattern: IPv4 addresses (no prefixes)

The following is an alternative that captures IP Addresses (no prefixes)

Expression

(?:^|[^a-zA-Z0-9_.])\K(?!([012345679])\.\1\.\1\.\1\b)(?!(0\.)?1\.2\.3\.4\b|2\.3\.4\.5\b|3\.4\.5\.6\b|4\.5\.6\.7\b|5\.6\.7\.8\b|6\.7\.8\.9\b)((25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)(?!\.\d)(?=[\D.]|$)

Example: IPv6 Addresses

The following expression searches for IPv6 addresses with a simple pattern.

Expression

(^|\s)(?:[0-9a-fA-F]{1,4}{7}[0-9a-fA-F]{1,4}($|\s)

Example: MAC Addresses

The following expression searches for MAC addresses with a simple pattern.

Expression

(^|\s)((??'1'[0-9a-fA-F]{2})(?:\:|-)?(?'2'[0-9a-fA-F]{2})(?:\:|-)?(?'3'[0-9a-fA-F]{2})(?:\:|-)?(?'4'[0-9a-fA-F]{2})(?:\:|-)?(?'5'[0-9a-fA-F]{2})(?:\:|-)?(?'6'[0-9a-fA-F]{2})))($|\s)

Example: Machine Readable Passport Number

Our Passport AnyFind finds US Passports.

This pattern finds the Machine readable code at the bottom of passports that look like this:

Machine readable code at the bottom of passports

  • This pattern is tuned to work with OCR vagaries.

Expression

([A-Z0-9<]{9}[O0-9]{1}[A-Z]{3}[0-9OBDlI]{7}[MF]{1}[0-9BDOlI]{7}[A-Z0-9<]{14}[0-9BDOlI<]{2})

Example: International Telephone Number

Spirion Telephone AnyFind finds US telephone numbers. This pattern finds a variety of international telephone numbers (+ symbol optional) that use [country-code] simple prefixes like “mobile”, “(p)”, etc.: +44 (0) 1483 926 095

Although there is no real standard on including “+” vs “0” when writing down a phone number, using '+' SIGNIFICANTLY reduces noise and false positives. It is possible to write a variation expression that uses a preceding “0” instead, but would require context validation nearby as a trade-off like t, tel, m, mob, c, cell, etc…

Note - The Spirion Telephone Number AnyFind validates the US +1 country code, so when using both the AnyFind and this pattern, US numbers with the +1 country code results in duplicate matches.

Expression

(???:\(m\)|\(c\)|\(o\)|cell|mob|mobile|phone|tel|telephone).{0,4}\K0?|(?:\+[ ]?)))(?:1|7|20|27|30|31|32|33|34|36|39|40|41|43|44|45|46|47|48|49|51|52|53|54|55|56|57|58|60|61|62|63|64|65|66|81|82|84|86|90|91|92|93|94|95|98|211|212|213|216|218|220|221|222|223|224|225|226|227|228|229|230|231|232|233|234|235|236|237|238|239|240|241|242|243|244|245|246|247|248|249|250|251|252|252|253|254|255|256|257|258|260|261|262|263|264|265|266|267|268|269|290|290|291|297|298|299|350|351|352|353|354|355|356|357|358|359|370|371|372|373|374|375|376|377|378|380|381|382|385|386|387|389|420|421|423|500|501|502|503|504|505|506|507|508|509|590|591|592|593|594|595|596|597|598|599|670|672|673|674|675|676|677|678|679|680|681|682|683|685|686|687|688|689|690|691|692|800|808|850|852|853|855|856|875|876|877|878|879|880|881|882|883|886|888|960|961|962|963|964|965|966|967|968|970|971|972|973|974|975|976|977|979|992|993|994|995|996|998)[ ]?(\.|\-)?[ ]?(?:\((?:0|\d{2,4})\))?[ ]?(\.|\-)?[ ]?[0-9\/ .-]{7,14}(?=(?:$|\s|\,))

Example: Dates, using Month, Day, and Year

AnyFind locates birthdates, however, there is a use case for finding Date Patterns with no validation that are easily incorporated into Sensitive Data Definitions.

This pattern does a good job of finding the following patterns with commas, spaces, slashes, and slashes. Back references are used to make sure the separators are consistent within the date match to reduce false positives

  • M DD YYYY such as 1-18-1981
  • MM DD YYYY such as 01/18/1981
  • YYYY AAA DD such as 1981 Jan 01
  • AAA DD YYYY such as January 1, 1981
  • AAA DD(st,rd,th) YYYY such as January 1st, 1981 (ordinal date formatting support is limited)

Expression

(?:^|\s)\K(???:1[89]|20)\d\d([- /.])(??:0[1-9]|1[012])|(?:jan|feb|mar|apr|jun|jul|aug|sep|sept|oct|nov|dec|january|february|march|april|may|june|july|august|september|october|november|december))\1(?:0[1-9]|[12][0-9]|3[01]))|(???:0?[1-9]|1[012])|(?:jan|feb|mar|apr|jun|jul|aug|sep|sept|oct|nov|dec|january|february|march|april|may|june|july|august|september|october|november|december))([- /.])(?:0[1-9]|[12][0-9]|3[01])\2(??:1[89]|20)\d\d))|(??:0[1-9]|[12][0-9]|3[01])([- /.])(??:0[1-9]|1[012])|(?:jan|feb|mar|apr|jun|jul|aug|sep|sept|oct|nov|dec|january|february|march|april|may|june|july|august|september|october|november|december))\3(???:1[89]|20)\d\d)|\d\d))|(??:jan|feb|mar|apr|jun|jul|aug|sep|sept|oct|nov|dec|january|february|march|april|may|june|july|august|september|october|november|december) (?:st|nd|rd|th)?((,[ ])|,|[ ])(?:18|19|20)\d{2}))(?:$|\s)

Example: Birth Dates with Simple Validation

AnyFind locates birthdates, however text validation is usually tight.

This pattern utilizes the above Date pattern and adds simple built-in text validation to eliminate the need to create a Sensitive Data Definition (SDD) (this is search engine logic created by end-users to find custom data types with accuracy) - for use cases where validation text is within 100 characters before the date match.

Validation words are birth/dob/born and can be easily added to or changed.

Expression

(???:birth|dob|born))(?:.|\n){0,100})(?:^|\s|\n)\K(???:1[89]|20)\d\d([- /.])(??:0[1-9]|1[012])|(?:jan|feb|mar|apr|jun|jul|aug|sep|sept|oct|nov|dec|january|february|march|april|may|june|july|august|september|october|november|december))\1(?:0[1-9]|[12][0-9]|3[01]))|(???:0?[1-9]|1[012])|(?:jan|feb|mar|apr|jun|jul|aug|sep|sept|oct|nov|dec|january|february|march|april|may|june|july|august|september|october|november|december))([- /.])(?:0[1-9]|[12][0-9]|3[01])\2(??:1[89]|20)\d\d))|(??:0[1-9]|[12][0-9]|3[01])([- /.])(??:0[1-9]|1[012])|(?:jan|feb|mar|apr|jun|jul|aug|sep|sept|oct|nov|dec|january|february|march|april|may|june|july|august|september|october|november|december))\3(???:1[89]|20)\d\d)|\d\d))|(??:jan|feb|mar|apr|jun|jul|aug|sep|sept|oct|nov|dec|january|february|march|april|may|june|july|august|september|october|november|december) (?:st|nd|rd|th)?(??:,[ ])|,|[ ])(?:18|19|20)\d{2}))(?:$|\s)

Example: Expiration Dates with simple validation

AnyFind locates birthdates, however there are use cases to locate dates of issue or expiry such as identification cards.

This pattern utilizes the above Date pattern and adds simple built-in text validation to eliminate the need to create an SDD for use cases where validation text is within 100 characters before the date match.

Validation words are expir/issue and can be easily added to or changed.

Expression

(???:expir|issue))(?:.|\n){0,100})(?:^|\s|\n)\K(???:1[89]|20)\d\d([- /.])(??:0[1-9]|1[012])|(?:jan|feb|mar|apr|jun|jul|aug|sep|sept|oct|nov|dec|january|february|march|april|may|june|july|august|september|october|november|december))\1(?:0[1-9]|[12][0-9]|3[01]))|(???:0?[1-9]|1[012])|(?:jan|feb|mar|apr|jun|jul|aug|sep|sept|oct|nov|dec|january|february|march|april|may|june|july|august|september|october|november|december))([- /.])(?:0[1-9]|[12][0-9]|3[01])\2(??:1[89]|20)\d\d))|(??:0[1-9]|[12][0-9]|3[01])([- /.])(??:0[1-9]|1[012])|(?:jan|feb|mar|apr|jun|jul|aug|sep|sept|oct|nov|dec|january|february|march|april|may|june|july|august|september|october|november|december))\3(???:1[89]|20)\d\d)|\d\d))|(??:jan|feb|mar|apr|jun|jul|aug|sep|sept|oct|nov|dec|january|february|march|april|may|june|july|august|september|october|november|december) (?:st|nd|rd|th)?(??:,[ ])|,|[ ])(?:18|19|20)\d{2}))(?:$|\s)

ALTERNATE Expression for OCR/Driver’s license work:

(???:ex?p|[1i][s]{2})\b).{0,8})\K(?:0[1-9]|1[012])- /.- /.\b

Example: Student GPA with simple validation

FERPA compliance requires location of student records and grades in files, attachments, etc. This pattern utilizes simple text validation for the term GPA (loosely defined for OCR vagaries) within 25 characters prior to a single digit number, a dot, and up to four additional digits.

  • Valid matches are:
    • 3.0
    • 2.15
    • 4.336
    • 3.6657

Expression

(??:G|C)PA)(?:.|\n){0,24}\D\K(?:\d.\d{1,4})

Example: Find Office Document Author

This pattern, when scanning for metadata, finds the author of an office document.

Expression

^dc:creator:\s+\K(?:[^\s:]+(?:\s+[^\s:]+)*)(?=\s+(?:dc:|dcterms|$)

Example: Find Adobe PDF Document Author

This pattern, when scanning for metadata, finds the author of an office document

Expression

^Author:\s+\K(?:[^\s:]+(?:\s+[^\s:]+)*)$

Example: ASP.NET Machinekey

Pattern: ASP.NET Machinekey

This pattern finds symmetric keys in XML configuration.

Example

<configuration>
    <system.web>
        <machineKey
            validationKey="Validation Key here"
            decryptionKey="Decryption key here"
            validation="SHA1"
            decryption="AES" />
    </system.web>
</configuration>

Expression:

\<machineKey.{0,50}?\K(validation|decryption)Key\=\".*?\"

Example: SQL Connection String Credentials

This pattern finds SQL Connection Strings with username/password pairs.

Does not conflict with Password AnyFind.

Examples

Azure SQL Database connection strings - ConnectionStrings.com

Expression

((Provider|Server)\=.{0,100}?\K((user|username|user\s?id|uid|db\_user|Service Account)\=.{1,20}\@?.{1,20}\;(password|pwd|db_pass|\s|\b)\=.?\;|(password|pwd|db_pass)\=.?\;(user\=|username|user\s?id|uid|db\_user|Service Account)\={1,20}\@?.{1,20}\;))

Example: Slack Access Tokens 

Expression

(\b|\<|\=|^)\Kxox[baps]\-[a-z-]{10,38}?(\s|\>|\;|$)

Example: US DOT Number

Expression

(?:^|\s)US[ ]?DOT[ -_]{0,2}?\d{6,8}?(?:$|\s)

Example: I9 Form OMB number

Expression

(?i)(?:^|\s)OMB No\. (1615\-0047|1545-1621)(?:$|\s)

Example: FAFSA Form OMB number with context

Looks for the word “Form” (case sensitive) up to 300 characters before.

Expression

(?i)Form.{0,300}\KOMB No\. (1845\-0001)(?:$|\s)

References

Tips

  • Characters ^ and $ are often not needed
    • Websites posting pre-created regular expressions often include the /^ characters (beginning of string) and ?/ characters (end of string) syntax
    • These should not be used in Spirion because the common intention is to search an entire string of text or the whole document
  • Non-capture groups are not typically beneficial
    • Spirion uses the entire match so non-captured groups may not be helpful eliminating section of a regex from a match
    • A negative look behind is a better option
  • Do not include preceding or trailing spaces in the regular expression

How to Test Regular Expressions

To work out bugs and reduce the chance and occurrence of false positives it is best to test the regular expression against sample data that resembles what the expression will ultimately target.

  • Building and testing regular expressions are most easily accomplished in a tool specifically designed to do so.
  • Many of these are free websites and there are also downloadable programs that work as well.

How to Use a Rex Tester

Using online testers is a good way to create a regex, see what it is supposed to do, and test it against data. 

When using testers, it is advised to enable the flags/modifiers that are enabled by default in the Spirion implementation (global and insensitive) so that your results will be aligned with what will be found using the created regex search in the Spirion client. 

How to Test in the Spirion Client

Before a regular expression is widely implemented it is best to perform final testing within the Spirion client to validate expected behavior. 

Once the regular expression is created, assemble a folder of sample data or copies of production text documents that contain patterns that should match and should not match.

Testing procedure:

  1. Assemble sample/test data into a folder
  2. Open the Spirion client
  3. Enable the Regex data type
  4. Specify the target for the search
  5. Start search
  6. Review and verify results

Detailed steps for searching with Regex in the client are shown in the following section: "Searching for Regular Expressions with the Spirion Client."

How to Input and Search for Regular Expressions using the Spirion Client

Use these steps to input and search for a regular expression using the Spirion client.

  1. Open the Spirion client.

  2. Select the Data Types tab.
  3. Deselect all data types and select Regular Expression. A datatype is enabled when a darkened box appears around the selection such as Regular Expression here:

    1. If no regular expressions have previously been specified, the Settings menu for Custom Sensitive Data Types opens automatically.
    2. Otherwise, click "File > Settings > Custom Types" to open the panel for Regex entry.
  4. Paste the pre-tested regular expression into the field labeled RegEx.
  5. Specify a name label for the regex that can be used in reporting and workflows.
  6. Click the Add button.
  7. Click the OK button, or the Apply button and then the OK button.


     
  8. Click the Locations tab at the top of the screen to select the target folder.
  9. Confirm that only Files is enabled in the Search Locations section.
  10. Confirm that only Custom Folder is selected in the File Locations section.

  11. Click the lower half text of the Customer Folder icon to open the Customize Folder List.


     
  12. Browse to or paste the location for the test data and click the Add button.
  13. Click Ok.
  14. Select the "Main" tab and click the Start button.

  15. Verify the search results are accurate.

How to Add Regular Expressions using the Spirion Sensitive Data Platform Console

Use the following steps to add regular expressions using the Spirion Sensitive Data Platform:

  1. From the left side navigation menu select Settings.
  2. Next, click Global Data Types.
  3. The "Data Types" page opens.
  4. Click the CUSTOM DATA TYPES tab.
  5. Click the Actions button in the top right corner of the page.

  6. Click Add Custom Data Type from the sub-menu that appears.
  7. Select Regular Expression in the drop-down box.
  8. Complete the following fields in the box:
    1. Name - Name of your Regular Expression
    2. Regular Expression - Enter the Regular Expression to be used
    3. Value - An Integer value for the Regular Expression
    4. Dollar Value - A numeric value to be used in the SVD3 module.

  9. Click the Save & Update button at the bottom of the page.
  10. This Regular Expression can now be used in a Scan Playbook.


Was this article helpful?