Menü schliessen
Created: December 13th 2024
Last updated: December 13th 2024
Categories: Cyber Security
Author: Marcus Fleuti

Advanced Email Fraud Detection: Using Regex Patterns to Catch Sophisticated DHL Phishing Attempts with Unicode Homoglyphs

Donation Section: Background
Monero Badge: QR-Code
Monero Badge: Logo Icon Donate with Monero Badge: Logo Text
82uymVXLkvVbB4c4JpTd1tYm1yj1cKPKR2wqmw3XF8YXKTmY7JrTriP4pVwp2EJYBnCFdXhLq4zfFA6ic7VAWCFX5wfQbCC

Introduction

Email fraud continues to evolve, with scammers employing increasingly sophisticated techniques to bypass traditional security measures. One common target is DHL, the international shipping company, whose brand is frequently exploited in phishing attempts. In this article, we'll dive deep into an advanced regex pattern designed to catch these fraudulent emails, even when they use Unicode homoglyphs and special characters to evade detection.

Understanding the Challenge

Modern phishing attempts often use sophisticated techniques in the Email FROM header like homoglyph attacks, where similar-looking characters replace legitimate ones. For example, scammers might replace the Latin "H" in "DHL" with a Cyrillic "Н" that looks identical to human eyes but has a different Unicode value.
Here's a real-world example:

ᎠНᏞ_Express <DHL_Express.13972@sinectis.com.ar>

Most people will not notice that the first "DHL" word is written with unicode characters, which just look like "DHL". It uses different Unicode characters to impersonate the DHL brand. Also most spamfilters will not be able to detect this FROM header as spam because of the same reason. Try it out yourself: Press CTRL+F in your browser and enter "DHL". You will see that your browser will not mark the first "DHL" word.

Breaking Down the Regex Pattern

Let's analyze our fraud detection pattern piece by piece:

(?:my-?)?(?:[_]|\b)(?:d|D|ᴅ|ď|đ|𝐝|𝑑|𝒅|𝓭|𝔡|𝕕|𝖉|𝗱|𝘥|𝙙|𝚍|Ꭰ).?(?:h|H|Н|н|ʜ|ℎ|ħ|ḣ|𝐡|𝒉|𝓱|𝔥|𝕙|𝖍|𝗵|𝘩|𝙝|𝚑).?(?:l|L|ʟ|ι|ℓ|ŀ|𝐥|𝑙|𝒍|𝓵|𝔩|𝕝|𝖑|𝗹|𝘭|𝙡|𝚕|Ꮮ)(?:[_]|\b).*?<(?!.*?@(?:.*?\.)?(?:dhl(?:-news)?|dhlfreight-news)\.(?:com|ch|de|ru|it|fr|at)(?:>|$)])

1. General description (in simple words)

The pattern implements an advanced detection mechanism for email sender verification, specifically targeting potential DHL impersonation attempts in email communications. It employs a sophisticated regular expression to analyze the sender information in email headers, with particular focus on the FROM field. The pattern is designed to identify instances where "DHL" appears in the sender's display name while simultaneously verifying that the associated email domain does not correspond to any of DHL's legitimate international domains.

This dual-verification approach creates an effective filter against common phishing tactics where attackers attempt to exploit DHL's brand recognition by using the company name in the sender's display name while sending from unrelated domains. The pattern is particularly powerful as it accounts for various Unicode homoglyphs and character substitutions that malicious actors might employ to bypass simpler detection methods.

When implemented within SpamAssassin's rule framework, this pattern assigns a significant spam score to messages that match these characteristics, effectively filtering potential phishing attempts while maintaining a low false-positive rate through careful consideration of DHL's legitimate international domain portfolio.

2. The Word Boundary Pattern Problem

One of the components of our pattern is (?:[_]|\b). This elegant solution addresses a specific limitation in regex word boundaries. Let's break down why this is important:

  • The standard word boundary \b in regex considers underscore (_) as a word character
    • This means \bDHL\b wouldn't match within the string "DHL_Express"
  • Our pattern (?:[_]|\b) creates a custom boundary that also treats underscores as separators
  • This catches both traditional word boundaries AND underscore-separated text

3. Homoglyph Detection

The pattern includes extensive character sets for each letter in "DHL":
D: d|D|ᴅ|ď|đ|𝐝|𝑑|𝒅|𝓭|𝔡|𝕕|𝖉|𝗱|𝘥|𝙙|𝚍|ᎠH: h|H|Н|н|ʜ|ℎ|ħ|ḣ|𝐡|𝒉|𝓱|𝔥|𝕙|𝖍|𝗵|𝘩|𝙝|𝚑L: l|L|ʟ|ι|ℓ|ŀ|𝐥|𝑙|𝒍|𝓵|𝔩|𝕝|𝖑|𝗹|𝘭|𝙡|𝚕|Ꮮ
These character sets include:

  • Standard ASCII characters (both upper and lowercase). Yes, we are also using the /i mode for which you might think this is unnecessary. And this is true - for some cases. There is characters though in REGEX like for example "é" and "É" which do not work with the /i mode. To ensure proper matching we use both, a full set of characters in combination with the /i mode being enabled.
  • Unicode mathematical variants
  • Similar-looking characters from different scripts
  • Decorated versions of the letters

Here's a full list of a character map we compiled in the past years, which might help you creating proper REGEXs in the future:

a = (?:a|A|А|а|Α|α|ä|ą|𝐚|𝑎|𝒂|𝓪|𝔞|𝕒|𝖆|𝗮|𝘢|𝙖|𝚊)
b = (?:b|B|ʙ|В|в|β|ḃ|ḅ|𝐛|𝑏|𝒃|𝔟|𝕓|𝖇|𝗯|𝘣|𝙗|𝚋)
c = (?:c|C|С|ᥴ|с|ᴄ|ϲ|ċ|ç|𝐜|𝑐|𝒄|𝓬|𝔠|𝕔|𝖈|𝗰|𝘤|𝙘|𝚌)
d = (?:d|D|ᴅ|ď|đ|𝐝|𝑑|𝒅|𝓭|𝔡|𝕕|𝖉|𝗱|𝘥|𝙙|𝚍|Ꭰ)
e = (?:e|E|Е|з|є|ɘ|ɛ|ə|𝑒|𝖾|Ԑ|ԑ|ε|ë|ė|ẹ|𝐞|𝒆|𝓮|𝔢|𝖊|𝗲|𝘦|𝙚|𝚎)
f = (?:f|F|ꜰ|𝑓|𝖿|ƒ|𝘧|Ϝ|ϝ|ḟ|𝐟|𝒇|𝔣|𝕗|𝖋|𝗳|𝙛|𝚏)
g = (?:g|G|ɢ|Ԍ|ԍ|Ꮐ|ġ|ğ|𝐠|𝑔|𝒈|𝓰|𝔤|𝕘|𝖌|𝗴|𝘨|𝙜|𝚐)
h = (?:h|H|Н|н|ʜ|ℎ|ħ|ḣ|𝐡|𝒉|𝓱|𝔥|𝕙|𝖍|𝗵|𝘩|𝙝|𝚑)
i = (?:i|I|1|\||𝑖|𝖎|𝓲|!|і|í|ı|ɪ|ị|𝐢|𝒊|𝔦|𝕚|𝖏|𝗶|𝘪|𝙞|𝚒)
j = (?:j|J|Ј|ј|ᴊ|ϳ|ɉ|𝐣|𝑗|𝒋|𝓳|𝔧|𝕛|𝖏|𝗷|𝘫|𝙟|𝚓)
k = (?:k|K|К|к|ᴋ|ĸ|ҟ|𝐤|𝑘|𝒌|𝓴|𝔨|𝕜|𝖐|𝗸|𝘬|𝙠|𝚔)
l = (?:l|L|ʟ|ι|ℓ|ŀ|𝐥|𝑙|𝒍|𝓵|𝔩|𝕝|𝖑|𝗹|𝘭|𝙡|𝚕|Ꮮ)
m = (?:m|M|М|м|ᴍ|ṁ|𝐦|𝑚|𝒎|𝓶|𝔪|𝕞|𝖒|𝗺|𝘮|𝙢|𝚖)
n = (?:n|N|и|И|п|П|η|ɴ|ŋ|ɲ|ň|ṅ|𝐧|𝑛|𝒏|𝓷|𝔫|𝕟|𝖓|𝗻|𝘯|𝙣|𝚗|Ո)
o = (?:o|O|О|о|ᴏ|օ|σ|ȯ|ọ|𝐨|𝑜|𝒐|𝓸|𝔬|𝕠|𝖔|𝗼|𝘰|𝙤|𝚘)
p = (?:p|P|Р|р|ᴘ|ρ|ṗ|𝐩|𝑝|𝒑|𝓹|𝔭|𝕡|𝖕|𝗽|𝘱|𝙥|𝚙)
q = (?:q|Q|ԛ|զ|𝑞|𝖖|𝓆|q|ϙ|𝐪|𝒒|𝓺|𝔮|𝕢|𝗾|𝘲|𝙦|𝚚)
r = (?:r|R|Я|я|ʀ|г|𝑟|𝖗|r|ŕ|ṙ|𝐫|𝒓|𝓻|𝔯|𝕣|𝗋|𝘳|𝙧|𝚛)
s = (?:s|S|ꜱ|Ѕ|ѕ|ṡ|ș|𝐬|𝑠|𝒔|𝓼|𝔰|𝕤|𝖘|𝗌|𝘴|𝙨|𝚜)
t = (?:t|T|Т|𝑡|𝖙|𝓉|ŧ|𝘵|t|τ|ṫ|𝐭|𝒕|𝓽|𝔱|𝕥|𝖙|𝗍|𝚝)
u = (?:u|U|ц|Ц|ս|ʊ|𝑢|𝖚|𝓊|u|μ|ü|ụ|𝐮|𝒖|𝓾|𝔲|𝕦|𝖚|𝗎|𝘶|𝙪|𝚞)
v = (?:v|V|ѵ|ν|ⅴ|𝑣|𝖛|𝓋|v|ṿ|𝐯|𝒗|𝓿|𝔳|𝕧|𝖝|𝗏|𝘷|𝙫|𝚟)
w = (?:w|W|ѡ|ω|ш|Ш|𝑤|𝖜|𝓌|w|ẇ|𝐰|𝒘|𝔀|𝔴|𝕨|𝖜|𝗐|𝘸|𝙬|𝚠)
x = (?:x|X|Х|х|𝑥|𝘹|x|χ|𝖷|𝓍|ẋ|𝐱|𝒙|𝔁|𝔵|𝕩|𝖝|𝗑|𝙭|𝚡)
y = (?:y|Y|У|у|ʏ|ý|ɏ|𝐲|𝑦|𝒚|𝓎|𝔂|𝔶|𝕪|𝖞|𝗒|𝘺|𝙮|𝚢)
z = (?:z|Z|ᴢ|ʐ|ź|ż|ž|𝑧|𝖟|𝓏|z|ẓ|𝐳|𝒛|𝔃|𝔷|𝕫|𝖟|𝗓|𝘻|𝙯|𝚣)

4. Flexible Matching with `.?`

Between each letter, we use `.?` to allow for a character of separation. This catches variants like:

- D-H-L
- D.H.L
- D H L

Legitimate Domain Validation

The pattern ends with a negative lookahead to ensure the email doesn't come from legitimate DHL domains:

(?!.*?@(?:.*?\.)?(dhl(?:-news)?|dhlfreight-news)\.(?:com|ch|de|ru|it|fr|at)>)

This checks that the email domain isn't:

  • dhl.com, dhl.de, etc.
  • dhl-news.com, dhl-news.de, etc.
  • dhlfreight-news.com, dhlfreight-news.de, etc.

The pattern's domain validation approach might seem counterintuitive at first, as it employs a negative lookahead to check for legitimate DHL domains. Instead of explicitly matching suspicious domains, we validate against a list of known legitimate DHL domains. This inverse logic serves a crucial purpose: it allows the pattern to match any sender that uses "DHL" in their display name while sending from a non-authorized domain.

This approach is particularly effective because:

  1. The list of legitimate DHL domains is finite and well-known
  2. The list of potential malicious domains is infinite and constantly changing
  3. Any email claiming to be from DHL should only come from their official domains

By structuring the pattern this way, we create a more robust detection system that automatically flags any new, unauthorized domains that attackers might use, without requiring constant updates to our pattern's domain list.

Alternative Approaches and Comparisons

Feature Our Regex Pattern Simple Text Match Domain Whitelist
Homoglyph Detection Yes No No
Special Character Handling Yes Limited No
False Positive Rate Low High Very Low
Implementation Complexity Medium Low Low

Implementing the ruleset in SpamAssassin

Introduction

SpamAssassin is a powerful tool in the fight against email fraud, and with proper configuration, it can effectively detect sophisticated phishing attempts targeting DHL customers. In this guide, we'll walk through implementing a custom rule that catches homoglyph-based DHL phishing attempts while maintaining a low false-positive rate.

The SpamAssassin Rule

Let's start with the complete rule configuration that you'll add to your SpamAssassin setup:

header          PHISHING_DHL                       From =~ /(?:my-?)?(?:[_]|\b)(?:d|D|ᴅ|ď|đ|𝐝|𝑑|𝒅|𝓭|𝔡|𝕕|𝖉|𝗱|𝘥|𝙙|𝚍|Ꭰ).?(?:h|H|Н|н|ʜ|ℎ|ħ|ḣ|𝐡|𝒉|𝓱|𝔥|𝕙|𝖍|𝗵|𝘩|𝙝|𝚑).?(?:l|L|ʟ|ι|ℓ|ŀ|𝐥|𝑙|𝒍|𝓵|𝔩|𝕝|𝖑|𝗹|𝘭|𝙡|𝚕|Ꮮ)(?:[_]|\b).*?<(?!.*?@(?:.*?\.)?(?:dhl(?:-news)?|dhlfreight-news)\.(?:com|ch|de|ru|it|fr|at)(?:>|$)])/ims
describe        PHISHING_DHL                       High Probability DHL Phishing/Scam
score           PHISHING_DHL                       8.0

Understanding the Components

1. Header Rule Definition

The rule starts with:

  • header: Indicates this is a header-based rule
  • describe: The description shown to the enduser in the spam report
  • score: The amount of points given when the filter matches (common setting: point values > 5 will cause the e-mail to be filtered out as spam)
  • PHISHING_DHL: The unique identifier for this rule
  • From =~ : Specifies that we're matching against the From header

2. Pattern Flags

The pattern uses three important flags:

  • /i: CASE-INSENSITIVE mode
  • /m: MULTILINE mode (^ and $ match line start/end)
  • /s: DOTALL mode (Makes dot match any character INCLUDING newlines)

3. Scoring Configuration

score PHISHING_DHL 8.0
The score of 8.0 is deliberately high because:

  • DHL phishing attempts are common and dangerous
  • The pattern is specifically designed to minimize false positives
  • Most legitimate DHL communication comes from official domains

Implementation Guide

1. Installation Location

Add the rule to your SpamAssassin configuration:

Locate your local.cf:

# Common locations:
- /etc/spamassassin/local.cf
- /etc/mail/spamassassin/local.cf
- /usr/local/etc/spamassassin/local.cf

Back up your existing configuration (example):

sudo cp /etc/spamassassin/local.cf /etc/spamassassin/local.cf.backup

Add the rule to local.cf

2. Testing the Configuration

After adding the rule:

Check syntax:

spamassassin --lint

Test with a sample email:

spamassassin -D --test-mode < sample-email.txt

Restart SpamAssassin:

sudo systemctl restart spamassassin

Performance Considerations

Aspect Impact Mitigation
Pattern Complexity Medium Pattern compilation caching
Memory Usage Low No action needed
Processing Time Low-Medium Early rule positioning

Monitoring and Maintenance

1. Regular Checks

  • Monitor SpamAssassin logs for rule hits
  • Track false positives/negatives
  • Review scoring effectiveness

2. Rule Updates

Keep the rule updated when:

  • New DHL domains are added
  • New Unicode homoglyphs are discovered
  • False positives are reported

3. Performance Monitoring

Monitor these metrics:

  • Rule processing time
  • Memory usage
  • Cache hit rates

Troubleshooting Common Issues

Rule Not Triggering

  • Verify rule syntax
  • Check SpamAssassin debug logs
  • Confirm character encoding support

False Positives

  • Review legitimate DHL domains
  • Adjust scoring if needed
  • Consider adding whitelist rules

Performance Issues

  • Monitor system resources
  • Consider rule optimization
  • Check regex engine performance

Conclusion

This advanced regex pattern demonstrates how to catch sophisticated email fraud attempts that use Unicode homoglyphs and special characters. By understanding and implementing these techniques, you can better protect your users from phishing attempts that try to impersonate trusted brands like DHL.

Remember that this pattern is just one part of a comprehensive email security strategy. It should be combined with other security measures like SPF, DKIM, and DMARC for optimal protection against email fraud.

The SpamAssassin configuration provides robust protection against sophisticated DHL phishing attempts while maintaining good performance. Regular monitoring and updates ensure continued effectiveness against evolving threats.