How lookalike domain detection works: algorithms and methods

Technical guide to detecting lookalike domains: Damerau-Levenshtein distance, homoglyph substitution, and structural pattern analysis with code examples.

A lookalike domain is a domain name designed to resemble a legitimate one visually or phonetically. Detecting them systematically relies on three families of techniques: edit distance algorithms, homoglyph substitution, and structural pattern analysis. For a domain like "acme.com", a complete scan can generate several hundred candidate variants before filtering. This article covers each method with concrete examples and code, then explains how Domain Sentinel combines them into a continuous monitoring pipeline.

Edit distance: Levenshtein and Damerau-Levenshtein

The Levenshtein distance between two strings is the minimum number of single-character insertions, deletions, and substitutions needed to transform one into the other. The Damerau-Levenshtein variant adds transpositions of adjacent characters, which covers the most frequent real-world typing errors.

Concrete examples for the string "acme":

"acme" vs "acmee": distance 1 (insertion of a character)
"acme" vs "acne": distance 1 (substitution of "m" by "n")
"acme" vs "amce": Damerau distance 1 (transposition), Levenshtein distance 2

For brand monitoring, a distance threshold of 1 or 2 is typically used depending on name length. Short names (4-5 characters) need a threshold of 1 to avoid false positives; longer names can tolerate 2.

Python implementation

def damerau_levenshtein(s1, s2):
    d = {}
    len1, len2 = len(s1), len(s2)
    for i in range(-1, len1 + 1):
        d[(i, -1)] = i + 1
    for j in range(-1, len2 + 1):
        d[(-1, j)] = j + 1
    for i in range(len1):
        for j in range(len2):
            cost = 0 if s1[i] == s2[j] else 1
            d[(i, j)] = min(
                d[(i - 1, j)] + 1,       # deletion
                d[(i, j - 1)] + 1,       # insertion
                d[(i - 1, j - 1)] + cost # substitution
            )
            if i > 0 and j > 0 and s1[i] == s2[j-1] and s1[i-1] == s2[j]:
                d[(i, j)] = min(d[(i, j)], d[(i-2, j-2)] + cost)  # transposition
    return d[(len1 - 1, len2 - 1)]

def generate_variants_distance_1(name):
    alphabet = 'abcdefghijklmnopqrstuvwxyz0123456789-'
    variants = set()
    for i in range(len(name)):
        variants.add(name[:i] + name[i+1:])                             # deletion
        for c in alphabet:
            variants.add(name[:i] + c + name[i+1:])                    # substitution
            variants.add(name[:i] + c + name[i:])                      # insertion
        if i < len(name) - 1:
            variants.add(name[:i] + name[i+1] + name[i] + name[i+2:]) # transposition
    variants.add(name + c for c in alphabet)  # trailing insertion
    return variants - {name}

Generating variants at distance 1

The four categories of modifications and the volume they produce for a name of length n:

Deletions: n variants
Substitutions: n x 36 variants (26 letters + 10 digits + hyphen)
Insertions: (n+1) x 36 variants
Transpositions: (n-1) variants

For a 5-character name like "acme" (4 characters), that produces roughly 350-400 raw variants before filtering out DNS-invalid strings and deduplication.

Homoglyphs: the Unicode character substitution attack

A homoglyph is a Unicode character that looks visually identical to another. Homoglyph attacks work because internationalized domain names (IDNs) allow non-ASCII characters encoded in Punycode. The Cyrillic "а" (U+0430) renders identically to the Latin "a" (U+0061) in virtually all fonts. An attacker can register "аpple.com" using a Cyrillic "а" and it will display as "apple.com" in many contexts.

Frequently exploited homoglyph pairs:

Original	Common substitutes
a	а (U+0430 Cyrillic), ɑ (U+0251), α (U+03B1 Greek)
o	о (U+043E Cyrillic), 0 (zero), ο (U+03BF Greek)
e	е (U+0435 Cyrillic), ё (U+0451)
c	с (U+0441 Cyrillic)
p	р (U+0440 Cyrillic)
l	1 (one), I (capital i),
n	п (U+043F Cyrillic)
i	í (U+00ED), ï (U+00EF), 1 (one)

Programmatic homoglyph detection

The Unicode Consortium maintains a public "confusables" dataset that maps characters to their visual equivalents. A practical Python approach:

# Simplified homoglyph dictionary (subset)
HOMOGLYPHS = {
    'a': ['а', 'ɑ', 'α'],
    'o': ['о', '0', 'ο'],
    'e': ['е', 'ё'],
    'c': ['с'],
    'l': ['1', 'I', '|'],
    'p': ['р'],
}

def generate_homoglyph_variants(domain):
    name, tld = domain.rsplit('.', 1)
    variants = set()

    def recurse(index, current):
        if index == len(name):
            if current != name:
                variants.add(f"{current}.{tld}")
            return
        char = name[index]
        recurse(index + 1, current + char)
        for substitute in HOMOGLYPHS.get(char, []):
            recurse(index + 1, current + substitute)

    recurse(0, '')
    return variants

A word of warning on combinatorial explosion: a 6-letter name where each letter has 3 possible homoglyphs generates up to 3^6 = 729 variants from homoglyphs alone. In practice, you apply a maximum substitution depth (typically 2 substitutions per domain) to keep the set manageable.

How modern browsers handle IDNs

Chrome and Firefox display the Punycode representation (starting with xn--) instead of the Unicode domain name when a domain mixes scripts (e.g., Latin and Cyrillic in the same label). So "аpple.com" with a Cyrillic "а" appears in the browser address bar as xn--pple-43d.com. However, single-script Cyrillic domains (where the entire label is Cyrillic) may still display in Unicode, which is why homoglyph detection remains relevant. The Punycode for аcme.com (Cyrillic а) is xn--cme-9cd.com.

Structural analysis: TLD variants and prefix/suffix patterns

The third family does not manipulate individual characters but the structure of the domain itself.

TLD swapping

Replacing the TLD is the simplest variant to generate and often the most commercially relevant. For "acme.com", this produces acme.net, acme.io, acme.co, acme.app, acme.shop. TLDs most frequently used for abuse include:

.co (extremely common: visually close to .com and a legitimate ccTLD)
.net (long-standing, widely recognized)
.io (tech sector standard, growing confusion with .com)
.app (mobile and SaaS)
.shop, .store (e-commerce)
.xyz, .top, .tk, .ml, .ga (often free or near-free; disproportionately used for phishing)

Common prefix and suffix patterns

These patterns are often more dangerous than pure typos because they can deceive sophisticated users:

get{brand}.com        try{brand}.com       use{brand}.com
{brand}app.com        {brand}hq.com        {brand}online.com
{brand}-login.com     {brand}-secure.com   {brand}-account.com
{brand}official.com   {brand}support.com   {brand}-help.com
my{brand}.com         {brand}pro.com       {brand}plus.com
sign{brand}.com       {brand}portal.com    {brand}-verify.com

Words like "login", "secure", "account", and "verify" in a domain are red flags, not trust signals. Legitimate services rarely need to include them in their primary domain name.

Separators and compound variations

If your brand name is compound ("Domain Sentinel"), monitor: domainsentinel.com, domain-sentinel.com, sentineldomain.com, sentinel-domain.com. If your brand uses a hyphen, monitor the unhyphenated version and vice versa.

Combining methods: the full detection pipeline

Domain Sentinel combines all three families in a sequential pipeline:

Generate candidates using Damerau-Levenshtein variants (distance 1-2), homoglyph substitutions (up to 2 substitutions), TLD swaps across a curated TLD list, and structural pattern variants (prefixes/suffixes).
Deduplicate and filter to remove DNS-invalid strings (labels over 63 characters, invalid character combinations) and reduce the candidate set.
Batch RDAP verification for each candidate to determine: registered, available, or pending/reserved.
Classify results: new registrations trigger alerts; available domains can be flagged for potential preventive registration.
Prioritize alerts by risk: domains combining multiple similarity signals (typo + suspicious keyword + recently registered) rank higher than simple TLD swaps with a long registration history.

The "noise" problem is real: exhaustive generation can produce thousands of candidates. Prioritization based on edit distance, TLD risk level, and presence of high-risk keywords (login, secure, official, verify, support) keeps the alert queue actionable.

Open-source tools and APIs

For teams that want to build their own pipeline or validate results:

dnstwist (Python, open source) is the reference tool for lookalike domain generation. It implements most of the methods described here and includes RDAP/WHOIS lookups. Use it for one-off scans or to build an offline variant database.
urlcrazy is similar, focused on typosquatting variant generation.
The ICANN RDAP service at rdap.org handles individual RDAP lookups for free and without authentication.
Domain Sentinel automates the entire pipeline continuously, sends alerts when new registrations are detected, and handles the RDAP querying and rate-limiting complexity.

The key limitation of open-source tools: they generate variants but do not monitor continuously. Running dnstwist once tells you what is registered today. You need to run it again tomorrow, and the day after, and build your own notification system on top. This is the gap Domain Sentinel fills.

In practice, starting with TLD swaps and structural patterns gives the fastest signal with the least noise. Add Damerau-Levenshtein variants for comprehensive typo coverage, then layer in homoglyph detection for complete coverage. The full scan for a 6-character brand name across 20 TLDs and a set of 30 prefix/suffix patterns produces roughly 2,000-3,000 candidate domains worth checking against RDAP. Run your brand through Domain Sentinel to see which of those are already registered.