Back to blog
6 min read

Product Matching Across E-Commerce Sites: Algorithms That Actually Work

A technical deep-dive into fuzzy matching, AI-powered semantic matching, and vector embeddings for cross-site product comparison in e-commerce.

product matchingalgorithmsAIvector embeddings

Product Matching Across E-Commerce Sites: Algorithms That Actually Work

Scraping competitor prices is the easy part. The hard part is figuring out which of your products corresponds to which of theirs.

"Premium Glass Jar 4oz Clear" from your catalog might be the same as "4 oz Clear Glass Container - Premium Quality" from a competitor. Or it might not — maybe theirs has a different closure type. Getting this right is the foundation of useful competitive intelligence.

Here's a technical look at the algorithms that solve this problem, their tradeoffs, and how to combine them for reliable results.

The Challenge

Product matching across e-commerce sites is hard because:

Naming conventions vary wildly. "1oz Mylar Bag" vs "Mylar Pouch 1 Ounce" vs "1-oz Flat Pouch, Mylar" — all the same product, all described differently. Attributes are embedded in titles. Size, color, material, and closure type are mashed into the product name rather than structured as separate fields. Extracting and comparing them requires parsing. Partial matches matter. Two products might be 80% similar — same material, same size, but different closure type. Whether that's a "match" depends on your business context. Scale compounds the problem. With 500 brand products and 5,000 competitor products, there are 2.5 million potential pairs to evaluate. Brute-force comparison doesn't work.

Algorithm 1: Fuzzy Text Matching

Fuzzy matching compares product name strings using edit distance or token-based similarity metrics. The most practical implementation for product titles is token sort ratio.

How Token Sort Ratio Works

  • Tokenize both strings (split into words)
  • Sort tokens alphabetically
  • Compare the sorted strings using Levenshtein distance
  • Return a similarity score from 0-100
  • Example:
    • "Premium Glass Jar 4oz Clear" → sorted: "4oz Clear Glass Jar Premium"
    • "4 oz Clear Glass Container Premium" → sorted: "4 Clear Container Glass Premium oz"

    Token sort handles word reordering, which is the most common difference between product names across stores.

    Strengths

    • Fast. RapidFuzz processes thousands of comparisons per second
    • No external dependencies. Runs locally, no API calls
    • Predictable. Same inputs always produce the same score
    • Good for obvious matches. Products with similar names score 85-95%

    Weaknesses

    • Semantic blind spots. Doesn't know that "CR" means "child-resistant" or that "1oz" and "1 ounce" are the same
    • Noise sensitivity. Marketing language ("Best Seller!", "NEW!") in titles reduces match accuracy
    • No attribute awareness. Treats all words equally — can't distinguish size from color from material

    When to Use

    Fuzzy matching is a great first pass. Set a threshold of 75-85% and you'll catch the straightforward matches with high confidence. Products below the threshold need a smarter approach.

    Algorithm 2: AI-Powered Semantic Matching

    Language models (like GPT-4o-mini) understand product semantics. They know that "4oz" and "4 ounce" are equivalent, that "mylar" and "metalized polyester" refer to the same material, and that "pop top" is a closure type.

    How It Works

  • Send your brand product and a batch of competitor products to the LLM
  • The prompt includes industry context (e.g., "packaging" or "supplements")
  • The model returns matched pairs with confidence scores and reasoning
  • Results are stored for future reference
  • Strengths

    • Semantic understanding. Handles synonyms, abbreviations, and domain knowledge
    • Context-aware. Can use industry profile (categories, common terms) to improve accuracy
    • Explains reasoning. The model can articulate why two products match or don't match
    • Handles ambiguity. Can flag "possible matches" for human review

    Weaknesses

    • Cost. Each matching call costs money (though GPT-4o-mini is cheap at ~$0.15 per 1M input tokens)
    • Latency. API calls take 1-5 seconds per batch
    • Non-deterministic. The same inputs might produce slightly different results across runs
    • Hallucination risk. The model might confidently match products that aren't actually the same

    When to Use

    AI matching is ideal as a second pass after fuzzy matching. Run fuzzy first to catch the easy matches, then send the remaining unmatched products to the LLM for semantic analysis.

    Algorithm 3: Vector Embeddings

    Vector embeddings represent product names as high-dimensional numerical vectors. Similar products have vectors that are close together in embedding space, regardless of how differently they're worded.

    How It Works

  • Generate embeddings for all product names using a model like OpenAI's text-embedding-3-small (1536 dimensions)
  • Store embeddings in a vector database (pgvector in PostgreSQL)
  • For each brand product, find the nearest competitor product vectors using cosine similarity
  • Products within a similarity threshold are potential matches
  • Strengths

    • Scales efficiently. Embedding generation is a one-time cost per product. Similarity search is fast with HNSW indexes
    • Language-agnostic similarity. Captures semantic meaning without explicit rules
    • Incrementally updateable. New products get embedded once and are immediately searchable

    Weaknesses

    • Black box. Hard to explain why two products matched or didn't
    • Requires infrastructure. Needs a vector database (though pgvector adds this to PostgreSQL natively)
    • Embedding quality varies. General-purpose embeddings may not capture domain-specific nuances (e.g., packaging terminology)

    When to Use

    Vector search works well as a candidate retrieval step. Find the top 10-20 nearest neighbors for each product, then use fuzzy matching or AI to confirm the actual match.

    The Hybrid Approach

    The most reliable strategy combines all three algorithms:

    Pass 1: Candidate Retrieval with Embeddings

    Generate embeddings for all products. For each brand product, retrieve the top 20 most similar competitor products by cosine similarity. This reduces the search space from thousands to a manageable candidate set.

    Pass 2: Fuzzy Scoring

    Run token sort ratio on all candidate pairs. Products scoring above 85% are high-confidence matches. Products scoring 60-85% go to the AI pass.

    Pass 3: AI Confirmation

    Send ambiguous candidates (60-85% fuzzy score) to GPT-4o-mini for semantic evaluation. The model provides a confidence score and reasoning.

    Pass 4: Human Review

    Products that the AI is uncertain about (confidence below 70%) get flagged for manual review. This is typically 5-10% of the total — a manageable workload.

    Practical Considerations

    Price Ratio Guards

    If your product costs $5 and the potential match costs $500, they're probably not the same product regardless of name similarity. Apply a price ratio guard (e.g., reject matches where the price ratio exceeds 10x).

    Stale Product Detection

    Products that haven't been updated by a scrape in over 30 days should be flagged. They might be discontinued or out of stock, making matches unreliable.

    Confidence Tracking

    Track the confidence distribution of your matches over time. If average confidence is dropping, it might indicate that competitors are changing their naming conventions or that your product catalog has shifted.

    Implementation in VantageDash

    VantageDash implements all three algorithms. The Comparison page shows matched products with confidence scores, and you can run fuzzy matching, AI matching, or hybrid matching from the dashboard. Product embeddings are stored via pgvector in Supabase, enabling fast similarity search across thousands of products.

    Match results include confidence scores, reasoning from the AI model, and price-per-unit comparisons to help you make informed pricing decisions.

    Track competitor pricing automatically

    VantageDash monitors competitor prices across thousands of products and alerts you when they change.

    Get started free