Back to Home

AI-Powered Matching

How True Record uses machine learning to find duplicates that rule-based systems miss.

Overview

True Record combines AI vector embeddings with traditional matching rules to find duplicates. This hybrid approach catches both exact matches and fuzzy/semantic similarities.

Semantic Understanding

Understands meaning, not just text

Sub-second Search

Finds matches in milliseconds

Hybrid Approach

AI + rules for best accuracy

Matching Pipeline

Every scan runs through a multi-stage pipeline to find and score potential duplicates.

1

Data Ingestion

Records are fetched from Salesforce via API and normalized (lowercase, trimmed, standardized formats).

2

Embedding Generation

Key fields are concatenated and sent to OpenAI to generate a 1536-dimensional vector embedding.

3

Vector Indexing

Embeddings are stored in PostgreSQL with pgvector and indexed using HNSW for fast similarity search.

4

K-NN Search

For each record, we find the K nearest neighbors by cosine similarity (K=5 by default).

5

Candidate Filtering

Neighbors are filtered by minimum similarity threshold and blocking rules to reduce false positives.

6

Confidence Scoring

Final match score combines vector similarity with field-level comparison weights.

Vector Embeddings

Embeddings capture the semantic meaning of records, allowing us to find duplicates even when fields are formatted differently or contain typos.

Embedding Model

OpenAI text-embedding-3-small1536 dimensionsCosine similarity

We use OpenAI's text-embedding-3-small model, which offers an excellent balance of accuracy and performance for entity matching tasks.

Fields Used for Embedding

Lead:
NameCompanyEmailPhoneTitle
Contact:
NameEmailPhoneTitleAccount.Name
Account:
NameWebsitePhoneBillingCityIndustry

Custom Field Selection

You can configure which fields are used for embedding in the Settings tab. Choose fields that uniquely identify records for best results.

K-NN Search

K-Nearest Neighbors (K-NN) search finds records with the most similar embeddings. We use approximate nearest neighbor (ANN) search for scalability.

HNSW Index

Hierarchical Navigable Small World (HNSW) is a graph-based ANN algorithm that provides near-perfect recall with logarithmic search time.

Time complexity: O(log n)

Recall: >99% at typical settings

pgvector Extension

We use PostgreSQL's pgvector extension for native vector storage and similarity search without external dependencies.

Index type: HNSW with cosine distance

Tested to 10M+ records

Search Parameters

k (neighbors)5Number of neighbors to find per record
Min. Similarity0.85Threshold below which matches are discarded
Batch Size50Records processed in parallel

Hybrid Matching

AI-only matching can surface false positives. We combine K-NN results with blocking rules for precision.

K-NN (Recall)

Casts a wide net using semantic similarity. Catches typos, abbreviations, and alternate formats.

Blocking Rules (Precision)

Filters candidates using exact-match or rule-based conditions (same domain, same phone, etc.).

Benefits of Hybrid Approach

  • Higher precision than AI-only (fewer false positives)
  • Higher recall than rules-only (catches semantic matches)
  • Tunable balance via confidence thresholds
  • Explainable matches with field-level breakdowns

Confidence Scoring

Each match receives a confidence score from 0-100% based on weighted field comparisons.

Score Calculation

confidence = (weightedFieldScore / totalWeight) × 100

Each field has a configurable weight. The final score is the weighted average of individual field match scores (exact match, fuzzy match, or no match). Cross-object matches receive a 5% penalty to reduce false positives.

50-69%

Review Carefully

70-89%

Likely Match

90-100%

Very High Confidence

Embedding Cache

Embeddings are expensive to generate. We cache them aggressively to minimize API costs and improve scan speed.

How Caching Works

Each record's embedding is cached with a hash of the input fields. When fields change, a new embedding is generated.

Cache Invalidation

Embeddings are invalidated when: (1) source fields are modified in Salesforce, (2) you change which fields are used for matching, or (3) manually via the Settings page.

Configuration

AI matching settings can be tuned per object type.

Go to Dashboard > Select Object > Settings tab > Matching section to configure similarity thresholds, embedding fields, and blocking rules.

Frequently Asked Questions

How accurate is AI matching?

In testing, our hybrid approach achieves 95%+ precision and 98%+ recall on typical CRM data. Accuracy depends on data quality and field selection.

Does AI matching work for non-English data?

Yes. OpenAI's embedding model supports 100+ languages. Matching works across languages, though accuracy is highest for English.

How much does AI matching cost?

AI matching requires AI Credits, which can be purchased separately. We cache embeddings to minimize credit usage—most scans use 90%+ cached embeddings, significantly reducing costs.

Can I disable AI matching and use only rules?

Yes. In Settings → Matching Rules, select 'Rules Only' mode to rely entirely on blocking rules and field comparisons without AI embeddings.

Why do some obvious duplicates have low confidence?

Low confidence usually means the records differ in key fields. Check which fields are used for embedding and consider adding more identifying fields.