Skip to content

CRM 2011 and Duplicate Detection

August 7, 2011

Introduction

Recently I have worked on a duplicate detection report (you can find some of the results and assets referenced in this article here: CRM 2011 Duplicate Detection Toolkit) and I would like to share some of my findings related to the duplicate detection features in CRM.

I should point out that most of the statements related in this article are not derived from official sources; they are mostly the result of research and reverse engineering. As such I will be grateful if you point out inconsistencies, errors or omissions.

Let’s start from how duplicate detection rules are modelled.

Matchcodes: How are they calculated?

When you create a rule, you need to specify one or more conditions (which are contained in the DuplicateRuleConditionBase table) and some attributes related to the rule, such as its Name, the Base and Matching record type, and whether matching should be Case-insensitive. The rule is stored in the DuplicateRuleBase table.

A new feature in CRM 2011 (which was present only in an embryonic form in CRM 4.0) allows comparing different entities while searching for duplicates. A possible implementation of this feature is to detect duplicates between Leads and Opportunities, to stop a sales rep to enter an opportunity when a matching lead already exists, (since in this case the sales rep should promote the lead into an opportunity rather than creating an opportunity from scratch).

In the remainder of this article I will use the example above, since the simpler case of de-duplication within the same entity can be easily derived by simplification.

I’ve created a rule to de-duplicate Opportunities against Leads on two conditions (first ten characters of Topic and Currency), Case-sensitive:

Let’s see how the rule is stored in the database:

SELECT BaseEntityTypeCode, BaseEntityName, BaseEntityMatchCodeTable,
       MatchingEntityTypeCode, MatchingEntityName, MatchingEntityMatchCodeTable
FROM DuplicateRuleBase
BaseEntityTypeCode BaseEntityTypeName BaseEntityMatchCodeTable MatchingEntityTypeCode MatchingEntityTypeName MatchingEntityMatchCodeTable
3 opportunity dbo.MatchCodeb5196707a4fc4fb8aa2f8321c1035fcf 4 lead dbo.MatchCode515f6855f6044941874186839057a129

Table 1 – DuplicateRuleBase

Most of the columns have an obvious meaning, with the exception of BaseEntityMatchCodeTable and MatchingEntityMatchCodeTable. These are two tables that are created and filled at the moment of publishing the rule. They contain matchcodes for each object of the relevant types (in this case opportunity and lead) based on the rule conditions defined above. Once the rule is published, CRM will make sure that the matchcode is updated each time an object is created or updated.

Incidentally, this is the reason why it takes a long time to publish a de-duplication rule for an entity of which there are many instances, and why it is a background job instead than a synchronous process. For example, if you have a million accounts, when you publish a new rule for an account CRM will have to create one million new matchcodes.

Let’s have a look on how matchcodes look like for the example above; I have created an opportunity with the Topic “A Very Long Opportunity Topic” and currency USD.

SELECT mc.MatchCode, Opty.Name, Opty.TransactionCurrencyId
FROM MatchCodeb5196707a4fc4fb8aa2f8321c1035fcf AS mc
INNER JOIN OpportunityBase AS Opty ON mc.ObjectId = Opty.OpportunityId
MatchCode Name TransactionCurrencyId
A Very Lon•cd7265e3-2fb3-e011-b1af-0800277b84f8 A Very Long Opportunity Topic CD7265E3-2FB3-E011-B1AF-0800277B84F8

Table 2 – Case-sensitive Matchcodes

Since the conditions include the first 10 characters of the Topic and the currency code, the matchcode is created by concatenating these two values (using a control character as a separator). Note that because the Case-sensitive flag was set, the matchcode preserves the case. This is how the matchcode would look if the rule were ignoring case:

MatchCode Name TransactionCurrencyId
a very lon•cd7265e3-2fb3-e011-b1af-0800277b84f8 A Very Long Opportunity Topic CD7265E3-2FB3-E011-B1AF-0800277B84F8

Table 3 – Case-insensitive Matchcodes

This mechanism ensures that any opportunity with the same first 10 characters and the same currency code will have exactly the same matchcode, including the case when the Case-sensitive flag is not set and the two opportunity differ only in case, since the matchcode is then always converted to lowercase.

Matchcodes: How are they used to detect duplicates?

Let’s assume that we have published the following rules:

  • Rule 1: The rule described above to de-duplicate opportunities against leads
  • Rule 2: A rule to de-duplicate opportunities against themselves on Topic and Potential Customer

The process that happens behind the scenes when a user creates or updates an opportunity is then quite straightforward:

  • For each rule that involves the opportunity, a matchcode is generated based on the updated attributes for each rule
  • The matchcode is then checked against the matchcode table of the matching entity. In this case the matchcode for rule 1 is checked against a matchcode table for leads, while for rule 2 the check is performed against a matchcode table for opportunity.

If a match is found, the opportunity is a duplicate.

Matchcodes: How are they are used in duplicate detection jobs?

Now, let’s run a duplicate detection job against opportunities. The job will fill the DuplicateRecordBase table with detected duplicates. Let’s assume that this is the complete set of data I have in my database:

  • Rules:
    • Rule 1: The rule described above to de-duplicate opportunities against leads (first 10 characters of Topic and Currency)
    • Rule 2: A rule to de-duplicate opportunities against themselves on Topic and Potential Customer
  • Leads:
    • One lead, with Topic “A Very Long Lead Topic” and currency USD
  • Opportunities:
    • Opportunity 1: Topic “A Very Long Opportunity Topic”, Currency USD and Potential Customer “Test Account”
    • Opportunity 2: Topic “A Very Long Opportunity Topic”, Currency EUR and Potential Customer “Test Account”

This is where things start to become complicated. Let’s have a look at the DuplicateRecordBase table:

SELECT DuplicateId, BaseRecordId, AsyncOperationId, BaseRecordIdTypeCode,
       DuplicateRuleId, DuplicateRecordId, BaseRecordIdName,
       DuplicateRecordIdName, DuplicateRecordIdYomiName, BaseRecordIdYomiName,
       DuplicateRecordIdTypeCode
FROM DuplicateRecordBase

This query returns values only in the first four columns, the rest is, surprisingly, all NULL.

DuplicateId BaseRecordId AsyncOperationId BaseRecordIdTypeCode <…>
9AD478A8-C1ED-4A6E-A933-020F372BBBC2 9EA61913-80BF-E011-9D7A-0800277B84F8 9BCE8A40-84BF-E011-9D7A-0800277B84F8 3 NULL
79D02A66-F362-405D-9A6C-D99B2F20BAE9 EE9836E3-E8BE-E011-9D7A-0800277B84F8 9BCE8A40-84BF-E011-9D7A-0800277B84F8 3 NULL

Table 4 – DuplicateRecordBase

Here is some information about the columns of the DuplicateRecordBase table:

Column Filled? Description
DuplicateId Yes This is the primary Key of the table. This is surprising because the usual name convention would suggest DuplicateRecordId as the name of the primary Key, and the name DuplicateId might confuse in thinking that this is a Foreign Key to a table that contains Duplicates. It isn’t.
BaseRecordId Yes This is the Id of the duplicate record. In this case this is a Foreign Key of the Opportunity table.
AsyncOperationId Yes This is the Foreign Key of the Asynchronous Job, i.e. the duplicate detection job (points to AsyncOperationBase)
BaseRecordIdTypeCode Yes The ObjectTypeCode of the base entity (in this case 3 = Opportunity)
DuplicateRuleId No This is a Foreign Key of the DuplicateRuleBase table, but it is not filled, and I suspect that the reason is that there might be multiple rules that assert that this is a duplicate.
DuplicateRecordId No This is NOT the Primary Key of this table, despite what the name might suggest. Seems to be always NULL.
BaseRecordIdName No This is NOT the Primary Attribute of the base object, despite what the name might suggest. Seems to be always NULL.
DuplicateRecordIdName No This is NOT the Primary Attribute of the duplicate object, despite what the name might suggest. Seems to be always NULL.
DuplicateRecordIdYomiName No Yomi versions of the above. Seems to be always NULL.
BaseRecordIdYomiName No Yomi versions of the above. Seems to be always NULL.
DuplicateRecordIdTypeCode No This is a NOT the ObjectTypeCode of the duplicate object. Seems to be always NULL.

Table 5 – DuplicateRecordBase Columns

The only useful piece of information that this table contains is the Id of the record (opportunity) that has been detected as a duplicate. What about information about which are the duplicates and which are the rules that apply? Bear with me; we’re getting to the interesting part.

Matchcodes: How can we use them to find duplicates?

Suppose that you want to understand how many duplicates you have in your database and the effectiveness of each de-duplication rule, how would you go about it? The CRM 2011 UI allows checking duplicates only one by one, which clearly does not give you a good picture of the quality of data you have in your database.

To get a clear picture, you need to understand how a “duplicate” (that is, more accurately, a “base record”) is related to its siblings (i.e. the “matching records”).

Let’s start with the DuplicateRecordBase table described above. This table will show you that a particular record has duplicates.

In my cases I have two duplicates:

9EA61913-80BF-E011-9D7A-0800277B84F8
EE9836E3-E8BE-E011-9D7A-0800277B84F8

I also know, looking at the BaseRecordIdTypeCode value of 3, that these are opportunities. The next step is to look at all the active (published) duplicate detection rules that have Opportunities as base entity:

SELECT Name, IsCaseSensitive, BaseEntityMatchCodeTable,
       MatchingEntityTypeCode, MatchingEntityName,
       MatchingEntityMatchCodeTable
FROM DuplicateRuleBase
WHERE
       StatusCode = 2
   AND BaseEntityTypeCode = 3
Name IsCaseSensitive BaseEntityMatchCodeTable MatchingEntityTypeCode MatchingEntityName MatchingEntityMatchCodeTable
Opty to Lead – Topic (10) and Currency 1 dbo.MatchCode348c665ccf3547c39ba5cec9b9c2a555 4 lead dbo.MatchCode31dadcb269ae4f3898c7c038f2a0860f
Opty – Topic and Customer 0 dbo.MatchCodedbf78af156b74bd2b623274e8938fbf5 3 opportunity dbo.MatchCodedbf78af156b74bd2b623274e8938fbf5

Table 5 – Active Duplicate Rules for Opportunities

As shown in Table 5, there are two active rules, as expected: Rule 1 and Rule 2 as described above. The important bits of information here are the names of the Matchcode tables involved:

Rule 1 Base Entity Matchcode Table: dbo.MatchCode348c665ccf3547c39ba5cec9b9c2a555
Rule 1 Matching Entity Matchcode Table: dbo.MatchCode31dadcb269ae4f3898c7c038f2a0860f
Rule 2 Base Entity Matchcode Table: dbo.MatchCodedbf78af156b74bd2b623274e8938fbf5
Rule 2 Matching Entity Matchcode Table: dbo.MatchCodedbf78af156b74bd2b623274e8938fbf5

Note that Rule 1 involves two matchcode tables: one for opportunity and one for leads, since de-duplication is cross-entity. For Rule 2, the two matchcode table are the same.

To know all the duplicates for a particular opportunity, I need to find the opportunity and its matchcode in the base matchcode table, then lookup all the records in the matching matchcode table that have the same matchcode. Then repeat this for each de-duplication rule.

Let’s see how this works for Rule 2. Let’s focus on the duplicate with the following Id:

9EA61913-80BF-E011-9D7A-0800277B84F8

And remembering that the base matchcode table for Rule 2 is

dbo.MatchCodedbf78af156b74bd2b623274e8938fbf5

I have:

SELECT MatchCode
FROM MatchCodedbf78af156b74bd2b623274e8938fbf5
WHERE ObjectId = '9EA61913-80BF-E011-9D7A-0800277B84F8'
MatchCode
a very long opportunity topic•2de762dd-32b3-e011-b1af-0800277b84f8

Table 5 – Opportunity Matchcode

Then I use the matchcode to query the matching matchcode table (which is the same of the base for Rule 2). When using the matchcode for an inner join, I need to check that the matchcodes are not null, and that they are binary equal (so that independently on the Case sensitivity settings of my database, two matchcodes that differ by case are recognised as different):

SELECT
    matching.ObjectId
FROM
    MatchCodedbf78af156b74bd2b623274e8938fbf5 AS base INNER JOIN
    MatchCodedbf78af156b74bd2b623274e8938fbf5 AS matching
    ON     base.MatchCode = matching.MatchCode
       AND matching.MatchCode IS NOT NULL
       AND CAST(base.MatchCode AS varbinary(900)) = CAST(matching.MatchCode AS varbinary(900))
WHERE
    base.ObjectId = '9EA61913-80BF-E011-9D7A-0800277B84F8'
AND matching.ObjectId <> '9EA61913-80BF-E011-9D7A-0800277B84F8'

This will give me the matching Opportunities. Note that the last line of the query above ensures that I’m not returning the base record as a duplicate (the base record will of course have the same matchcode as itself).

Rule 1 is just a little more complicated, since it involves two entities. This query will return Ids of Leads:

SELECT
    matching.ObjectId
FROM
    MatchCode348c665ccf3547c39ba5cec9b9c2a555 AS base INNER JOIN
    MatchCode31dadcb269ae4f3898c7c038f2a0860f AS matching
    ON     base.MatchCode = matching.MatchCode
       AND matching.MatchCode IS NOT NULL
       AND CAST(base.MatchCode AS varbinary(900)) = CAST(matching.MatchCode AS varbinary(900))
WHERE
    base.ObjectId = '9EA61913-80BF-E011-9D7A-0800277B84F8'

Now I know which the duplicates are, and which are the rules that determine them. Mission accomplished (albeit through quite some winding roads).

As a final thought, I have published a duplication detection toolkit in codeplex (http://crm2011dupdetection.codeplex.com) which contains some stored procedures to facilitate the process described above. Let’s see how the toolkit can provide the same information with just one line of T-SQL:

EXECUTE usp_ag_showdup 'Opportunity'
RuleId BaseEntityTypeCode BaseObjectId MatchingEntityTypeCode MatchingObjectId Matchcode
C52955A0-E8BE-E011-9D7A-0800277B84F8 3 9EA61913-80BF-E011-9D7A-0800277B84F8 4 45D97FC3-E8BE-E011-9D7A-0800277B84F8 A Very Lon•cd7265e3-2fb3-e011-b1af-0800277b84f8
3DA46345-7CBF-E011-9D7A-0800277B84F8 3 EE9836E3-E8BE-E011-9D7A-0800277B84F8 3 9EA61913-80BF-E011-9D7A-0800277B84F8 a very long opportunity topic•2de762dd-32b3-e011-b1af-0800277b84f8

Table 5 – usp_ag_showdup output

Et voilà!
Alberto “Hashing-the-matches” Gemin

Advertisements
5 Comments leave one →
  1. MzM permalink
    October 22, 2013 15:21

    Thanks, save a lot of my time.
    M.

Trackbacks

  1. CRM 2011 Duplicate Detection Toolkit available in Codeplex « XRM 2011
  2. Custom Workflow Activities by Alberto Gemin « MSCRM Bing'd
  3. Custom Workflow Activities by Alberto Gemin - MSCRM Bing’d – Informative and Technical - CRM Technical Blogs - Microsoft Dynamics Community
  4. Matchcode in CRM 2011 Bulk detection rules | Sanghamitra Samantaray

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: