CRM 2011 and Duplicate Detection
Recently I have worked on a duplicate detection report (you can find some of the results and assets referenced in this article here: CRM 2011 Duplicate Detection Toolkit) and I would like to share some of my findings related to the duplicate detection features in CRM.
I should point out that most of the statements related in this article are not derived from official sources; they are mostly the result of research and reverse engineering. As such I will be grateful if you point out inconsistencies, errors or omissions.
Let’s start from how duplicate detection rules are modelled.
Matchcodes: How are they calculated?
When you create a rule, you need to specify one or more conditions (which are contained in the DuplicateRuleConditionBase table) and some attributes related to the rule, such as its Name, the Base and Matching record type, and whether matching should be Case-insensitive. The rule is stored in the DuplicateRuleBase table.
A new feature in CRM 2011 (which was present only in an embryonic form in CRM 4.0) allows comparing different entities while searching for duplicates. A possible implementation of this feature is to detect duplicates between Leads and Opportunities, to stop a sales rep to enter an opportunity when a matching lead already exists, (since in this case the sales rep should promote the lead into an opportunity rather than creating an opportunity from scratch).
In the remainder of this article I will use the example above, since the simpler case of de-duplication within the same entity can be easily derived by simplification.
I’ve created a rule to de-duplicate Opportunities against Leads on two conditions (first ten characters of Topic and Currency), Case-sensitive:
Let’s see how the rule is stored in the database:
SELECT BaseEntityTypeCode, BaseEntityName, BaseEntityMatchCodeTable, MatchingEntityTypeCode, MatchingEntityName, MatchingEntityMatchCodeTable FROM DuplicateRuleBase
Table 1 – DuplicateRuleBase
Most of the columns have an obvious meaning, with the exception of BaseEntityMatchCodeTable and MatchingEntityMatchCodeTable. These are two tables that are created and filled at the moment of publishing the rule. They contain matchcodes for each object of the relevant types (in this case opportunity and lead) based on the rule conditions defined above. Once the rule is published, CRM will make sure that the matchcode is updated each time an object is created or updated.
Incidentally, this is the reason why it takes a long time to publish a de-duplication rule for an entity of which there are many instances, and why it is a background job instead than a synchronous process. For example, if you have a million accounts, when you publish a new rule for an account CRM will have to create one million new matchcodes.
Let’s have a look on how matchcodes look like for the example above; I have created an opportunity with the Topic “A Very Long Opportunity Topic” and currency USD.
SELECT mc.MatchCode, Opty.Name, Opty.TransactionCurrencyId FROM MatchCodeb5196707a4fc4fb8aa2f8321c1035fcf AS mc INNER JOIN OpportunityBase AS Opty ON mc.ObjectId = Opty.OpportunityId
|A Very Lon•cd7265e3-2fb3-e011-b1af-0800277b84f8||A Very Long Opportunity Topic||CD7265E3-2FB3-E011-B1AF-0800277B84F8|
Table 2 – Case-sensitive Matchcodes
Since the conditions include the first 10 characters of the Topic and the currency code, the matchcode is created by concatenating these two values (using a control character as a separator). Note that because the Case-sensitive flag was set, the matchcode preserves the case. This is how the matchcode would look if the rule were ignoring case:
|a very lon•cd7265e3-2fb3-e011-b1af-0800277b84f8||A Very Long Opportunity Topic||CD7265E3-2FB3-E011-B1AF-0800277B84F8|
Table 3 – Case-insensitive Matchcodes
This mechanism ensures that any opportunity with the same first 10 characters and the same currency code will have exactly the same matchcode, including the case when the Case-sensitive flag is not set and the two opportunity differ only in case, since the matchcode is then always converted to lowercase.
Matchcodes: How are they used to detect duplicates?
Let’s assume that we have published the following rules:
- Rule 1: The rule described above to de-duplicate opportunities against leads
- Rule 2: A rule to de-duplicate opportunities against themselves on Topic and Potential Customer
The process that happens behind the scenes when a user creates or updates an opportunity is then quite straightforward:
- For each rule that involves the opportunity, a matchcode is generated based on the updated attributes for each rule
- The matchcode is then checked against the matchcode table of the matching entity. In this case the matchcode for rule 1 is checked against a matchcode table for leads, while for rule 2 the check is performed against a matchcode table for opportunity.
If a match is found, the opportunity is a duplicate.
Matchcodes: How are they are used in duplicate detection jobs?
Now, let’s run a duplicate detection job against opportunities. The job will fill the DuplicateRecordBase table with detected duplicates. Let’s assume that this is the complete set of data I have in my database:
- Rule 1: The rule described above to de-duplicate opportunities against leads (first 10 characters of Topic and Currency)
- Rule 2: A rule to de-duplicate opportunities against themselves on Topic and Potential Customer
- One lead, with Topic “A Very Long Lead Topic” and currency USD
- Opportunity 1: Topic “A Very Long Opportunity Topic”, Currency USD and Potential Customer “Test Account”
- Opportunity 2: Topic “A Very Long Opportunity Topic”, Currency EUR and Potential Customer “Test Account”
This is where things start to become complicated. Let’s have a look at the DuplicateRecordBase table:
SELECT DuplicateId, BaseRecordId, AsyncOperationId, BaseRecordIdTypeCode, DuplicateRuleId, DuplicateRecordId, BaseRecordIdName, DuplicateRecordIdName, DuplicateRecordIdYomiName, BaseRecordIdYomiName, DuplicateRecordIdTypeCode FROM DuplicateRecordBase
This query returns values only in the first four columns, the rest is, surprisingly, all NULL.
Table 4 – DuplicateRecordBase
Here is some information about the columns of the DuplicateRecordBase table:
|DuplicateId||Yes||This is the primary Key of the table. This is surprising because the usual name convention would suggest DuplicateRecordId as the name of the primary Key, and the name DuplicateId might confuse in thinking that this is a Foreign Key to a table that contains Duplicates. It isn’t.|
|BaseRecordId||Yes||This is the Id of the duplicate record. In this case this is a Foreign Key of the Opportunity table.|
|AsyncOperationId||Yes||This is the Foreign Key of the Asynchronous Job, i.e. the duplicate detection job (points to AsyncOperationBase)|
|BaseRecordIdTypeCode||Yes||The ObjectTypeCode of the base entity (in this case 3 = Opportunity)|
|DuplicateRuleId||No||This is a Foreign Key of the DuplicateRuleBase table, but it is not filled, and I suspect that the reason is that there might be multiple rules that assert that this is a duplicate.|
|DuplicateRecordId||No||This is NOT the Primary Key of this table, despite what the name might suggest. Seems to be always NULL.|
|BaseRecordIdName||No||This is NOT the Primary Attribute of the base object, despite what the name might suggest. Seems to be always NULL.|
|DuplicateRecordIdName||No||This is NOT the Primary Attribute of the duplicate object, despite what the name might suggest. Seems to be always NULL.|
|DuplicateRecordIdYomiName||No||Yomi versions of the above. Seems to be always NULL.|
|BaseRecordIdYomiName||No||Yomi versions of the above. Seems to be always NULL.|
|DuplicateRecordIdTypeCode||No||This is a NOT the ObjectTypeCode of the duplicate object. Seems to be always NULL.|
Table 5 – DuplicateRecordBase Columns
The only useful piece of information that this table contains is the Id of the record (opportunity) that has been detected as a duplicate. What about information about which are the duplicates and which are the rules that apply? Bear with me; we’re getting to the interesting part.
Matchcodes: How can we use them to find duplicates?
Suppose that you want to understand how many duplicates you have in your database and the effectiveness of each de-duplication rule, how would you go about it? The CRM 2011 UI allows checking duplicates only one by one, which clearly does not give you a good picture of the quality of data you have in your database.
To get a clear picture, you need to understand how a “duplicate” (that is, more accurately, a “base record”) is related to its siblings (i.e. the “matching records”).
Let’s start with the DuplicateRecordBase table described above. This table will show you that a particular record has duplicates.
In my cases I have two duplicates:
I also know, looking at the BaseRecordIdTypeCode value of 3, that these are opportunities. The next step is to look at all the active (published) duplicate detection rules that have Opportunities as base entity:
SELECT Name, IsCaseSensitive, BaseEntityMatchCodeTable, MatchingEntityTypeCode, MatchingEntityName, MatchingEntityMatchCodeTable FROM DuplicateRuleBase WHERE StatusCode = 2 AND BaseEntityTypeCode = 3
|Opty to Lead – Topic (10) and Currency||1||dbo.MatchCode348c665ccf3547c39ba5cec9b9c2a555||4||lead||dbo.MatchCode31dadcb269ae4f3898c7c038f2a0860f|
|Opty – Topic and Customer||0||dbo.MatchCodedbf78af156b74bd2b623274e8938fbf5||3||opportunity||dbo.MatchCodedbf78af156b74bd2b623274e8938fbf5|
Table 5 – Active Duplicate Rules for Opportunities
As shown in Table 5, there are two active rules, as expected: Rule 1 and Rule 2 as described above. The important bits of information here are the names of the Matchcode tables involved:
Rule 1 Base Entity Matchcode Table: dbo.MatchCode348c665ccf3547c39ba5cec9b9c2a555 Rule 1 Matching Entity Matchcode Table: dbo.MatchCode31dadcb269ae4f3898c7c038f2a0860f Rule 2 Base Entity Matchcode Table: dbo.MatchCodedbf78af156b74bd2b623274e8938fbf5 Rule 2 Matching Entity Matchcode Table: dbo.MatchCodedbf78af156b74bd2b623274e8938fbf5
Note that Rule 1 involves two matchcode tables: one for opportunity and one for leads, since de-duplication is cross-entity. For Rule 2, the two matchcode table are the same.
To know all the duplicates for a particular opportunity, I need to find the opportunity and its matchcode in the base matchcode table, then lookup all the records in the matching matchcode table that have the same matchcode. Then repeat this for each de-duplication rule.
Let’s see how this works for Rule 2. Let’s focus on the duplicate with the following Id:
And remembering that the base matchcode table for Rule 2 is
SELECT MatchCode FROM MatchCodedbf78af156b74bd2b623274e8938fbf5 WHERE ObjectId = '9EA61913-80BF-E011-9D7A-0800277B84F8'
|a very long opportunity topic•2de762dd-32b3-e011-b1af-0800277b84f8|
Table 5 – Opportunity Matchcode
Then I use the matchcode to query the matching matchcode table (which is the same of the base for Rule 2). When using the matchcode for an inner join, I need to check that the matchcodes are not null, and that they are binary equal (so that independently on the Case sensitivity settings of my database, two matchcodes that differ by case are recognised as different):
SELECT matching.ObjectId FROM MatchCodedbf78af156b74bd2b623274e8938fbf5 AS base INNER JOIN MatchCodedbf78af156b74bd2b623274e8938fbf5 AS matching ON base.MatchCode = matching.MatchCode AND matching.MatchCode IS NOT NULL AND CAST(base.MatchCode AS varbinary(900)) = CAST(matching.MatchCode AS varbinary(900)) WHERE base.ObjectId = '9EA61913-80BF-E011-9D7A-0800277B84F8' AND matching.ObjectId <> '9EA61913-80BF-E011-9D7A-0800277B84F8'
This will give me the matching Opportunities. Note that the last line of the query above ensures that I’m not returning the base record as a duplicate (the base record will of course have the same matchcode as itself).
Rule 1 is just a little more complicated, since it involves two entities. This query will return Ids of Leads:
SELECT matching.ObjectId FROM MatchCode348c665ccf3547c39ba5cec9b9c2a555 AS base INNER JOIN MatchCode31dadcb269ae4f3898c7c038f2a0860f AS matching ON base.MatchCode = matching.MatchCode AND matching.MatchCode IS NOT NULL AND CAST(base.MatchCode AS varbinary(900)) = CAST(matching.MatchCode AS varbinary(900)) WHERE base.ObjectId = '9EA61913-80BF-E011-9D7A-0800277B84F8'
Now I know which the duplicates are, and which are the rules that determine them. Mission accomplished (albeit through quite some winding roads).
As a final thought, I have published a duplication detection toolkit in codeplex (http://crm2011dupdetection.codeplex.com) which contains some stored procedures to facilitate the process described above. Let’s see how the toolkit can provide the same information with just one line of T-SQL:
EXECUTE usp_ag_showdup 'Opportunity'
|C52955A0-E8BE-E011-9D7A-0800277B84F8||3||9EA61913-80BF-E011-9D7A-0800277B84F8||4||45D97FC3-E8BE-E011-9D7A-0800277B84F8||A Very Lon•cd7265e3-2fb3-e011-b1af-0800277b84f8|
|3DA46345-7CBF-E011-9D7A-0800277B84F8||3||EE9836E3-E8BE-E011-9D7A-0800277B84F8||3||9EA61913-80BF-E011-9D7A-0800277B84F8||a very long opportunity topic•2de762dd-32b3-e011-b1af-0800277b84f8|
Table 5 – usp_ag_showdup output
Alberto “Hashing-the-matches” Gemin