We have just started a new data cleansing project. One of the aims of the
project is to identify similar records so that we can eliminate them. The
problem lies in how to identify those similar records.
Consider a table that has the columns: (ID, name â?¦). A record whose name is
â'XYZâ' is similar to â'XYâ'. Moreover â'XYZ Companyâ' should be similar to â'XYZ
Coâ'.
We have to ideas:
1) For each record in the table, get the records that match it based on a
function, candidate function. Then for each match get a percentage that
indicate how close the 2 records are, scoring function.
2) For each record create a function that gives a score to a record, Sum of
its name characters ASCII codes + any other function, items with nearby
scores are considered similar. There could be more than one function; 2
functions means search in 2D, 3 functions means search in 3D.
Can you lead me to ideas/articles/books that show how to identify the
columns that will be included in the scoring functions, or that illustrate a
better way of searching?
Note:
Matreials needn't be targeted at SQL Server, we can do some part in SQL
Server and the other in C#.On 28 Apr, 09:52, Shehab Kamal <ShehabKa...@.discussions.microsoft.com>
wrote:
> We have just started a new data cleansing project. One of the aims of the
> project is to identify similar records so that we can eliminate them. The
> problem lies in how to identify those similar records.
> Consider a table that has the columns: (ID, name ...). A record whose name is
> "XYZ" is similar to "XY". Moreover "XYZ Company" should be similar to "XYZ
> Co".
> We have to ideas:
> 1) For each record in the table, get the records that match it based on a
> function, candidate function. Then for each match get a percentage that
> indicate how close the 2 records are, scoring function.
> 2) For each record create a function that gives a score to a record, Sum of
> its name characters ASCII codes + any other function, items with nearby
> scores are considered similar. There could be more than one function; 2
> functions means search in 2D, 3 functions means search in 3D.
> Can you lead me to ideas/articles/books that show how to identify the
> columns that will be included in the scoring functions, or that illustrate a
> better way of searching?
> Note:
> Matreials needn't be targeted at SQL Server, we can do some part in SQL
> Server and the other in C#.
Data cleansing is supported by Integration Services using the Fuzzy
Lookup feature:
http://msdn.microsoft.com/msdnmag/issues/05/09/sqlserver2005/default.aspx
You'll find that other integration tools offer the same kind of
functionality.
--
David Portas, SQL Server MVP
Whenever possible please post enough code to reproduce your problem.
Including CREATE TABLE and INSERT statements usually helps.
State what version of SQL Server you are using and specify the content
of any error messages.
SQL Server Books Online:
http://msdn2.microsoft.com/library/ms130214(en-US,SQL.90).aspx
--|||The ETI idea is brilliant. Thanks for the link
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment