Another Excel question

Kevy Baby · 04-25-2008, 02:03 PM

Alex taught me how to dedupe in a previous thread - thank you. But now I have a new challenge. I generate a list of names (LName, FName), that I will use to generate an index in a 700 page directory. The program I use finds all instances of each record within the 700 page PDF file.

The problem I get is that some names listed sometimes without a middle name, just a middle initial, or with the whole middle name:

Smith, John
Smith, John A
Smith, John Adam

In this instance, the first (without a middle initial or name), will pick up all instances of all three names since "Smith, John" is a part "Smith, John A" and "Smith, John Adam". The second version also picks up all instances of the third version.

To minimize this, I manually go through and delete (using this example) the second and third option. However, I did that when there were only a couple hundred records. This one has 8,000 records.

Any ideas?

Alex · 04-25-2008, 02:06 PM

Are you confident that every time you have

Smith, John
Smith, John A
Smith, John Adam

that you'll want to delete the last two? Or do you need to evaluate and sometimes Smith, John A is actually a different person from Smith, John Adam so you need to keep both?

Kevy Baby · 04-25-2008, 02:10 PM

Quote:

Originally Posted by Alex

Are you confident that every time you have

Smith, John
Smith, John A
Smith, John Adam

that you'll want to delete the last two? Or do you need to evaluate and sometimes Smith, John A is actually a different person from Smith, John Adam so you need to keep both?

Doesn't matter. When the index is being created, if the first one is on page one, the second on page 2 and so on, I would get Index results showing:

Smith, John... 1, 2, 3
Smith, John A... 2, 3
Smith, John Adam... 3

Even though they may be "different" John Smiths, they would still be combined anyways. And the likelihood of different John Smiths (with the names I am dealing with) is pretty minimal and not worth worrying about.

Alex · 04-25-2008, 02:14 PM

So the index lists for each individual needs to be combined into a single list as well?

Kevy Baby · 04-25-2008, 02:25 PM

Quote:

Originally Posted by Alex

So the index lists for each individual needs to be combined into a single list as well?

I think so (if I understand your question).

To expand on what I do with this. We create the body of this directory using FileMaker Pro and output it to a PDF document. There is no way within FMP (the way we are running this program) to capture the indexing data as we are running the directory. We found a program called Acrobotics PDF Index Creator, which takes a TXT file of the data to be found and processes it against the PDF file to find all instances of each record in the TXT file. I did try adding quote marks around each separate instance ("Smith, John" etc) but then it didn't find anything.

All data is cleaned in Excel before being imported into FMP which makes it simple for me to capture the name data for the TXT file.

That is why I am trying to suppress records in Excel.

Kevy Baby · 04-25-2008, 02:33 PM

FWIW, I just found out that Fox Pro can do this using a "near match" dedupe (I don't know if that is the formal name of the process - it is just what my source told me).

Alex · 04-25-2008, 03:05 PM

Here's a brute force method. I'm sure I'll think of something graceful later when I can't actually test it out at work.

Assume your name list is in Column A and consists of the following five names (starting in cell A1):

Jones, Greg
Smith, John
Smith, John A.
Smith, John Adam
Wilson, Peter

Then, on the assumption that a first and last name match is sufficient for your purposes of assuming that they are the same person:

In cell B1 enter the following formula:

Code:

 =IF(ISERR(LEFT(A1,FIND(" ",A1,1))&LEFT(RIGHT(A1,LEN(A1)-FIND(" ",A1,1)),FIND(" ",RIGHT(A1,LEN(A1)-FIND(" ",A1,1)),1))), LEFT(A1,FIND(" ",A1,1)) & TRIM(RIGHT(A1,FIND(" ",A1,1)-3)),TRIM(LEFT(A1,FIND(" ",A1,1))&LEFT(RIGHT(A1,LEN(A1)-FIND(" ",A1,1)),FIND(" ",RIGHT(A1,LEN(A1)-FIND(" ",A1,1)),1))))

Hopefully that comes out in a cut and pastable way. If you hand type it you are going to screw up an impossible to find parenthesis.

This will result in cell B1 saying "Jones, Greg" which is no change from A1. However, when you copy that formula down the column it will result in this:

Code:

 Jones, Greg            Jones, Greg
Smith, John            Smith, John
Smith, John A.         Smith, John
Smith, John Adam       Smith, John 
Wilson, Peter          Wilson, Peter

So, all of the names that did consist of more than just first and last are now trimmed.

Into cell C2, copy this formula (note, you're leaving C1 blank):

Code:

=IF(B2=B1, "DUP", " ")

Copy that formula down the rest of the C column.

Now what you'll have is:

Code:

  Jones, Greg            Jones, Greg     
 Smith, John            Smith, John
 Smith, John A.         Smith, John      DUP
 Smith, John Adam       Smith, John      DUP
 Wilson, Peter          Wilson, Peter

In other words, each row that has a trimmed name that is identical to the row before is now labelled "DUP". You can now quickly scroll through the list and see the duplicates. If you need to manually combine the index lists for each name this may be fine. If you don't need to do that and just want to delete each duplicate row then do this:

Insert a new column between B and C.

Right click on the "B" at the top of the B column so the entire column is highlighted and press ctrl-c. Right click on the new C column and select "Paste special" from the options list. Select the "Values" radio button from the list of options in the window that results. This will paste the values of the B column into C without the underlying formulas.

Update the formula in what is now column D to reference Column C instead of B. Now you can sort the entire spreadsheet by column D, which will group all the duplicate rows and you can mass select and delete them. Resort by Column A to get them back in alphabetical order and voila.

Very brute force and lacking grace.

Alex · 04-25-2008, 03:08 PM

Here's a brute force method. I'm sure I'll think of something graceful later when I can't actually test it out at work.

Assume your name list is in Column A and consists of the following five names (starting in cell A1):

Jones, Greg
Smith, John
Smith, John A.
Smith, John Adam
Wilson, Peter

Then, on the assumption that a first and last name match is sufficient for your purposes of assuming that they are the same person:

In cell B1 enter the following formula:

Code:

 =IF(ISERR(LEFT(A1,FIND(" ",A1,1))&LEFT(RIGHT(A1,LEN(A1)-FIND(" ",A1,1)),FIND(" ",RIGHT(A1,LEN(A1)-FIND(" ",A1,1)),1))), LEFT(A1,FIND(" ",A1,1)) & TRIM(RIGHT(A1,FIND(" ",A1,1)-3)),TRIM(LEFT(A1,FIND(" ",A1,1))&LEFT(RIGHT(A1,LEN(A1)-FIND(" ",A1,1)),FIND(" ",RIGHT(A1,LEN(A1)-FIND(" ",A1,1)),1))))

Hopefully that comes out in a cut and pastable way. If you hand type it you are going to screw up an impossible to find parenthesis.

This will result in cell B1 saying "Jones, Greg" which is no change from A1. However, when you copy that formula down the column it will result in this:

Code:

Jones, Greg            Jones, Greg
Smith, John            Smith, John
Smith, John A.         Smith, John
Smith, John Adam       Smith, John 
Wilson, Peter          Wilson, Peter

So, all of the names that did consist of more than just first and last are now trimmed.

Into cell C2, copy this formula (note, you're leaving C1 blank):

Code:

=IF(B2=B1, "DUP", " ")

Copy that formula down the rest of the C column.

Now what you'll have is:

Code:

 Jones, Greg            Jones, Greg     
 Smith, John            Smith, John
 Smith, John A.         Smith, John      DUP
 Smith, John Adam       Smith, John      DUP
 Wilson, Peter          Wilson, Peter

In other words, each row that has a trimmed name that is identical to the row before is now labelled "DUP". You can now quickly scroll through the list and see the duplicates. If you need to manually combine the index lists for each name this may be fine. If you don't need to do that and just want to delete each duplicate row then do this:

Insert a new column between B and C.

Right click on the "B" at the top of the B column so the entire column is highlighted and press ctrl-c. Right click on the new C column and select "Paste special" from the options list. Select the "Values" radio button from the list of options in the window that results. This will paste the values of the B column into C without the underlying formulas.

Update the formula in what is now column D to reference Column C instead of B. Now you can sort the entire spreadsheet by column D, which will group all the duplicate rows and you can mass select and delete them. Resort by Column A to get them back in alphabetical order and voila.

Very brute force and lacking grace.

Kevy Baby · 04-25-2008, 03:37 PM

As much as I appreciate the work, I don't think this will work as many of the names include three or four (or more!) names. One actual name (I added an X in the middle of each name only so that a slurp spider won't readily find the name):

ABXDEL LAXTIF, MOXHAMED AHXMED

Alex · 04-25-2008, 04:12 PM

It can be reworked so that it keys off the comma instead of spaces. This would give you "mark as duplicate if everything to the left of the comma plus the first single word after the comma is identical."

But yeah, if you have complexly different name structures it will quickly get very complex for formula creation. I wouldn't be surprised if there are some good purchasable macros or addons but a quick search didn't find any.

The "near match" stuff I found for Excel was all just someone making complex search strings through.

But like I said, I'm sure that some graceful solution will occur to me as soon as I'm on the train home and can't do anything about it.

If you want the alteration for using comma it is:

Code:

=LEFT(A15,FIND(",",A15,1)) & IF(ISERR(TRIM(LEFT(RIGHT(A15,LEN(A15)-FIND(",",A15,1)-1),FIND(" ",RIGHT(A15,LEN(A15)-FIND(",",A15,1)-1),1)))), RIGHT(A15,LEN(A15)-FIND(",",A15,1)-1),TRIM(LEFT(RIGHT(A15,LEN(A15)-FIND(",",A15,1)-1),FIND(" ",RIGHT(A15,LEN(A15)-FIND(",",A15,1)-1),1))))

This will result in

Smith, Tom Jones
Smith Jones, Tom
Smith Jones, Tom John
Smith Jones, Tom I Really Have Too Many Names But Mom Didn't Want To Insult Family Members
Wilson, Tom

Being stripped down to (for comparison as duplicate)

Smith, Tom
Smith Jones, Tom
Smith Jones, Tom
Smith Jones, Tom
Wilson, Tom

So only the third and fourth ones would get listed as a duplicate.

04-25-2008, 02:03 PM	#1
Kevy Baby Chowder Head Join Date: Jan 2005 Location: Yes Posts: 18,500	Another Excel question Alex taught me how to dedupe in a previous thread - thank you. But now I have a new challenge. I generate a list of names (LName, FName), that I will use to generate an index in a 700 page directory. The program I use finds all instances of each record within the 700 page PDF file. The problem I get is that some names listed sometimes without a middle name, just a middle initial, or with the whole middle name: Smith, John Smith, John A Smith, John Adam In this instance, the first (without a middle initial or name), will pick up all instances of all three names since "Smith, John" is a part "Smith, John A" and "Smith, John Adam". The second version also picks up all instances of the third version. To minimize this, I manually go through and delete (using this example) the second and third option. However, I did that when there were only a couple hundred records. This one has 8,000 records. Any ideas? __________________ The thing about quotes on the internet is that you cannot verify their validity. - Abraham Lincoln
	Submit to Quotes

04-25-2008, 02:06 PM	#2
Alex . Join Date: Feb 2005 Posts: 13,354	Are you confident that every time you have Smith, John Smith, John A Smith, John Adam that you'll want to delete the last two? Or do you need to evaluate and sometimes Smith, John A is actually a different person from Smith, John Adam so you need to keep both?
	Submit to Quotes

04-25-2008, 02:14 PM	#4
Alex . Join Date: Feb 2005 Posts: 13,354	So the index lists for each individual needs to be combined into a single list as well?
	Submit to Quotes

04-25-2008, 02:33 PM	#6
Kevy Baby Chowder Head Join Date: Jan 2005 Location: Yes Posts: 18,500	FWIW, I just found out that Fox Pro can do this using a "near match" dedupe (I don't know if that is the formal name of the process - it is just what my source told me). __________________ The thing about quotes on the internet is that you cannot verify their validity. - Abraham Lincoln
	Submit to Quotes

04-25-2008, 03:05 PM	#7
Alex . Join Date: Feb 2005 Posts: 13,354	Here's a brute force method. I'm sure I'll think of something graceful later when I can't actually test it out at work. Assume your name list is in Column A and consists of the following five names (starting in cell A1): Jones, Greg Smith, John Smith, John A. Smith, John Adam Wilson, Peter Then, on the assumption that a first and last name match is sufficient for your purposes of assuming that they are the same person: In cell B1 enter the following formula: Code: =IF(ISERR(LEFT(A1,FIND(" ",A1,1))&LEFT(RIGHT(A1,LEN(A1)-FIND(" ",A1,1)),FIND(" ",RIGHT(A1,LEN(A1)-FIND(" ",A1,1)),1))), LEFT(A1,FIND(" ",A1,1)) & TRIM(RIGHT(A1,FIND(" ",A1,1)-3)),TRIM(LEFT(A1,FIND(" ",A1,1))&LEFT(RIGHT(A1,LEN(A1)-FIND(" ",A1,1)),FIND(" ",RIGHT(A1,LEN(A1)-FIND(" ",A1,1)),1)))) Hopefully that comes out in a cut and pastable way. If you hand type it you are going to screw up an impossible to find parenthesis. This will result in cell B1 saying "Jones, Greg" which is no change from A1. However, when you copy that formula down the column it will result in this: Code: Jones, Greg Jones, Greg Smith, John Smith, John Smith, John A. Smith, John Smith, John Adam Smith, John Wilson, Peter Wilson, Peter So, all of the names that did consist of more than just first and last are now trimmed. Into cell C2, copy this formula (note, you're leaving C1 blank): Code: =IF(B2=B1, "DUP", " ") Copy that formula down the rest of the C column. Now what you'll have is: Code: Jones, Greg Jones, Greg Smith, John Smith, John Smith, John A. Smith, John DUP Smith, John Adam Smith, John DUP Wilson, Peter Wilson, Peter In other words, each row that has a trimmed name that is identical to the row before is now labelled "DUP". You can now quickly scroll through the list and see the duplicates. If you need to manually combine the index lists for each name this may be fine. If you don't need to do that and just want to delete each duplicate row then do this: Insert a new column between B and C. Right click on the "B" at the top of the B column so the entire column is highlighted and press ctrl-c. Right click on the new C column and select "Paste special" from the options list. Select the "Values" radio button from the list of options in the window that results. This will paste the values of the B column into C without the underlying formulas. Update the formula in what is now column D to reference Column C instead of B. Now you can sort the entire spreadsheet by column D, which will group all the duplicate rows and you can mass select and delete them. Resort by Column A to get them back in alphabetical order and voila. Very brute force and lacking grace.
	Submit to Quotes

04-25-2008, 03:08 PM	#8
Alex . Join Date: Feb 2005 Posts: 13,354	Here's a brute force method. I'm sure I'll think of something graceful later when I can't actually test it out at work. Assume your name list is in Column A and consists of the following five names (starting in cell A1): Jones, Greg Smith, John Smith, John A. Smith, John Adam Wilson, Peter Then, on the assumption that a first and last name match is sufficient for your purposes of assuming that they are the same person: In cell B1 enter the following formula: Code: =IF(ISERR(LEFT(A1,FIND(" ",A1,1))&LEFT(RIGHT(A1,LEN(A1)-FIND(" ",A1,1)),FIND(" ",RIGHT(A1,LEN(A1)-FIND(" ",A1,1)),1))), LEFT(A1,FIND(" ",A1,1)) & TRIM(RIGHT(A1,FIND(" ",A1,1)-3)),TRIM(LEFT(A1,FIND(" ",A1,1))&LEFT(RIGHT(A1,LEN(A1)-FIND(" ",A1,1)),FIND(" ",RIGHT(A1,LEN(A1)-FIND(" ",A1,1)),1)))) Hopefully that comes out in a cut and pastable way. If you hand type it you are going to screw up an impossible to find parenthesis. This will result in cell B1 saying "Jones, Greg" which is no change from A1. However, when you copy that formula down the column it will result in this: Code: Jones, Greg Jones, Greg Smith, John Smith, John Smith, John A. Smith, John Smith, John Adam Smith, John Wilson, Peter Wilson, Peter So, all of the names that did consist of more than just first and last are now trimmed. Into cell C2, copy this formula (note, you're leaving C1 blank): Code: =IF(B2=B1, "DUP", " ") Copy that formula down the rest of the C column. Now what you'll have is: Code: Jones, Greg Jones, Greg Smith, John Smith, John Smith, John A. Smith, John DUP Smith, John Adam Smith, John DUP Wilson, Peter Wilson, Peter In other words, each row that has a trimmed name that is identical to the row before is now labelled "DUP". You can now quickly scroll through the list and see the duplicates. If you need to manually combine the index lists for each name this may be fine. If you don't need to do that and just want to delete each duplicate row then do this: Insert a new column between B and C. Right click on the "B" at the top of the B column so the entire column is highlighted and press ctrl-c. Right click on the new C column and select "Paste special" from the options list. Select the "Values" radio button from the list of options in the window that results. This will paste the values of the B column into C without the underlying formulas. Update the formula in what is now column D to reference Column C instead of B. Now you can sort the entire spreadsheet by column D, which will group all the duplicate rows and you can mass select and delete them. Resort by Column A to get them back in alphabetical order and voila. Very brute force and lacking grace.
	Submit to Quotes

04-25-2008, 03:37 PM	#9
Kevy Baby Chowder Head Join Date: Jan 2005 Location: Yes Posts: 18,500	As much as I appreciate the work, I don't think this will work as many of the names include three or four (or more!) names. One actual name (I added an X in the middle of each name only so that a slurp spider won't readily find the name): ABXDEL LAXTIF, MOXHAMED AHXMED __________________ The thing about quotes on the internet is that you cannot verify their validity. - Abraham Lincoln
	Submit to Quotes

04-25-2008, 04:12 PM	#10
Alex . Join Date: Feb 2005 Posts: 13,354	It can be reworked so that it keys off the comma instead of spaces. This would give you "mark as duplicate if everything to the left of the comma plus the first single word after the comma is identical." But yeah, if you have complexly different name structures it will quickly get very complex for formula creation. I wouldn't be surprised if there are some good purchasable macros or addons but a quick search didn't find any. The "near match" stuff I found for Excel was all just someone making complex search strings through. But like I said, I'm sure that some graceful solution will occur to me as soon as I'm on the train home and can't do anything about it. If you want the alteration for using comma it is: Code: =LEFT(A15,FIND(",",A15,1)) & IF(ISERR(TRIM(LEFT(RIGHT(A15,LEN(A15)-FIND(",",A15,1)-1),FIND(" ",RIGHT(A15,LEN(A15)-FIND(",",A15,1)-1),1)))), RIGHT(A15,LEN(A15)-FIND(",",A15,1)-1),TRIM(LEFT(RIGHT(A15,LEN(A15)-FIND(",",A15,1)-1),FIND(" ",RIGHT(A15,LEN(A15)-FIND(",",A15,1)-1),1)))) This will result in Smith, Tom Jones Smith Jones, Tom Smith Jones, Tom John Smith Jones, Tom I Really Have Too Many Names But Mom Didn't Want To Insult Family Members Wilson, Tom Being stripped down to (for comparison as duplicate) Smith, Tom Smith Jones, Tom Smith Jones, Tom Smith Jones, Tom Wilson, Tom So only the third and fourth ones would get listed as a duplicate.
	Submit to Quotes