When analysing data in a Microsoft Excel spreadsheet, there are two things that I find I typically need to do:
- Identify duplicate rows
- Find matches against a lookup table
This article explains how to do these two things in excel using formulas and functions.
You might be thinking, "bah!, I'm a developer why do I need to do anything in Excel?". However I recommend you get your Excel skills up to scratch — there are lot of powerful things you can do with it. Plus all the users out there will be using it, so you might as well know a bit about it.
Identifying Duplicate Rows
Say you have the following list of operators:
| A | |
| 1 | OPRID |
|---|---|
| 2 | TIMMSR |
| 3 | JOHNSP |
| 4 | NIXONL |
| 5 | JOHNSP |
| 6 | BROWNK |
| 7 | SMITHJ |
How do you identify if a column (e.g. operator IDs) has duplicate values? To start with, you can sort the data in that column. This can help you see patterns of repeating data. However to be accurate, you should use COUNTIF function:
=COUNTIF(range, criteria)
The range is your data column and the criteria is the cell. the COUNTIF function returns a count of the number of matches, so if you have more than one in a column (e.g. A:A) then you have a duplicate.
So in the above example, you would create a second column with formulas as shown (duplicated column). Note that I'm showing the formula and the result, but all you would see in Excel is the result in the duplicated column.
| A | B | C | |
| 1 | OPRID | FORMULA | DUPLICATED |
|---|---|---|---|
| 2 | TIMMSR | =COUNTIF(A:A, A1) > 1 | FALSE |
| 3 | JOHNSP | =COUNTIF(A:A, A2) > 1 | TRUE |
| 4 | NIXONL | =COUNTIF(A:A, A3) > 1 | FALSE |
| 5 | JOHNSP | =COUNTIF(A:A, A4) > 1 | TRUE |
| 6 | BROWNK | =COUNTIF(A:A, A5) > 1 | FALSE |
| 7 | SMITHJ | =COUNTIF(A:A, A6) > 1 | FALSE |
Notice that the two JOHNSP rows return true, indicating that they are duplicated. Obviously, in this small example you could just spot the duplicated rows but try doing that with 5000 rows. A filter on the duplicated column for true values would show you just the duplicated rows.
Note that in Microsoft Excel 2007 there is a Remove Duplicates button on the Data ribbon. This is handy if you just want to clear out duplicates, but not so useful if you want to identify the duplicated rows.
Finding Matches against a Lookup Table
Say you have the following list of operator IDs and roles along with a lookup table of roles you want to find in amongst the list:
| A | B | |
| 1 | OPRID | ROLE |
|---|---|---|
| 2 | TIMMSR | PeopleSoft Administrator |
| 3 | TIMMSR | PeopleSoft User |
| 4 | TIMMSR | PeopleTools |
| 5 | TIMMSR | Standard Non-Page Permissions |
| 6 | JOHNSP | Employee |
| 7 | JOHNSP | PeopleSoft User |
| 8 | NIXONL | Employee |
| 9 | NIXONL | Standard Non-Page Permissions |
| 10 | NIXONL | PeopleSoft User |
| 11 | BROWNK | PeopleSoft User |
| 12 | BROWNK | PeopleTools |
| 13 | BROWNK | Standard Non-Page Permissions |
| 14 | SMITHJ | PeopleSoft Administrator |
| 15 | SMITHJ | PeopleSoft User |
| 16 | ||
| 17 | LOOKUP | |
| 18 | Employee | |
| 19 | PeopleTools |
How would you find which of the operator IDs have the roles listed in the lookup table at the bottom (either Employee or PeopleTools)?
In this case, use the MATCH function. Note you can also use the VLOOKUP function, but I find the MATCH function easier.
The syntax of the MATCH function is:
=MATCH(lookup_value, lookup_array, [match_type])
The parameters you need to specify are the lookup value (the cell), your lookup array (lookup table) and the match type. Match types can be 0 (exact match), -1 (less than), and 1 (greater than).
MATCH returns the row number that matched in the lookup table or #N/A if there is no match. Typically you want to return a boolean result like true or false or yes or no, so you need to encapsulate MATCH with the ISNA function which returns true when the result is #N/A and false when it isn't and then wrap that in an IF statement to return your desired result (e.g. NO_MATCH or MATCH):
=IF(ISNA(MATCH(lookup_value, lookup_array, [match_type])), "NO_MATCH", "MATCH")
So for the above example, you would create a MATCH column and use the formulas as shown:
| A | B | C | D | |
| 1 | OPRID | ROLE | FORMULA | MATCH |
|---|---|---|---|---|
| 2 | TIMMSR | PeopleSoft Administrator | =IF(ISNA(MATCH(B2, $A$18:$A$19, 0)), "NO", "YES") | NO |
| 3 | TIMMSR | PeopleSoft User | =IF(ISNA(MATCH(B3, $A$18:$A$19, 0)), "NO", "YES") | NO |
| 4 | TIMMSR | PeopleTools | =IF(ISNA(MATCH(B4, $A$18:$A$19, 0)), "NO", "YES") | YES |
| 5 | TIMMSR | Standard Non-Page Permissions | =IF(ISNA(MATCH(B5, $A18:$A$19, 0)), "NO", "YES") | NO |
| 6 | JOHNSP | Employee | =IF(ISNA(MATCH(B6, $A$18:$A$19, 0)), "NO", "YES") | YES |
| 7 | JOHNSP | PeopleSoft User | =IF(ISNA(MATCH(B7, $A$18:$A$19, 0)), "NO", "YES") | NO |
| 8 | NIXONL | Employee | =IF(ISNA(MATCH(B8, $A$18:$A$19, 0)), "NO", "YES") | YES |
| 9 | NIXONL | Standard Non-Page Permissions | =IF(ISNA(MATCH(B9, $A$18:$A$19, 0)), "NO", "YES") | NO |
| 10 | NIXONL | PeopleSoft User | =IF(ISNA(MATCH(B10, $A$18:$A$19, 0)), "NO", "YES") | NO |
| 11 | BROWNK | PeopleSoft User | =IF(ISNA(MATCH(B11, $A$18:$A$19, 0)), "NO", "YES") | NO |
| 12 | BROWNK | PeopleTools | =IF(ISNA(MATCH(B12, $A$18:A$19, 0)), "NO", "YES") | YES |
| 13 | BROWNK | Standard Non-Page Permissions | =IF(ISNA(MATCH(B13, $A$18:$A$19, 0)), "NO", "YES") | NO |
| 14 | SMITHJ | PeopleSoft Administrator | =IF(ISNA(MATCH(B14, $A$18:$A$19, 0)), "NO", "YES") | NO |
| 15 | SMITHJ | PeopleSoft User | =IF(ISNA(MATCH(B15, $A$18:$A$19, 0)), "NO", "YES") | NO |
| 16 | ||||
| 17 | LOOKUP | |||
| 18 | Employee | |||
| 19 | PeopleTools |
Note the dollar signs ($) around the lookup array ($A$18:$A$19) in the MATCH function. This is important as it prevents the values from auto-adjusting (incrementing) as you perform an auto-fill for all rows in the column. That way the location of the lookup table stays static (cells A18 and A19) as your formulas are adjusted by the auto-fill.
