Finding Needles With ’sort’ and ‘uniq’

I had to do this recently, and so I thought it would be useful to share this for two reasons:

  1. Someone else may need to do it and find this technique useful
  2. Someone else may know a better way of doing this

Quick ‘n’ dirty explanation: you have two lists. One list is a superset of the other list. You want to identify all of the items that exist *only* in the larger list. Here’s how you do that:

cat small_list >> largelist; sort largelist | uniq -u

Note that ‘uniq -u’ is not the same as ‘sort -u’. The former will display only the lines in the file that occur once. The latter displays all lines in the file, *once*, regardless of how many times they occur in the file.

Longer example explanation: I have an LDAP server, and at some point we added an objectclass and associated attribute to every user account. However, new accounts weren’t being *created* with the objectclass and attribute. At some point, I figured out that there was some inconsistency between account objects, and figured I had better get a list of accounts that didn’t have the objectclass and attribute so I could correct the situation. Problem is, you can’t negate a search using the standard ‘ldapsearch’ command line tools. So I can’t ask for all objects where ‘objectclass != myobjectclass’ or something.

What I did was two ldapsearches. One for all of the objects in that part of the tree, and then another for all objects in that part of the tree with the objectclass in place. Of course, the former list is a superset of the latter, and then we do ‘cat subset >> superset; sort superset | uniq -u’ – and that will be the list of people who do *not* have the objectclass associated with their account entry in the directory server.

Technorati Tags: , , , , , , , , ,

Social Bookmarks:

  • solenopsis6

    A lot of people don’t know about ‘comm’. And that’s a real shame, ’cause it’s such a useful tool!

    comm exists to compare contents of two files. It has 3 columns available in its output — the lines only in file 1, the lines only in file 2, and the lines in both. I don’t think they even need to be sorted first.

    This will show you lines only in test1.txt:
    comm -23 test1.txt test2.txt

    This will show you lines only in test2.txt:
    comm -13 test1.txt test2.txt

    This will show you lines only common to both files:
    comm -12 test1.txt test2.txt

    Enjoy!

  • http://m0j0.wordpress.com/ m0j0

    Wow – that totally rocks! Thanks for posting that!

    By the way – from the comm man page:

    The comm utility reads file1 and file2, which should be sorted lexically,
    and produces three text columns as output: lines only in file1; lines
    only in file2; and lines in both files.

    What’s more, it’s also installed on my mac as well as my linux box :-)