Finding Duplicates with sort and uniq

Imagine this: You have two text files full of information, with one data entry on each line. You want to find out which lines occur in both files. Now, if the files are mostly the same, it’s probably best to use a program called diff. However, if the files are mostly different, you can use this little incantation:

cat file1.txt file2.txt | sort -n | uniq -d

This will join file1 and file2, sort the joined data -numerically, and display only the lines that are not unique (uniq -d).

This came in handy for manipulating electrode files today. Our electrode files just contain lists of node numbers. The simulator gets unhappy when you try to do things with overlapping electrodes, so in this way we were able to remove the offending overlap without too much trouble.

Project Management, Priorities, and Office Hours

Recently, a friend emailed me to ask about GTD. He was tasked with a presentation on project management, and had heard of GTD from my writings and from others. I had good and bad news for him.

The good news is that GTD is an excellent system for keeping your tasks organized. The bad news is that it doesn’t do much else. Sure, the GTD books talks about these different altitudes, about taking different views of your goals, projects, hopes, and dreams, but it doesn’t really offer much insight into what you’re supposed to do at those ‘altitudes’.

On top of this, I’ve had some problems recently with becoming sidetracked. I’ve been getting a lot of questions from people in lab lately, I have some exciting side-projects that I’ve been coaxing along, and I’ve not been hacking away at my most important projects with the necessary zeal to really move them forward. Serendipitously, Readeroo recently sent me to an old bookmark on Slashdot — an excerpted chapter from the acclaimed The Art of Project Management. I can see why they sent the chapter excerpt out — it’s Project Management gold in and of itself.

Here are the points that really grabbed me:

  • Prioritized lists at the goal, ‘feature’ (software-oriented, yes), and task level are the ultimate arbiters of what to do next
  • There are really only two priority levels — necessary (or 1) and everything else (2 through infinity or whatever). Priority 1 must be done. The rest is fluff after priority 1 items are accomplished.
  • Rigorous separation of the prioritized lists into priority 1 and everything else, both at the outset of a project and during any reviews and revamping, is essential.

Between managing the cluster, helping lab members with things, and getting caught up in my own little side-projects, I have not been doing these things. Priority 1 items have been submerged below a sea of other things. Yesterday, inspired by that excerpt, I re-focused. I refined my project lists and drew the all-important dividing line between priority 1 and everything else.

In order to help stick to these priorities, I’m enacting “office hours”. I’ve found myself doing this lately anyway, and it’s been working well. I’m declaring before-lunch time my time. If someone comes to me with an issue (other than “there’s a fire in the server room”) before lunch, my reply is now, “I’ll talk to you about it after lunch.” Since I’m a morning person, and most people in my lab are not, this works pretty smoothly. Most people aren’t here in the morning anyway. This gives me a good 4-5 hours of priority-1 time, without neglecting my “team” duties. Perhaps if I do this long enough, people will naturally come to me after noon all of the time.

As a last side note on The Art of Project Management, it unfortunately does not seem to be offered on Amazon directly from them anymore. I have no idea why. Luckily the Hopkins library has it, so I’ll be checking it out soon.

Maintaining My Posting Rate

I had one major New-Year’s resolution (though I had resolved it before then): post to my blog on average once per day. This sounds simple, but I don’t just want one post on each day. I’m happy to let my posting muse cycle between wordlessness and logorrhea. Therefore, if I post three things one day, I’m off scot free for the next two days. In practice, this gets pretty hard to keep track of. I’ve already started to find it difficult 23 days into the year.

However, I was able to remedy this with a little PHP and MySQL code. WordPress (this blogging software) runs on those technologies, and so it was trivial for me to tap into the database and produce this page. It does something very simple. It goes from the beginning of the year to the present day, tallying posts along the way, and dividing by the number of days in the year that have passed. This gives me a total post count and a ratio of posts to days. As you can see if you look at that page, I have been coasting for a little while, but was getting dangerously close to “1”. This post should remedy that, which is kind of cheating, but I’m willing to accept it.

Probably this could be dressed up into a widget or plugin or something, and anyone is welcome to use it to do that, but I have neither the time nor sufficient interest to learn how.
Here’s the PHP code, if you’re interested: postcountphp.gz (1 kB)

ADDENDUM: Updated 2008-01-26 to only count published posts.