java-libpst & pst2gmail

January 26th, 2010

I like Gmail.

I like the huge storage, the reliability (I know they’ve had the occasional issue, but it’s nothing compared to some other services), the interface with the vim-like shortcuts and I simply love the searching capabilities.

When Google finally enabled IMAP support for Gmail I decided it might be worthwhile to move my work email over to it (people may recall my love for IMAP).  This meant I could still use Outlook with IMAP and get all of the benefits such as being able to read calendar invitations, RTF formatted emails as well as the horrendous HTML generated by client with Outlook, but it also gave me the added features of Gmail.

Unfortunately I had about 4GB of email stored in my old Outlook PST.  I really wanted to migrate this over to Gmail as well to take advantage of those sweet searching features to help me navigate through the vast minefield which is my email archive.

So! I tried drag-dropping the emails from my PST file over to the Gmail IMAP store through Outlook, and started the migration.  My Outlook almost instantly hung while each email was uploaded to Gmail at what seemed a glacial pace.  Indeed, I left it for a weekend and managed to get a couple of gig up, but my outlook had completely crashed by the time I walked in on Monday.

So, I thought to myself “there has to be a better way”, and started searching for a migration tool.  Unfortunately I couldn’t find one.  No probs I thought, I’ll just write my own… only there wasn’t much in the way of libraries that could read PST files.

First I tried libpst, which is a cute little C library.  I hadn’t really done any development in straight C before, so I decided I’d try to make a little ncurses app to read my PST files to test out the library.  I proceeded with moderate levels of success.  Unfortunately I ran into a few issues, the first being my deep-seated and intense hatred for C as a language.  The second and slightly more fatal one was that the library didn’t work particularly well for me; it would crash when reading any emails under the Inbox and it would simply hang when trying to read my larger 4GB PST file.

I then came across libpff.  This is another C library that was built from libpst. By this time however, I had had a gutful of programming in C.  I also suspected it may have some of the scalability issues present in the original libpst.  It was also a little difficult for me to work out how exactly to use the library as C isn’t one of my strong points and there isn’t much in the way of usage / API documentation.

So, I did the only reasonable thing that a self-respecting geek would do: started to write my own library in Java.  Thankfully, Joachim Metz of Hoffmann Investigations who was responsible for the vast majority of the work in libpff also studiously documented the file format as he went and it looked like there was more than enough there to make a start.

The library (java-libpst):

Okay, so I’ll admit that I was biting off a bit more than I could effectively chew, let alone swallow.  It turns out that PST Files are all little-endian, and Java is, of course, big endian.  Oh, did I mention that PST files are also encrypted by default using a substitution cipher? But… not completely encrypted… just enough of them to make it difficult, oh! all the RTF items are compressed as well.

There is also the matter of the reasonably important unknown bits of the file format which were not yet documented.  For example, there is this B-Tree in the file format that allows you to quickly locate an item within the file.  Which is great! only, there didn’t seem to be a way to find out what child items may be attached to a parent item (like a folder).  Oh! you can find out who your parent in easy enough, the id of the parent is sitting right there… but the child items… no sir!  This was a bit of a problem you see because as with most tree structured files, you start from the top and work your way down.

I managed to work around this by reading the entire contents of the B-Tree and creating my own inverted map of parent to child objects.  It turns out that this wasn’t really that bad of a solution and only takes a few seconds to read through them all and create the map even on huge PST files.

Frustratingly, soon after I had managed to get this part of the library to work, a new section of the file format was documented which describes “related” nodes, which contains, among other things, a list of all the child items.  Sigh.

I also had a few difficulties understanding some of the finer points of the documentation, like when data tables fall across different blocks with external lookup tables.  There was also a fun point when I started to hit “named items”, which are basically keys that change meaning from file to file.  The identifier that denotes a recipient’s email address may, for example, be a different identifier in a different PST file!  Once again, this is only in some instances (yay for consistency!).

After much head scratching, reading, and staring at massive amounts of Hex encoded data, I managed to get the library working well enough to navigate the PST structure and extract items including emails, attachments, contacts, calendar items, etc. relatively quickly without crashing too horribly.

The result is available on Google code here:  http://code.google.com/p/java-libpst/

But my work was not yet done.

The migration app (pst2gmail):

Of course, the real goal was to create an application that could migrate all of your email from your legacy PST file to Gmail.

With the library working, this part was a lot more straightforward, and a mostly working alpha can be downloaded from here: http://code.google.com/p/pst2gmail/

I did have grand plans of having the system migrate your contacts and calendar items as well, which wouldn’t actually be that difficult to do, however my personal development focus has once again shifted and this functionality will probably remain disabled and unimplemented until people start to howl for it to be implemented.

Until then, please enjoy, the emails are uploaded through IMAP, flagged messages are passed through as starred items, folders are automatically created as tags, attachments and all that work, you can even resume your upload through the generated log files if you need to stop the process midway through!

New PHP MSSQL Driver

August 12th, 2008

Further to my previous post about FreeTDS for PHP on windows, I’ve just come across a new microsoft PHP 5 extension that apparently provides compatibility with the latest versions of MSSQL (including the 2008 pre-releases).

I’m yet to check it out, but it looks promising: http://www.microsoft.com/downloads/details.aspx?FamilyID=61bf87e0-d031-466b-b09a-6597c21a2e2a&DisplayLang=en

Hacked

August 11th, 2008

Well, I guess I was one of many wordpress victims that have suffered from an unscrupulous blog spamming bot.

Unfortunately I didn’t keep backups of my posts.  It’s strange, you don’t really count these sorts of things as important and therefore it was overlooked with my normal backups.  Thankfully, I was saved by the Way Back Machine, which had a copy of my Jasper post which was destroyed.

Google is currently in the process of removing me off their blacklist.  Apparently it can take a bit of time.

A few reasons for the lack of posts and generally not being around.  Firstly I went through Europe, and caught up with Nick, Haley, Havard and Alex, which was awesome.  Secondly, I have been completely swamped at work and it is a little difficult to be geeky at home when you have had a rather full-on day.

Hopefully this need to update my blog will prompt a bit of a resurgence in posting.