Moving HashDecrypt to the GPU, bumps and bruises

It’s been a while since I blogged about the HashDecrypt project.
During my absent time from the blog we’ve been covering some really cool topics about CUDA Development in the DPS915 class with Chris Szalwinski that will definitely come handy in the implementation for HashDecrypt.

So as far as progress goes a few things happened.

Before starting to port our md5 decrypt application, we wanted to make sure it was properly working and somewhat optimized on the CPU.
After some changes here and there we managed to get a 70% increase by improving the algorithm that generated and compared the md5 hashes.

Having the application running well on the CPU we started sketching ideas of how to use the GPU to leverage some performance. We had and still have high hopes that the GPU will increase by huge leaps the performance of our app.
However is not that simple.

First we thought all we needed to do was to get the application that was running on the CPU make some small modifications to make CUDA compatible and that would solve all our problems. We were wrong.

For the problem domain we are working on there are some major areas that need to be thought exclusively when moving the application to the GPU, I’ll get to them later, first I want to describe our first idea.

  • Have all the strings generated on the host and copied to the device.
  • Cache the strings on registers on the device CUDA Cores
  • Generate md5 hashes for the strings sitting on the registers
  • Compare the generated hashes with the one we were trying to break.
  • If we found a match we would copy to a pointer we passed to our kernel so we could later identified back in the host

Well, in theory sounded plausible.
However there was one MAJOR problem with that approach.
For a string of 8 chars, there are 7.289831534994528e+15 possible combinations using the ASCII values from 32 to 127
That means 58 Petabytes of data.
The amount of time it would take to pass back and forth all this data between the device and the host would take forever.
Specially since memory transfer between the device and the host is one of the biggest bottle necks.

With that idea out of the way we started brainstorming other possibilities.

The one that made more sense was:
To launch a grid of blocks containing the total number of characters we were trying to match, in our case 94.
Write a separate kernel for each word length, for example:

__global__ void wordLength1(const char* hashToBreak) {
// Generate words and hashes
}

__global__ void wordLength2(const char* hashToBreak) {
// Generate words and hashes
}

__global__ void wordLength3(const char* hashToBreak) {
// Generate words and hashes
}
...
......
.........

Well, but why are creating different kernels for different word lengths?
Because we want to keep Thread Divergence and Memory Coalescence to a minimum.
By having a block of 96 grids we can create the exact amount of loops to match the needs of each word length.
This way having all warps running the same instructions during all cycles.
Plus we can also cache the string into registers to speed up the hash generation process.

We haven’t fully implemented the idea above, we are still testing a few areas to make sure the approach is feasible.

In the end one thing is for sure, GPU programming is definitely not easy, but sure as hell is a lot of fun.
The goal is to have a working implementation of the HashDecrypt project ready by mid December.
We plan to create a Nodejs web client to interact with the application and have some default fallbacks for when the user’s computer don’t have support for CUDA, but that’s is for the future.

In the coming weeks I should be blogging more often about this topic so stay tuned!

Advertisements