Moving HashDecrypt to the GPU, bumps and bruises

It’s been a while since I blogged about the HashDecrypt project.
During my absent time from the blog we’ve been covering some really cool topics about CUDA Development in the DPS915 class with Chris Szalwinski that will definitely come handy in the implementation for HashDecrypt.

So as far as progress goes a few things happened.

Before starting to port our md5 decrypt application, we wanted to make sure it was properly working and somewhat optimized on the CPU.
After some changes here and there we managed to get a 70% increase by improving the algorithm that generated and compared the md5 hashes.

Having the application running well on the CPU we started sketching ideas of how to use the GPU to leverage some performance. We had and still have high hopes that the GPU will increase by huge leaps the performance of our app.
However is not that simple.

First we thought all we needed to do was to get the application that was running on the CPU make some small modifications to make CUDA compatible and that would solve all our problems. We were wrong.

For the problem domain we are working on there are some major areas that need to be thought exclusively when moving the application to the GPU, I’ll get to them later, first I want to describe our first idea.

  • Have all the strings generated on the host and copied to the device.
  • Cache the strings on registers on the device CUDA Cores
  • Generate md5 hashes for the strings sitting on the registers
  • Compare the generated hashes with the one we were trying to break.
  • If we found a match we would copy to a pointer we passed to our kernel so we could later identified back in the host

Well, in theory sounded plausible.
However there was one MAJOR problem with that approach.
For a string of 8 chars, there are 7.289831534994528e+15 possible combinations using the ASCII values from 32 to 127
That means 58 Petabytes of data.
The amount of time it would take to pass back and forth all this data between the device and the host would take forever.
Specially since memory transfer between the device and the host is one of the biggest bottle necks.

With that idea out of the way we started brainstorming other possibilities.

The one that made more sense was:
To launch a grid of blocks containing the total number of characters we were trying to match, in our case 94.
Write a separate kernel for each word length, for example:

__global__ void wordLength1(const char* hashToBreak) {
// Generate words and hashes
}

__global__ void wordLength2(const char* hashToBreak) {
// Generate words and hashes
}

__global__ void wordLength3(const char* hashToBreak) {
// Generate words and hashes
}
...
......
.........

Well, but why are creating different kernels for different word lengths?
Because we want to keep Thread Divergence and Memory Coalescence to a minimum.
By having a block of 96 grids we can create the exact amount of loops to match the needs of each word length.
This way having all warps running the same instructions during all cycles.
Plus we can also cache the string into registers to speed up the hash generation process.

We haven’t fully implemented the idea above, we are still testing a few areas to make sure the approach is feasible.

In the end one thing is for sure, GPU programming is definitely not easy, but sure as hell is a lot of fun.
The goal is to have a working implementation of the HashDecrypt project ready by mid December.
We plan to create a Nodejs web client to interact with the application and have some default fallbacks for when the user’s computer don’t have support for CUDA, but that’s is for the future.

In the coming weeks I should be blogging more often about this topic so stay tuned!

Advertisements

Time to select a projet, profiling ffmpeg2theora

Time has come to choose a project for the DPS915 CUDA programming course.

After looking online, a project that caught my eyes was the ffmpeg2theora

ffmpeg2theora is built on top of the ffmpeg project and its goal is to provide a command line interface to convert videos to the theora format wrapped in an Ogg container

My idea for the project is to add GPU optimization support to the converter, specifically using the CUDA api for Nvidia graphics cards.
At the moment it is not clear how or if it’s even possible to do that since the converter itself has a lot of dependencies and talking with some developers in the #vorbis channel I was told that the optimizations would have to be done in the libtheora and a big chunk of the library it’s already written in assembly for performance reasons.

So for now I’m trying to gather as much information as possible.

Here is a list of some resources relevant to the research

 

To get an idea of the project I decided to build it and play around with the converter

As I would expect to build ffmpeg2theora from source requires a bunch of dependencies.
The developers created two scripts that make the process easy.
One script clones the latest stable release of ffmpeg from their git repository and builds it and the other does the same thing but for libkate

Besides installing ffmpeg and libkate I also needed to install

  • libvorbis
  • libogg

On Ubuntu I also had to install

  • yasm
  • gawk

The Build system

For the build system they use SCons
SCons is a software construction tool implemented in Python, it is a replacement for the famous make.

I have to say that at first I was kind of skeptic, but after reading their user docs and hacking around some scripts I fell in love immediately.
SCons doesn’t try to solve all the problems in the world, but they take a very pragmatic approach towards build tools and have some info to back it up.

Here is the SCons script used in the ffmpeg2theora projet:

# SCons build specification
# vi:si:et:sw=2:sts=2:ts=2
from glob import glob
import os

import SCons

def version():
    f = os.popen("./version.sh")
    version = f.read().strip()
    f.close()
    return version

pkg_version="0.29"

pkg_name="ffmpeg2theora"

scons_version=(1,2,0)

try:
    EnsureSConsVersion(*scons_version)
except TypeError:
    print 'SCons %d.%d.%d or greater is required, but you have an older version' % scons_version
    Exit(2)

opts = Variables()
opts.AddVariables(
  BoolVariable('static', 'Set to 1 for static linking', 0),
  BoolVariable('debug', 'Set to 1 to enable debugging', 0),
  BoolVariable('build_ffmpeg', 'Set to 1 to build local copy of ffmpeg', 0),
  ('prefix', 'install files in', '/usr/local'),
  ('bindir', 'user executables', 'PREFIX/bin'),
  ('mandir', 'man documentation', 'PREFIX/man'),
  ('destdir', 'extra install time prefix', ''),
  ('APPEND_CCFLAGS', 'Additional C/C++ compiler flags'),
  ('APPEND_LINKFLAGS', 'Additional linker flags'),
  BoolVariable('libkate', 'enable libkate support', 1),
  BoolVariable('crossmingw', 'Set to 1 for crosscompile with mingw', 0)
)
env = Environment(options = opts)
Help(opts.GenerateHelpText(env))

pkg_flags="--cflags --libs"
if env['static']:
  pkg_flags+=" --static"
  env.Append(LINKFLAGS=["-static"])

if env['crossmingw']:
    env.Tool('crossmingw', toolpath = ['scons-tools'])

prefix = env['prefix']
if env['destdir']:
  if prefix.startswith('/'): prefix = prefix[1:]
  prefix = os.path.join(env['destdir'], prefix)
man_dir = env['mandir'].replace('PREFIX', prefix)
bin_dir = env['bindir'].replace('PREFIX', prefix)

env.Append(CPPPATH=['.'])
env.Append(CCFLAGS=[
  '-DPACKAGE_VERSION=\\"%s\\"' % pkg_version,
  '-DPACKAGE_STRING=\\"%s-%s\\"' % (pkg_name, pkg_version),
  '-DPACKAGE=\\"%s\\"' % pkg_name,
  '-D_FILE_OFFSET_BITS=64'
])

env.Append(CCFLAGS = Split('$APPEND_CCFLAGS'))
env.Append(LINKFLAGS = Split('$APPEND_LINKFLAGS'))

if env['debug'] and env['CC'] == 'gcc':
  env.Append(CCFLAGS=["-g", "-O2", "-Wall"])

if GetOption("help"):
    Return()

def ParsePKGConfig(env, name):
  if os.environ.get('PKG_CONFIG_PATH', ''):
    action = 'PKG_CONFIG_PATH=%s pkg-config %s "%s"' % (os.environ['PKG_CONFIG_PATH'], pkg_flags, name)
  else:
    action = 'pkg-config %s "%s"' % (pkg_flags, name)
  return env.ParseConfig(action)

def TryAction(action):
    import os
    ret = os.system(action)
    if ret == 0:
        return (1, '')
    return (0, '')

def CheckPKGConfig(context, version):
  context.Message( 'Checking for pkg-config... ' )
  ret = TryAction('pkg-config --atleast-pkgconfig-version=%s' % version)[0]
  context.Result( ret )
  return ret

def CheckPKG(context, name):
  context.Message( 'Checking for %s... ' % name )
  if os.environ.get('PKG_CONFIG_PATH', ''):
    action = 'PKG_CONFIG_PATH=%s pkg-config --exists "%s"' % (os.environ['PKG_CONFIG_PATH'], name)
  else:
    action = 'pkg-config --exists "%s"' % name
  ret = TryAction(action)[0]
  context.Result( ret )
  return ret

env.PrependENVPath ('PATH', os.environ['PATH'])

conf = Configure(env, custom_tests = {
  'CheckPKGConfig' : CheckPKGConfig,
  'CheckPKG' : CheckPKG,
})

if env["build_ffmpeg"]:
  if env.GetOption('clean'):
    TryAction("cd ffmpeg;make distclean")
  else:
    TryAction("./build_ffmpeg.sh")

if not env.GetOption('clean'):
  pkgconfig_version='0.15.0'
  if not conf.CheckPKGConfig(pkgconfig_version):
     print 'pkg-config >= %s not found.' % pkgconfig_version
     Exit(1)

  if not conf.CheckPKG("ogg >= 1.1"):
    print 'ogg >= 1.1 missing'
    Exit(1)

  if not conf.CheckPKG("vorbis"):
    print 'vorbis missing'
    Exit(1)

  if not conf.CheckPKG("vorbisenc"):
    print 'vorbisenc missing'
    Exit(1)

  if not conf.CheckPKG("theoraenc >= 1.1.0"):
    print 'theoraenc >= 1.1.0 missing'
    Exit(1)

  XIPH_LIBS="ogg >= 1.1 vorbis vorbisenc theoraenc >= 1.1.0"

  if not conf.CheckPKG(XIPH_LIBS):
    print 'some xiph libs are missing, ffmpeg2theora depends on %s' % XIPH_LIBS
    Exit(1)
  ParsePKGConfig(env, XIPH_LIBS)

  FFMPEG_LIBS=[
      "libavdevice",
      "libavformat",
      "libavfilter",
      "libavcodec >= 52.30.0",
      "libpostproc",
      "libswscale",
      "libswresample",
      "libavutil",
  ]
  if os.path.exists("./ffmpeg"):
    pkg_path = list(set(map(os.path.dirname, glob('./ffmpeg/*/*.pc'))))
    pkg_path.append(os.environ.get('PKG_CONFIG_PATH', ''))
    os.environ['PKG_CONFIG_PATH'] = ':'.join(pkg_path)
    env.Append(CCFLAGS=[
      '-Iffmpeg'
    ])

  if not conf.CheckPKG(' '.join(FFMPEG_LIBS)):
    print """
        Could not find %s.
        You can install it via
         sudo apt-get install %s
        or update PKG_CONFIG_PATH to point to ffmpeg's source folder
        or run ./get_ffmpeg.sh (for more information see INSTALL)
    """ %(" ".join(FFMPEG_LIBS), " ".join(["%s-dev"%l.split()[0] for l in FFMPEG_LIBS]))
    Exit(1)

  for lib in FFMPEG_LIBS:
      ParsePKGConfig(env, lib)

  if conf.CheckCHeader('libavformat/framehook.h'):
      env.Append(CCFLAGS=[
        '-DHAVE_FRAMEHOOK'
      ])

  KATE_LIBS="oggkate"
  if env['libkate']:
    if os.path.exists("./libkate/misc/pkgconfig"):
      os.environ['PKG_CONFIG_PATH'] = "./libkate/misc/pkgconfig:" + os.environ.get('PKG_CONFIG_PATH', '')
    if os.path.exists("./libkate/pkg/pkgconfig"):
      os.environ['PKG_CONFIG_PATH'] = "./libkate/pkg/pkgconfig:" + os.environ.get('PKG_CONFIG_PATH', '')
    if conf.CheckPKG(KATE_LIBS):
      ParsePKGConfig(env, KATE_LIBS)
      env.Append(CCFLAGS=['-DHAVE_KATE', '-DHAVE_OGGKATE'])
    else:
      print """
          Could not find libkate. Subtitles support will be disabled.
          You can also run ./get_libkate.sh (for more information see INSTALL)
          or update PKG_CONFIG_PATH to point to libkate's source folder
      """

  if conf.CheckCHeader('iconv.h'):
      env.Append(CCFLAGS=[
        '-DHAVE_ICONV'
      ])
      if conf.CheckLib('iconv'):
          env.Append(LIBS=['iconv'])

  if env['crossmingw']:
      env.Append(CCFLAGS=['-Wl,-subsystem,windows'])
      env.Append(LIBS=['m'])
  elif env['static']:
      env.Append(LIBS=['m', 'dl'])

# Flags for profiling
env.Append(CCFLAGS=['-pg'])
env.Append(CCFLAGS=['-g'])
env.Append(CCFLAGS=['-DDEBUG'])
env.Append(LINKFLAGS=['-pg'])

env = conf.Finish()

# ffmpeg2theora
ffmpeg2theora = env.Clone()
ffmpeg2theora_sources = glob('src/*.c')
ffmpeg2theora.Program('ffmpeg2theora', ffmpeg2theora_sources)

ffmpeg2theora.Install(bin_dir, 'ffmpeg2theora')
ffmpeg2theora.Install(man_dir + "/man1", 'ffmpeg2theora.1')
ffmpeg2theora.Alias('install', prefix)

The script just set some configurations for the build and checks for some dependencies.
I added some extra flags because I wanted to generate a profile of the application

env.Append(CCFLAGS=['-pg'])
env.Append(CCFLAGS=['-g'])
env.Append(CCFLAGS=['-DDEBUG'])
env.Append(LINKFLAGS=['-pg'])

Summarizing the steps to build ffmpeg2theora:

Download the source code

Run:

sudo ./get_ffmpeg.sh
sudo ./get_libkate.sh
sudo scons
sudo scons install

**If you need to install any other dependencies the configure scripts will output to the terminal
**On the mac I had some problems in running “sudo scons”, the pkg-config path would get corrupted and the build would fail, by loggin in the shell as root and sourcing the environment variables of my profile solved the problem (I didn’t have this issue on Ubuntu)
**If you don’t run the get_ffmpeg script as root the libraries won’t be installed in the system and the build will fail during the linking stage

Profiling

Next step was to generate a profile of the program and see which area of the application was consuming most of the CPU time.
I used the Instruments Timer Profiler to create a profile of the application.
I have previously blogged about how to use the Instruments Timer Profiler on the mac

instruments -t "/Applications/Xcode.app/Contents/Applications/Instruments.app/Contents/Resources/templates/Time Profiler.tracetemplate" ./ffmpeg2theora myvideo.mp4

and the profile information was generated:

Looking at the profile gave me a better idea of how the converter works but I still need to run the converter with a larger video to see where the heavy processing takes place.

What’s next?

This is a very intimidating projet considering that I’m not very familiar in video encoding and CUDA programming, but what better way to learn something than by doing it? 🙂
I still remember taking the Topics in OpenSource Development last year with David Humphrey here at Seneca college and how we started hacking Firefox. At the beginning it was very hard and overwelming, but after a while the beast didn’t look as scary as before. That just proved to me that as long you put the time into something you will get the results no matter what. In the end hard works does pay off, indeed.

With that being said, I’m a little scare about diving into an area that I don’t much about and trying to implement something new, but at the same time I welcome the challenge and I will try to learn as much as I can during the process.
Video Processing and GPU Programming are two topics that interest me so I’m sure it will be a lot of fun 🙂


Building GNU Scientific Library(GSL) from source

In the DPS915 workshop 2 we started working with the GNU Scientific Library(GSL) to perform some matrix operations.

GSL is an implementation of Basic Linear Algebra Subprograms(BLAS)

By default, the GSL libraries don’t come pre install on OSX, so if you want to use them you gotta install it yourself

Download

You can download the latest stable release from here.
At this time, the latest stable one is version 1.15
gsl-1.15.tar.gz

Install

Unzip the files and navigate to the folder.
The steps to install:

Check if all the dev dependencies are installed

./configure

Build the libraries.

time make -j12 > /dev/null

 

  • time will output how much the build process took
  • -j lets you specify how many cores to use during the build
  • > /dev/null redircts the output to make the build faster

Finally you need to install the libraries

sudo make install

The header files are installed in:
/user/loca/includes/gsl

And some libraries in:
/user/local/lib

Test it

To check if the gsl libs are working you can try to run a sample program:

 // matMult.cpp

 #include
 #include
 #include
 #include
 extern "C" {
 #include <gsl/gsl_cblas.h>
 }
 using namespace std;
 #define WIDTH 5

 int main(int argc, char** argv) {
     if (argc != 2)
         cerr << "** invalid number of arguments**" << endl;
     else {
         int n = atof(argv[1]);
         float* a = new float[n * n];
         float* b = new float[n * n];
         float* c = new float[n * n];
         int k = 0;
         srand(time(nullptr));
         double f = 1.0 / RAND_MAX;
         for (int i = 0; i < n; i++)
             for (int j = 0; j < n; j++)
                 a[k++] = rand() * f;
         k = 0;
         for (int i = 0; i < n; i++)
             for (int j = 0; j < n; j++)
                 b[k++] = rand() * f;

         // compute matrix product
         cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans,
          n, n, n, 1.0, a, n, b, n, 0.0, c, n);

         // output result
         k = 0;
         cout << setprecision(6) << fixed;
         for (int i = 0; i < n; i++) {
             cout << setw(5) << i+1 << ':';
             for (int j = 0; j < n; j++) {
                 cout << setw(10) << c[k++] << ' ';
                 if (j % WIDTH == WIDTH - 1)
                     cout << endl << setw(5) << i+1 << ':';
             }
             cout << endl;
         }
         cout << endl;
         delete [] a;
         delete [] b;
         delete [] c;
     }
 }

To compile:

g++ -std=c++0x -Wall matMult.cpp -o matMult -lgslcblas

-std=c++0x enables c++0x support

To run the program:

./matMult 2

The first argument is the size of the matrix you want to generate

If you see a 2 by 2 matrix then it means it worked 🙂

**The code sample was taken from https://scs.senecac.on.ca/~gpu610/pages/content/lines.html


DPS915 Workshop 1 – Initial Profile

Int the first workshop for the DPS915 course(Parallel Programming Fundamentals) we had to profile a simple application.
I wrote a previous blog post listing the steps to profile an application on osx.

The application we had to profile was:

// Profile a Serial Application - Workshop 1
 // w1.cpp

 #include <iostream>
 #include <iomanip>
 #include <cstdlib>
 #include <ctime>
 using namespace std;

 void init(float** a, int n) {
     float f = 1.0f / RAND_MAX;
     for (int i = 0; i < n; i++)
         for (int j = 0; j < n; j++)
             a[i][j] = rand() * f;
 }

 void add(float** a, float** b, float** c, int n) {
     for (int i = 0; i < n; i++)
         for (int j = 0; j < n; j++)
             c[i][j] = a[i][j] + 3.0f * b[i][j];
 }

 void multiply(float** a, float** b, float** c, int n) {
     for (int i = 0; i < n; i++)
         for (int j = 0; j < n; j++) {
             float sum = 0.0f;
             for (int k = 0; k < n; k++)
                 sum += a[i][k] * b[k][j];
             c[i][j] = sum;
         }
 }

 int main(int argc, char* argv[]) {
     // start timing
     time_t ts, te;
     ts = time(nullptr);

     // interpret command-line arguments
     if (argc != 3) {
         cerr << "**invalid number of arguments**" << endl;
         return 1;
     }
     int n  = atoi(argv[1]);   // size of matrices
     int nr = atoi(argv[2]);   // number of runs

     float** a = new float*[n];
     for (int i = 0; i < n; i++)
        a[i] = new float[n];
     float** b = new float*[n];
     for (int i = 0; i < n; i++)
        b[i] = new float[n];
     float** c = new float*[n];
     for (int i = 0; i < n; i++)
        c[i] = new float[n];
     srand(time(nullptr));
     init(a, n);
     init(b, n);

     for (int i = 0; i < nr; i++) {
         add(a, b, c, n);
         multiply(a, b, c, n);
     }

     for (int i = 0; i < n; i++)
        delete [] a[i];
     delete [] a;
     for (int i = 0; i < n; i++)
        delete [] b[i];
     delete [] b;
     for (int i = 0; i < n; i++)
        delete [] c[i];
     delete [] c;

     // elapsed time
     te = time(nullptr);
     cout << setprecision(0);
     cout << "Elapsed time : " << difftime(te, ts) << endl;
 }

We had to run the application with 12 different combinations to see how much time the program spent executing the “add” and “multiply” functions.

Here is the profile results:

To easy the process of generating the profile data, I create a bash script to automate the runs:

#!/bin/bash

# First Set
N[0]=80
NR[0]=50

N[1]=160
NR[1]=50

N[2]=320
NR[2]=50


# Second Set
N[3]=80
NR[3]=100

N[4]=160
NR[4]=100

N[5]=320
NR[5]=100


# Third Set
N[6]=80
NR[6]=200

N[7]=160
NR[7]=200

N[8]=320
NR[8]=200


# Fourth Set
N[9]=80
NR[9]=400

N[10]=160
NR[10]=400

N[11]=320
NR[11]=400


if [ $(uname) = "Darwin" ]
then
OS="mac"
  CC="g++-4.7"
else
OS="linux"
  CC="g++"
fi

echo "OS $OS"

OPTIONS="-std=c++0x -O2 -g -pg"
OBJ="w1"
SRC="w1.cpp"

INSTRUMENT_TEMPLATE="/Applications/Xcode.app/Contents/Applications/Instruments.app/Contents/Resources/templates/Time Profiler.tracetemplate"
#compile workshop
$CC $OPTIONS -o $OBJ $SRC

#generate profile info
for i in {0..11}
do
echo "Running ${i}th set"
  if [ $OS = "mac" ]
  then
echo "Running on MacOS"
    instruments -t "$INSTRUMENT_TEMPLATE" -D results/mac/"${N[$i]}x${NR[$i]}.log" $OBJ ${N[$i]} ${NR[$i]}
  else
echo "Running some linux distro."
    ./$OBJ ${N[$i]} ${NR[$i]}
    gprof -p $OBJ > "results/linux/${N[$i]}x${NR[$i]}.log"
  fi
done

The script works both on mac and linux.
If it’s running on a mac, it uses the Instruments Time Profiler, on a linux distro it uses gprof.

I’m committing all my course work to github

Any suggestions are more than welcome 🙂