Using ‘HugeCollections’ To Manage Big Data

Using ‘HugeCollections’ To Manage Big Data

OLYMPUS DIGITAL CAMERA

Introduction

When writing complex software things don’t always go as planned. You implement a new feature that works perfectly well locally and in a test environment, but when your code hits the real world everything falls apart. Sadly, that’s one of the things we all have to deal with as software developers. On a recent project for a major telecommunications client we needed to be able to process more than 20 million records every night. That equated to 5GB of storage and unfortunately the environment where our process was running had up to 4GB of memory.

Processing such a vast amount of data brings a lot of challenges along with it, especially when you also need to combine it with a few more million records that are located in a database. Adding code to retrieve associated information and transform raw data might take an extra few milliseconds per each record. However, when you repeat that operation 20 million times those few milliseconds can easily turn into hours.

So there we were asking ourselves why it was taking so long. Is it an index we forgot to add to the database? A network latency problem? These things can be very hard to pin down.

We needed to think outside the box to get around this one.

Searching for alternatives

When we deployed the software to our staging environment we were confronted with the sad reality – our process took nearly 14 hours to run! In contrast we had estimated it should only take 2 hours. Not good.

We added some instrumentation and analysed the timings for each operation. We quickly discovered that 4 out of the 14 hours was due to executing queries for each one of the 20 million entries we had.

Our first cut at getting around these horrendous times was to introduce everyone’s favourite in memory key/value store: Redis. It’s super-fast and reliable and we had used it in other projects to handle large amounts of data. Redis is also very effective in terms of memory utilisation, so we thought it would be able to do the job nicely for us.

The idea was to have a few preloaded key value mappings so that we could quickly find values associated with each record. Another part of the solution was putting the results of some transformations in other maps. We thought this would solve all of our problems but we were in for another surprise. Once we ran the process again we realised that it was still taking 10 hours. Obviously Redis had shaved off 4 hours but still 10 hours was still cripplingly slow.

After some deeper investigation we realised that the overhead of going to Redis over TCP/IP was taking most of that time. What may seem like an insignificant amount of processing time (2 milliseconds) per record was actually adding almost 9 hours to the whole run.

Back to the drawing board, we thought to ourselves.

If we could only have all the data and mappings stored in HashMaps within the same JVM.  Well, guess what? It turns out that you can. I had attended a MelbJVM meet-up a few months prior and remembered that one of the presentations of the night was about a library called ‘HugeCollections’ which facilities the handling of large amounts of data off-heap and comes with phenomenal performance.

‘HugeCollections’

Huge Collections is an open source solution and it’s part of the OpenHFT (High Frequency Trading) project. This library allows concurrent access to a very large set of off-heap key/value pairs. Sounds pretty awesome, right? But how does it actually go in practice?

As I’m sure you already know, the JVM allocates every byte of memory in the heap by default. So how does Huge Collections work around this? Basically it implements the Map and ConcurrentMap interfaces using sun.misc.Unsafe to allocate memory off-heap.

The SharedHashMap interface and its implementations allows sharing maps between runs through a concept known as ‘Memory-Mapped Files’. It supports concurrency within  the same process as well as between multiple processes. You can find more details about concurrency handling and thread safety here

Stop Press: SharedHashMap has just been deprecated in favour of the new ChronicleMap which supports Java 8. However, the basic principles remain the same.

Since all this memory is allocated off the heap we could now avoid one of the biggest issues we encountered in the JVM when we wanted to have large amounts of memory allocated on the heap – no more garbage collection (GC) hogging up resources when running. Sweet!

GC is great because it hides all the hideous complexities associated with managing memory inside the JVM. Thanks to GC, developers no longer need to concentrate on allocating the right amount of memory and deallocating it when is appropriate to do so. Instead they can focus on writing code to solve the real problems and in turn become more productive.

However, nothing comes for free. GC has an overhead associated with it – specifically having to traverse a huge number of objects in the heap when trying to determine which objects are eligible for collection. GC can take up to a few seconds to complete in some cases.

Things can get even worse when you are producing more ‘garbage’ than what the GC can collect. When this happens the queue of objects waiting to be garbage-collected grows continuously, meaning the GC kicks in more often and takes much longer to run. Eventually the GC gives up and the JVM crashes with a GC overhead limit error.

Fortunately, with Huge Collections we can add, retrieve and remove 100 million entries without triggering a single GC. And all of this can be done with a very low persistence latency – as low as 100 nanoseconds in some cases. Pretty impressive, isn’t it?!

Using this we were able to bring the whole processing time back down to 1.5 hrs. This is an average of 0.24 milliseconds per entry. That gave us an improvement in performance to the factor of 8.3. Now were were cooking!

Using SharedHashMap

SharedHashMap allows you to store gigantic maps of key/value entries. The entries get persisted to a file that allows sharing the maps between different processes. It also implements the Map interface so adding and retrieving information is as easy as calling put and getrespectively.

Of course, this was very useful to us since it allowed us to cache all our records as if they were sitting in native Maps in the JVM.

Since SharedHashMap uses off-heap memory we needed to specify the initial entry size and the maximum number of entries that we were going to add. The file that serves as the mapped-memory storage also needs to be specified on creation time as well as the type of the key and the value. Here’s an example:

...
SharedHashMap<CharSequence, CharSequence> sharedHashMap = new SharedHashMapBuilder().
entrySize(64).entries(1000).
create(new File("sharedMap.bin"), CharSequence.class, CharSequence.class);
...

This is a very simple case of a SharedHashMap with a max of  1000 entries with each one of a maximum of 64 bytes.

If the file already exists we will have access to whatever entries have been previously added  and they can be accessed calling the get method. This makes the solution ideal for caching information between runs.

Once you have specified the entry size and the number of entries the file will be allocated the full size. If for some reason you need larger entries or more than 1000 entries you will need delete the file and recreate it.

It’s always a good idea putting the initialisation parameters as properties so that you don’t need to re-compile and redeploy your program. In our case we have everything in a properties file and our initialisation code can easily re-populate some of the HashMaps when they are empty. So for us it’s as easy as changing the properties and removing the shared map file.

We can add information to the Map:


...

StringBuilder key = new StringBuilder();
StringBuilder value = new StringBuilder();

for (int i = 0; i < 1000; i++) {
    populateKey(key, i);
    populateValue(value, i);
    sharedHashMap.put(key, value);
}

...

private void populateValue(StringBuilder value, int counter) {
   value.setLength(0);
   value.append("Value ");
   value.append(counter);
}

....

private void populateKey(StringBuilder key, int counter) {
   key.setLength(0);
   key.append(counter);
}

And we can also get values using keys:

...
for (int i = 0; i < 1000; i++) {
    populateKey(key, i);
    CharSequence savedValue = sharedHashMap.get(key);
    System.out.println(savedValue);
}

...

You can also insert values more complex other than those that implement CharSequence (or Strings and StringBuilders). You’ll find a good example of how to write your own code to marshall other types here.

Finally, after using the SharedHashMap we need to close it calling the close method:

...
sharedHashMap.close();
...

Conclusion

HugeCollections gave us a great alternative to process large amounts of data by allowing us to be able to fetch associated information with a very low latency. All of this with the convenience of using Java.

Using this solution we were able to avoid the latency that TCP/IP had when using Redis. This latency really made the difference when processing Big Data.

No Comments

Sorry, the comment form is closed at this time.

Discover more from Shine Solutions Group

Subscribe now to keep reading and get access to the full archive.

Continue reading