Share |

Framework.DataStorage

From Matrix Platform

Jump to: navigation, search

Contents


Introduction

This article presents a simple, fast, lightweight data storage library. It is specifically aimed at storing very large series of small objects and was developed for application in the financial industry; however it can be applied in any data intensive environment due to its very high performance architecture.


The data storage toolkit is designed to store and extract data directly to and from binary files, and implements common operations like reading, writing, appending as well as item search. Both fixed and variable item size is supported. Besides a highly optimized file format, the toolkit also features advanced “search by index” techniques that allow quickly locating items both in fixed and variable item size modes. All access to the items is based in their index, so the order of writing is always important.


An alternative approach to data management, with excellent potential

In the modern world of data management, the SQL databases dominate, still there are quite a few scenarios where their excellent general usability turns out to be more of a disadvantage due to compromised performance.


The library presented here is an attempt to show an alternative solution that proves to be largely superior in quite a few cases. The Data Storage framework, as it currently stands, is aimed at solving a specific problem - storing large series (millions/billions) of small sized objects to files, for sequential or index based access. Since it is so focused in its task, the framework provides a highly optimized solution, that can outperform a general data storage solution (like a typical RDBMS, or even a OODBMS) up to 50x times. It will also reduce the development time and complexity of a solution since all the operations are achieved directly through the native .NET language, with no need to generate LINQ , SQL, Stored procedures or other type of queries.


Feature Overview

Variable and fixed item size formats

Support for both variable and fixed item size.


Consistent, separator based file format

File format is consistent and resilient, even if errors occur or there is only a part of a file available, remaining entries can still be extracted.


Little size-overhead (4B per entry)

The storage space overhead is minimal, at only 4 Bytes per item. If you plan on using very small individual items (say 8 Bytes per item or less) it may be a good idea to group individual items in packs of 128, 256 etc. and store these packs instead. This will reduce data overhead to negligible levels.


Optimized seek algorithms

Variable sized item indexing uses statics to try and determine the most likely position of the item requested; when using fixed size entries the item requested is instantly accessed.


"Paged" reading and writing operations

Reading and writing operations are performed in a “paged” mode to improve performance and come close to the native way hard disk drives operate.


Index or sequence only based access

The only way to access an item is based on its index; this makes important storing items in the same sequence as accessing them as otherwise performance will suffer.

Operation Overview

The following diagram gives an overview of the operation of the library when performing read and write operations:

Library operation overview.


Application areas

• Financial, storing market data

Financial markets data consists of very many sequentially generated items of price or volume or news.

• Engineering, real time sensor data

Real time streaming data coming in from sensors of heat, speed, light etc.

• Statistics, Simulations results, performance data logs


Implementation

The File Format

This diagram gives an idea how the file format is organized. Both the variable and fixed size file formats are extremely simple:


File formats.


Using a separator based file format has a few advantages. It is simple (one file only, no additional index file), it is very resilient (any part of the file can be read, even if others get damaged) and fast (both reading and writing are fast and straight-forward). The major downside is relatively low performance when searching items in variable size mode; to solve this a few advanced techniques are implemented in the library to optimize variable item index based access.


UML Diagram

The UML diagram shows the classes of the library, how they related to each other, as well their public properties and methods.

File formats.


Classes

DataSeriesManipulator class

Base of instances that perform operations on data series. Stores common information like item header size, the file stream etc. and provides static methods implementing operations performing conversion from data buffers to objects and vice versa, using the assigned serializer.


DataSeriesReaderRaw class

Provides basic reading capabilities, stores the majority of the read logic source code.


DataSeriesReader class

The class instantiated by the user to perform the actual reading. After creating the instance, make sure to call Initialize() method pointing it to the file we are about to read from (see sample section for details).


DataSeriesWriter class

The class used by the user to perform the writing operations.


DataSeriesSequenceHelper class

Helper class provides assistance with processing escape sequences in the data stream.


Example

Here is an example on how to use the writer class. As you can see, it is extremely simple and only requires to instantiate and initialize the object.

            using (DataSeriesWriter writer = new DataSeriesWriter())
            {
                if (writer.Initalize(CreateSerializer(), FilePath, true) == false)
                {
                    throw new OperationCanceledException("Writer init failed.");
                }
 
                for (int i = 0; i < valuesCount; i++)
                { // Write items ony by one.
                    writer.Write(values[i]);
                }
            }

We can also write multiple items with only one call like this:

            Writer.write(values);

where values is an IEnumerable<ItemType> compatible collection.


Unit Testing

The source code also contains a Unit Testing project (Matrix.Framework.DataStorage.UnitTest), designed to test the capabilities and proper operation of the library. To run it you will need the nUnit testing framework. The test project contains three different test classes, for writing, reading and combined operation. It is possible to test both fixed size and variable size performance, by switching between the usage of FixedItemSerializerHelper and VariableItemSerializerHelper.


Performance

Compared to a typical SQL database solution, the Data Storage library has the potential to be many times faster. This is due to its simplified functionality and straight forward design. A couple of SQL setups running on the test machine provided 20-35K items / sec in sequential read operations and much lower in write. These limitations come mostly from the complex middle layers between the data and the code that needs to process it; the many features they provide come at a price and this price is performance. We also tested one of the most popular OODB solutions and the results were in the 30-45K range; this proved to be insufficient when access to millions of items is required. The Data Storage library can easily outperform these results 15-30 times in read operations and up 20-100 times in write.


Another advantage of the library is its performance does not typically change with the growth of the file/database. File system size is the only limitation in size and the speed of operations is not influenced by the number of items stored. This becomes very significant if dealing with very many items – say ex. 100 million or more.


Here are some test result numbers based on executing the Unit Tests on a typical Intel Core 2 8200E machine and a 7200rpm HDD, each item is a fixed 32 Byte structure:

Data series write of 10 million items takes 5.04 sec or 1.98 million / sec.

Data series read of 10 million items takes 14.9 sec or 0.670 million / sec.


The reading is much more CPU intensive due to some specifics of the .NET framework. HDD limits were not reached in any of these tests. Further optimization in performance is possible trough the usage of non-managed (for ex. C++) file access layer. These tests were conducted on single thread mode for a more detailed performance estimation. Utilizing the full 2/4/8 cores of a modern CPU in asynchronous mode will yield corresponding performance gains up to the HD drive performance limit (modern HDD drives have very high consecutive read and write speeds, with big slowdown in seek operations; SSD drives are much more suited for random seek operations). Please note all these results are for rough personal evaluative purposes only and come with no guarantee of compliance. If you require a commercial application quality make sure to perform your own additional testing with selected hardware, software and data configuration.


The Matrix Platform & Sister Toolkits

The Data Storage system is part of the Matrix Platform (www.matrixplatform.com). The Platform is free, LGPL pen source software and contains systems that can be commonly reused across multiple software solutions. Besides data handling, the platform also covers diagnostics, component management, thread control and other. The source code attached to this article does not use the diagnostics code inside the Data Storage project to evade a mandatory assembly reference to the Matrix Platform Diagnostics assemblies. All the diagnostics related source code is encompassed by #if Matrix_Diagnostics … #end sections.


Conclusion

Custom fit solutions like this one have great potential and offer some great advantages like ultimate performance, compact solution size, improved manageability. With the advancement of technologies and improvement of general usage technologies like the SQL databases, it is common to see them applied in nearly all custom scenarios; however custom sized solutions are often superior and should never be underestimated when making choice of technology and implementation, especially when performance and memory requirements are of highest priority. Finally the advancement of general programming technologies like the .NET framework has also provided the opportunity to develop custom fit solutions faster than ever before and helps to make them yet again a competitive and viable option.

modified on 6 July 2010 at 14:25 ••• 23,361 views