Saturday, December 9, 2017

KAFKA - The Future Of Modern Architecture

Software architecture has evolved to completely new dimension in last few years. One of the major change with the advent of  open source technology, this has enabled a whole new dimension to software architecture style. I must say 10 years back it was easier to design a system, did you ask why? Imagine having options of choosing within a limited set of technology. There is always a known set of technology which plays perfectly with each other ex-.Net, SQL Server, SSIS etc. Compare this to today's stack, we need to more option than ever to communicate with disparate tech stack making it harder to manage inter-communication and to that, an added complexity of each tech stack have their own future growth path, while no one ensuring the compatibility with another stack. This poses a new maintainability challenge to architects to make all these stacks as loosely coupled as possible and thus introducing Kafka to solve most of our coupling woes.

I am going to take an example of a typical data-driven architecture -

This highlights the typical flow of a data-driven application Import raw data, batch process and correlates it to transform into facts, pass over to data warehouse layer for analytics, serve front layer and let other consume the fact using pub-sub. This traditional designing suffers from various downsides like slow not real-time, low maintainability index, low scalability index and not to mention complex to code. While others are obvious to let's talk about maintainability - In this example, all the stack should know how the communicating stack work like Nodejs, the importing service, and pub-sub service should know how to read/write to SQL, database engines should know each other etc. Trust me it becomes more complex if you start adding more producers  and consumers not to mention stuff like different analytical layer and machine learning will cause it to probably lead to demise. 

Now let's introduce Kafka into this. First Kafka is real-time, scalable and distributed streaming platform. What it means that there is a minimum data processing performance impact, power of load-sensitive scaling.
This serves as a great intermediate layer and perfect solution to our disparate tech stack problem as all the stack has to interact with just one technology Kafka. Let's try to see if we could introduce Kafka to our broken design and make it awesome.

It does look a lot better! All the stacks are now loosely coupled, independently scalable and guess what, adding any other layer is fun. If you want to add a machine learning or Hadoop layer just write it off the Kafka and there is no impact on performance. There are thousands of open source projects supports Kafka producer and consumer in almost any language. I use  Net as producer and has all types of consumers Spark, Kafka Stream, Nodejs, Drool etc.  Being real-time I publish in real-time to cloud for machine learning and other clients using pub-sub. Its stable and my current peak workload is around 50,000 messages a second dealing with TB worth of data. 

Sunday, October 23, 2016

Run Windows Nano Server As Docker Container

Setting Context 

Microsoft recently came out with their version of docker support with the release of windows server 2016 and windows nano server. Although Microsoft have to go long way to make docker support seamless and as stable as in other linux and osx platforms, but a start is always welcomed in Net developers fraternity. Huge number of developers are evaluating this new buzz in town.
I work on open source technology stack like angular, react, node etc. and adopted docker in linux early this year. I must say its really cool to see this finally in Microsoft tech. stack and its too early to say, but its not ready for production. I will limit using these for research or perhaps small development environment.


I am running window server 2016 standard edition. One can download the vhd with 180 days trial but be aware they can't be upgraded to standard edition after expiry of 180 days, at-least this was the case while writing. So lets jump on some action. So take you VHD and use virtual box or any tool of your choice to mount it as VM. In my case I am using Parallels.

First Things First [Very Important :) ]

Once you are done you need to go through basic steps to enable docker. Oh! wait the most important thing you will ever learn with windows, first update it. If you don't get the latest updated definition, you might get several issues, at-least I got several weird exceptions but just by updating all the issues just disappeared.

Run Docker in Windows 2016 server 

Run this commands in PowerShell with elevated privileges

Install-Module -Name DockerMsftProvider -Repository PSGallery -Force
Install-Package -Name docker -ProviderName DockerMsftProvider
Restart-Computer -Force

Once you computer restarts, run docker version and you should be able to see output something like this

This means you docker service is running. You can check it by running -  Service Docker. This should show the service should be in running state.

Get Nano Server 

Since our goal is to run Nano server. Lets fetch the nano server from docker hub. Use PowerShell running in elevated permission to run

  • docker pull microsoft/nanoserver 

This should fetch the nanoserver images and now you should be able to see it in by running 
  • docker images

Now we have got the image in local, so run it by 

  • docker run -it [ImageID from above command]
It will open up a new prompt which is essentially a command prompt in nano server. Voila you are able to run nano server successfully.

You can open up PowerShell by typing PowerShell and hit enter. Since this is just a bare window platform, you need to install everything as per your need. Next blog I will try to demonstrate how you can run IIS inside the nano server. Till then enjoy Dockering......!

Friday, September 4, 2015

Automatic Restore from Nuget

Package managers like Npm,NuGet etc. in short span of time became the first choice for managing any external/internal dependency. It not only makes resolving dependency easy but also imposes to best practice like versioning. If you are still skeptical about it, I recommend open your visual studio and give it a try. Here is a nice article about it

I spent quite an amount of time figuring the best way to use Nu-Get and how can we make hassle free.I don't recommend throwing all your internal dependency as NuGet package rather you should divide your application into independent components and if each component is candidate for independent development and can have multiple versions in future then NuGet is your best bet.

Working with NuGet in visual studio is very easy, all you have to do is add your local nuget server address in available package source and use it to resolve any dependency.

When any nuget reference is added to a solution, Nuget will maintain its own artifacts in solution and project level.
Project level will contain only one file called package.config which contain the dependent package information like version,framework,id.

Solution level will contain a hidden folder of .nuget which contain
  1. NuGet Config 
  2. NuGet.exe (Can be excluded from source control)
  3. NuGet.targets

.NuGet folder is important if you want to have automatic restore - which means when you build visual studio will try to look for external dependencies and in case its not able to locate it will initiate nuget.exe application to resolve them using details from configurations present either in NuGet.config in solution directory or %ProgramData%\NuGet\Config\.Nuget also maintain its own local cache @ %LOCALAPPDATA%\NuGet\Cache\ and in case if its able to find the NuGet package in local cache it won't go out to probe your NuGet feed server.

So if we want automatic restore this .nuget folder is of great significance.I think the best way to maintain a coherent restore mechanism is by defining the configurations as part of NuGet config file and add it to source control. When other member downloads a version of code he will have exact same configuration to restore which will save lot of hassles.To enable automatic restore you can either go to solution right click and there will option of Enabling automatic restore or you can do via configuration

So two important thing to add in configuration to enable automatic restore, is  add true key for package restore  and add NuGet server address optionally, in-case you want to use your own server.

This will allow us to force the configuration for everyone who downloads the code.To make this more fun we can automate the NuGet restore by creating a small batch file.

 @echo off  
 REM Check the nuget application is installed as part of Local App Data   
 REM If installed then copy to local project folder   
 REM Download the file   
 echo Downloading latest version of NuGet.exe...  
 IF NOT EXIST %LocalAppData_LOCATION%\NuGet md %LocalAppData_LOCATION%\NuGet  
 REM - This is command which need the powershell latest version so discontinue it as many CDK systems run old window 7   
 REM powershell -NoProfile -ExecutionPolicy unrestricted -Command "$ProgressPreference = 'SilentlyContinue'; Invoke-WebRequest '' -OutFile '%CACHED_NUGET_LOCATION%'"  
 powershell -NoProfile -ExecutionPolicy unrestricted -Command "$ProgressPreference = 'SilentlyContinue'; (New-Object System.Net.WebClient).DownloadFile('', '%CACHED_NUGET_LOCATION%')   
 REM Copy the nuget to local directory   
 echo NuGet.exe exists.Initiating package restore.....  
 IF EXIST .nuget\nuget.exe goto restore  
 md .nuget  
 copy %CACHED_NUGET_LOCATION% .nuget\nuget.exe > nul  
 REM Run package restore command   
 echo Deleting cached package .....  
  del %LocalAppData_LOCATION%\NuGet\Cache\*.nupkg /q  
  echo Initiating package restore.....  
 .nuget\NuGet.exe restore -ConfigFile ".nuget\NuGet.Config"  
 echo Success!!!  

Place this batch file in solution root folder in same path of .NuGet and execute this. Remember to not add package folder to source control,it will defy the whole purpose of adding using NuGet.

Okay so guys enjoy flexibility!

Update - Newer versions of msbuild already have the nuget restore feature out of the box so if you need to do restore during build. Just enable NuGet Restore and you are good.

Tuesday, May 5, 2015

Parse Large Flat File using Hash Table

Recently I stumbled on very interesting requirement which was simple and tricky at same time.Let me outline the objective for ya.
Requirement is to have a flat file reader tool which has ability to parse the flat file (can be of any number of flat files but for each file, the code should be extensible with minimum code). Now the flat file format is fixed for now and each line starts with a key, and then set of values separated with comma. Objective is to read a particular key from user and for that key, display all the records present in all the flat files. There can be multiple files and the file size can grow.

To solve this type of requirement, I zeroed on below approaches-

  1. On-Demand Parsing : This approach is simplest and parses the data source, once the customer is requested
  2. Index based Parsing: This approach builds the index for data source and provides Random retrieval of data from data source.
  3. Index based parsing and Memory Mapped file – This approach has all the feature of the above parsing and additionally maps the section of the file in memory. This is used for very huge files.
Analyze - Option 1  

At first glance problem seems to be very simple and boils down to parsing the key and, finding the set of values for the key and repeat it, till you find the end of file. However there is serious issues, for every query, file has to be parsed completely, as there can be more than one value. This is serious problem as it performs same task repeatedly which is not only optimum but a waste of CPU, memory and time :). Also with file growing in size, the time taken with increase, so I quickly discarded this naive approach and looked at other options available.

Analyze - Option 3  

Option 3 is the best approach when dealing with GB of data. In this approach we can’t bring the entire file in memory as it will be too huge, so jumping to a particular index of record or offset of records is not very good approach. This will involve dividing the entire file into Virtual Bucket of records and storing the index/offset of a particular record with respect to the bucket starting position. This will help to open a specific portion of file as filestream and get it mapped to memory and reading the data for the particular index.This approach will make more sense for very large files but my requirement is not for very large files but rather for marginally large file of order ~1GB.

Analyze - Option 2

This seemed to be the best option as my file size is not or order >1Gb. I re-factored my approach to option 2 and used hash table or Dictionary (Hash table implementation in C#).Idea was to parse the whole file once and let it index all the records available in the flat files using record offset. This will allow the query to look for index once and find the values straight, rather than scanning file repeatedly.

Index Based parsing 

 Index Structure

I decided to have index like -


My index will have dictionary with key (Int32) and Values (list of File Pointers or Long).
This Index will maintain the key as CustomerId and Values as list of all the index which points to the record for that key. So approach was to read the files and maintain an offset location for that key or customer id.

File Scanning or Index Building Logic

File scanning logic is simply reading each records and extract the key from the record and add the key and record position as part of dictionary. If any record with same key id found then add the record position in filepointer list.
  • Start reading the file from index 0  and extract each line
  • Split the line into various comma separated values and extract the value to be defined as key
  • Add the key and the record position as list in the index

Retrieval Logic

While retrieving jump straight to location of offset and retrieve the complete line and display it to user. This will work well if index is built prior to query but if user doesn’t opt to build the index or record was added after the index is build (addition of records with new key), in those cases we need to fallback to eager loading or rescan the file for the record.
  • For a particular key requested, check in Index for the value
  • If value found, retrieve the value from each file for the particular key and return to user
  • If value not found, run Rescanning logic for the particular key

Rescanning logic

Rescanning logic will be used in eager loading when index doesnot contain the key. It will invoke the same logic which was used in Scanning but with a limit of records. There may be a file where key is primary key which means we can have only one record per key or a limitation that for 1 key we can have maximum 10 records. So for a key to record mapping constraint, it better to look for only the maximum number of records in file rather than scanning whole file. So my logic goes like this.
  • Define FileMaxRecordPerKey value(max number of records a file can have for particular key), if undefined assume to be Int.Max
  • Scan the file for key and for every record found increment the counter by 1
  • If the CounterValue is equal to the FileMaxRecordPerKey, stop further scan
  • Add the key and Value to the index and return to user


  • Tool maintains the index for every record stored in all flat files in hashtable .This index is in-memory and will be recreated on start of tool
  • Tool provides the functionality to prebuild the index or ignore the index building and fallback to  on demand parsing
  • For Fallback scenarios where on demand parsing is used, it will creates the index for requested record. Which allows any subsequent query for the id to be less expensive
  • For Fallback cases the files will be scanned completely but in cases when record number is limited, further scanning is not required once we have found the maximum number of record. This is implemented as FileMaxRecordPerKey property
  • For any record not found in prebuilt index, it will try to locate using on-demand parsing
  • Dynamic addition of record in any of flat file is not supported in above logic and there is no implementation of dirty index
For a working source control checkout Git repo -

Tuesday, June 17, 2014

Extract Property Value from Object/Object Graph

The below code should be able to parse the object for respective property value and return back the value/null as response.Below type of string literals can be parsed with the below logic

            string s = "P[0].A";
            string g = "T[0].U.D";
            string t = "S[0].P[7].D";
            string NegativeValue = "P[0].WRONGVALUE";

public static Object GetPropertyValue(String name, object obj, Type type)

            var isArray = false;

            var parts = name.Split('.').ToList();
            var currentPart = parts[0];
            var backup = parts[0]; ;

            int index = 0;
            if (currentPart.Contains(ArrayIdentifier))
                // replace with some swift logic here -
                index = Int32.Parse(currentPart.Substring(currentPart.IndexOf('[')).Replace('[', ' ').Replace(']', ' ').Trim());
                currentPart = currentPart.Substring(0, currentPart.IndexOf('['));


            PropertyInfo info = type.GetProperty(currentPart);

            if (info == null) { return null; }

            if (info.PropertyType.GetInterface("IEnumerable") != null)
                int itemNb = 0;
                foreach (object item in (IEnumerable)info.GetValue(obj, null))
                    if (itemNb == index)
                        return GetPropertyValue(String.Join(".", parts), item, item.GetType());


                    //displayObject(item, displaySubObject, item.GetType);
                // index is not in range of the values provided
                throw new ArgumentOutOfRangeException();

            if (name.IndexOf(".") > -1)
                return GetPropertyValue(String.Join(".", parts), info.GetValue(obj, null), info.PropertyType);
                //    return GetPropValue(String.Join(".", parts), info.GetValue(obj, null), info.PropertyType,index);
                if ((info.PropertyType.IsValueType || info.PropertyType.Equals(typeof(string))))
                    return info.GetValue(obj, null).ToString();

                    //return info.GetValue(obj, null).ToString();
                    return (null);