Parse Large Flat File using Hash Table
Recently I stumbled on very interesting requirement which was simple and tricky at same time.Let me outline the objective for ya.
Requirement is to have a flat file reader tool which has ability to parse the flat file (can be of any number of flat files but for each file, the code should be extensible with minimum code). Now the flat file format is fixed for now and each line starts with a key, and then set of values separated with comma. Objective is to read a particular key from user and for that key, display all the records present in all the flat files. There can be multiple files and the file size can grow.
To solve this type of requirement, I zeroed on below approaches-
- On-Demand Parsing : This approach is simplest and parses the data source, once the customer is requested
- Index based Parsing: This approach builds the index for data source and provides Random retrieval of data from data source.
- Index based parsing and Memory Mapped file – This approach has all the feature of the above parsing and additionally maps the section of the file in memory. This is used for very huge files.
Analyze - Option 1
At first glance problem seems to be very simple and boils down to parsing the key and, finding the set of values for the key and repeat it, till you find the end of file. However there is serious issues, for every query, file has to be parsed completely, as there can be more than one value. This is serious problem as it performs same task repeatedly which is not only optimum but a waste of CPU, memory and time :). Also with file growing in size, the time taken with increase, so I quickly discarded this naive approach and looked at other options available.
Analyze - Option 3
Option 3 is the best approach when dealing with GB of data. In this approach we can’t bring the entire file in memory as it will be too huge, so jumping to a particular index of record or offset of records is not very good approach. This will involve dividing the entire file into Virtual Bucket of records and storing the index/offset of a particular record with respect to the bucket starting position. This will help to open a specific portion of file as filestream and get it mapped to memory and reading the data for the particular index.This approach will make more sense for very large files but my requirement is not for very large files but rather for marginally large file of order ~1GB.
Analyze - Option 2
This seemed to be the best option as my file size is not or order >1Gb. I re-factored my approach to option 2 and used hash table or Dictionary (Hash table implementation in C#).Idea was to parse the whole file once and let it index all the records available in the flat files using record offset. This will allow the query to look for index once and find the values straight, rather than scanning file repeatedly.
Index Based parsing
I decided to have index like -
My index will have dictionary with key (Int32) and Values (list of File Pointers or Long).
This Index will maintain the key as CustomerId and Values as list of all the index which points to the record for that key. So approach was to read the files and maintain an offset location for that key or customer id.
File Scanning or Index Building Logic
File scanning logic is simply reading each records and extract the key from the record and add the key and record position as part of dictionary. If any record with same key id found then add the record position in filepointer list.
- Start reading the file from index 0 and extract each line
- Split the line into various comma separated values and extract the value to be defined as key
- Add the key and the record position as list in the index
While retrieving jump straight to location of offset and retrieve the complete line and display it to user. This will work well if index is built prior to query but if user doesn’t opt to build the index or record was added after the index is build (addition of records with new key), in those cases we need to fallback to eager loading or rescan the file for the record.
- For a particular key requested, check in Index for the value
- If value found, retrieve the value from each file for the particular key and return to user
- If value not found, run Rescanning logic for the particular key
Rescanning logic will be used in eager loading when index doesnot contain the key. It will invoke the same logic which was used in Scanning but with a limit of records. There may be a file where key is primary key which means we can have only one record per key or a limitation that for 1 key we can have maximum 10 records. So for a key to record mapping constraint, it better to look for only the maximum number of records in file rather than scanning whole file. So my logic goes like this.
- Define FileMaxRecordPerKey value(max number of records a file can have for particular key), if undefined assume to be Int.Max
- Scan the file for key and for every record found increment the counter by 1
- If the CounterValue is equal to the FileMaxRecordPerKey, stop further scan
- Add the key and Value to the index and return to user
- Tool maintains the index for every record stored in all flat files in hashtable .This index is in-memory and will be recreated on start of tool
- Tool provides the functionality to prebuild the index or ignore the index building and fallback to on demand parsing
- For Fallback scenarios where on demand parsing is used, it will creates the index for requested record. Which allows any subsequent query for the id to be less expensive
- For Fallback cases the files will be scanned completely but in cases when record number is limited, further scanning is not required once we have found the maximum number of record. This is implemented as FileMaxRecordPerKey property
- For any record not found in prebuilt index, it will try to locate using on-demand parsing
- Dynamic addition of record in any of flat file is not supported in above logic and there is no implementation of dirty index
For a working source control checkout Git repo - https://bitbucket.org/anupkumarsharma/flatfileparser/src