Monday, June 30, 2008

Regular Expression for parsing a CSV file with C#

I have to parse CSV (Comma Separated Value) files regularly. It is quite surprising alot of information exchange is still done in CSV format. I have also googled for a efficient way to parse a CSV file. A median performance solution for small CSV file is using regular expression. Regular expression used to parse CSV file has been quite complex.

After a few trial and error, I found a regular expression tokenise the CSV line quite nicely. It is surprisingly simple

[^,"]*

It says ignor comma and quote in your match, hence it will only match other characters. BUT NOTE, this simple expression can only handle very simple situation only, if it is possible there are comma (Separater) in a token, e.g. "abcd,efg"; the simple expression is not adaquate.

A more complex and comprehensive expression can be found from OmegaMan's Blog page. On his page you can find some very useful tips about using regular expression in C#.

Based on OmegaMan's expression for tokenise CSV, to parse a single line of CSV

((?(?=[\x22\x27])(?:[\x22\x27]+)(?[^\x22\x27]*)(?:[\x22\x27]+)|(?[^,\r\n]*))(?:,?)) #\x22 is double quote ", \x27 is single quote '

This expression says
if found double quote or single quote then //(?(?=[\x22\x27])
{
match column and strip quotes and commas //(?:[\x22\x27]+)(?[^\x22\x27]*)(?:[\x22\x27]+)
}
else
{
match column and strip commas //(?[^,\r\n]*))(?:,?)
}

I am yet to figure out one final problem, the above regular expression will alway return an empty match at the end. Need to figure out how to stop it from happening?

Update: I figure out what make it to do an extra match at the end of line if I modify the expression as following this problem will not occur

((?(?=[\x22\x27])(?:[\x22\x27]+)(?[^\x22\x27]*)(?:[\x22\x27]+)|(?[^,\r\n]+))(?:,?))

The problem is then how to make sure that empty columns are handle, the original one can hande empty columns such as "abc",,"efg". But the new one will ignor the empty column, if the regular expression represent some tabular data then we need to make the empty columns are handled.

One way to handle this might be using Match.Groups["column"].Captures[0].Index property and using the original expression. The index of match will equal to the length of input string, i.e. inputstring.lastIndex + 1 or inputstring.length.

Update: OK I finally got it, a expression:
(?(?=[\x22\x27])(?:[\x22\x27]+)(?[^\x22\x27]*)(?:[\x22\x27]+)(?:,?)|((?[^,\r\n]+)(?:,?)|(?\W*)(?:,)))

This one will handle empty column too. FINALLY.

UPDATE, damn, unfortunately the previous one is not perfect either, it can not handle 2 consecutive empty column e.g. "abc",,,"efg".
it will miss the second empty one. Still need to figure this one out.

AFTER much of try for couple of days, I think I will give up the idea of a pure regular expression solution to this problem. a combination of regular expression and C# code logic is need to solve this problem comprehensively.

No comments: