I have to parse CSV (Comma Separated Value) files regularly. It is quite surprising alot of information exchange is still done in CSV format. I have also googled for a efficient way to parse a CSV file. A median performance solution for small CSV file is using regular expression. Regular expression used to parse CSV file has been quite complex.
After a few trial and error, I found a regular expression tokenise the CSV line quite nicely. It is surprisingly simple
[^,"]*
It says ignor comma and quote in your match, hence it will only match other characters. BUT NOTE, this simple expression can only handle very simple situation only, if it is possible there are comma (Separater) in a token, e.g. "abcd,efg"; the simple expression is not adaquate.
A more complex and comprehensive expression can be found from OmegaMan's Blog page. On his page you can find some very useful tips about using regular expression in C#.
Based on OmegaMan's expression for tokenise CSV, to parse a single line of CSV
((?(?=[\x22\x27])(?:[\x22\x27]+)(?<column>[^\x22\x27]*)(?:[\x22\x27]+)|(?<column>[^,\r\n]*))(?:,?)) #\x22 is double quote ", \x27 is single quote '
I am yet to figure out one final problem, the above regular expression will alway return an empty match at the end. Need to figure out how to stop it from happening?
After a few trial and error, I found a regular expression tokenise the CSV line quite nicely. It is surprisingly simple
[^,"]*
It says ignor comma and quote in your match, hence it will only match other characters. BUT NOTE, this simple expression can only handle very simple situation only, if it is possible there are comma (Separater) in a token, e.g. "abcd,efg"; the simple expression is not adaquate.
A more complex and comprehensive expression can be found from OmegaMan's Blog page. On his page you can find some very useful tips about using regular expression in C#.
Based on OmegaMan's expression for tokenise CSV, to parse a single line of CSV
((?(?=[\x22\x27])(?:[\x22\x27]+)(?<column>[^\x22\x27]*)(?:[\x22\x27]+)|(?<column>[^,\r\n]*))(?:,?)) #\x22 is double quote ", \x27 is single quote '
I am yet to figure out one final problem, the above regular expression will alway return an empty match at the end. Need to figure out how to stop it from happening?
No comments:
Post a Comment