Monday, June 30, 2008

Regular Expression Parse CSV Code

/// <summary>
/// Helper Class for tokenise CSV files using mainly regular expression
///
/// </summary>
public class CSVStringTokeniser
{
/// <summary>
/// This expression says
///if found double quote or single quote then //(?(?=[\x22\x27])
///{
/// match column and strip quotes and commas //(?:[\x22\x27]+)(?<column>[^\x22\x27]*)(?:[\x22\x27]+)
///}
///else
///{
/// match column and strip commas //(?<column>[^,\r\n]*))(?:,?)
///}
///
/// This expression have the problem of putting extra empty match at the end when not needed
/// </summary>
public static readonly string _expression =
@"((?(?=[\x22\x27])(?:[\x22\x27]+)(?<column>[^\x22\x27]*)(?:[\x22\x27]+)|(?<column>[^,\r\n]*))(?:,?))";

/// <summary>
/// this simple case one can not handle comman with in the column, e.g "abc,efg",hijk
/// </summary>
public static readonly string _expressionSimple =
@"[^,\x22\x27]*";

/// <summary>
/// this expression is similar to _expression but this one do not put extra empty match at the end
/// however this one can not handle consecutive empty columns e.g. abc,,,efg
/// </summary>
public static readonly string _expressionTrial =
@"(?(?=[\x22\x27])(?:[\x22\x27]+)(?<column>[^\x22\x27]*)(?:[\x22\x27]+)(?:,?)|((?<column>[^,\r\n]+)(?:,?)|(?<column>\W*)(?:,)))";

/// <summary>
/// static method to tokenise a CSV (comma separated value) string
/// The limitation of this method is that it can not handle quoted empty columns, e.g. "abc","","","efg"
/// </summary>
/// <param name="inputStr">CSV String</param>
/// <returns>list of values in the CSV with quotes and comma removed</returns>
public static List<string> Tokenise(string inputStr)
{
// here we use _expression because it is most generic and can handle most situation
Regex reg = new Regex(_expression);

List<string> rs = new List<string>();

foreach (Match match in reg.Matches(inputStr))
{
if (match.Success)
{
foreach (Capture capture in match.Groups["column"].Captures)
{
if (capture.Index < inputStr.Length)
rs.Add(capture.Value);
else // ignore last match if the end charater is not another comma, hence an empty column at the end
{
string lastCharacter = inputStr.Substring(inputStr.Length - 1, 1);
if (lastCharacter == ",")
{
rs.Add(string.Empty);
}
}
}
}
}
return rs;
}
}

Regular Expression for parsing a CSV file with C#

I have to parse CSV (Comma Separated Value) files regularly. It is quite surprising alot of information exchange is still done in CSV format. I have also googled for a efficient way to parse a CSV file. A median performance solution for small CSV file is using regular expression. Regular expression used to parse CSV file has been quite complex.

After a few trial and error, I found a regular expression tokenise the CSV line quite nicely. It is surprisingly simple

[^,"]*

It says ignor comma and quote in your match, hence it will only match other characters. BUT NOTE, this simple expression can only handle very simple situation only, if it is possible there are comma (Separater) in a token, e.g. "abcd,efg"; the simple expression is not adaquate.

A more complex and comprehensive expression can be found from OmegaMan's Blog page. On his page you can find some very useful tips about using regular expression in C#.

Based on OmegaMan's expression for tokenise CSV, to parse a single line of CSV

((?(?=[\x22\x27])(?:[\x22\x27]+)(?[^\x22\x27]*)(?:[\x22\x27]+)|(?[^,\r\n]*))(?:,?)) #\x22 is double quote ", \x27 is single quote '

This expression says
if found double quote or single quote then //(?(?=[\x22\x27])
{
match column and strip quotes and commas //(?:[\x22\x27]+)(?[^\x22\x27]*)(?:[\x22\x27]+)
}
else
{
match column and strip commas //(?[^,\r\n]*))(?:,?)
}

I am yet to figure out one final problem, the above regular expression will alway return an empty match at the end. Need to figure out how to stop it from happening?

Update: I figure out what make it to do an extra match at the end of line if I modify the expression as following this problem will not occur

((?(?=[\x22\x27])(?:[\x22\x27]+)(?[^\x22\x27]*)(?:[\x22\x27]+)|(?[^,\r\n]+))(?:,?))

The problem is then how to make sure that empty columns are handle, the original one can hande empty columns such as "abc",,"efg". But the new one will ignor the empty column, if the regular expression represent some tabular data then we need to make the empty columns are handled.

One way to handle this might be using Match.Groups["column"].Captures[0].Index property and using the original expression. The index of match will equal to the length of input string, i.e. inputstring.lastIndex + 1 or inputstring.length.

Update: OK I finally got it, a expression:
(?(?=[\x22\x27])(?:[\x22\x27]+)(?[^\x22\x27]*)(?:[\x22\x27]+)(?:,?)|((?[^,\r\n]+)(?:,?)|(?\W*)(?:,)))

This one will handle empty column too. FINALLY.

UPDATE, damn, unfortunately the previous one is not perfect either, it can not handle 2 consecutive empty column e.g. "abc",,,"efg".
it will miss the second empty one. Still need to figure this one out.

AFTER much of try for couple of days, I think I will give up the idea of a pure regular expression solution to this problem. a combination of regular expression and C# code logic is need to solve this problem comprehensively.

Regular Expression for parsing a CSV file with C#

I have to parse CSV (Comma Separated Value) files regularly. It is quite surprising alot of information exchange is still done in CSV format. I have also googled for a efficient way to parse a CSV file. A median performance solution for small CSV file is using regular expression. Regular expression used to parse CSV file has been quite complex.

After a few trial and error, I found a regular expression tokenise the CSV line quite nicely. It is surprisingly simple

[^,"]*

It says ignor comma and quote in your match, hence it will only match other characters. BUT NOTE, this simple expression can only handle very simple situation only, if it is possible there are comma (Separater) in a token, e.g. "abcd,efg"; the simple expression is not adaquate.

A more complex and comprehensive expression can be found from OmegaMan's Blog page. On his page you can find some very useful tips about using regular expression in C#.

Based on OmegaMan's expression for tokenise CSV, to parse a single line of CSV

((?(?=[\x22\x27])(?:[\x22\x27]+)(?<column>[^\x22\x27]*)(?:[\x22\x27]+)|(?<column>[^,\r\n]*))(?:,?)) #\x22 is double quote ", \x27 is single quote '

This expression says
if found double quote or single quote then //(?(?=[\x22\x27])
{
match column and strip quotes and commas //(?:[\x22\x27]+)(?<column>[^\x22\x27]*)(?:[\x22\x27]+)
}
else
{
match column and strip commas //(?<column>[^,\r\n]*))(?:,?)
}

I am yet to figure out one final problem, the above regular expression will alway return an empty match at the end. Need to figure out how to stop it from happening?

Update: I figure out what make it to do an extra match at the end of line if I modify the expression as following this problem will not occur

((?(?=[\x22\x27])(?:[\x22\x27]+)(?<column>[^\x22\x27]*)(?:[\x22\x27]+)|(?<column>[^,\r\n]+))(?:,?))

The problem is then how to make sure that empty columns are handle, the original one can hande empty columns such as "abc",,"efg". But the new one will ignor the empty column, if the regular expression represent some tabular data then we need to make the empty columns are handled.

One way to handle this might be using Match.Groups["column"].Captures[0].Index property and using the original expression. The index of match will equal to the length of input string, i.e. inputstring.lastIndex + 1 or inputstring.length.

Update: OK I finally got it, a expression:
(?(?=[\x22\x27])(?:[\x22\x27]+)(?<column>[^\x22\x27]*)(?:[\x22\x27]+)(?:,?)|((?<column>[^,\r\n]+)(?:,?)|(?<column>\W*)(?:,)))

This one will handle empty column too. FINALLY.

Sunday, June 29, 2008

Regular Expression for parsing a CSV file with C#

I have to parse CSV (Comma Separated Value) files regularly. It is quite surprising alot of information exchange is still done in CSV format. I have also googled for a efficient way to parse a CSV file. A median performance solution for small CSV file is using regular expression. Regular expression used to parse CSV file has been quite complex.

After a few trial and error, I found a regular expression tokenise the CSV line quite nicely. It is surprisingly simple

[^,"]*

It says ignor comma and quote in your match, hence it will only match other characters. BUT NOTE, this simple expression can only handle very simple situation only, if it is possible there are comma (Separater) in a token, e.g. "abcd,efg"; the simple expression is not adaquate.

A more complex and comprehensive expression can be found from OmegaMan's Blog page. On his page you can find some very useful tips about using regular expression in C#.

Based on OmegaMan's expression for tokenise CSV, to parse a single line of CSV

((?(?=[\x22\x27])(?:[\x22\x27]+)(?<column>[^\x22\x27]*)(?:[\x22\x27]+)|(?<column>[^,\r\n]*))(?:,?)) #\x22 is double quote ", \x27 is single quote '

I am yet to figure out one final problem, the above regular expression will alway return an empty match at the end. Need to figure out how to stop it from happening?



Friday, June 27, 2008

OSX Keyboard shortcuts, Way of doing things

This is just a reminder for myself

OS:
Shutdown osx ctrl + option + cmd + eject
Restart ctrl + cmd + eject
Force Quit Screen option + cmd + esc

Finder:
Empty trash can cmd + shift + backspace
simple item info cmd + i
multiple item info cmd + option + i

VMWare Fusion 1.X.X Windows XP Pro x64 Problem

I have VMWare Fusion installed on my MacBook Pro to run Windows XP so that I can do Visual Studio C# Development on my Mac (best of both world?) While most of time it works well, but there are problems.

I have some problem with running Visual Studio 2008 and SQL Management Studio 2005 on Windows XP Pro x64 edition. I constantly getting memory access violation message from .NET Framework. And Visual Studio 2008 will quit randomly.

VMWare support suggested a flag in setting file mainmem.useNamedFile = "FALSE"; I tried it, it seem to have some effect but does not solve the problem at all.

I installed another instance of Windows XP 32 bit edition, and there are no problems.

Well I would think it indicate something is not right in Fusion for Windows XP Pro X64?

Wednesday, June 25, 2008

Vodafone NZ HSDPA Broadband XU870 on MAC OS X

I use Vodafone NZ Mobile Broadband (HSDPA) service with a Novatel Merlin XU870 ExpressCard, with my MacBook Pro OS X 15.3. For majority of times, the connection has been somewhat unstable.

When I first move to OS X 15 Leopard, Vodafone NZ does not support it the new OS X operating system. I have to resort to reading tones of forums and trial and error to figure out a way to make it work. I did manage to get it working just using Leopard's build in software and driver for the XU870.
Go to system preference -> network -> select XU870 -> Advance Option -> WWAN tab -> Select Novatel in Vendor drop down list, then click apply (important otherwise next option will not appear), then select GSM in model drop down list, then enter www.vodafone.net.nz into APN textbox , leave CID as default (1). Click Apply.
Back to general setting page and click on connect, you should be able to connect. After this, you will be able to see signal strength and connect and discount from task bar icon.

The above work around approach works, but it is not very stable (at least for me), the connect sometimes just hang, you can send requests (upload traffic) to server but does not seem to be any reply (download traffic), you will need to discount and remove the XU870 card and reinsert it then reconnect to make it work again. This is rather annoying. Not entirely sure if this problem is due to hacked way of set it up or it is simply because of Vodafone NZ's crappy service. One way to keep it working (connected) is to open something constantly using the internet connection; e.g. remote desktop, radio player.

Then May 2008 Vodafone NZ finally supported OS X 15 with new software Version: 2.08.05.04; 6 month after launch of Leopard .(Link to download vodafone software) I thought it might be better to use the Mobile Connect software, and it might solve the unstable connection problem. However, I am hoping too much, it is not the case. it make things worse than before, it is still unstable as before, (maybe a little better but not much), the connection still hang every so often. And it is more difficult to connect and disconnet, to do these you will need to go to the system preference network page.

Then today I come across this Launch2Net software. It suppose to be really good by reading the reviews, such as this one. But it cost 75 euros just for a single license. Very expensive, is it too much to spend just for more convenience? Have to think again

Tuesday, June 24, 2008

Windows XP Update Failed After Refresh Install (upgrade on top of existing system)

In this CNET Forum Post , you can find a very simple and effective solution to Windows Update Failed problem.

I had some problem with my Windows XP instance. After a refresh install (reinstall Windows XP using upgrade option); the Windows Update reports fail to apply updates. There is no obvious reason. I had this problem several times before. After lots of search, I finally come across this effective and simple solution.

Windows XP updates fail to install

by fnlvn2 - 28/07/07 00:16
In reply to: Windows Update KB928366 by jackfrost64

This sounds like a problem that I have after I've performed a repair installation. It's taken me about 6 months of research through tech sites including Microsoft's to find a repair that always works. In summary all if my updates would download but, all failed to install.
The solution is to register the dll files associated with the Windows Update Program. To do this do the following
Go to the start button then then the run button and type the following:

regsvr32 wuapi.dll (enter) you will recieve a confirmation message
regsvr32 wuaueng.dll
regsvr32 wuaueng1.dll
regsvr32 wucltui.dll
regsvr32 wups.dll
regsvr32 wups2.dll
regsvr32 wuweb.dll

Reboot and go to the windows update site and try to install your security update. I do understand that my problem was somewhat different than yours. However, one of the items that I was attempting to update was the same as yours. Therefore I'm quite confident that the solution is the same.

First Blog

Nothing to say really, just started a blog and try it.

I think I am going to use it to record interesting things (mainly IT technical) around the web that I want to remember.