Optimizing a custom Trim() function in C#
I needed to write a Trim() function that removed anything that wasn't a digit from the beginning and end of a string. What usually happens when I write a blog entry like this is that someone posts a comment a few days later saying "hey, you didn't need to write that, it's already in the .NET library." If it is, I haven't been able to find it.
Here is my TrimNonDigits() function first attempt:
protected string TrimNonDigits_v1(string text)
{
while (text.Length > 0 && !Char.IsDigit(text, 0))
{
text = text.Substring(1);
}
while (text.Length > 0 && !Char.IsDigit(text, text.Length - 1))
{
text = text.Substring(0, text.Length - 1);
}
return text;
}
Now this function works and for what I needed it for (trivially short strings) it would do the job. But it is my intention to add it to a string extension library and I had a fear that someone would pass in an enormous string and with all the Substring() operations creating new strings this may become extremely inefficient.
To see how inefficient this function was I wrote another unit test to specifically measure the speed at which the code ran. Here is that unit test:
[TestMethod]
public void trim_non_digits_big_text()
{
int repeat = 10000;
UnitTestDerived utd = new UnitTestDerived();
string lotsOfCharacters = String.Join("", Enumerable.Repeat("abcdef", repeat).ToArray());
string text = lotsOfCharacters + "1" + lotsOfCharacters;
string expected = "1";
string actual = utd.TrimNotDigits(text);
Assert.AreEqual(expected, actual);
}
The unit test passes but takes 21 seconds to run. Although extremely unlikely that this amount of non-digit-text on either side of a number might appear it's still a valid use case so the code needed reworking so that the Substring() function was only called once. This is the resulting change to the code:
protected string TrimNonDigits(string text)
{
int start = 0, end = text.Length - 1;
while (text.Length > start && !Char.IsDigit(text, start))
{
start++;
}
if (start != end)
{
while (end > start && !Char.IsDigit(text, end))
{
end--;
}
}
return text.Substring(start, 1 + end - start);
}
The unit test runs against this new function in under 1 second. That's a dramatic difference and good demonstration for the case of not using Substring() repeatedly in a loop if you can help it.
A side note. I also had several other unit tests that checked the validity of the results of the algorithm before I did the optimization. These included the testing of edge cases and extremes. This made the optimization refactor a cinch because I was fairly confident at the end that the algorithm will work in all situations.