Code Review: A simple markup processor

The purpose of this post is to find a better way to write the code shown below. The examples below are not meant to be best practices or examples I would expect to be used in any system. I came up with this and am posting it to get a code review from anyone who stumbles upon it. I would love to find a better way to accomplish this, so if you have ideas or suggestions, please post them in the comments.

I have a side project I’m working on which requires users to be able to enter in comments. I wanted to support a very limited set of formatting so things like TinyMCE, FCKEditor, Markdown and the like are out. I have some basic requirements:

  1. Convert *italics* to <em>italics</em> and likewise for _bold_ to <strong>bold</strong>, but not in code blocks.
  2. For code blocks, convert lines starting with at least four spaces to be wrapped in <pre></pre> tags. Consecutive lines should be grouped together.
  3. Encode any HTML.
  4. Convert line breaks to HTML line breaks outside of code blocks.

I’ve created a text file you can look at that has some sample input.

For the first requirement, I’ve created a method to take a string input, a delimiter and an HTML tag to replace the delimiters with. I’ve also created a list of special characters used in regular expressions so I can assemble a proper regex if the delimiter is one of those characters.

private static List<string> SpecialRegexChars =

    new List<string> { "$", "^", "{", "[", "(", "|", ")", "]", "}", "*", "+", "?", @"\" };

 

private static string ReplaceWithHtml(string input, string delimiter, string tag)

{

    if (SpecialRegexChars.Contains(delimiter))

        delimiter = @"\" + delimiter;

 

    string regex = delimiter + "(.+)" + delimiter;

 

    string output = input;

    Regex r = new Regex(regex);

 

    foreach (Match match in r.Matches(input))

    {

        if (match.Groups.Count > 1)

        {

            output = output.Replace(match.Groups[0].Value, string.Format("<{1}>{0}</{1}>", match.Groups[1].Value, tag));

        }

    }

    return output;

}

Originally I ran this method using the entire sample input and it worked great, except for that if you had text surrounded by * or _ in a code block, it was replaced with a tag. So I set this aside and moved on to detecting code blocks so I could know where to avoid performing the replacement.

A code block for this scenario is any single or consecutive group a lines of text that start with four or more spaces. The beginning of every block needs to be prepended with a <pre> tag and appended with a </pre> closing tag. I considered using regex to handle this, but I couldn’t (and honestly didn’t want to) get my head around it. If there is a simple, easily-readable regex (there’s an oxymoron) that accomplishes what I’m about to demonstrate, please share it in the comments.

Short of an elegant regex, the next option I decided to try is splitting the input into an array of lines and looping through each line and determining whether or not it is a code block along with the lines before and after it. The vast majority of input will be relatively short so while a loop isn’t perfect, it actually works pretty well in this case. It also allows me to process the tag replacement on a per-line basis depending on whether or not that line is code.

Not too long after starting on my Codify method I had a morass of complicated and nearly unreadable ‘if’ statements. But, it worked! I was getting exactly the desired output. But I knew I could never come back and read the code later if the requirements changed. I worked at cleaning it up as much as possible focusing on improving readability, and this is the result:

private static string Codify(string input)

{

    bool isInCodeBlock = false;

    string[] lines = input.Split(new[] { Environment.NewLine }, StringSplitOptions.None);

 

    for (int i = 0; i < lines.Length; i++)

    {

        bool isFirst = i == 0;

        bool isLast = i == lines.Length – 1;

        bool isNotFirst = i > 0;

        bool isNotLast = i < lines.Length – 1;

        string nextLine = (isNotLast ? lines[i + 1] : "").TrimEnd();

        string prevLine = (isNotFirst ? lines[i - 1] : "").TrimEnd();

        bool nextLineIsCode = isLast ? false : nextLine.StartsWith("    ");

        bool prevLineIsCode = prevLine.StartsWith("    ");

        string prefix = "";

        string suffix = "";

        string contents = HttpUtility.HtmlEncode(lines[i].TrimEnd());

        bool thisLineIsCode = contents.StartsWith("    ");

 

        if (((isFirst) || (isNotFirst && !isInCodeBlock)) && thisLineIsCode)

        {

            prefix = "<pre>" + Environment.NewLine;

            isInCodeBlock = true;

        }

 

        if (!isInCodeBlock)

        {

            contents = ReplaceWithHtml(contents, "*", "em");

            contents = ReplaceWithHtml(contents, "_", "strong");

        }

 

        if (isInCodeBlock && !nextLineIsCode)

        {

            suffix = Environment.NewLine + "</pre>";

            isInCodeBlock = false;

        }

 

        if (!thisLineIsCode && !nextLineIsCode && !prevLineIsCode)

            suffix = "<br />";

 

        lines[i] = string.Concat(prefix, contents, suffix);

    }

 

    return string.Join(Environment.NewLine, lines);

}

While this isn’t the prettiest 45 lines of code I’ve ever written, it does work.

If you see something, anything, wrong with this code, tell me about it in the comments! Don’t hold back, I want to know everything you think is wrong with this code. I’m trying to work on not being attached to code I write so this will be good practice.

Posted December 29th, 12:43 AM
Read more posts about .NET, C#, Code Review.

View Comments
Link

My Projects

ManagedAssembly

RestSharp

jQuery Snippets for Visual Studio 2010

@dotnetlinks on Twitter

SnapLeague