Code Review: A simple markup processor

The purpose of this post is to find a better way to write the code shown below. The examples below are not meant to be best practices or examples I would expect to be used in any system. I came up with this and am posting it to get a code review from anyone who stumbles upon it. I would love to find a better way to accomplish this, so if you have ideas or suggestions, please post them in the comments.

I have a side project I’m working on which requires users to be able to enter in comments. I wanted to support a very limited set of formatting so things like TinyMCE, FCKEditor, Markdown and the like are out. I have some basic requirements:

  1. Convert *italics* to <em>italics</em> and likewise for _bold_ to <strong>bold</strong>, but not in code blocks.
  2. For code blocks, convert lines starting with at least four spaces to be wrapped in <pre></pre> tags. Consecutive lines should be grouped together.
  3. Encode any HTML.
  4. Convert line breaks to HTML line breaks outside of code blocks.

I’ve created a text file you can look at that has some sample input.

For the first requirement, I’ve created a method to take a string input, a delimiter and an HTML tag to replace the delimiters with. I’ve also created a list of special characters used in regular expressions so I can assemble a proper regex if the delimiter is one of those characters.

private static List<string> SpecialRegexChars =

    new List<string> { "$", "^", "{", "[", "(", "|", ")", "]", "}", "*", "+", "?", @"\" };

 

private static string ReplaceWithHtml(string input, string delimiter, string tag)

{

    if (SpecialRegexChars.Contains(delimiter))

        delimiter = @"\" + delimiter;

 

    string regex = delimiter + "(.+)" + delimiter;

 

    string output = input;

    Regex r = new Regex(regex);

 

    foreach (Match match in r.Matches(input))

    {

        if (match.Groups.Count > 1)

        {

            output = output.Replace(match.Groups[0].Value, string.Format("<{1}>{0}</{1}>", match.Groups[1].Value, tag));

        }

    }

    return output;

}

Originally I ran this method using the entire sample input and it worked great, except for that if you had text surrounded by * or _ in a code block, it was replaced with a tag. So I set this aside and moved on to detecting code blocks so I could know where to avoid performing the replacement.

A code block for this scenario is any single or consecutive group a lines of text that start with four or more spaces. The beginning of every block needs to be prepended with a <pre> tag and appended with a </pre> closing tag. I considered using regex to handle this, but I couldn’t (and honestly didn’t want to) get my head around it. If there is a simple, easily-readable regex (there’s an oxymoron) that accomplishes what I’m about to demonstrate, please share it in the comments.

Short of an elegant regex, the next option I decided to try is splitting the input into an array of lines and looping through each line and determining whether or not it is a code block along with the lines before and after it. The vast majority of input will be relatively short so while a loop isn’t perfect, it actually works pretty well in this case. It also allows me to process the tag replacement on a per-line basis depending on whether or not that line is code.

Not too long after starting on my Codify method I had a morass of complicated and nearly unreadable ‘if’ statements. But, it worked! I was getting exactly the desired output. But I knew I could never come back and read the code later if the requirements changed. I worked at cleaning it up as much as possible focusing on improving readability, and this is the result:

private static string Codify(string input)

{

    bool isInCodeBlock = false;

    string[] lines = input.Split(new[] { Environment.NewLine }, StringSplitOptions.None);

 

    for (int i = 0; i < lines.Length; i++)

    {

        bool isFirst = i == 0;

        bool isLast = i == lines.Length – 1;

        bool isNotFirst = i > 0;

        bool isNotLast = i < lines.Length – 1;

        string nextLine = (isNotLast ? lines[i + 1] : "").TrimEnd();

        string prevLine = (isNotFirst ? lines[i - 1] : "").TrimEnd();

        bool nextLineIsCode = isLast ? false : nextLine.StartsWith("    ");

        bool prevLineIsCode = prevLine.StartsWith("    ");

        string prefix = "";

        string suffix = "";

        string contents = HttpUtility.HtmlEncode(lines[i].TrimEnd());

        bool thisLineIsCode = contents.StartsWith("    ");

 

        if (((isFirst) || (isNotFirst && !isInCodeBlock)) && thisLineIsCode)

        {

            prefix = "<pre>" + Environment.NewLine;

            isInCodeBlock = true;

        }

 

        if (!isInCodeBlock)

        {

            contents = ReplaceWithHtml(contents, "*", "em");

            contents = ReplaceWithHtml(contents, "_", "strong");

        }

 

        if (isInCodeBlock && !nextLineIsCode)

        {

            suffix = Environment.NewLine + "</pre>";

            isInCodeBlock = false;

        }

 

        if (!thisLineIsCode && !nextLineIsCode && !prevLineIsCode)

            suffix = "<br />";

 

        lines[i] = string.Concat(prefix, contents, suffix);

    }

 

    return string.Join(Environment.NewLine, lines);

}

While this isn’t the prettiest 45 lines of code I’ve ever written, it does work.

If you see something, anything, wrong with this code, tell me about it in the comments! Don’t hold back, I want to know everything you think is wrong with this code. I’m trying to work on not being attached to code I write so this will be good practice.

Posted December 29th, 2008 12:43 AM
Read more posts about .NET, C#, Code Review.

View Comments
Link

  • You could use Regex.Escape instead of checking SpecialRegexChars.
  • Good to know! Thanks
  • Matt
    Incidentally I know you said you didn't want markdown but given that it is available for .Net as a port (http://www.aspnetresources.com/blog/markdown_announced.aspx) have you considered just accepting using that as away to get started since it should do all you need (and the source is available if you really wanted to restrict it to the very limited form described)?

    Being a subset of a wider format would allow you to expand easily in future...
  • Matt
    All other simple markup rules I ever used had *foo* for bold and _bar_ for italics. Just a note

    You are doing a simple state machine but are spreading your state around amongst multiple variables.
    You are disallowing bold/italics across lines and also doing greedy matching so
    *foo* and *bar*
    becomes:
    foo* and *bar

    Also how do you want to handle the following?

    _*foo*_ vs _*foo_*

    The former should trigger em and strong, the latter might if you are being nice (though you will have to make you code more complex to output legit xhtml)

    Assuming you will only accept well formed input do it as a state machine with an explicit stack
    something like (not complied - assumes a Deque like the powercollections one)

    private class TextAndToken
    {
    public TextAndToken() : this(char.MinValue) {}
    public TextAndToken(char c) { this.Token = c; }
    public readonly char Token;
    public readonly StringBuilder Text = new StringBuilder();

    public void Append(TextAndToken t)
    {
    if (t.Token != char.MinValue)
    this.Text.Append(t.Token);
    this.Text.Append(t.Text);
    }
    }

    private string bool IsCodeBlock(string line)
    {
    return line.StartsWith(" "); // could allow empty lines already in code block later by passing in the state variable
    }

    private bool MatchesTopOfStack(char c, Deque<textandtoken> stack, out string tag)
    {
    tag = null;
    switch (c)
    {
    case '*':
    tag = "em";
    break;
    case '_':
    tag = "strong";
    break;
    }
    return (stack.Back.Token != c);
    }

    private void FlattenParseStack(Deque<textandtoken> parseStack)
    {
    if (parseStack.Count == 1)
    return;
    TextAndToken flattened = new TextAndToken();
    while (parseStack.Count > 1)
    {
    flattened.Append(parseStack.RemoveFromFront());
    }
    parseStack.AddToBack(flattened);
    }

    private string Codify(string input)
    {
    bool isInCodeBlock = false;
    Deque<textandtoken> parseStack = new Deque<textandtoken>();
    parseStack.Add(new Token());
    string[] lines = input.Split(new string[] { Environment.NewLine }, StringSplitOptions.None);
    foreach (var l in lines)
    {
    // decision here - no matter what we encode http characters
    string contents = HttpUtility.HtmlEncode(line.TrimEnd());
    if (IsCodeBlock(contents))
    {
    if (!inCodeBlock)
    {
    // decision - a code block terminates all previous possible markup
    FlattenParseStack(parseStack);
    parseStack.Back.Text.Append("
    ");				
    
    inCodeBlock= true;
    }
    continue;
    }

    if (isInCodeBlock)
    {
    isInCodeBlock = false;
    parseStack.Back.Text.Append("
    ");
    }
    // all tokens are characters - simple code as a result
    foreach (char c in contents)
    {
    if (MatchesTopOfStack(c, parseStack, out tag))
    {
    var block = parseStack.RemoveFromBack();
    praseStack.Back.Text.AppendFormat("<{0}>{1}", tag, block.Text);
    }
    else if (tag != null)
    {
    parseStack.AddToFront(new TextAndToken(c));
    }
    else
    {
    parseStack.Back.Text.Append(c);
    }
    }
    parseStack.Back.Text.Append(Environment.NewLine);
    }
    FlattenParseStack(parseStack);
    return parseStack.Back.Text.ToString();
    }</textandtoken></textandtoken></textandtoken></textandtoken>
  • A slight tweak to the regex fixes the spanning issue: \*[^\*]+\* It doesn't fix multiline italics or bold though.

    I tried getting your code up and running but didn't get it to work once it compiled. If you're interested in following up, email me. I'll let you in on what the project is.
  • Matt
    yuck - sorry the fomatting is lost - Disqus doesn't appear to do it's own markup and I can't edit
  • Matt
    oh - and .StartsWith(" "); should contain 4 spaces - I now officially hate Disqus
    plus I actually fluffed the code blocks - I failed to strip the leading 4 spaces :(
blog comments powered by Disqus