Articles/

Content Parsing with Femur

A look at Femur's parsing architecture, designed for content manipulation, transformation, and rendering through custom AST walkers.

This website is built on chtml, a templating system I wrote that takes a lot of inspiration from Razor. Razor compiles templates into classes and creates instances behind the scenes for every template it processes. I wanted to try the same approach — templates embedded directly in C#, compiled ahead of time — but with fewer allocations and less framework complexity. That meant accepting fewer features, which was a trade-off I was fine with.

That context matters here because it's what drove the need for these parsing libraries. I needed to compile Markdown articles into C# code at build time. The goal was zero-allocation rendering at runtime: no string manipulation, no runtime parsing, just generated code that writes directly to an output stream.

Part of the motivation was wanting to learn by building. Parsers are one of those things I'd always worked around rather than through. But there was a real technical gap too. Most existing Markdown libraries don't expose the AST in a way that supports that kind of output. They parse to HTML, or they give you a tree you can read but not easily use to drive code generation. I needed something I could walk and use as input to a code generator. So I built it.

That's where the Femur parsing libraries came from. The Markdown and HTML parsers are both built around the same idea: parse content into an AST you can walk, transform, and render to any output format you want. They're not serialization tools. They don't map content to typed objects the way JSON or XML libraries do. They're for manipulating documents: transforming structure, extracting information, generating output.

The Problem Space

When working with content-heavy applications (documentation sites, blogs, CMSs, static site generators), you often need to:

  1. Parse content from various formats (Markdown, HTML, custom markup)
  2. Analyze or transform the content (extract links, rewrite URLs, inject components)
  3. Render to different output formats (HTML, plain text, custom formats)

Most existing parsing libraries fall into one of two camps:

  • String-based manipulation : Fast but brittle, error-prone, and can't handle nesting properly
  • Heavy DOM libraries : Powerful but allocate heavily, often tied to a specific output format

Femur parsing libraries sit in between: lightweight AST-based parsing with streaming support and extensible rendering.

Architecture Overview

The Femur parsing ecosystem is built in layers:

┌─────────────────────────────────────────────────┐
│     Concrete Parsers & Renderers                │
│  (MarkdownParser, HtmlParser, Renderers)        │
├─────────────────────────────────────────────────┤
│     Domain-Specific Abstractions                │
│  (MarkdownNodeType, MarkupNodeType, Walkers)    │
├─────────────────────────────────────────────────┤
│     Base Parsing Infrastructure                 │
│  (StreamParser<T>, Node, ParentNode)            │
└─────────────────────────────────────────────────┘

Base Infrastructure: StreamParser<T>

At the foundation is StreamParser<T> , a generic streaming parser that handles reading from streams efficiently:

public abstract class StreamParser<TDocument> : IDisposable
{
    protected char[] Buffer;           // Pooled from ArrayPool
    protected int Position;            // Current position in buffer
    protected int Length;              // Valid data length
    protected int TotalCharsRead;      // Absolute position tracking
    
    // Template methods for subclasses
    protected abstract TDocument CreateDocument();
    protected abstract void InitializeParsing(TDocument document);
    protected abstract void ProcessCharacter(char ch, TDocument document);
    protected virtual void Cleanup();
    
    // Utility methods for parsing
    protected bool ReadMore();
    protected int GetAbsolutePosition();
    protected void SkipWhitespace();
    protected string ReadUntil(char delimiter);
}

Key design points:

  • Streaming : Reads in 4KB chunks, handles files of any size without loading entirely into memory
  • Buffer pooling : Uses ArrayPool<char> to avoid allocations
  • Position tracking : Maintains absolute position across buffer boundaries for source location tracking
  • Template Method pattern : Subclasses implement domain logic, base class handles streaming complexity

Subclasses focus on what to parse, not how to read efficiently.

Node System: Extensible AST

The AST is built on a simple node hierarchy:

public class Node
{
    public NodeType NodeType { get; set; }
    public SourceLocation SourceLocation { get; set; }
    
    // Navigation
    public Node? GetParent();
    public Node? GetNextSibling();
    public Node? GetPreviousSibling();
    public IEnumerable<Node> GetAncestors();
}

public class ParentNode : Node
{
    private List<Node>? _children;  // Lazy initialized!
    
    public List<Node> Children => _children ??= new();
    public bool HasChildren => _children?.Count > 0;
}

Design highlights:

  • Navigation in all directions : Walk up (parent/ancestors), down (children), or sideways (siblings)
  • Lazy initialization : Children list only allocated if the node actually has children
  • Extensible types : NodeType.Custom("MyType") lets parsers define domain-specific node types
  • Source tracking : Every node knows where it came from in the source document

Markdown Parser: Two-Phase CommonMark

The MarkdownParser implements the CommonMark spec using a two-phase approach:

Phase 1: Block Structure (line-by-line)

  • Headings (ATX: # Title , Setext: underlined)
  • Code blocks (fenced: ``` , indented: 4 spaces)
  • Lists (ordered/unordered, nested)
  • Block quotes (> )
  • Thematic breaks (--- )
  • HTML blocks
  • Paragraphs

Phase 2: Inline Structure (character-by-character within blocks)

  • Emphasis (*emphasis* , _emphasis_ )
  • Strong emphasis (**strong** , __strong__ )
  • Code spans ( `code` )
  • Links ([text](url) )
  • Images (![alt](url) )
  • Line breaks

Optional Phase 3: Smart Punctuation

  • Curly quotes ("" )
  • Dashes (-- → en-dash, --- → em-dash)
  • Ellipsis (... )
// Simple API
using var stream = File.OpenRead("document.md");
var document = MarkdownParser.Parse(stream);

// Or from string
var document = MarkdownParser.Parse("# Hello\
\
Paragraph text");

Extended Markdown: Front Matter Support

ExtendedMarkdownParser extends the base parser to support YAML front matter:

var markdown = @"---
title: My Article
author: Nathan Quandt
tags: [parsing, markdown, femur]
---

# Content starts here

Article body...";

var document = ExtendedMarkdownParser.Parse(markdown);

// Access front matter
var title = document.FrontMatterBlock?.ParsedData["title"];
var tags = document.FrontMatterBlock?.ParsedData["tags"] as List<object>;

Front matter is parsed as a separate FrontMatterBlockNode attached to the document, containing both the raw YAML text and parsed key-value pairs.

HTML Parser: Standard Markup Parsing

The HtmlParser provides streaming HTML parsing with AST generation:

  • Elements with attributes : case-preserved tag names, lazy attribute dictionary
  • Self-closing tags : <br /> , <img /> detected and marked
  • Void elements : HTML void elements (img , br , input , etc.) automatically recognized
  • Comments : <!-- comment --> parsed as CommentNode
  • CDATA sections : <![CDATA[...]]> supported
  • DOCTYPE declarations : <!DOCTYPE html> parsed as DocumentTypeNode
  • SVG/MathML : delegates to XML parser for proper namespace handling
  • Script/style preservation : content inside <script> and <style> preserved exactly

Node types:

// Core nodes from Femur.Markup.Abstractions
DocumentNode     // Root document
ElementNode      // HTML elements (<div>, <p>, etc.)
  ├─ TagName: string
  ├─ Attributes: Dictionary<string, string>
  ├─ IsSelfClosing: bool
  └─ IsVoidElement: bool

TextNode         // Text content
CommentNode      // <!-- comments -->
CDataNode        // <![CDATA[...]]>
DocumentTypeNode // <!DOCTYPE html>
XmlElementNode   // SVG/MathML elements

Usage example:

using Femur.Html.Parser;
using Femur.Markup.Abstractions.Nodes;

var html = @"
<!DOCTYPE html>
<div class='container' id='main'>
  <h1>Hello World</h1>
  <p>This is a <strong>paragraph</strong>.</p>
  <!-- Comment here -->
</div>";

var document = HtmlParser.Parse(html);

// Navigate AST
foreach (var node in document.Children)
{
    if (node is ElementNode element)
    {
        Console.WriteLine($"Tag: {element.TagName}");
        
        if (element.HasAttributes)
        {
            foreach (var attr in element.Attributes)
                Console.WriteLine($"  {attr.Key}={attr.Value}");
        }
        
        WalkChildren(element);
    }
}

void WalkChildren(ContainerNode parent)
{
    foreach (var child in parent.Children)
    {
        if (child is ElementNode el)
            WalkChildren(el);
        else if (child is TextNode text)
            Console.WriteLine($"Text: {text.Content}");
    }
}

Navigation example:

var document = HtmlParser.Parse("<div><p>First</p><p>Second</p></div>");
var div = (ElementNode)document.Children[0];
var firstP = (ElementNode)div.Children[0];

// Navigate up
var parent = firstP.GetParent(); // Returns div

// Navigate sideways
var nextSibling = firstP.GetNextSibling(); // Returns second <p>
var allSiblings = firstP.GetElementSiblings(); // All element siblings

// Navigate to ancestors
var ancestors = firstP.GetAncestors(); // [div, document]

// Location tracking
Console.WriteLine($"Offset: {firstP.Location.Offset}");
Console.WriteLine($"Length: {firstP.Location.Length}");

Rendering: AST Walkers

The reason the AST approach is useful is that you can walk the tree in whatever order you need and produce whatever output you want. Femur uses the Visitor pattern for this through MarkdownAstWalker .

Rendering to HTML

public class MarkdownHtmlRenderer : MarkdownAstWalker
{
    private readonly StringBuilder _output;
    
    public string Render(MarkdownDocumentNode document)
    {
        _output.Clear();
        Walk(document);
        return _output.ToString();
    }
    
    protected override void VisitHeading(HeadingNode node)
    {
        _output.Append($"<h{node.Level}>");
        WalkChildren(node);  // Render inline content
        _output.Append($"</h{node.Level}>");
    }
    
    protected override void VisitCodeBlock(CodeBlockNode node)
    {
        var lang = !string.IsNullOrEmpty(node.Info) 
            ? $" class="language-{EscapeHtml(node.Info)}""
            : "";
        
        _output.Append($"<pre><code{lang}>");
        _output.Append(EscapeHtml(node.Content));
        _output.Append("</code></pre>");
    }
    
    protected override void VisitLink(LinkNode node)
    {
        _output.Append($"<a href="{EscapeAttribute(node.Url)}"");
        if (!string.IsNullOrEmpty(node.Title))
            _output.Append($" title="{EscapeAttribute(node.Title)}"");
        _output.Append('>');
        WalkChildren(node);
        _output.Append("</a>");
    }
}

// Usage
var renderer = new MarkdownHtmlRenderer();
string html = renderer.Render(document);

You only override the visitor methods you care about. The walker handles tree traversal.

Rendering to Other Formats

You're not limited to HTML. The same walker approach works for any output:

public class MarkdownToPlainTextRenderer : MarkdownAstWalker
{
    private readonly StringBuilder _output;
    
    protected override void VisitHeading(HeadingNode node)
    {
        var text = GetTextContent(node);
        _output.AppendLine(text.ToUpper());
        _output.AppendLine(new string('=', text.Length));
        _output.AppendLine();
    }
    
    protected override void VisitCodeBlock(CodeBlockNode node)
    {
        foreach (var line in node.Content.Split('\
'))
            _output.AppendLine($"    {line}");
        _output.AppendLine();
    }
    
    protected override void VisitLink(LinkNode node)
    {
        WalkChildren(node);
        _output.Append($" [{node.Url}]");
    }
}

Extracting Information

Walk the AST without generating output to extract information:

public class LinkCollector : MarkdownAstWalker
{
    public List<string> ExternalLinks { get; } = new();
    public List<string> InternalLinks { get; } = new();
    
    protected override void VisitLink(LinkNode node)
    {
        if (node.Url.StartsWith("http://") || node.Url.StartsWith("https://"))
            ExternalLinks.Add(node.Url);
        else if (node.Url.StartsWith("#") || node.Url.StartsWith("/"))
            InternalLinks.Add(node.Url);
        
        base.VisitLink(node);
    }
}

var collector = new LinkCollector();
collector.Walk(document);

Console.WriteLine($"Found {collector.ExternalLinks.Count} external links");
Console.WriteLine($"Found {collector.InternalLinks.Count} internal links");

Use cases for this pattern:

  • Extract all links for validation
  • Generate a table of contents from headings
  • Find all images that need optimization
  • Build search indexes
  • Validate document structure

Modifying the AST

You can also modify nodes in place as you walk, which lets you chain multiple passes together:

public class UrlRewriter : MarkdownAstWalker
{
    private readonly string _baseUrl;
    
    public UrlRewriter(string baseUrl) => _baseUrl = baseUrl;
    
    protected override void VisitLink(LinkNode node)
    {
        if (!node.Url.StartsWith("http"))
            node.Url = $"{_baseUrl.TrimEnd('/')}/{node.Url.TrimStart('/')}";
        
        base.VisitLink(node);
    }
    
    protected override void VisitImage(ImageNode node)
    {
        if (!node.Url.StartsWith("http"))
            node.Url = $"{_baseUrl.TrimEnd('/')}/{node.Url.TrimStart('/')}";
        
        base.VisitImage(node);
    }
}

// Chain multiple passes
var document = MarkdownParser.Parse(markdown);
new UrlRewriter("https://example.com").Walk(document);
new ImageOptimizer().Walk(document);
new HeadingIdGenerator().Walk(document);
var html = new MarkdownHtmlRenderer().Render(document);

HTML Transformation

The HTML parser doesn't include a built-in walker, but the node callback feature or manual recursion covers most cases:

// Extract all links using the node callback
var links = new List<string>();

HtmlParser.Parse(html, node =>
{
    if (node is ElementNode { TagName: "a" } anchor)
    {
        if (anchor.Attributes.TryGetValue("href", out var href))
            links.Add(href);
    }
});

// Build a custom walker for more complex operations
public class HtmlToMarkdownConverter
{
    private readonly StringBuilder _output = new();
    
    public string Convert(DocumentNode document)
    {
        _output.Clear();
        foreach (var child in document.Children)
            VisitNode(child);
        return _output.ToString();
    }
    
    private void VisitNode(Node node)
    {
        switch (node)
        {
            case ElementNode element:
                ConvertElement(element);
                break;
            case TextNode text:
                _output.Append(text.Content);
                break;
        }
    }
    
    private void ConvertElement(ElementNode element)
    {
        switch (element.TagName.ToLower())
        {
            case "h1":
                _output.Append("# ");
                VisitChildren(element);
                _output.AppendLine();
                break;
            case "h2":
                _output.Append("## ");
                VisitChildren(element);
                _output.AppendLine();
                break;
            case "p":
                VisitChildren(element);
                _output.AppendLine("\
");
                break;
            case "strong" or "b":
                _output.Append("**");
                VisitChildren(element);
                _output.Append("**");
                break;
            case "em" or "i":
                _output.Append("*");
                VisitChildren(element);
                _output.Append("*");
                break;
            case "a":
                _output.Append('[');
                VisitChildren(element);
                _output.Append("](");
                _output.Append(element.Attributes.GetValueOrDefault("href", ""));
                _output.Append(')');
                break;
            default:
                VisitChildren(element);
                break;
        }
    }
    
    private void VisitChildren(ContainerNode parent)
    {
        foreach (var child in parent.Children)
            VisitNode(child);
    }
}

// Convert HTML to Markdown
var htmlDoc = HtmlParser.Parse("<h1>Title</h1><p>Text with <strong>bold</strong>.</p>");
var converter = new HtmlToMarkdownConverter();
var markdown = converter.Convert(htmlDoc);
// Result: "# Title\
\
Text with **bold**.\
"

How I Use This on This Website

Since chtml compiles templates to static C# methods rather than instantiating template classes at runtime, the same approach applies to content. Articles are written in Markdown and compiled to C# code at build time. No Markdown is parsed at runtime. The build step walks the AST and generates a C# method that writes the rendered HTML directly to an output stream.

// At build time: compile markdown to C# code
public class MarkdownCompiler
{
    public string CompileToCode(MarkdownDocumentNode document)
    {
        var sb = new StringBuilder();
        sb.AppendLine("public static async ValueTask RenderAsync(...)");
        sb.AppendLine("{");
        
        var walker = new CodeGeneratingWalker(sb);
        walker.Walk(document);
        
        sb.AppendLine("}");
        return sb.ToString();
    }
}

public class CodeGeneratingWalker : MarkdownAstWalker
{
    private readonly StringBuilder _code;
    
    protected override void VisitHeading(HeadingNode node)
    {
        _code.AppendLine($"writer.Write("<h{node.Level}>");");
        WalkChildren(node);
        _code.AppendLine($"writer.Write("</h{node.Level}>");");
    }
    
    protected override void VisitText(MarkdownTextNode node)
    {
        var escaped = EscapeForCode(node.Content);
        _code.AppendLine($"writer.Write("{escaped}");");
    }
}

The generated output is a normal C# method. At runtime, rendering an article is just a method call with no parsing, no allocation, no string building overhead. The AST approach made this straightforward to implement. You get full control over output by walking the tree yourself.

Other Real-World Use Cases

Static Site Generator

var articles = Directory.GetFiles("content/", "*.md")
    .Select(file => {
        using var stream = File.OpenRead(file);
        return ExtendedMarkdownParser.Parse(stream);
    })
    .ToList();

// Extract metadata for index page
var articleList = articles.Select(doc => new {
    Title = doc.FrontMatterBlock?.ParsedData["title"]?.ToString(),
    Date = doc.FrontMatterBlock?.ParsedData["date"],
    Slug = Path.GetFileNameWithoutExtension(doc.FileName)
}).ToList();

// Render each article
foreach (var doc in articles)
{
    var renderer = new MarkdownHtmlRenderer();
    var html = renderer.Render(doc);
    var layout = ApplyLayout(html, doc.FrontMatterBlock);
    File.WriteAllText($"output/{slug}.html", layout);
}

Documentation Validator

public class DocumentationValidator : MarkdownAstWalker
{
    public List<string> Errors { get; } = new();
    private HashSet<string> _headingIds = new();
    
    protected override void VisitHeading(HeadingNode node)
    {
        var id = GenerateId(GetTextContent(node));
        if (_headingIds.Contains(id))
            Errors.Add($"Duplicate heading ID: {id}");
        else
            _headingIds.Add(id);
        
        base.VisitHeading(node);
    }
    
    protected override void VisitLink(LinkNode node)
    {
        if (node.Url.StartsWith("#"))
        {
            var fragment = node.Url.Substring(1);
            if (!_headingIds.Contains(fragment))
                Errors.Add($"Broken internal link: {node.Url}");
        }
        
        base.VisitLink(node);
    }
    
    protected override void VisitCodeBlock(CodeBlockNode node)
    {
        if (node.IsFenced && string.IsNullOrEmpty(node.Info))
            Errors.Add("Code block missing language specification");
        
        base.VisitCodeBlock(node);
    }
}

Multi-Format Publishing

// Parse once, render to multiple formats
var document = MarkdownParser.Parse(sourceMarkdown);

var htmlRenderer = new MarkdownHtmlRenderer();
File.WriteAllText("output.html", htmlRenderer.Render(document));

var textRenderer = new MarkdownToPlainTextRenderer();
File.WriteAllText("output.txt", textRenderer.Render(document));

var jsonRenderer = new MarkdownToJsonRenderer();
File.WriteAllText("search-index.json", jsonRenderer.Render(document));

Component-Based Content

The fenced div syntax (::: ) lets you embed components while keeping content readable in plain Markdown:

# Article Title

Regular markdown content here.

:::C:Callout 
This is a custom callout component with markdown inside!
:::

More content...

:::C:CodeExample 
public void Example() 
:::

What Femur Parsing Is NOT

It's worth being clear about what these libraries are not designed for.

Not for Object Serialization

If you need to serialize/deserialize strongly-typed objects, use:

  • System.Text.Json for JSON
  • XmlSerializer for XML
  • YamlDotNet for YAML

Femur parsers build ASTs for content manipulation, not data binding.

Not for Real-Time Editing

Femur parsers are single-parse and immutable by design. If you need incremental parsing, syntax highlighting in real-time, or IDE-style code completion, look at tree-sitter or Roslyn (for C#).

Not for Web Scraping

If you're scraping HTML from the web, you want:

  • HtmlAgilityPack (robust HTML parsing with XPath)
  • AngleSharp (full HTML5 parsing with CSS selectors)

Femur's HTML parser is designed for trusted markup (your own templates), not hostile HTML from arbitrary websites.

Not a Full CMS

Femur parsing is a library, not a framework. It doesn't include database integration, authentication, an admin UI, or content workflows. It's a building block for content systems, not a replacement for one.

Performance Characteristics

Memory Efficiency

  • Streaming : 4KB chunks, handles large files without loading everything into memory
  • Pooling : Uses ArrayPool<char> for buffers
  • Lazy initialization : Node children and attributes only allocated if needed
  • Single pass : No separate tokenization phase

Trade-offs

AST approach (Femur default):

  • ✅ Can transform/analyze the AST multiple times
  • ✅ Navigation in any direction (parent/child/sibling)
  • ✅ Easier to implement complex transformations
  • ❌ Allocates node objects

Streaming approach (possible with custom renderers):

  • ✅ Zero AST allocations
  • ✅ Can use ReadOnlySpan<char> for zero-copy
  • ✅ Slightly faster for simple renders
  • ❌ Can't walk the AST multiple times
  • ❌ Harder to implement transformations

For most use cases the AST approach is the right choice. Streaming-only is worth considering for high-throughput scenarios like rendering thousands of Markdown files in a batch job.

Extensibility

Custom Parsers

Extend StreamParser<TDocument> :

public class WikitextParser : StreamParser<WikitextDocument>
{
    protected override WikitextDocument CreateDocument() => new WikitextDocument();
    
    protected override void ProcessCharacter(char ch, WikitextDocument document)
    {
        // Implement wikitext parsing logic
    }
}

Custom Renderers

Extend MarkdownAstWalker or create your own walker:

public class CustomRenderer : MarkdownAstWalker
{
    protected override void VisitHeading(HeadingNode node)
    {
        // Custom heading rendering
    }
}

Custom Node Types

public class CalloutNode : MarkdownContainerNode
{
    public string CalloutType { get; set; }  // "note", "warning", "tip"
    public string? Title { get; set; }
}

// In parser
var calloutNode = new CalloutNode 
{ 
    NodeType = NodeType.Custom("Callout"),
    CalloutType = "warning"
};

Parser Composition

Compose parsers via inheritance:

public class GithubFlavoredMarkdownParser : ExtendedMarkdownParser
{
    protected override void InitializeParsing(MarkdownDocumentNode document)
    {
        base.InitializeParsing(document);
        // Add GitHub-specific features (tables, task lists, etc.)
    }
}

Getting Started

The Femur parsing libraries are available as NuGet packages:

dotnet add package Femur.Parsing
dotnet add package Femur.Markdown.Parser
dotnet add package Femur.Markdown.Extended.Parser
dotnet add package Femur.Html.Parser

Simple example:

using Femur.Markdown.Parser;
using Femur.Markdown.Renderer;

var markdown = @"
# Hello World

This is a **bold** statement with a [link](https://example.com).

public void Example() { }
";

var document = MarkdownParser.Parse(markdown);

var renderer = new MarkdownHtmlRenderer();
var html = renderer.Render(document);

Console.WriteLine(html);

Wrapping Up

The source code is available on GitHub, and I'm actively using these libraries in production for this website. The core idea is simple: parse content into a tree you can walk, and make the walker extensible enough that you can produce whatever output you need. That turned out to be the right abstraction for compile-time rendering, and it generalizes well to other content-heavy scenarios.

If you're building a static site generator, documentation system, content migration tool, or anything else that needs to read and transform documents, these libraries give you a foundation to work from without pulling in a heavy dependency.

I am a footer.