Friday, November 9, 2007

PowerShell syntax highlighting with HTML

When I decided to start this blog, I thought it would be nice to be able to display PowerShell code examples with nice formatting and syntax highlighting. I tried a few freely available tools out there that advertised PowerShell syntax support, but they all seemed to fall short in a category or two. None of them correctly handled multiple line strings or here-strings, and none of them correctly highlighted PowerShell variables enclosed in curly braces e.g. "${this is a variable}".

I thought it would be fun to try to write my own syntax highlighting tool with PowerShell. It was a little more difficult than I originally thought it would be, but it really was fun.

The script takes a string parameter that can be a code snippet or a path to a PowerShell script file. A switch parameter can be provided if line numbers are wanted in the output. The script highlights strings, comments, operators, numbers, keywords (including things kind of like keywords), types (specifically the shortcut types available in PowerShell, like [string] and [regex]), variables, and Cmdlet names. The colors used to highlight each of these items, along with the background color, default foreground color, and line number color can be customized by changing the values of the variables declared at the top of the script.

Here is the script (highlighted with itself):

# Highlight-Syntax.ps1
# version 1.0
# by Jeff Hillman
#
# this script uses regular expressions to highlight PowerShell
# syntax with HTML.

param[string] $code, [switch] $LineNumbers )

if ( Test-Path $code -ErrorAction SilentlyContinue )
{
    $code = Get-Content $code | Out-String
}

$backgroundColor = "#DDDDDD"
$foregroundColor = "#000000"
$stringColor     = "#800000"
$commentColor    = "#008000"
$operatorColor   = "#C86400"
$numberColor     = "#800000"
$keywordColor    = "#C86400"
$typeColor       = "#404040"
$variableColor   = "#000080"
$cmdletColor     = "#C86400"
$lineNumberColor = "#404040"

filter Html-Encode( [switch] $Regex )
{
    # some regular expressions operate on strings that have already
    # been through this filter, so the patterns need to be updated
    # to look for the encoded characters instead of the literal ones.
    # we do it with this filter instead of directly in the regular 
    # expression so the expressions can be a bit more readable (ha!)

    $_ = $_ -replace "&", "&"
    
    if ( $Regex )
    {
        $_ = $_ -replace "(?<!\(\?)<", "&lt;"
        $_ = $_ -replace "(?<!\(\?)>", "&gt;"
    }
    else
    {
        $_ = $_ -replace "\t", "    "
        $_ = $_ -replace " ", "&nbsp;"
        $_ = $_ -replace "<", "&lt;"
        $_ = $_ -replace ">", "&gt;"
    }
    
    $_
}

# regular expressions

$operatorRegex =  @"
((?x:
 (?# assignment operators)
 =|\+=|-=|\*=|/=|%=|
 (?# arithmatic operators)
 (?<!\de)
 (\+|-|\*|/|%)(?![a-z])|
 (?# unary operators)
 \+\+|\-\-|
 (?# logical operators)
 (-and|-or|-not)\b|!|
 (?# bitwise operators)
 (-band|-bor)\b|
 (?# redirection and pipeline operators)
 2>>|>>|2>&1|1>&2|2>|>|<|\||
 (?# comparison operators)
 (
  -[ci]? (?# case and case-insensitive variants)
  (eq|ne|ge|gt|lt|le|like|notlike|match|notmatch|replace|contains|notcontains)\b
 )|
 (?# type operators)
 (-is|-isnot|-as)\b|
 (?# range and miscellaneous operators)
 \.\.|(?<!\d)\.(?!\d)|&|::|:|,|``|
 (?# string formatting operator)
 -f\b
))
"@ | Html-Encode -Regex

$numberRegex = @"
((?x:
 (
  (?# hexadecimal numbers)
  (\b0x[0-9a-f]+)|
  (?# regular numbers)
  (?<!&)
  ((\b[0-9]+(\.(?!\.))?[0-9]*)|((?<!\.)\.[0-9]+))
  (?!(>>|>&[12]|>))
  (?# scientific notation)
  (e(\+|-)?[0-9]+)?
 )
 (
  (?# type specifiers)
  (l|ul|u|f|ll|ull)?
  (?# size shorthand)
  (b|kb|mb|gb)?
  \b
 )?
))
"@ | Html-Encode -Regex

$keyWordRegex = @"
((?x:
 \b(
 (?# don't match anything that looks like a variable or a parameter)
 (?<![-$])
 (
  (?# condition keywords)
  if|else|elseif|(?<!\[)switch(?!\])|
  (?# loop keywords)
  for|(?<!\|</span>&nbsp;)foreach(?!-object)|in|do|while|until|default|break|continue|
  (?# scope keywords)
  global|script|local|private|
  (?# block keywords)
  begin|process|end|
  (?# other keywords)
  function|filter|param|throw|trap|return
 )
 )\b
))
"@

$typeRegex = @"
((?x:
 \[
 (
  (?# primitive types and arrays of those types)
  ((int|long|string|char|bool|byte|double|decimal|float|single)(\[\])?)|
  (?# other types)
  regex|array|xml|scriptblock|switch|hashtable|type|ref|psobject|wmi|wmisearcher|wmiclass
 )
 \]
))
"@

$cmdletNames = Get-Command -Type Cmdlet | Foreach-Object { $_.Name }

function Highlight-Other( [string] $code )
{
    $highlightedCode = $code | Html-Encode
    
    # operators
    $highlightedCode = $highlightedCode -replace 
        $operatorRegex, "<span style='color: $operatorColor'>`$1</span>"

    # numbers
    $highlightedCode = $highlightedCode -replace 
        $numberRegex, "<span style='color: $numberColor'>`$1</span>"

    # keywords
    $highlightedCode = $highlightedCode -replace 
        $keyWordRegex, "<span style='color: $keywordColor'>`$1</span>"

    # types
    $highlightedCode = $highlightedCode -replace 
        $typeRegex, "<span style='color: $typeColor'>`$1</span>"

    # Cmdlets
    $cmdletNames | Foreach-Object {
        $highlightedCode = $highlightedCode -replace 
            "\b($_)\b", "<span style='color: $cmdletColor'>`$1</span>"
    }

    $highlightedCode
}

$RegexOptions = [System.Text.RegularExpressions.RegexOptions]

$highlightedCode = ""

# we treat variables, strings, and comments differently because we don't 
# want anything inside them to be highlighted.  we combine the regular 
# expressions so they are mutually exclusive

$variableRegex = '(\$(\w+|{[^}`]*(`.[^}`]*)*}))'

$stringRegex = @"
(?x:
 (?# here strings)
 @[`"'](.|\n)*?^[`"']@|
 (?# double-quoted strings)
 `"[^`"``]*(``.[^`"``]*)*`"|
 (?# single-quoted strings)
 '[^'``]*(``.[^'``]*)*'
)
"@

$commentRegex = "#[^\r\n]*"

[regex]::Matches( $code, 
                  "(?<before>(.|\n)*?)" + 
                  "((?<variable>$variableRegex)|" + 
                  "(?<string>$stringRegex)|" + 
                  "(?<comment>$commentRegex))",
                  $RegexOptions::MultiLine ) | Foreach-Object {
    # highlight everything before the variable, string, or comment    
    $highlightedCode += Highlight-Other $_.Groups[ "before" ].Value

    if ( $_.Groups[ "variable" ].Value )
    {
        $highlightedCode += 
            "<span style='color: $variableColor'>" + 
            ( $_.Groups[ 'variable' ].Value | Html-Encode ) + 
            "</span>"
    }
    elseif ( $_.Groups[ "string" ].Value )
    {
        $string = $_.Groups[ 'string' ].Value | Html-Encode
        
        $string = "<span style='color: $stringColor'>$string</span>"

        # we have to highlight each piece of multi-line strings
        if ( $string -match "\r\n" )
        {
            # highlight any line continuation characters as operators
            $string = $string -replace 
                "(``)(?=\r\n)", "<span style='color: $operatorColor'>``</span>"

            $string = $string -replace 
                "\r\n", "</span>`r`n<span style='color: $stringColor'>"
        }

        $highlightedCode += $string
    }
    else
    {
        $highlightedCode += 
            "<span style='color: $commentColor'>" + 
            $( $_.Groups[ 'comment' ].Value | Html-Encode ) + 
            "</span>"
    }

    # we need to keep track of the last position of a variable, string, 
    # or comment, so we can highlight everything after it
    $lastMatch = $_
}

if ( $lastMatch )
{
    # highlight everything after the last variable, string, or comment   
    $highlightedCode += Highlight-Other $code.SubString( $lastMatch.Index + $lastMatch.Length )
}
else
{
    $highlightedCode = Highlight-Other $code
}

# add line breaks
$highlightedCode = 
    [regex]::Replace( $highlightedCode, '(?=\r\n)', '<br />', $RegexOptions::MultiLine )

# put the highlighted code in the pipeline
"<div style='width: 100%; " + 
            "/*height: 100%;*/ " +
            "overflow: auto; " +
            "font-family: Consolas, `"Courier New`", Courier, mono; " +
            "font-size: 12px; " +
            "background-color: $backgroundColor; " +
            "color: $foregroundColor; " + 
            "padding: 2px 2px 2px 2px; white-space: nowrap'>"

if ( $LineNumbers )
{
    $digitCount = 
        ( [regex]::Matches( $highlightedCode, "^", $RegexOptions::MultiLine ) ).Count.ToString().Length

    $highlightedCode = [regex]::Replace( $highlightedCode, "^", 
        "<li style='color: $lineNumberColor; padding-left: 5px'><span style='color: $foregroundColor'>",
        $RegexOptions::MultiLine )

    $highlightedCode = [regex]::Replace( $highlightedCode, "<br />", "</span><br />",
        $RegexOptions::MultiLine )
    
    "<ol start='1' style='border-left: " +
                         "solid 1px $lineNumberColor; " +
                         "margin-left: $( ( $digitCount * 10 ) + 15 )px; " +
                         "padding: 0px;'>"
}

$highlightedCode

if ( $LineNumbers )
{
    "</ol>"
}

"</div>"


As you might have guessed, most of the work with this script was getting the regular expressions right. I have always loved the support for regular expressions offered by the .Net Framework, and PowerShell makes them even easier to use. It turns out that I was able to reuse the expressions in a grammar file for my new favorite text editor, Intype. I like that my code examples look absolutely identical to what I see in my editor.

The script obviously relies heavily on these regular expressions, which can contribute to a higher potential for problems, but it seems to do a pretty good job. With all of the matching and string processing, the script can also be fairly slow.

Then along came the CTP for Windows PowerShell 2.0. One of the new classes available to developers is the System.Management.Automation.PsParser class, which can be used to tokenize PowerShell code. As you might imagine, a task like highlighting syntax becomes much easier.

Below is an equivalent highlighting script that makes use of the System.Management.Automation.PsParser class. It is used in the same way as the PowerShell version 1.0 script.

#requires -version 2.0

# Highlight-Syntax.ps1
# version 2.0
# by Jeff Hillman
#
# this script uses the System.Management.Automation.PsParser class
# to highlight PowerShell syntax with HTML.

param( [string] $code, [switch] $LineNumbers )

if ( Test-Path $code -ErrorAction SilentlyContinue )
{
    $code = Get-Content $code | Out-String
}

$backgroundColor = "#DDDDDD"
$foregroundColor = "#000000"
$lineNumberColor = "#404040"

$PSTokenType = [System.Management.Automation.PSTokenType]

$colorHash = @{ 
#    $PSTokenType::Unknown            = $foregroundColor; 
    $PSTokenType::Command            = "#C86400";
#    $PSTokenType::CommandParameter   = $foregroundColor;
#    $PSTokenType::CommandArgument    = $foregroundColor;
    $PSTokenType::Number             = "#800000";
    $PSTokenType::String             = "#800000";
    $PSTokenType::Variable           = "#000080";
#    $PSTokenType::Member             = $foregroundColor;
#    $PSTokenType::LoopLabel          = $foregroundColor;
#    $PSTokenType::Attribute          = $foregroundColor;
    $PSTokenType::Type               = "#404040";
    $PSTokenType::Operator           = "#C86400";
#    $PSTokenType::GroupStart         = $foregroundColor;
#    $PSTokenType::GroupEnd           = $foregroundColor;
    $PSTokenType::Keyword            = "#C86400";
    $PSTokenType::Comment            = "#008000";
    $PSTokenType::StatementSeparator = "#C86400";
#    $PSTokenType::NewLine            = $foregroundColor;
    $PSTokenType::LineContinuation   = "#C86400";
#    $PSTokenType::Position           = $foregroundColor;
    
}

filter Html-Encode
{
    $_ = $_ -replace "&", "&amp;"
    $_ = $_ -replace " ", "&nbsp;"
    $_ = $_ -replace "<", "&lt;"
    $_ = $_ -replace ">", "&gt;"

    $_
}

# replace the tabs with spaces
$code = $code -replace "\t", ( " " * 4 )

if ( $LineNumbers )
{
    $highlightedCode = "<li style='color: $lineNumberColor; padding-left: 5px'>"
}
else
{
    $highlightedCode = ""
}

$parser = [System.Management.Automation.PsParser]
$lastColumn = 1
$lineCount = 1

foreach ( $token in $parser::Tokenize( $code, [ref] $null ) | Sort-Object StartLine, StartColumn )
{
    # get the color based on the type of the token
    $color = $colorHash[ $token.Type ]
    
    if ( $color -eq $null ) 
    { 
        $color = $foregroundColor
    }

    # add whitespace
    if ( $lastColumn -lt $token.StartColumn )
    {
        $highlightedCode += ( "&nbsp;" * ( $token.StartColumn - $lastColumn ) )
    }

    switch ( $token.Type )
    {
        $PSTokenType::String {
            $string = "<span style='color: {0}'>{1}</span>" -f $color, 
                ( $code.SubString( $token.Start, $token.Length ) | Html-Encode )

            # we have to highlight each piece of multi-line strings
            if ( $string -match "\r\n" )
            {
                # highlight any line continuation characters as operators
                $string = $string -replace "(``)(?=\r\n)", 
                    ( "<span style='color: {0}'>``</span>" -f $colorHash[ $PSTokenType::Operator ] )

                $stringHtml = "</span><br />`r`n"
                
                if ( $LineNumbers )
                {
                     $stringHtml += "<li style='color: $lineNumberColor; padding-left: 5px'>"
                }

                $stringHtml += "<span style='color: $color'>"

                $string = $string -replace "\r\n", $stringHtml
            }

            $highlightedCode += $string
            break
        }

        $PSTokenType::NewLine {
            $highlightedCode += "<br />`r`n"
            
            if ( $LineNumbers )
            {
                $highlightedCode += "<li style='color: $lineNumberColor; padding-left: 5px'>"
            }
            
            $lastColumn = 1
            ++$lineCount
            break
        }

        default {
            if ( $token.Type -eq $PSTokenType::LineContinuation )
            {
                $lastColumn = 1
                ++$lineCount
            }

            $highlightedCode += "<span style='color: {0}'>{1}</span>" -f $color, 
                ( $code.SubString( $token.Start, $token.Length ) | Html-Encode )
        }
    }

    $lastColumn = $token.EndColumn
}

# put the highlighted code in the pipeline
"<div style='width: 100%; " + 
            "/*height: 100%;*/ " +
            "overflow: auto; " +
            "font-family: Consolas, `"Courier New`", Courier, mono; " +
            "font-size: 12px; " +
            "background-color: $backgroundColor; " +
            "color: $foregroundColor; " + 
            "padding: 2px 2px 2px 2px; white-space: nowrap'>"

if ( $LineNumbers )
{
    $digitCount =  $lineCount.ToString().Length

    "<ol start='1' style='border-left: " +
                         "solid 1px $lineNumberColor; " +
                         "margin-left: $( ( $digitCount * 10 ) + 15 )px; " +
                         "padding: 0px;'>"
}

$highlightedCode

if ( $LineNumbers )
{
    "</ol>"
}

"</div>"


Besides being much faster, the PsParser technique provides much more potential for customization. This script highlights the same types of things as the 1.0 version of the script, but other token types are available, including CommandParameter, CommandArgument (these two types would be very difficult to define with a regular expression), and Member. All of the token types are listed in the script; those that I ignore are commented out.

As an extra bonus, here is a little script that highlights PowerShell commands in the console:

# Highlight-Commands.ps1
# by Jeff Hillman
#
# this script highlights PowerShell commands with HTML.

param[string] $commands )

$backgroundColor = "#000000"
$foregroundColor = "#FFC400"

filter Html-Encode( [switch] $Regex )
{
    $_ = $_ -replace "&", "&amp;"
    $_ = $_ -replace "\t", "    "
    $_ = $_ -replace " ", "&nbsp;"
    $_ = $_ -replace "<", "&lt;"
    $_ = $_ -replace ">", "&gt;"
    
    $_
}

# add line breaks
$highlightedCommands = $commands | Html-Encode

$highlightedCommands = [regex]::Replace( $highlightedCommands, "^", 
    "<span style='font-weight: bold;'>",
    [System.Text.RegularExpressions.RegexOptions]::MultiLine )

$highlightedCommands = [regex]::Replace( $highlightedCommands, "(?=\r\n)", "</span><br />",
    [System.Text.RegularExpressions.RegexOptions]::MultiLine )


# put the highlighted commands in the pipeline
"<div style='width: 100%; " + 
            "/*height: 100%;*/ " +
            "overflow: auto; " +
            "font-family: `"Courier New`", Courier, mono; " +
            "font-size: 12px; " +
            "background-color: $backgroundColor; " +
            "color: $foregroundColor; " + 
            "padding: 2px 2px 2px 2px; white-space: nowrap'>"

$highlightedCommands

"</div>"

C:\Users\hillman\Documents\WindowsPowerShell\Utilities

PSH$ ls


    Directory: Microsoft.PowerShell.Core\FileSystem::C:\Users\hillman\Documents\WindowsPowerShell\Utilities


Mode                LastWriteTime     Length Name
----                -------------     ------ ----
-a---         07-Nov-07   4:05 PM      38117 Compile-Help.ps1
-a---         10-Nov-07   2:53 PM       8047 Highlight-1.0Syntax.ps1
-a---         10-Nov-07   3:09 PM       5182 Highlight-2.0Syntax.ps1
-a---         10-Nov-07   3:27 PM       1296 Highlight-Commands.ps1
-a---         09-Nov-07   2:49 PM      14741 Utilities.ps1


Well, I hope these scripts come in handy for someone else out there.

1 comment:

Unknown said...

6 years later it came in handy ;-)

Thanks.

I updated it a bit to cater for indention, let me know if you want an updated version.