Creating a Markdown Tooling System
Creating a Markdown Tooling System
In the last newsletter, I shared my experiences in creating a PowerShell tool to process Markdown files retrieved using the Buttondown CLI. These are files representing each newsletter I have published since 2022. Each file has a Markdown metadata header.
---
id: 379f1f33-af6c-4483-967f-c7b0d955fa51
subject: WPF PowerShell Applications
status: sent
email_type: premium
slug: wpf-powershell-applications
publish_date: 2024-03-22T17:16:45.726699Z
---
My goal is to parse this data and create a PowerShell object representation of the metadata. My initial approach to this task was to use regular expressions. Maybe you can apply what I demonstrated last time to your work.
But there is almost always a different way to achieve a goal in PowerShell. Today, I want to share another approach. This is a technique I often use when parsing or working with text files. I'm going to use a generic collection for each file and extract the metadata from there.
Using a Generic Collection
I'll initialize a generic collection designed to hold strings and define the path to a sample file.
$list = [System.Collections.Generic.List[string]]::new()
$Path = 'd:\buttondown\emails\wpf-powershell-applications.md'
It is possible to combine the following steps, but I'll keep them separate for clarity. First, I will read the file's contents.
$content = Get-Content -Path $Path
Pay close attention here. When using regular expressions, I included the -Raw
parameter to read the file as a single string. However, when using a generic collection, I do not use this parameter. The Get-Content
cmdlet reads the file line by line and returns an array of strings. I can then add this array to the generic collection.
$list.AddRange([string[]]$content)
I know the metadata is bounded by ---
lines. If I can find the list index of the opening and closing lines, I can parse the lines in between.
$i = $list.IndexOf('---', 0)
I will use the IndexOf
method to find the index of the first occurrence of the ---
line. It isn't required, but the second parameter value is the index of where I want to begin searching. Since the metadata header is the first thing in the file, $i
will be 0. To find the closing marker, I can use the same method and start searching from the line that follows.
$j = $list.IndexOf('---', $i + 1)
Now I have the start and end indices of the metadata header. I can extract the lines in between.
PS C:\> $list[$i..$j]
---
id: 379f1f33-af6c-4483-967f-c7b0d955fa51
subject: WPF PowerShell Applications
status: sent
email_type: premium
slug: wpf-powershell-applications
publish_date: 2024-03-22T17:16:45.726699Z
---
Or to be more precise:
PS C:\> $list[($i+1)..($j - 1)]
id: 379f1f33-af6c-4483-967f-c7b0d955fa51
subject: WPF PowerShell Applications
status: sent
email_type: premium
slug: wpf-powershell-applications
publish_date: 2024-03-22T17:16:45.726699Z
The IndexOf
method uses a simple comparison. For more complex scenarios, you can use the FindIndex
method, which allows you to specify a predicate to match against the items in the collection. A predicate is like a script block.
$i = $list.FindIndex(0, { $args[0] -eq '---' }) + 1
$j = $list.FindIndex($i, { $args[0] -eq '---' }) - 1
The first parameter is the index number to begin the search. The predicate is the script block.$Args[0]
represents an unnamed parameter that evaluates to the contents of each line in the collection. Functionally, my syntax for FindIndex
is doing the same thing because I am using the -eq
operator to compare the line contents to the ---
string. For more complicated searches, I might use -like
or -match
operators or a compound expression. Think of the predicate conceptually as a Where-Object
filtering script block.
My code also takes the index number into account, so $i
is the first line of the metadata and $j
is the last. I don't have to deal with the ---
markers. With this information, I can process each line and split it into key/value pairs on the colon.
$list[$i..$j] | ForEach-Object -Begin {
# Initialize the hashtable
$meta = @{}
} -Process {
#split into two strings
$split = $_ -split ':', 2
$meta.Add($split[0].Trim(), $split[1].Trim())
} -End {
$meta
}
Note that I am using the -split
operator with a limit of 2. This ensures that I only get two strings for each split. Otherwise, the -split
operator would split on every colon in the line, which would be a problem for the publish_date
line. I also use the Trim()
method to remove any leading or trailing whitespace from the key and value strings, just in case.
This leaves me with a hashtable that contains the metadata.
Name Value
---- -----
publish_date 2024-03-22T17:16:45.726699Z
id 379f1f33-af6c-4483-967f-c7b0d955fa51
email_type premium
subject WPF PowerShell Applications
slug wpf-powershell-applications
status sent
Looks pretty easy, right? Remember what I said last time about knowing your data and trusting that it will be consistent? It turns out that is not the case with these files. Here's a metadata header from another file:
id: 68dbe992-4dd1-41bd-81e5-a09a8de76066
subject: Toolmaking Toolmaking
status: sent
email_type: premium
slug: toolmaking-toolmaking
publish_date: 2024-03-07T18:13:13Z
attachments:
- 31079a9f-58ce-48ed-a7dc-9c66c1cf966e
---
My code will fail on this because it can't split on that last line. Fortunately, I can ignore attachment metadata. However, I will add a filtering check to ensure that I only process lines that contain a colon. This will prevent the code from failing on lines that do not conform to the expected format.
$list = [System.Collections.Generic.List[string]]::new()
$Path = 'd:\buttondown\emails\toolmaking-toolmaking.md'
$content = Get-Content -Path $Path
#add to the list
$list.AddRange([string[]]$content)
$i = $list.FindIndex(0, { $args[0] -eq '---' }) + 1
$j = $list.FindIndex($i, { $args[0] -eq '---' }) - 1
$list[$i..$j] | Where {$_ -match ":"} | ForEach-Object -Begin {
# Initialize the hashtable
$meta = @{}
} -Process {
#split into two strings
$split = $_ -split ':', 2
$meta.Add($split[0].Trim(), $split[1].Trim())
} -End {
$meta
}
This provides the expected result.
Name Value
---- -----
publish_date 2024-03-07T18:13:13Z
id 68dbe992-4dd1-41bd-81e5-a09a8de76066
email_type premium
subject Toolmaking Toolmaking
slug toolmaking-toolmaking
status sent
With the meta
hashtable, I can create my custom object using some of the logic I shared last time.
$web = 'https://buttondown.com/behind-the-powershell-pipeline/archive/'
$link = $web + $meta.slug
$obj = [PSCustomObject]@{
Title = $meta.subject
Published = $meta.publish_date -as [datetime]
Category = $meta.email_type
Link = $link
Path = $path
Status = $meta.status
}

Let's try the whole directory.
$Path = 'd:\buttondown\emails\*.md'
$list = [System.Collections.Generic.List[string]]::new()
$files = Get-ChildItem $Path
$r = foreach ($file in $files) {
#write-host $file.fullname
$content = Get-Content -Path $file.FullName
#add to the list
$list.AddRange([string[]]$content)
$i = $list.FindIndex(0, { $args[0] -eq '---' }) + 1
$j = $list.FindIndex($i, { $args[0] -match '---' }) - 1
$list[$i..$j] | Where {$_ -match ":"} |
ForEach-Object -Begin {
# Initialize the hashtable
$meta = @{}
} -Process {
#split into two strings
$split = $_ -split ':', 2
$meta.Add($split[0].Trim(), $split[1].Trim())
}
[PSCustomObject]@{
Title = $meta.subject
Published = $meta.publish_date -as [datetime]
Category = $meta.email_type
Link = $web + $meta.slug
Path = $file.FullName
Status = $meta.status
}
# Clear the list for the next file
$list.Clear()
}
This took about a second to complete, which is a bit longer than the regular expression approach, but using the generic collection offers some advantages, even if there is a slight overhead. I get the same results as before.
Summary
What I showed you today gives me another way to generate the information and data I want to work with. My code writes a valuable object to the pipeline. However, when building a toolset, it is essential to consider ways to add value beyond the raw data. How might the user want to use this information? What can you add to make the process as effortless as possible? It is these little things that separate a good tool from a great one. We'll dive into this next time.