Creating a Markdown File Analyzer
I am always looking for ways to use PowerShell. The other day, I found the seeds for a new project in my Inbox. As you know, I use a platform from Buttondown.com to manage my newsletter. I've written in the past about building PowerShell tools to manage previously published content using their API. I also use the API to help me in writing each issue. So I was very excited and curious when I received an email from Buttondown about a new command-line tool. I could use this to pull all content from my newsletter archive as well as publish new content.
The tool itself isn't important here. The valuable part is that I was able to download all of my past newsletters in Markdown format to my computer. I wasn't looking for it, but this makes it easy to backup my content. That too, isn't the focus of this article. What I found interesting was that each Markdown file included a metadata section at the beginning of the file.
---
id: 379f1f33-af6c-4483-967f-c7b0d955fa51
subject: WPF PowerShell Applications
status: sent
email_type: premium
slug: wpf-powershell-applications
publish_date: 2024-03-22T17:16:45.726699Z
---
I immediately recognized that I could parse this data from each file and create useful data. You too may have text files you want to analyze or parse. Let's look at some ways to approach this problem using PowerShell.
Results First
The first step is to consider the desired result. When I analyze each file I want to extract specific bits of information. Don't write the bits as strings. Instead, think about an object. I want to create an object based on each file that contains the following properties:
- Title
- Published
- Category
- Link
- Path
- Status
I can get most of these values from the metadata. The other values I can construct. The Path
property will be the path to the Markdown file. The Link
property will be the URL to the published content. I can build this from the slug
property in the metadata
$web = 'https://buttondown.com/behind-the-powershell-pipeline/archive/'
$slug = 'wpf-powershell-applications'
$link = $web + $slug
I manually created a sample object to validate my plan.
$sample = [PSCustomObject]@{
Title = 'WPF PowerShell Applications'
Published = '3/22/2024 1:16:45 PM'
Category = 'Premium'
Link = $link
Path = $Path
Status = 'sent'
}

Now that I know where I want to go, I can figure out how to get there. The property values are in the metadata header. I need to find the best way to extract that data from the file using PowerShell.
Using Regular Expressions
The first idea that immediately came to mind was to use regular expressions. I could use a regex pattern to match the metadata section and extract the values. The pattern would look for lines that start with a key followed by a colon and then the value.
[regex]$rxTitle = '(?<=subject:\s).*'
[regex]$rxPublished = '(?<=publish_date:\s).*'
[regex]$rxCategory = '(?<=email_type:\s).*'
[regex]$rxSlug = '(?<=slug:\s).*'
[regex]$rxStatus = '(?<=status:\s).*'
Each pattern is using a positive lookbehind assertion, e.g. (?<=subject:\s)
, to match the value after the key. The regex match value will be everything after the colon and any whitespace that follows it. I can then use these patterns to extract the values from the file content.
#the base URL for the newsletter archive
$web = 'https://buttondown.com/behind-the-powershell-pipeline/archive/'
#the path to the Markdown file
$Path = 'd:\buttondown\emails\wpf-powershell-applications.md'
#get the content of the file as a single string
$content = Get-Content -Path $Path -Raw
#get the regex matches for each property
$title = $rxTitle.Match($content).Value
$published = $rxPublished.Match($content).Value
$category = $rxCategory.Match($content).Value
$slug = $rxSlug.Match($content).Value
$link = $web + $slug
$status = $rxStatus.Match($content).Value
Let's pause for a moment to mention a few key points. First, I'm reading the file contents as a single string using the -Raw
parameter. If I didn't do this, then $content
would be an array of strings which would have seriously complicated the regex matching. Second, whenever using regular expressions, there is an assumption that the data is consistent and predictable. At least if you want to avoid any more complexity than regex is already introducing.
By the way, I could have used Select-String
with the regular expression patterns.
PS C:\> Select-String -InputObject $content -Pattern $rxTitle |
Select-Object -ExpandProperty matches
Groups : {0}
Success : True
Name : 0
Captures : {0}
Index : 54
Length : 27
Value : WPF PowerShell Applications
ValueSpan :
All that remains is to create the object with the extracted values.
$object = [PSCustomObject]@{
Title = $title
Published = $published -as [datetime]
Category = $category
Link = $link
Path = $Path
Status = $status
}

If I can do this for one file, I can do it for all of them. I can use Get-ChildItem
to get all the Markdown files in the directory and then loop through each file to extract the metadata.
$files = Get-ChildItem D:\buttondown\emails\*.md
$r = foreach ($file in $files) {
$content = Get-Content -Path $file.FullName -Raw
$title = $rxTitle.Match($content).Value
$published = $rxPublished.Match($content).Value
$category = $rxCategory.Match($content).Value
$slug = $rxSlug.Match($content).Value
$link = $web + $slug
$status = $rxStatus.Match($content).Value
[PSCustomObject]@{
Title = $title
Published = $published -as [datetime]
Category = $category
Link = $link
Path = $file.FullName
Status = $status
}
}