Behind the PowerShell Pipeline logo

Behind the PowerShell Pipeline

Subscribe
Archives
June 6, 2025

Creating a Markdown File Analyzer

I am always looking for ways to use PowerShell. The other day, I found the seeds for a new project in my Inbox. As you know, I use a platform from Buttondown.com to manage my newsletter. I've written in the past about building PowerShell tools to manage previously published content using their API. I also use the API to help me in writing each issue. So I was very excited and curious when I received an email from Buttondown about a new command-line tool. I could use this to pull all content from my newsletter archive as well as publish new content.

The tool itself isn't important here. The valuable part is that I was able to download all of my past newsletters in Markdown format to my computer. I wasn't looking for it, but this makes it easy to backup my content. That too, isn't the focus of this article. What I found interesting was that each Markdown file included a metadata section at the beginning of the file.

---
id: 379f1f33-af6c-4483-967f-c7b0d955fa51
subject: WPF PowerShell Applications
status: sent
email_type: premium
slug: wpf-powershell-applications
publish_date: 2024-03-22T17:16:45.726699Z
---

I immediately recognized that I could parse this data from each file and create useful data. You too may have text files you want to analyze or parse. Let's look at some ways to approach this problem using PowerShell.

Results First

The first step is to consider the desired result. When I analyze each file I want to extract specific bits of information. Don't write the bits as strings. Instead, think about an object. I want to create an object based on each file that contains the following properties:

  • Title
  • Published
  • Category
  • Link
  • Path
  • Status

I can get most of these values from the metadata. The other values I can construct. The Path property will be the path to the Markdown file. The Link property will be the URL to the published content. I can build this from the slug property in the metadata

$web = 'https://buttondown.com/behind-the-powershell-pipeline/archive/'
$slug = 'wpf-powershell-applications'
$link = $web + $slug

I manually created a sample object to validate my plan.

$sample = [PSCustomObject]@{
    Title     = 'WPF PowerShell Applications'
    Published = '3/22/2024 1:16:45 PM'
    Category  = 'Premium'
    Link      = $link
    Path      = $Path
    Status   = 'sent'
}
prototype object
figure 1

Now that I know where I want to go, I can figure out how to get there. The property values are in the metadata header. I need to find the best way to extract that data from the file using PowerShell.

Using Regular Expressions

The first idea that immediately came to mind was to use regular expressions. I could use a regex pattern to match the metadata section and extract the values. The pattern would look for lines that start with a key followed by a colon and then the value.

[regex]$rxTitle = '(?<=subject:\s).*'
[regex]$rxPublished = '(?<=publish_date:\s).*'
[regex]$rxCategory = '(?<=email_type:\s).*'
[regex]$rxSlug = '(?<=slug:\s).*'
[regex]$rxStatus = '(?<=status:\s).*'

Each pattern is using a positive lookbehind assertion, e.g. (?<=subject:\s), to match the value after the key. The regex match value will be everything after the colon and any whitespace that follows it. I can then use these patterns to extract the values from the file content.

#the base URL for the newsletter archive
$web = 'https://buttondown.com/behind-the-powershell-pipeline/archive/'
#the path to the Markdown file
$Path  = 'd:\buttondown\emails\wpf-powershell-applications.md'
#get the content of the file as a single string
$content = Get-Content -Path $Path -Raw
#get the regex matches for each property
$title = $rxTitle.Match($content).Value
$published = $rxPublished.Match($content).Value
$category = $rxCategory.Match($content).Value
$slug = $rxSlug.Match($content).Value
$link = $web + $slug
$status = $rxStatus.Match($content).Value

Let's pause for a moment to mention a few key points. First, I'm reading the file contents as a single string using the -Raw parameter. If I didn't do this, then $content would be an array of strings which would have seriously complicated the regex matching. Second, whenever using regular expressions, there is an assumption that the data is consistent and predictable. At least if you want to avoid any more complexity than regex is already introducing.

By the way, I could have used Select-String with the regular expression patterns.

PS C:\> Select-String -InputObject $content -Pattern $rxTitle |
Select-Object -ExpandProperty matches

Groups    : {0}
Success   : True
Name      : 0
Captures  : {0}
Index     : 54
Length    : 27
Value     : WPF PowerShell Applications
ValueSpan :

All that remains is to create the object with the extracted values.

$object = [PSCustomObject]@{
    Title     = $title
    Published = $published -as [datetime]
    Category  = $category
    Link      = $link
    Path      = $Path
    Status    = $status
}
Metadata object
figure 2

If I can do this for one file, I can do it for all of them. I can use Get-ChildItem to get all the Markdown files in the directory and then loop through each file to extract the metadata.

$files = Get-ChildItem D:\buttondown\emails\*.md
$r = foreach ($file in $files) {
    $content = Get-Content -Path $file.FullName -Raw
    $title = $rxTitle.Match($content).Value
    $published = $rxPublished.Match($content).Value
    $category = $rxCategory.Match($content).Value
    $slug = $rxSlug.Match($content).Value
    $link = $web + $slug
    $status = $rxStatus.Match($content).Value

    [PSCustomObject]@{
        Title     = $title
        Published = $published -as [datetime]
        Category  = $category
        Link      = $link
        Path      = $file.FullName
        Status    = $status
    }
}
Want to read the full issue?
GitHub Bluesky LinkedIn About Jeff
Powered by Buttondown, the easiest way to start and grow your newsletter.