PowerShell Regex Groupies
Learn how to use regular expression named captures in PowerShell to extract data from log files, and created custom objects for structured data storage.
My friend Gladys reached out the other day for help with a regular expression problem. She told me she was trying to do use capture groups with data from a log file using PowerShell. I assumed she was using a regex pattern to defined named captures. This is a technique where you can assign a name to the matching text in a capture group. This makes it easier to reference the captured text later in your script. The more information I learned from her, the more I realized the term she was using didn't mean what she thought it did.
But her problem is not unique and I thought offered a terrific "learning opportunity." Ultimately, named captures will work for her, but she had some additional challenges at well. Let's work through the problem.
Named Captures
I will use this string as an example.
$t = "198274-banana_foo:monkey"
You might have a log file with lines of structured data that you want to parse. I recommend that when you are working with regular expressions in PowerShell, think about creating object output and not focusing merely on matching text.
In my silly example, I want to break the string into several parts:
- Number (198274)
- Fruit (banana)
- Animal (monkey)
Here is a regular expression pattern that will match the string and capture the parts I want.
[regex]$rx = "(?\d+)-(?\w+)_\w+:(?\w+)"
The named capture is the part in parentheses. You can define a name for each pattern. The first pattern I am searching for is 1 or more digits (\d+
). Before this pattern I define a name using the syntax ?
.
The regex pattern then looks for a hyphen followed by a named capture group for the fruit. The pattern \w+
matches one or more word characters. After that, the pattern looks for the underscore followed by a string of word characters and the colon. The pattern ends with a named capture for the animal.
Here's how it looks in PowerShell.
PS C:\> $t -match $rx
True
PS C:\> $matches
Name Value
---- -----
number 198274
animal monkey
fruit banana
0 198274-banana_foo:monkey
Because I defined my regular expression capture as a regex object, I can use the object's methods and properties. This will be much easier when I get to scripting.
PS C>\> $rx.match($t) | tee -Variable m
Groups : {0, number, fruit, animal}
Success : True
Name : 0
Captures : {0}
Index : 0
Length : 24
Value : 198274-banana_foo:monkey
ValueSpan :
PS C:\> $m.groups | Format-Table
Groups Success Name Captures Index Length Value
------ ------- ---- -------- ----- ------ -----
{0, number, fruit, animal} True 0 {0} 0 24 198274-banana_f…
True number {number} 0 6 198274
True fruit {fruit} 7 6 banana
True animal {animal} 18 6 monkey
PS C:\> $m.groups["fruit"]
Success : True
Name : fruit
Captures : {fruit}
Index : 7
Length : 6
Value : banana
ValueSpan :
PS C:\> $m.groups["fruit"].value
banana
To turn this output into an object, I can work with each named group. Each match object has a Groups
property that contains the named captures. I can reference each named capture by name, although I need to skip the group with a name of 0 .
PS C:\> $m.groups.where({$_.name -ne 0}) |
>> ForEach-Object -Begin {
>> $h = [ordered]@{}
>> } -Process {
>> $h.Add($_.Name,$_.value)
>> } -End {
>> New-Object -TypeName PSObject -Property $h
>> }
number fruit animal
------ ----- ------
198274 banana monkey
Here's a test file.
#fruitanimal.txt
#this is a test file
198274-banana_foo:monkey
124533-cherry_bar:dog
308452-pineapple_acx:elephant
501283-apricot_xyz:bluejay
452456-apple_pqr:eagle
#eof
And here's some PowerShell code that will read the file, parse out the data I want, and create an object for each line. This is a one-line solution.
Get-Content .\fruityanimal.txt |
Where {$_ -match $rx} |
Foreach-Object {
$($rx.match($_)).groups.where({$_.name -ne 0}) |
ForEach-Object -Begin {
#create a temporary hashtable for each line
$h = [ordered]@{}
} -Process {
$h.Add($_.Name,$_.Value)
} -End {
#turn the hashtable into an object
New-Object -TypeName PSObject -Property $h
}
}
This code gives me the following output.
number fruit animal
------ ----- ------
198274 banana monkey
124533 cherry dog
308452 pineapple elephant
501283 apricot bluejay
452456 apple eagle
These are custom objects that I can use anyway I want. Yes, I parsed text, but my result is a rich object that I can use in my scripts.
The one-liner might be difficult to follow. In a script, it might make sense to break it down into smaller steps. Here's a script that does the same thing.
[regex]$rx = "(?\d+)-(?\w+)_\w+:(?\w+)"
$FilePath = ".\fruityanimal.txt"
$Text = Get-Content $FilePath | Where {$_ -match $rx}
#don't use $Matches as a variable name
$rxMatches = $rx.matches($Text)
$result = for ($i= 0;$i -lt $rxMatches.count;$i++) {
Write-Host "Processing match $i" -foregroundColor green
$Groups = $rxMatches[$i].groups.where({$_.name -ne 0})
$h = [ordered]@{}
foreach ($group in $Groups) {
Write-Host "Processing property: $($Group.Name)" -foregroundColor yellow
$h.Add($group.Name,$group.Value)
}
New-Object -TypeName PSObject -Property $h
}
$result
MultiLine Captures
Now that we have an idea of how regex named captures work, let's turn to Gladys' problem. She has a log file with multiple lines of data. Each group is separated by a varying number of blank lines.
Here's a text file with sample data.
12345
This is a line
So is this
We end here
What about here?
54321
Kittens are cute!
Oh, what a tangled web we weave
Alas, poor Yorick
Monkeys are fun
98765
My first line is different
Here's something else
regular expressions are fun
roll with the changes
Her ultimate goal is to create CSV output:
12345,This is a line,So is this,We end here,What about here?
54321,Kittens are cute!,Oh, what a tangled web we weave,Alas, poor Yorick,Monkeys are fun
98765,My first line is different,Here's something else,regular expressions are fun,roll with the changes
Her challenge is to match a cluster of lines. Each capture group begins with a number. Named captures will work here, but with a little extra effort. In order for this process to work, the file should be read as a single string. This is because the regex pattern will match multiple lines.
$text = Get-Content .\logdata.txt -Raw
Without -Raw
, Get-Content
reads the file line by line which means $text
is an array of strings. With -Raw
, Get-Content
reads the file as a single string.
Let's do a quick proof-of-concept on the first named captures.
[regex]$rx ="(?\d{5})"
The capture works as expected.
PS C:\> $rx.matches($Text) | Format-Table
Groups Success Name Captures Index Length Value ValueSpan
------ ------- ---- -------- ----- ------ ----- ---------
{0, head} True 0 {0} 0 5 12345
{0, head} True 0 {0} 72 5 54321
{0, head} True 0 {0} 173 5 98765
Sometimes, it helps to write out what you want to match.
Match a line that begins with a number. Ignore the line return and capture the next line. Repeat this for each line to capture.
In a regular expression, you can use \r
to indicate a carriage return. When faced with a complex pattern, start small and add to it.
[regex]$rx ="(?\d{5}(?=\r))\r\n(?.*(?=\r))"
This pattern is creating a named capture called head
which will match on exactly 5 digits (\d{5}
) but only when looking ahead there is a carriage return. The (?=\r)
is the look ahead pattern. The match will not be included with the value for head
. The pattern continues using \r\n
To indicate a carriage return and new line.I can then repeat the concept for the next named capture. The line
capture will match any multiple character (.*
) until it encounters a carriage return. Let's test it out.
PS C:\> $rx.matches($Text).groups.where({$_.name -ne 0}) | Select-Object Name,Value
Name Value
---- -----
head 12345
line1 This is a line
head 54321
line1 Kittens are cute!
head 98765
line1 My first line is different
This looks promising. I can now add the rest of the lines to the pattern.
[regex]$rx = "(?\d{5}(?=\r))\r\n(?.*(?=\r))\r\n(?.*(?=\r))\r\n(?.*(?=\r))\r\n(?.*(?=\r))"
And again, verify,
PS C:\> $rx.matches($Text).groups | tee -variable g | Where Name -ne 0 | Format-Table Property Name,Value -AutoSize
Name Value
---- -----
head 12345
line1 This is a line
line2 So is this
line3 We end here
line4 What about here?
head 54321
line1 Kittens are cute!
line2 Oh, what a tangled web we weave
line3 Alas, poor Yorick
line4 Monkeys are fun
head 98765
line1 My first line is different
line2 Here's something else
line3 regular expressions are fun
line4 roll with the changes
Creating Objects
I want to create objects from this data. Again, develop with a small piece of the data and then expand.
$k = $g | where name -eq 0 | Select-Object -first 1
Each group will have a set of nested groups.
PS C:\> $k.groups | where name -ne 0 | Format-Table
Success Name Captures Index Length Value ValueSpan
------- ---- -------- ----- ------ ----- ---------
True head {head} 0 5 12345
True line1 {line1} 7 14 This is a line
True line2 {line2} 23 10 So is this
True line3 {line3} 35 11 We end here
True line4 {line4} 48 16 What about here?
I'm going to create a hashtable and add each named capture. The hashtable key will be the capture name, and the value will the capture value.
$h = [ordered]@{}
$k.groups | where name -ne 0 | ForEach-Object {$h.Add($_.Name,$_.Value)}
$r = New-Object -TypeName PSObject -Property $h
I have now have a custom object in $r
.
head : 12345
line1 : This is a line
line2 : So is this
line3 : We end here
line4 : What about here?
Having an object gives me options. If Gladys wants a CSV, this is easy,.
PS C:\> $r | ConvertTo-csv -NoHeader
"12345","This is a line","So is this","We end here","What about here?"
With this code, I can scale it up to process the entire file.
$Groups = $rx.matches($Text).groups | Where-Object name -EQ 0
$out = Foreach ($Group in $Groups) {
$Named = $group.groups.where({ $_.name -ne 0 })
$Named | ForEach-Object -Begin {
$h = [ordered]@{}
} -Process {
#Write-Host "Processing $($NamedGroup.Name)" -ForegroundColor yellow
$h.Add($_.Name, $_.Value)
} -End {
New-Object -TypeName PSObject -Property $h
}
} #foreach group
Here's my output.
PS C:\> $out
head : 12345
line1 : This is a line
line2 : So is this
line3 : We end here
line4 : What about here?
head : 54321
line1 : Kittens are cute!
line2 : Oh, what a tangled web we weave
line3 : Alas, poor Yorick
line4 : Monkeys are fun
head : 98765
line1 : My first line is different
line2 : Here's something else
line3 : regular expressions are fun
line4 : roll with the changes
PS C:\> $out | Get-Member -MemberType Properties
TypeName: System.Management.Automation.PSCustomObject
Name MemberType Definition
---- ---------- ----------
head NoteProperty string head=12345
line1 NoteProperty string line1=This is a line
line2 NoteProperty string line2=So is this
line3 NoteProperty string line3=We end here
line4 NoteProperty string line4=What about here?
I put everything together in a script file.
#requires -version 5.1
#ConvertLog.ps1
Param(
[Parameter(Position = 0, Mandatory)]
[ValidateNotNullOrEmpty()]
[ValidateScript({ Test-Path $_ })]
[string]$FilePath,
[Parameter(HelpMessage = 'Specify a custom type name for the output object')]
[alias('Type')]
[string]$CustomTypeName
)
#this script will only work with files that match this pattern
#you might want to make the pattern a parameter
[regex]$rx = '(?\d{5}(?=\r))\r\n(?.*(?=\r))\r\n(?.*(?=\r))\r\n(?.*(?=\r))\r\n(?.*(?=\r))'
Write-Verbose "Getting raw content from $FilePath"
$text = Get-Content -Path $FilePath -Raw
$Groups = $rx.matches($Text).groups | Where-Object name -EQ 0
Write-Verbose "Processing $($Groups.count) groups"
#initialize a group counter
$i = 1
$out = Foreach ($Group in $Groups) {
Write-Verbose "Group $i"
$Named = $group.groups.where({ $_.name -ne 0 })
$Named | ForEach-Object -Begin {
Write-Verbose "Initializing hash table for group $i"
if ($CustomTypeName) {
$h = [ordered]@{PSTypeName = $CustomTypeName }
}
else {
$h = [ordered]@{ }
}
} -Process {
Write-Verbose "...Adding property $($_.Name) to group $i"
$h.Add($_.Name, $_.Value)
} -End {
Write-Verbose "Creating object for group $i"
New-Object -TypeName PSObject -Property $h
}
#increment the group counter
$i++
} #foreach group
$out
This script is designed for a specific file layout. I have hard-coded the regex pattern. You might make it more flexible by making the pattern a parameter.
I added Verbose output to help visualize what the script is doing.
PS C:\> $r = .\ConvertLog.ps1 .\logdata.txt -Type PSGladys -Verbose
VERBOSE: Getting raw content from .\logdata.txt
VERBOSE: Processing 3 groups
VERBOSE: Group 1
VERBOSE: Initializing hash table for group 1
VERBOSE: ...Adding property head to group 1
VERBOSE: ...Adding property line1 to group 1
VERBOSE: ...Adding property line2 to group 1
VERBOSE: ...Adding property line3 to group 1
VERBOSE: ...Adding property line4 to group 1
VERBOSE: Creating object for group 1
VERBOSE: Group 2
VERBOSE: Initializing hash table for group 2
VERBOSE: ...Adding property head to group 2
VERBOSE: ...Adding property line1 to group 2
VERBOSE: ...Adding property line2 to group 2
VERBOSE: ...Adding property line3 to group 2
VERBOSE: ...Adding property line4 to group 2
VERBOSE: Creating object for group 2
VERBOSE: Group 3
VERBOSE: Initializing hash table for group 3
VERBOSE: ...Adding property head to group 3
VERBOSE: ...Adding property line1 to group 3
VERBOSE: ...Adding property line2 to group 3
VERBOSE: ...Adding property line3 to group 3
VERBOSE: ...Adding property line4 to group 3
VERBOSE: Creating object for group 3
The output is a collection of custom objects.
PS C:\> $r
head : 12345
line1 : This is a line
line2 : So is this
line3 : We end here
line4 : What about here?
head : 54321
line1 : Kittens are cute!
line2 : Oh, what a tangled web we weave
line3 : Alas, poor Yorick
line4 : Monkeys are fun
head : 98765
line1 : My first line is different
line2 : Here's something else
line3 : regular expressions are fun
line4 : roll with the changes
If Gladys wants a CSV:
PS C:\> $r | ConvertTo-Csv -NoHeader
"12345","This is a line","So is this","We end here","What about here?"
"54321","Kittens are cute!","Oh, what a tangled web we weave","Alas, poor Yorick","Monkeys are fun"
"98765","My first line is different","Here's something else","regular expressions are fun","roll with the changes"
Use Export-Csv
to create a CSV file. Creating object output and not text is the key!
I added a parameter to the script to allow for a custom object type name. This is useful if you want use a custom formatting file to control the output.
A Non-Regex Solution
In Gladys' situation, there is also a non-regex option. She can describe what she wants to do in plain English.
Beginning with a line that starts with a number, get the next four lines, making each line a property of an object.
We can use a generic list to hold the file data.
$List = [System.Collections.Generic.List[string]]::new()
We don't need the blank lines, so we can filter them out and add each line to the list.
Get-Content .\logdata.txt | where {$_ -match "^\w"} |
foreach { $List.Add($_) }
The list has all the file content except for blank lines. Now I can use a simple For
loop to process the data.
$r = for ($i = 0;$i -lt $List.Count;$i+=5) {
#get the line that matches 5 digits
if ( $List[$i] -match "\d{5}") {
# Write-Host "Starting a new object at $i" -ForegroundColor Cyan
$h = [ordered]@{}
$h.Add("head",$List[$i])
$h.Add("line1",$List[$i+1])
$h.Add("line2",$List[$i+2])
$h.Add("line3",$List[$i+3])
$h.Add("line4",$List[$i+4])
New-Object -TypeName PSObject -Property $h
}
}

I'll leave it to you to build a PowerShell function or script based on this code.
Summary
When you face a challenge like this, try to verbalize what want to accomplish. Writing it down is even better. If you can't articulate it, it will be very difficult to code it. I also recommend that whatever solution you develop should write objects to the pipeline. Avoid returning text.
If you found any of this confusing or mysterious, don't be shy. You probably aren't the only one. Please don't hesitate to leave a comment or question.