Using PowerShell + Word Interop to Extract Tables from PDFs

I needed to extract some tables from PDFs. If you’ve ever done this before you realise PDF format doesn’t really contain table information, it just contains lines and positions of text. Testing out Tabula http://tabula.technology/ which worked reasonably well, but didn’t handle tables split across pages.

So I used PowerShell and Word Interop with Microsoft Office 2016 to extract the tables into CSV files.

This script will extract every table into an individual CSV file named after the PDF. i.e. Document.pdf will produce Document1.csv, Document2.csv etc

Currently header names are not taken from the tables, and are instead labelled Column1, Column2, etc. This is due to the fact I was working with a lot of tables that didn’t have header names.

The script can be downloaded here:

https://onedrive.live.com/redir?resid=93A8E9D387076121!23690&authkey=!ADJ_3Kn6-tMBO4Q&ithint=file%2czip

# USAGE
# Extract-PdfTables -FileName <PDF FILE>
# Extracts a single CSV in same directory as PDF file
# For every table found in the PDF file
# PDFFile1.CSV, PDFFile2.CSV etc

Set-StrictMode -Version Latest

# function from http://stackoverflow.com/questions/5544844/how-to-call-a-complex-com-method-from-powershell
Function Invoke-NamedParameter {
    [CmdletBinding(DefaultParameterSetName = "Named")]
    param(
        [Parameter(ParameterSetName = "Named", Position = 0, Mandatory = $true)]
        [Parameter(ParameterSetName = "Positional", Position = 0, Mandatory = $true)]
        [ValidateNotNull()]
        [System.Object]$Object
        ,
        [Parameter(ParameterSetName = "Named", Position = 1, Mandatory = $true)]
        [Parameter(ParameterSetName = "Positional", Position = 1, Mandatory = $true)]
        [ValidateNotNullOrEmpty()]
        [String]$Method
        ,
        [Parameter(ParameterSetName = "Named", Position = 2, Mandatory = $true)]
        [ValidateNotNull()]
        [Hashtable]$Parameter
        ,
        [Parameter(ParameterSetName = "Positional")]
        [Object[]]$Argument
    )

    end {  ## Just being explicit that this does not support pipelines
        if ($PSCmdlet.ParameterSetName -eq "Named") {
            ## Invoke method with parameter names
            ## Note: It is ok to use a hashtable here because the keys (parameter names) and values (args)
            ## will be output in the same order.  We don't need to worry about the order so long as
            ## all parameters have names
            $Object.GetType().InvokeMember($Method, [System.Reflection.BindingFlags]::InvokeMethod,
                $null,  ## Binder
                $Object,  ## Target
                ([Object[]]($Parameter.Values)),  ## Args
                $null,  ## Modifiers
                $null,  ## Culture
                ([String[]]($Parameter.Keys))  ## NamedParameters
            )
        } else {
            ## Invoke method without parameter names
            $Object.GetType().InvokeMember($Method, [System.Reflection.BindingFlags]::InvokeMethod,
                $null,  ## Binder
                $Object,  ## Target
                $Argument,  ## Args
                $null,  ## Modifiers
                $null,  ## Culture
                $null  ## NamedParameters
            )
        }
    }
}


Function Extract-PdfTables
{
    param([string]$FileName)

    Add-Type -Assembly "Microsoft.Office.Interop.Word"

    $item = Get-ChildItem $filename
    $baseFile = [System.IO.Path]::Combine($item.DirectoryName,$item.BaseName)

    $word = New-Object -ComObject Word.Application 
    $word.Visible = $true 

    # check Word Version, required for PDF conversion capability
    if ($word -eq $null -or $word.Version -lt 16)
    {
        Throw "Requires Microsoft Word 2016 or later"
    }

    # disable convert PDF warning prompt
    Set-ItemProperty -Path HKCU:\SOFTWARE\Microsoft\Office\16.0\Word\Options -Name DisableConvertPdfWarning -Value 0x1
    
    # open PDF with Microsoft Word
    $doc = Invoke-NamedParameter $word.Documents "Open" @{ "FileName" = $filename ; "AddToRecentFiles" = $false ; "ConfirmConversions" = $false ; "ReadOnly" = $true ; "Visible"=$false } 

    $tableCount = 0
    $lastTable = $null
    ForEach ($table in $doc.Tables)
    {
        # tables split across pages in PDFs will show up as separate tables
        # we will try to auto-detect if table is continuing across page
        # in this case $newTable will be set to false
        $newTable = $true
        
        # check for continuing table
        if ($lastTable -ne $null)
        {
            # Get the range between previous table and current table
            $RangeBetweenTables = $lastTable.Range
            $RangeBetweenTables.Collapse([Microsoft.Office.Interop.Word.WdCollapseDirection]::wdCollapseEnd)
            $RangeStartOfNextTable = $table.Range
            $RangeStartOfNextTable.Collapse([Microsoft.Office.Interop.Word.WdCollapseDirection]::wdCollapseStart)
            $RangeBetweenTables.End = $RangeStartOfNextTable.Start -1

            # Count lines separating tables
            $lineCount = $RangeBetweenTables.ComputeStatistics([Microsoft.Office.Interop.Word.WdStatistic]::wdStatisticLines)
            if ($lastTable.Columns.Count -eq $table.Columns.Count -and $lineCount -eq 0)
            {
                $newTable = $false
            }
        }

        $lastTable = $table
    
        if ($newTable)
        {
            $rowOffset = 0
            $tableCoun
            "Table # $tableCount"
            $dataTable = New-Object System.Data.DataTable "Table$tableCount"

            # build the columns for our table - these will get default names Column1,Column2
            For ($i = 0; $i -le $table.Columns.Count; $i++ )
            {
                $dataTable.Columns.Add()
            }
        }
        else
        {
            $rowOffset = $dataTable.Rows.Count - 1
        }

        For ($i = 0; $i -le $table.Rows.Count; $i++)
        {
            [void]$dataTable.Rows.Add() 
        }


        ForEach ($row in $table.Rows)
        {
            ForEach ($cell in $row.Cells)
            {
                # remove some unprintable characters that may be at end of cell value
                $text = $cell.Range.Text.Trim()
                $text = $text.Replace([string][char]13,"")
                $text = $text.Replace([string][char]7,"")

                $dataTable.Rows[$rowOffset + $cell.RowIndex -1][$cell.ColumnIndex-1] = $text
            }
        }

        $dataTable | Export-Csv -NoTypeInformation -Path "$baseFile$tableCount.csv"
    }
    $doc.Close
    $word.Quit
}

About chentiangemalc

specializes in end-user computing technologies. disclaimer 1) use at your own risk. test any solution in your environment. if you do not understand the impact/consequences of what you're doing please stop, and ask advice from somebody who does. 2) views are my own at the time of posting and do not necessarily represent my current view or the view of my employer and family members/relatives. 3) over the years Microsoft/Citrix/VMWare have given me a few free shirts, pens, paper notebooks/etc. despite these gifts i will try to remain unbiased.
This entry was posted in Office, PowerShell and tagged . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s