Python API

PDF

class RPA.PDF.PDF

PDF is a library for managing PDF documents.

It can be used to extract text from PDFs, add watermarks to pages, and decrypt/encrypt documents.

Merging and splitting PDFs is supported by Add Files To PDF keyword. Read the keyword documentation for examples.

There is also limited support for updating form field values. (check Set Field Value and Save Field Values for more info)

The input PDF file can be passed as an argument to the keywords, or it can be omitted if you first call Open PDF. A reference to the current active PDF will be stored in the library instance and can be changed by using the Switch To PDF keyword with another PDF file path, therefore you can asynchronously work with multiple PDFs.

Attention

Keep in mind that this library works with text-based PDFs, and it can’t extract information from an image-based (scan) PDF file. For accurate results, you have to use specialized external services wrapped by the RPA.DocumentAI library.

Portal example with video recording demo for parsing PDF invoices: https://github.com/robocorp/example-parse-pdf-invoice

Examples

Robot Framework

*** Settings ***
Library    RPA.PDF
Library    String

*** Tasks ***
Extract Data From First Page
    ${text} =    Get Text From PDF    report.pdf
    ${lines} =     Get Lines Matching Regexp    ${text}[${1}]    .+pain.+
    Log    ${lines}

Get Invoice Number
    Open Pdf    invoice.pdf
    ${matches} =  Find Text    Invoice Number
    Log List      ${matches}

Fill Form Fields
    Switch To Pdf    form.pdf
    ${fields} =     Get Input Fields   encoding=utf-16
    Log Dictionary    ${fields}
    Set Field Value    Given Name Text Box    Mark
    Save Field Values    output_path=${OUTPUT_DIR}${/}completed-form.pdf
    ...                  use_appearances_writer=${True}
from RPA.PDF import PDF
from robot.libraries.String import String

pdf = PDF()
string = String()

def extract_data_from_first_page():
    text = pdf.get_text_from_pdf("report.pdf")
    lines = string.get_lines_matching_regexp(text[1], ".+pain.+")
    print(lines)

def get_invoice_number():
    pdf.open_pdf("invoice.pdf")
    matches = pdf.find_text("Invoice Number")
    for match in matches:
        print(match)

def fill_form_fields():
    pdf.switch_to_pdf("form.pdf")
    fields = pdf.get_input_fields(encoding="utf-16")
    for key, value in fields.items():
        print(f"{key}: {value}")
    pdf.set_field_value("Given Name Text Box", "Mark")
    pdf.save_field_values(
        output_path="completed-form.pdf",
        use_appearances_writer=True
    )
ENCODING = 'utf-8'
FIELDS_ENCODING = 'iso-8859-1'
RE_FLAGS = 24
ROBOT_LIBRARY_DOC_FORMAT = 'REST'
ROBOT_LIBRARY_SCOPE = 'GLOBAL'
property active_pdf_document
add_files_to_pdf(files: list | None = None, target_document: str | None = None, append: bool = False) None

Add images and/or pdfs to new PDF document.

Supports merging and splitting PDFs.

Image formats supported are JPEG, PNG and GIF.

The file can be added with extra properties by denoting : at the end of the filename. Each property should be separated by comma.

Supported extra properties for PDFs are:

  • page and/or page ranges

  • no extras means that all source PDF pages are added into new PDF

Supported extra properties for images are:

  • format, the PDF page format, for example. Letter or A4

  • rotate, how many degrees image is rotated counter-clockwise

  • align, only possible value at the moment is center

  • orientation, the PDF page orientation for the image, possible values P (portrait) or L (landscape)

  • x/y, coordinates for adjusting image position on the page

Examples

Robot Framework

*** Keywords ***
Add files to pdf
    ${files}=    Create List
    ...    ${TESTDATA_DIR}${/}invoice.pdf
    ...    ${TESTDATA_DIR}${/}approved.png:align=center
    ...    ${TESTDATA_DIR}${/}robot.pdf:1
    ...    ${TESTDATA_DIR}${/}approved.png:x=0,y=0
    ...    ${TESTDATA_DIR}${/}robot.pdf:2-10,15
    ...    ${TESTDATA_DIR}${/}approved.png
    ...    ${TESTDATA_DIR}${/}landscape_image.png:rotate=-90,orientation=L
    ...    ${TESTDATA_DIR}${/}landscape_image.png:format=Letter
    Add Files To PDF    ${files}    newdoc.pdf

Merge pdfs
    ${files}=    Create List
    ...    ${TESTDATA_DIR}${/}invoice.pdf
    ...    ${TESTDATA_DIR}${/}robot.pdf:1
    ...    ${TESTDATA_DIR}${/}robot.pdf:2-10,15
    Add Files To Pdf    ${files}    merged-doc.pdf

Split pdf
    ${files}=    Create List
    ...    ${OUTPUT_DIR}${/}robot.pdf:2-10,15
    Add Files To Pdf     ${files}    split-doc.pdf

Python

from RPA.PDF import PDF

pdf = PDF()

def example_addfiles():
    list_of_files = [
        'invoice.pdf',
        'approved.png:align=center',
        'robot.pdf:1',
        'approved.png:x=0,y=0',
    ]
    pdf.add_files_to_pdf(
        files=list_of_files,
        target_document="output/output.pdf"
    )

def example_merge():
    list_of_files = [
        'invoice.pdf',
        'robot.pdf:1',
        'robot.pdf:2-10,15',
    ]
    pdf.add_files_to_pdf(
        files=list_of_files,
        target_document="output/merged-doc.pdf"
    )

def example_split():
    list_of_files = [
        'robot.pdf:2-10,15',
    ]
    pdf.add_files_to_pdf(
        files=list_of_files,
        target_document="output/split-doc.pdf"
    )
Parameters:
  • files – list of filepaths to add into PDF (can be either images or PDFs)

  • target_document – filepath of target PDF

  • append – appends files to existing document if append is True

add_watermark_image_to_pdf(image_path: str | Path, output_path: str | Path, source_path: str | Path | None = None, coverage: float = 0.2) None

Add an image into an existing or new PDF.

If no source path is given, assume a PDF is already opened.

Examples

Robot Framework

*** Keyword ***
Indicate approved with watermark
    Add Watermark Image To PDF
    ...             image_path=approved.png
    ...             source_path=/tmp/sample.pdf
    ...             output_path=output/output.pdf

Python

from RPA.PDF import PDF

pdf = PDF()

def indicate_approved_with_watermark():
    pdf.add_watermark_image_to_pdf(
        image_path="approved.png"
        source_path="/tmp/sample.pdf"
        output_path="output/output.pdf"
    )
Parameters:
  • image_path – filepath to image file to add into PDF

  • source – filepath to source, if not given add image to currently active PDF

  • output_path – filepath of target PDF

  • coverage – how the watermark image should be scaled on page, defaults to 0.2

close_all_pdfs() None

Close all opened PDF file descriptors.

Examples

Robot Framework

*** Keywords ***
Close Multiple PDFs
    Close all pdfs
close_pdf(source_pdf: str | None = None) None

Close PDF file descriptor for a certain file.

Examples

Robot Framework

*** Keywords ***
Close just one pdf
    Close pdf   path/to/the/pdf/file.pdf
Parameters:

source_pdf – filepath to the source pdf.

Raises:

ValueError – if file descriptor for the file is not found.

convert(source_path: str | None = None, trim: bool = True, pagenum: str | int | None = None)

Parse source PDF into entities.

These entities can be used for text searches or XML dumping for example. The conversion will be done automatically when using the dependent keywords directly.

Parameters:
  • source_path – source PDF filepath

  • trim – trim whitespace from the text is set to True (default)

  • pagenum – Page number where search is performed on, defaults to None. (meaning all pages get converted – numbers start from 1)

Examples

Robot Framework

***Settings***
Library    RPA.PDF

***Tasks***
Example Keyword
    Convert    /tmp/sample.pdf

Python

from RPA.PDF import PDF

pdf = PDF()

def example_keyword():
    pdf.convert("/tmp/sample.pdf")
decrypt_pdf(source_path: str, output_path: str, password: str) bool

Decrypt PDF with password.

If no source path given, assumes a PDF is already opened.

Examples

Robot Framework

*** Keywords ***
Make PDF human readable
    ${success}=  Decrypt PDF    /tmp/sample.pdf

Python

from RPA.PDF import PDF

pdf = PDF()

def make_pdf_human_readable():
    success = pdf.decrypt_pdf("/tmp/sample.pdf")
Parameters:
  • source_path – filepath to the source pdf.

  • output_path – filepath to the decrypted pdf.

  • password – password as a string.

Returns:

True if decrypt was successful, else False or Exception.

Raises:

ValueError – on decryption errors.

dump_pdf_as_xml(source_path: str | None = None) str

Get PDFMiner format XML dump of the PDF

Examples

Robot Framework

***Settings***
Library    RPA.PDF

***Tasks***
Example Keyword
    ${xml}=  Dump PDF as XML    /tmp/sample.pdf

Python

from RPA.PDF import PDF

pdf = PDF()

def example_keyword():
    xml = pdf.dump_pdf_as_xml("/tmp/sample.pdf")
Parameters:

source_path – filepath to the source PDF

Returns:

XML content as a string

encrypt_pdf(source_path: str | None = None, output_path: str | None = None, user_pwd: str = '', owner_pwd: str | None = None, use_128bit: bool = True) None

Encrypt a PDF document.

If no source path given, assumes a PDF is already opened.

Examples

Robot Framework

*** Keywords ***
Secure this PDF
    Encrypt PDF    /tmp/sample.pdf

Secure this PDF and set passwords
    Encrypt PDF
    ...    source_path=/tmp/sample.pdf
    ...    output_path=/tmp/new/sample_encrypted.pdf
    ...    user_pwd=complex_password_here
    ...    owner_pwd=different_complex_password_here
    ...    use_128bit=${TRUE}

Python

from RPA.PDF import PDF

pdf = PDF()

def secure_this_pdf():
    pdf.encrypt_pdf("/tmp/sample.pdf")
Parameters:
  • source_path – filepath to the source pdf.

  • output_path – filepath to the target pdf, stored by default in the robot output directory as output.pdf

  • user_pwd – allows opening and reading PDF with restrictions.

  • owner_pwd – allows opening PDF without any restrictions, by default same user_pwd.

  • use_128bit – whether to 128bit encryption, when false 40bit encryption is used, default True.

extract_pages_from_pdf(source_path: str | None = None, output_path: str | None = None, pages: int | str | List[int] | List[str] | None = None) None

Extract pages from source PDF and save to a new PDF document.

Page numbers start from 1.

If no source path given, assumes a PDF is already opened.

Examples

Robot Framework

*** Keywords ***
Save PDF pages to a new document
    ${pages}=    Extract Pages From PDF
    ...          source_path=/tmp/sample.pdf
    ...          output_path=/tmp/output.pdf
    ...          pages=5

Save PDF pages from open PDF to a new document
    ${pages}=    Extract Pages From PDF
    ...          output_path=/tmp/output.pdf
    ...          pages=5

Python

from RPA.PDF import PDF

pdf = PDF()

def save_pdf_pages_to_a_new_document():
    pages = pdf.extract_pages_from_pdf(
        source_path="/tmp/sample.pdf",
        output_path="/tmp/output.pdf",
        pages=5
    )
Parameters:
  • source_path – filepath to the source pdf.

  • output_path – filepath to the target pdf, stored by default in the robot output directory as output.pdf

  • pages – page numbers to extract from PDF (numbers start from 1) if None then extracts all pages.

find_text(locator: str, pagenum: int | str = 1, direction: str = 'right', closest_neighbours: str | int | None = 1, strict: bool = False, regexp: str | None = None, trim: bool = True, ignore_case: bool = False) List[Match]

Find the closest text elements near the set anchor(s) through locator.

The PDF will be parsed automatically before elements can be searched.

Parameters:
  • locator – Element to set anchor to. This can be prefixed with either “text:”, “subtext:”, “regex:” or “coords:” to find the anchor by text or coordinates. The “text” strategy is assumed if no such prefix is specified. (text search is case-sensitive; use ignore_case param for controlling it)

  • pagenum – Page number where search is performed on, defaults to 1 (first page).

  • direction – In which direction to search for text elements. This can be any of ‘top’/’up’, ‘bottom’/’down’, ‘left’ or ‘right’. (defaults to ‘right’)

  • closest_neighbours – How many neighbours to return at most, sorted by the distance from the current anchor.

  • strict – If element’s margins should be used for matching those which are aligned to the anchor. (turned off by default)

  • regexp – Expected format of the searched text value. By default all the candidates in range are considered valid neighbours.

  • trim – Automatically trim leading/trailing whitespace from the text elements. (switched on by default)

  • ignore_case – Do a case-insensitive search when set to True. (affects the passed locator and regexp filtering)

Returns:

A list of Match objects where every match has the following attributes: .anchor - the matched text with the locator; .neighbours - a list of adjacent texts found on the specified direction

Attention

Keep in mind that this keyword works with text-based PDFs, and it can’t extract information from an image-based (scan) PDF file. For accurate results, you have to use specialized external services wrapped by the RPA.DocumentAI library.

Portal example with video recording demo for parsing PDF invoices: https://github.com/robocorp/example-parse-pdf-invoice

Examples

Robot Framework

PDF Invoice Parsing
    Open Pdf    invoice.pdf
    ${matches} =  Find Text    Invoice Number
    Log List      ${matches}
List has one item:
Match(anchor='Invoice Number', direction='right', neighbours=['INV-3337'])

Python

from RPA.PDF import PDF

pdf = PDF()

def pdf_invoice_parsing():
    pdf.open_pdf("invoice.pdf")
    matches = pdf.find_text("Invoice Number")
    for match in matches:
        print(match)

pdf_invoice_parsing()
Match(anchor='Invoice Number', direction='right', neighbours=['INV-3337'])
static fit_dimensions_to_box(width: int, height: int, max_width: int, max_height: int) Tuple[int, int]

Fit dimensions of width and height to a given box.

get_all_figures(source_path: str | None = None) dict

Return all figures in the PDF document.

If no source path given, assumes a PDF is already opened.

Examples

Robot Framework

*** Keywords ***
Image fetch
    &{figures}=  Get All Figures    /tmp/sample.pdf

Image fetch from open PDF
    &{figures}=  Get All Figures

Python

from RPA.PDF import PDF

pdf = PDF()

def image_fetch():
    figures = pdf.get_all_figures("/tmp/sample.pdf")
Parameters:

source_path – filepath to the source pdf.

Returns:

dictionary of figures divided into pages.

get_input_fields(source_path: str | None = None, replace_none_value: bool = False, encoding: str | None = 'iso-8859-1') dict

Get input fields in the PDF.

Stores input fields internally so that they can be used without parsing the PDF again.

Parameters:
  • source_path – Filepath to source, if not given use the currently active PDF.

  • replace_none_value – Enable this to conveniently visualize the fields. ( replaces the null value with field’s default or its name if absent)

  • encoding – Use an explicit encoding for field name/value parsing. ( defaults to “iso-8859-1” but “utf-8/16” might be the one working for you)

Returns:

A dictionary with all the found fields. Use their key names when setting values into them.

Raises:

KeyError – If no input fields are enabled in the PDF.

Examples

Robot Framework

Example Keyword
    ${fields} =     Get Input Fields    form.pdf
    Log Dictionary    ${fields}

Python

from RPA.PDF import PDF

pdf = PDF()

def example_keyword():
    fields = pdf.get_input_fields("form.pdf")
    print(fields)

example_keyword()
get_number_of_pages(source_path: str | None = None) int

Get number of pages in the document.

If no source path given, assumes a PDF is already opened.

Examples

Robot Framework

*** Keywords ***
Number of pages in PDF
    ${page_count}=    Get Number Of Pages    /tmp/sample.pdf

Number of pages in opened PDF
    ${page_count}=    Get Number Of Pages

Python

from RPA.PDF import PDF

pdf = PDF()

def number_of_pages_in_pdf():
    page_count = pdf.get_number_of_pages("/tmp/sample.pdf")
Parameters:

source_path – filepath to the source pdf

Raises:

PdfReadError – if file is encrypted or other restrictions are in place

get_pdf_info(source_path: str | None = None) dict

Get metadata from a PDF document.

If no source path given, assumes a PDF is already opened.

Examples

Robot Framework

*** Keywords ***
Get PDF metadata
    ${metadata}=    Get PDF Info    /tmp/sample.pdf

*** Keywords ***
Get metadata from an already opened PDF
    ${metadata}=    Get PDF Info

Python

from RPA.PDF import PDF

pdf = PDF()

def get_pdf_metadata():
    metadata = pdf.get_pdf_info("/tmp/sample.pdf")
Parameters:

source_path – filepath to the source PDF.

Returns:

dictionary of PDF information.

get_text_from_pdf(source_path: str | None = None, pages: int | str | List[int] | List[str] | None = None, details: bool = False, trim: bool = True) dict

Get text from set of pages in source PDF document.

If no source path given, assumes a PDF is already opened.

Examples

Robot Framework

*** Keywords ***
Text extraction from PDF
    ${text}=    Get Text From PDF    /tmp/sample.pdf

Text extraction from open PDF
    ${text}=    Get Text From PDF

Python

from RPA.PDF import PDF

pdf = PDF()

def text_extraction_from_pdf():
    text = pdf.get_text_from_pdf("/tmp/sample.pdf")
Parameters:
  • source_path – filepath to the source pdf.

  • pages – page numbers to get text (numbers start from 1).

  • details – set to True to return textboxes, default False.

  • trim – set to False to return raw texts, default True means whitespace is trimmed from the text

Returns:

dictionary of pages and their texts.

html_to_pdf(content: str | List[str], output_path: str, encoding: str = 'utf-8', margin: float = 0, working_directory: str | None = None) None

Generate a PDF file from HTML content.

Note that input must be well-formed and valid HTML.

Examples

Robot Framework

*** Keywords ***
Create PDF from HTML
    HTML to PDF    ${html_content_as_string}  /tmp/output.pdf

Multi Page PDF
    @{pages}=    Create List    ${page1_html}    ${page2_html}
    HTML To PDF   ${pages}    output.pdf
    ...  margin=10
    ...  working_directory=subdir

Python

from RPA.PDF import PDF

pdf = PDF()

def create_pdf_from_html():
    pdf.html_to_pdf(html_content_as_string, "/tmp/output.pdf")

def multi_page_pdf():
    pages = [page1_html, page2_html, page3_html]
    pdf.html_to_pdf(pages, "output.pdf", margin=10)
    # if I have images in the HTML in the 'subdir'
    pdf.html_to_pdf(pages, "output.pdf",
        margin=10, working_directory="subdir"
    )
Parameters:
  • content – HTML content

  • output_path – filepath where to save the PDF document

  • encoding – codec used for text I/O

  • margin – page margin, default is set to 0

  • working_directory – directory where to look for HTML linked resources, by default uses the current working directory

is_pdf_encrypted(source_path: str | None = None) bool

Check if PDF is encrypted.

If no source path given, assumes a PDF is already opened.

Parameters:

source_path – filepath to the source pdf.

Returns:

True if file is encrypted.

Examples

Robot Framework

*** Keywords ***
Is PDF encrypted
    ${is_encrypted}=    Is PDF Encrypted    /tmp/sample.pdf

*** Keywords ***
Is open PDF encrypted
    ${is_encrypted}=    Is PDF Encrypted

Python

from RPA.PDF import PDF

pdf = PDF()

def example_keyword():
    is_encrypted = pdf.is_pdf_encrypted("/tmp/sample.pdf")
property logger
open_pdf(source_path: str | Path) None

Open a PDF document for reading.

This is called automatically in the other PDF keywords when a path to the PDF file is given as an argument.

Examples

Robot Framework

*** Keywords ***
Open my pdf file
    Open PDF    /tmp/sample.pdf

Python

from RPA.PDF import PDF

pdf = PDF()

def example_keyword():
    metadata = pdf.open_pdf("/tmp/sample.pdf")
Parameters:

source_path – filepath to the source pdf.

Raises:

ValueError – if PDF is already open.

property pages_count

Get number of pages in active PDF document.

static resolve_input(path: str | Path) str

Normalizes input path and returns as string.

static resolve_output(path: str | Path | None = None) str

Normalizes output path and returns as string.

rotate_page(pages: int | str | List[int] | List[str] | None, source_path: str | None = None, output_path: str | None = None, clockwise: bool = True, angle: int = 90) None

Rotate pages in source PDF document and save to target PDF document.

If no source path given, assumes a PDF is already opened.

Examples

Robot Framework

*** Keywords ***
PDF page rotation
    Rotate Page
    ...          source_path=/tmp/sample.pdf
    ...          output_path=/tmp/output.pdf
    ...          pages=5

Python

from RPA.PDF import PDF

pdf = PDF()

def pdf_page_rotation():
    pages = pdf.rotate_page(
        source_path="/tmp/sample.pdf",
        output_path="/tmp/output.pdf",
        pages=5
    )
Parameters:
  • pages – page numbers to extract from PDF (numbers start from 1).

  • source_path – filepath to the source pdf.

  • output_path – filepath to the target pdf, stored by default in the robot output directory as output.pdf

  • clockwise – directorion that page will be rotated to, default True.

  • angle – number of degrees to rotate, default 90.

save_field_values(source_path: str | None = None, output_path: str | None = None, newvals: dict | None = None, use_appearances_writer: bool = False) None

Save field values in PDF if it has fields.

Parameters:
  • source_path – Source PDF with fields to update.

  • output_path – Updated target PDF.

  • newvals – New values when updating many at once.

  • use_appearances_writer – For some PDF documents the updated fields won’t be visible (or will look strange). When this happens, try to set this to True so the previewer will re-render these based on the actual values. (and viewing the output PDF in a browser might display the field values correcly then)

Examples

Robot Framework

Example Keyword
    Open PDF    ./tmp/sample.pdf
    Set Field Value    phone_nr    077123123
    Save Field Values    output_path=./tmp/output.pdf

Multiple operations
    &{new_fields}=       Create Dictionary
    ...                  phone_nr=077123123
    ...                  title=dr
    Save Field Values    source_path=./tmp/sample.pdf
    ...                  output_path=./tmp/output.pdf
    ...                  newvals=${new_fields}

Python

from RPA.PDF import PDF

pdf = PDF()

def example_keyword():
    pdf.open_pdf("./tmp/sample.pdf")
    pdf.set_field_value("phone_nr", "077123123")
    pdf.save_field_values(output_path="./tmp/output.pdf")

def multiple_operations():
    new_fields = {"phone_nr": "077123123", "title": "dr"}
    pdf.save_field_values(
        source_path="./tmp/sample.pdf",
        output_path="./tmp/output.pdf",
        newvals=new_fields
    )
save_figure_as_image(figure: Figure, images_folder: str = '.', file_prefix: str = '') str | None

Try to save the image data from Figure object, and return the file name, if successful.

Figure needs to have byte stream and that needs to be recognized as image format for successful save.

Examples

Robot Framework

*** Keyword ***
Figure to Image
    ${image_file_path} =     Save figure as image
    ...             figure=pdf_figure_object
    ...             images_folder=/tmp/images
    ...             file_prefix=file_name_here

Python

from RPA.PDF import PDF

pdf = PDF()

def figure_to_image():
    image_file_path = pdf.save_figure_as_image(
        figure="pdf_figure_object"
        images_folder="/tmp/images"
        file_prefix="file_name_here"
    )
Parameters:
  • figure – PDF Figure object which will be saved as an image. The PDF Figure object can be determined from the Get All Figures keyword

  • images_folder – directory where image files will be created

  • file_prefix – image filename prefix

Returns:

image filepath or None

save_figures_as_images(source_path: str | None = None, images_folder: str = '.', pages: str | None = None, file_prefix: str = '') List[str]

Save figures from given PDF document as image files.

If no source path given, assumes a PDF is already opened.

Examples

Robot Framework

*** Keyword ***
Figures to Images
    ${image_filenames} =    Save figures as images
    ...             source_path=/tmp/sample.pdf
    ...             images_folder=/tmp/images
    ...             pages=${4}
    ...             file_prefix=file_name_here

Python

from RPA.PDF import PDF

pdf = PDF()

def figures_to_images():
    image_filenames = pdf.save_figures_as_image(
        source_path="/tmp/sample.pdf"
        images_folder="/tmp/images"
        pages=4
        file_prefix="file_name_here"
    )
Parameters:
  • source_path – filepath to PDF document

  • images_folder – directory where image files will be created

  • pages – target figures in the pages, can be single page or range, default None means that all pages are scanned for figures to save (numbers start from 1)

  • file_prefix – image filename prefix

Returns:

list of image filenames created

save_pdf(output_path: str, reader: PdfReader | None = None)

Save the contents of a pypdf reader to a new file.

Examples

Robot Framework

*** Keyword ***
Save changes to PDF
    Save PDF    /tmp/output.pdf

Python

from RPA.PDF import PDF

pdf = PDF()

def save_changes_to_pdf():
    pdf.save_pdf(output_path="output/output.pdf")
Parameters:
  • output_path – filepath to target PDF

  • reader – a pypdf reader (defaults to currently active document)

set_anchor_to_element(locator: str, trim: bool = True, pagenum: int | str = 1, ignore_case: bool = False) bool

Sets main anchor point in the document for further searches.

This is used internally in the library and can work with multiple anchors at the same time if such are found.

Parameters:
  • locator – Element to set anchor to. This can be prefixed with either “text:”, “subtext:”, “regex:” or “coords:” to find the anchor by text or coordinates. The “text” strategy is assumed if no such prefix is specified. (text search is case-sensitive; use ignore_case param for controlling it)

  • trim – Automatically trim leading/trailing whitespace from the text elements. (switched on by default)

  • pagenum – Page number where search is performed on, defaults to 1 (first page).

  • ignore_case – Do a case-insensitive search when set to True.

Returns:

True if at least one anchor was found.

Examples

Robot Framework

Example Keyword
     ${success} =  Set Anchor To Element    Invoice Number

Python

from RPA.PDF import PDF

pdf = PDF()

def example_keyword():
    success = pdf.set_anchor_to_element("Invoice Number")
set_convert_settings(line_margin: float | None = None, word_margin: float | None = None, char_margin: float | None = None, boxes_flow: float | None = 0.5)

Change settings for PDFMiner document conversion.

line_margin controls how textboxes are grouped - if conversion results in texts grouped into one group then set this to lower value

word_margin controls how spaces are inserted between words - if conversion results in text without spaces then set this to lower value

char_margin controls how characters are grouped into words - if conversion results in individual characters instead of then set this to higher value

boxes_flow controls how much horizontal and vertical position of the text matters when determining the order of text boxes. Value can be between range of -1.0 (only horizontal position matters) to +1.0 (only vertical position matters). This feature (advanced layout analysis) can be disabled by setting value to None thus bottom left corner of the text box is used to determine order of the text boxes.

Parameters:
  • line_margin – relative margin between bounding lines, default 0.5

  • word_margin – relative margin between words, default 0.1

  • char_margin – relative margin between characters, default 2.0

  • boxes_flow – positioning of the text boxes based on text, default 0.5

Examples

Robot Framework

***Settings***
Library    RPA.PDF

***Tasks***
Example Keyword
    Set Convert Settings  line_margin=0.00000001
    ${texts}=  Get Text From PDF  /tmp/sample.pdf

Python

from RPA.PDF import PDF

pdf = PDF()

def example_keyword():
    pdf.set_convert_settings(boxes_flow=None)
    texts = pdf.get_text_from_pdf("/tmp/sample.pdf")
set_field_value(field_name: str, value: Any, source_path: str | None = None) None

Set value for field with given name on the active document.

Tries to match with field’s identifier directly or its label. When ticking checkboxes, try with the /Yes string value as simply Yes might not work with most previewing apps.

Parameters:
  • field_name – Field to update.

  • value – New value for the field.

  • source_path – Source PDF file path.

Raises:

ValueError – When field can’t be found or more than one field matches the given field_name.

Examples

Robot Framework

Example Keyword
    Open PDF    ./tmp/sample.pdf
    Set Field Value    phone_nr    077123123
    Save Field Values    output_path=./tmp/output.pdf

Python

from RPA.PDF import PDF

pdf = PDF()

def example_keyword():
    pdf.open_pdf("./tmp/sample.pdf")
    pdf.set_field_value("phone_nr", "077123123")
    pdf.save_field_values(output_path="./tmp/output.pdf")
switch_to_pdf(source_path: str | Path | None = None) None

Switch library’s current fileobject to already opened file or open a new file if not opened.

This is done automatically in the PDF library keywords.

Examples

Robot Framework

*** Keywords ***
Jump to another PDF
    Switch to PDF    /tmp/another.pdf

Python

from RPA.PDF import PDF

pdf = PDF()

def jump_to_another_pdf():
    pdf.switch_to_pdf("/tmp/sample.pdf")
Parameters:

source_path – filepath to the source pdf.

Raises:

ValueError – if PDF filepath is not given and there are no active file to activate.

template_html_to_pdf(template: str, output_path: str, variables: dict | None = None, encoding: str = 'utf-8', margin: float = 0, working_directory: str | None = None) None

Use HTML template file or content to generate PDF file.

It provides an easy method of generating a PDF document from an HTML formatted template file or HTML content string with variable substitution.

Examples

Robot Framework

*** Keywords ***
Create PDF from HTML template file
    ${TEMPLATE}=    Set Variable    order.template
    ${PDF}=         Set Variable    result.pdf
    &{DATA}=        Create Dictionary
    ...             name=Robot Generated
    ...             email=robot@domain.com
    ...             zip=00100
    ...             items=Item 1, Item 2
    Template HTML to PDF
    ...    template=${TEMPLATE}
    ...    output_path=${PDF}
    ...    variables=${DATA}

Create PDF from HTML content
    ${HTML}=        Set Variable    <html><body><h1>{{title}}</h1><p>{{content}}</p></body></html>
    ${PDF}=         Set Variable    result.pdf
    &{DATA}=        Create Dictionary
    ...             title=My Report
    ...             content=This is the content
    Template HTML to PDF
    ...    template=${HTML}
    ...    output_path=${PDF}
    ...    variables=${DATA}

Python

from RPA.PDF import PDF

p = PDF()

# Using template file
orders = ["item 1", "item 2", "item 3"]
data = {
    "name": "Robot Process",
    "email": "robot@domain.com",
    "zip": "00100",
    "items": "<br/>".join(orders),
}
p.template_html_to_pdf("order.template", "order.pdf", data)

# Using HTML content directly
html_content = "<html><body><h1>{{title}}</h1><p>{{content}}</p></body></html>"
data = {"title": "My Report", "content": "This is the content"}
p.template_html_to_pdf(html_content, "report.pdf", data)
Parameters:
  • template – filepath to HTML template file OR HTML content string. If the file path exists, it will be read as a template file. If the file path does not exist, the string is treated as HTML content.

  • output_path – filepath where to save PDF document

  • variables – dictionary of variables to fill into template, defaults to {}

  • encoding – codec used for text I/O

  • margin – page margin, default is set to 0

  • working_directory – directory where to look for HTML linked resources, by default uses the current working directory

Note

Line Break Handling: When using HTML content with <br/> or <br> tags, these are automatically converted to paragraph breaks to improve text extraction. However, for optimal line break preservation in get_text_from_pdf(), consider using proper HTML block elements like <p> or <div> instead of <br/> tags.

Example for better line breaks:

# Instead of: "<p>Line 1<br/>Line 2</p>"
# Use: "<p>Line 1</p><p>Line 2</p>"