java.lang.Object

org.apache.drill.exec.store.pdf.PdfUtils

public class PdfUtils extends Object

Field Summary

Fields

Modifier and Type

Field

Description

static final technology.tabula.extractors.ExtractionAlgorithm

DEFAULT_ALGORITHM
Constructor Summary

Constructors

Constructor

Description

PdfUtils()
Method Summary

Modifier and Type

Method

Description

static List<String>

convertRowToStringArray(List<technology.tabula.RectangularTextContainer> input)

static List<String>

extractFirstRowValues(technology.tabula.Table table)

Returns the values contained in a PDF Table row

static List<technology.tabula.Table>

extractTablesFromPDF(org.apache.pdfbox.pdmodel.PDDocument document)

Returns a list of tables found in a given PDF document.

static List<technology.tabula.Table>

extractTablesFromPDF(org.apache.pdfbox.pdmodel.PDDocument document, technology.tabula.extractors.ExtractionAlgorithm algorithm)

Returns a list of tables found in a given PDF document.

static List<technology.tabula.RectangularTextContainer>

getRow(technology.tabula.Table table, int rowIndex)

This function retuns the contents of a specific row in a PDF table as a list of Strings.

static List<String>

getRowAsStringList(technology.tabula.Table table, int rowIndex)

This function retuns the contents of a specific row in a PDF table as a list of Strings.

static technology.tabula.Table

getSpecificTable(org.apache.pdfbox.pdmodel.PDDocument document, int tableIndex, technology.tabula.extractors.ExtractionAlgorithm algorithm)

Returns a specific table from a PDF document.

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Field Details
- DEFAULT_ALGORITHM
  
  public static final technology.tabula.extractors.ExtractionAlgorithm DEFAULT_ALGORITHM
Constructor Details
- PdfUtils
  
  public PdfUtils()
Method Details
- extractTablesFromPDF
  
  public static List<technology.tabula.Table> extractTablesFromPDF(org.apache.pdfbox.pdmodel.PDDocument document)
  
  Returns a list of tables found in a given PDF document. There are several extraction algorithms available and this function uses the default Basic Extraction Algorithm.
  
  Parameters:
  
  document - The input PDF document to search for tables
  
  Returns:
  
  A list of tables found in the document.
- extractTablesFromPDF
  
  public static List<technology.tabula.Table> extractTablesFromPDF(org.apache.pdfbox.pdmodel.PDDocument document, technology.tabula.extractors.ExtractionAlgorithm algorithm)
  
  Returns a list of tables found in a given PDF document. There are several extraction algorithms available and this function allows the user to select which to use.
  
  Parameters:
  
  document - The input PDF document to search for tables
  
  algorithm - The extraction algorithm
  
  Returns:
  
  A list of tables found in the document.
- getSpecificTable
  
  public static technology.tabula.Table getSpecificTable(org.apache.pdfbox.pdmodel.PDDocument document, int tableIndex, technology.tabula.extractors.ExtractionAlgorithm algorithm)
  
  Returns a specific table from a PDF document. Returns null in the event that the user requests a table that does not exist. If there is an error with the document the function will throw a UserException.
  
  Parameters:
  
  document - The source PDF document
  
  tableIndex - The index of the desired table
  
  Returns:
  
  The desired Table, null if the table is not valid, or if the document has no tables.
- extractFirstRowValues
  
  public static List<String> extractFirstRowValues(technology.tabula.Table table)
  
  Returns the values contained in a PDF Table row
  
  Parameters:
  
  table - The source table
  
  Returns:
  
  A list of the header rows
- getRowAsStringList
  
  public static List<String> getRowAsStringList(technology.tabula.Table table, int rowIndex)
  
  This function retuns the contents of a specific row in a PDF table as a list of Strings.
  
  Parameters:
  
  table - The table containing the data.
  
  rowIndex - The desired row index
  
  Returns:
  
  A list of Strings with the data.
- convertRowToStringArray
  
  public static List<String> convertRowToStringArray(List<technology.tabula.RectangularTextContainer> input)
- getRow
  
  public static List<technology.tabula.RectangularTextContainer> getRow(technology.tabula.Table table, int rowIndex)
  
  This function retuns the contents of a specific row in a PDF table as a list of Strings.
  
  Parameters:
  
  table - The table containing the data.
  
  rowIndex - The desired row index
  
  Returns:
  
  A list of Strings with the data.

Class PdfUtils

Field Summary

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Field Details

DEFAULT_ALGORITHM

Constructor Details

PdfUtils

Method Details

extractTablesFromPDF

extractTablesFromPDF

getSpecificTable

extractFirstRowValues

getRowAsStringList

convertRowToStringArray

getRow