Class PdfUtils

java.lang.Object
org.apache.drill.exec.store.pdf.PdfUtils

public class PdfUtils extends Object
  • Field Summary

    Fields
    Modifier and Type
    Field
    Description
    static final technology.tabula.extractors.ExtractionAlgorithm
     
  • Constructor Summary

    Constructors
    Constructor
    Description
     
  • Method Summary

    Modifier and Type
    Method
    Description
    static List<String>
    convertRowToStringArray(List<technology.tabula.RectangularTextContainer> input)
     
    static List<String>
    extractFirstRowValues(technology.tabula.Table table)
    Returns the values contained in a PDF Table row
    static List<technology.tabula.Table>
    extractTablesFromPDF(org.apache.pdfbox.pdmodel.PDDocument document)
    Returns a list of tables found in a given PDF document.
    static List<technology.tabula.Table>
    extractTablesFromPDF(org.apache.pdfbox.pdmodel.PDDocument document, technology.tabula.extractors.ExtractionAlgorithm algorithm)
    Returns a list of tables found in a given PDF document.
    static List<technology.tabula.RectangularTextContainer>
    getRow(technology.tabula.Table table, int rowIndex)
    This function retuns the contents of a specific row in a PDF table as a list of Strings.
    static List<String>
    getRowAsStringList(technology.tabula.Table table, int rowIndex)
    This function retuns the contents of a specific row in a PDF table as a list of Strings.
    static technology.tabula.Table
    getSpecificTable(org.apache.pdfbox.pdmodel.PDDocument document, int tableIndex, technology.tabula.extractors.ExtractionAlgorithm algorithm)
    Returns a specific table from a PDF document.

    Methods inherited from class java.lang.Object

    clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
  • Field Details

    • DEFAULT_ALGORITHM

      public static final technology.tabula.extractors.ExtractionAlgorithm DEFAULT_ALGORITHM
  • Constructor Details

    • PdfUtils

      public PdfUtils()
  • Method Details

    • extractTablesFromPDF

      public static List<technology.tabula.Table> extractTablesFromPDF(org.apache.pdfbox.pdmodel.PDDocument document)
      Returns a list of tables found in a given PDF document. There are several extraction algorithms available and this function uses the default Basic Extraction Algorithm.
      Parameters:
      document - The input PDF document to search for tables
      Returns:
      A list of tables found in the document.
    • extractTablesFromPDF

      public static List<technology.tabula.Table> extractTablesFromPDF(org.apache.pdfbox.pdmodel.PDDocument document, technology.tabula.extractors.ExtractionAlgorithm algorithm)
      Returns a list of tables found in a given PDF document. There are several extraction algorithms available and this function allows the user to select which to use.
      Parameters:
      document - The input PDF document to search for tables
      algorithm - The extraction algorithm
      Returns:
      A list of tables found in the document.
    • getSpecificTable

      public static technology.tabula.Table getSpecificTable(org.apache.pdfbox.pdmodel.PDDocument document, int tableIndex, technology.tabula.extractors.ExtractionAlgorithm algorithm)
      Returns a specific table from a PDF document. Returns null in the event that the user requests a table that does not exist. If there is an error with the document the function will throw a UserException.
      Parameters:
      document - The source PDF document
      tableIndex - The index of the desired table
      Returns:
      The desired Table, null if the table is not valid, or if the document has no tables.
    • extractFirstRowValues

      public static List<String> extractFirstRowValues(technology.tabula.Table table)
      Returns the values contained in a PDF Table row
      Parameters:
      table - The source table
      Returns:
      A list of the header rows
    • getRowAsStringList

      public static List<String> getRowAsStringList(technology.tabula.Table table, int rowIndex)
      This function retuns the contents of a specific row in a PDF table as a list of Strings.
      Parameters:
      table - The table containing the data.
      rowIndex - The desired row index
      Returns:
      A list of Strings with the data.
    • convertRowToStringArray

      public static List<String> convertRowToStringArray(List<technology.tabula.RectangularTextContainer> input)
    • getRow

      public static List<technology.tabula.RectangularTextContainer> getRow(technology.tabula.Table table, int rowIndex)
      This function retuns the contents of a specific row in a PDF table as a list of Strings.
      Parameters:
      table - The table containing the data.
      rowIndex - The desired row index
      Returns:
      A list of Strings with the data.