Package org.apache.drill.exec.store.pdf
Class PdfUtils
java.lang.Object
org.apache.drill.exec.store.pdf.PdfUtils
-
Field Summary
Modifier and TypeFieldDescriptionstatic final technology.tabula.extractors.ExtractionAlgorithm
-
Constructor Summary
-
Method Summary
Modifier and TypeMethodDescriptionconvertRowToStringArray
(List<technology.tabula.RectangularTextContainer> input) extractFirstRowValues
(technology.tabula.Table table) Returns the values contained in a PDF Table rowstatic List<technology.tabula.Table>
extractTablesFromPDF
(org.apache.pdfbox.pdmodel.PDDocument document) Returns a list of tables found in a given PDF document.static List<technology.tabula.Table>
extractTablesFromPDF
(org.apache.pdfbox.pdmodel.PDDocument document, technology.tabula.extractors.ExtractionAlgorithm algorithm) Returns a list of tables found in a given PDF document.static List<technology.tabula.RectangularTextContainer>
getRow
(technology.tabula.Table table, int rowIndex) This function retuns the contents of a specific row in a PDF table as a list of Strings.getRowAsStringList
(technology.tabula.Table table, int rowIndex) This function retuns the contents of a specific row in a PDF table as a list of Strings.static technology.tabula.Table
getSpecificTable
(org.apache.pdfbox.pdmodel.PDDocument document, int tableIndex, technology.tabula.extractors.ExtractionAlgorithm algorithm) Returns a specific table from a PDF document.
-
Field Details
-
DEFAULT_ALGORITHM
public static final technology.tabula.extractors.ExtractionAlgorithm DEFAULT_ALGORITHM
-
-
Constructor Details
-
PdfUtils
public PdfUtils()
-
-
Method Details
-
extractTablesFromPDF
public static List<technology.tabula.Table> extractTablesFromPDF(org.apache.pdfbox.pdmodel.PDDocument document) Returns a list of tables found in a given PDF document. There are several extraction algorithms available and this function uses the default Basic Extraction Algorithm.- Parameters:
document
- The input PDF document to search for tables- Returns:
- A list of tables found in the document.
-
extractTablesFromPDF
public static List<technology.tabula.Table> extractTablesFromPDF(org.apache.pdfbox.pdmodel.PDDocument document, technology.tabula.extractors.ExtractionAlgorithm algorithm) Returns a list of tables found in a given PDF document. There are several extraction algorithms available and this function allows the user to select which to use.- Parameters:
document
- The input PDF document to search for tablesalgorithm
- The extraction algorithm- Returns:
- A list of tables found in the document.
-
getSpecificTable
public static technology.tabula.Table getSpecificTable(org.apache.pdfbox.pdmodel.PDDocument document, int tableIndex, technology.tabula.extractors.ExtractionAlgorithm algorithm) Returns a specific table from a PDF document. Returns null in the event that the user requests a table that does not exist. If there is an error with the document the function will throw a UserException.- Parameters:
document
- The source PDF documenttableIndex
- The index of the desired table- Returns:
- The desired Table, null if the table is not valid, or if the document has no tables.
-
extractFirstRowValues
Returns the values contained in a PDF Table row- Parameters:
table
- The source table- Returns:
- A list of the header rows
-
getRowAsStringList
This function retuns the contents of a specific row in a PDF table as a list of Strings.- Parameters:
table
- The table containing the data.rowIndex
- The desired row index- Returns:
- A list of Strings with the data.
-
convertRowToStringArray
-
getRow
public static List<technology.tabula.RectangularTextContainer> getRow(technology.tabula.Table table, int rowIndex) This function retuns the contents of a specific row in a PDF table as a list of Strings.- Parameters:
table
- The table containing the data.rowIndex
- The desired row index- Returns:
- A list of Strings with the data.
-