API Reference
The APIs are segregated into 3 modules:
- Common
- COS
- PD
Common module has general system access and file access and parsing APIs.
COS module is the low level file format for PDF. Carousel Object Structure was original term proposed inside Adobe which later transformed into Acrobat. COS layer has the object structure, definition and the cross references to access them.
PD module is the higher level document access layer. Accessing PDF pages or extracting the content from there or understanding document rendering using fonts or image objects will be typically in this layer.
A detailed explanation of these layers and their rational has been explained in the Architecture and Design section.
Common
PDFIO.Common.CDTextString
— Type CDTextString
PDF file format structure provides two primary string types. Hexadecimal string CosXString
and literal string CosLiteralString
. However, these are mere binary representation of string types without having any encoding associated for semantic representation. Determination of encoding is carried out mostly by associated fonts and character maps in the content stream. There are also strings used in descriptions and other attributes of a PDF file where no font or mapping information is provided. This represents the string type in such situations. Typically, strings in PDFs are of 3 types.
- Text string a. PDDocEncoded string - Similar to ISO_8859-1 b. UTF-16BE strings
- ASCII string
- Byte string - Pure binary data no interpretation
1 and 2 can be represented by the CDTextString
. convert
methods are provided to translate the CosString
to CDTextString
Ref: PDF Specification Section 7.9.2
Note: Internally CDTextString
is a String
object of julia.
PDFIO.Common.CDDate
— Type CDDate
Internally represented as string objects, these are timezone enabled date and time objects.
PDF files support the string format: (D:YYYYMMDDHHmmSSOHH'mm)
PDFIO.Common.CDDate
— Method CDDate(s::CDTextString)
PDF files support the string format: (D:YYYYMMDDHHmmSSOHH'mm)
Example
julia> date = CDDate("D:20190425173659+05'30")
D:20190425173659+05'30
julia> date.d
2019-04-25T17:36:59
julia> date.tz
5 hours, 30 minutes
julia> date.ahead
true
PDFIO.Common.getUTCTime
— Function getUTCTime(d::CDDate) -> CDDate
Removes the timezone information and returns the CDDate at UTC.
Example
julia> getUTCTime(CDDate("D:20190425173659+05'30"))
D:20190425120659Z
PDFIO.Common.CDRect
— Type CDRect
CosArray
representation of a rectangle in the lower left and upper right point format
Note: CDRect
maps to a Rect
object in the Rectangle
package.
Example
julia> CDRect(CosArray(CosObject[CosInt(0), CosInt(0), CosInt(840), CosFloat(640)]))
Rect:[0.0 0.0 840.0 640.0]
COS Objects
PDFIO.Cos.CosObject
— Type CosObject
PDF is a structured document format with lots of internal data structures like dictionaries, arrays, trees. CosObject
is the interface to access these objects and get detailed access to the objects and gather additional information. Although, defined in the COS layer, objects of these type are returned from almost all the APIs. Hence, the objects have a separate significance whether you need to use the Cos
layer or not. Below is the object hierarchy.
CosObject Abstract
CosNull Value (CosNullType)
CosString Abstract
CosName Concrete
CosNumeric Abstract
CosInt Concrete
CosFloat Concrete
CosBoolean Concrete
CosTrue Value (CosBoolean)
CosFalse Value (CosBoolean)
CosDict Concrete
CosArray Concrete
CosStream Concrete (always wrapped as an indirect object)
CosIndirectObjectRef Concrete (only useful when CosDoc is available)
Note: As a reader API you may not need to instantiate any of CosObject
types. They are normally populated as a result of parsing a PDF file.
PDFIO.Cos.CosNull
— Constant CosNull
PDF representation of a null
object. Can be applied to CosObject
of any type.
PDFIO.Cos.CosString
— Type CosString
Abstract type that represents a PDF string. In PDF objects are mere byte representations. They translate to actual text strings by application of fonts and associated encodings.
PDFIO.Cos.CosName
— Type CosName
Name objects are symbols used in PDF documents.
PDFIO.Cos.@cn_str
— Macro @cn_str(str) -> CosName
A string decorator for easier instantiation of a CosName
Example:
julia> cn"Name"
/Name
PDFIO.Cos.CosNumeric
— TypePDFIO.Cos.CosInt
— Type CosInt
An integer in PDF document.
PDFIO.Cos.CosFloat
— Type CosFloat
A numeric float data type.
PDFIO.Cos.CosBoolean
— Type CosBoolean
A boolean object in PDF which is either a CosTrue
or CosFalse
PDFIO.Cos.CosDict
— Type CosDict
Name value pair of a PDF objects. The object is very similar to the Dict
object. The key
has to be of a CosName
type.
PDFIO.Cos.set!
— Method set!(dict::CosDict, name::CosName, obj::CosObject) -> CosDict
set!(stm::CosStream, name::CosName, obj::CosObject) -> CosStream
Sets the value on a dictionary object. Setting a CosNull
object deletes the object from the dictionary.
In case of CosStream
objects the data is added to the extent dictionary.
Example
julia> set!(catalog, cn"Version", cn"1.4")
julia> <<
...
/Version /1.4
...
>>
PDFIO.Cos.CosArray
— Type CosArray
An array in a PDF file. The objects can be any combination of CosObject
.
Base.length
— Method length(o::CosArray) -> Int
Number of elements in CosArray
Example
julia> a = CosArray(CosObject[CosInt(1), CosFloat(2f0),
CosInt(3), CosFloat(4f0)])
[1 2.0 3 4.0 ]
julia> length(a)
4
PDFIO.Cos.CosStream
— Type CosStream
A stream object in a PDF. Stream objects have an extends
disctionary, followed by binary data.
PDFIO.Cos.CosIndirectObjectRef
— Type CosIndirectObjectRef
A parsed data structure to ensure the object information is stored as an object. This has no meaning without a associated CosDoc. When a reference object is hit the object should be searched from the CosDoc and returned.
Base.get
— Function get(o::CosObject) -> val
get(o::CosIndirectObjectRef) -> (objnum, gennum)
get(o::CosArray, isNative=false) -> Vector{CosObject}
An array in a PDF file. The objects can be any combination of CosObject
.
isNative = true
will return the underlying native object inside the CosArray
by invoking get method on it.
Example
julia> a = CosArray(CosObject[CosInt(1), CosFloat(2f0), CosInt(3), CosFloat(4f0)])
[1 2.0 3 4.0 ]
julia> get(a)
4-element Array{CosObject,1}:
1
2.0
3
4.0
julia> get(a, true)
4-element Array{Real,1}:
1
2.0f0
3
4.0f0
get(dict::CosDict, name::CosName, defval::T = CosNull) where T ->
Union{CosObject, T}
get(stm::CosStream, name::CosName, defval::T = CosNull) where T ->
Union{CosObject, T}
Returns the value as a CosObject
for the key name
or the defval
provided.
In case of CosStream
objects the data is collected from the extent dictionary.
Example
julia> get(catalog, cn"Version")
null
julia> get(catalog, cn"Version", cn"1.4")
/1.4
julia> get(catalog, cn"Version", "1.4")
"1.4"
get(stm::CosStream) -> IO
Decodes the stream and provides output as an IO
.
Example
julia> stm
448 0 obj
<<
/FFilter /FlateDecode
/F (/tmp/tmpIyGPhL/tmp9hwwaG)
/Length 437
>>
stream
...
endstream
endobj
julia> io = get(stm)
IOBuffer(data=UInt8[...], readable=true, writable=true, ...)
PD
PDFIO.PD.PDDoc
— Type PDDoc
An in memory representation of a PDF document. Mostly, used as an opaque handle to be passed on to other methods.
See pdDocOpen
.
PDFIO.PD.pdDocOpen
— Function pdDocOpen(filepath::AbstractString) -> PDDoc
Opens a PDF document and provides the PDDoc document object for subsequent query into the PDF file. filepath
is the path to the PDF file in the relative or absolute path format.
Remember to release the document with pdDocClose
, once the object is no longer required. Although doc
has certain members, it should normally considered as an opaque handle.
Example
julia> doc = pdDocOpen("test/PDFTest-0.0.4/stillhq/3.pdf")
PDDoc ==>
CosDoc ==>
filepath: /home/sambit/.julia/dev/PDFIO/test/PDFTest-0.0.4/stillhq/3.pdf
size: 817945
hasNativeXRefStm: false
Trailer dictionaries:
<<
/Info 146 0 R
/Prev 814755
/Size 163
/Root 154 0 R
/ID [<2ff783c9846ab546bd49f709cb7be307> <2ff783c9846ab546bd49f709cb7be307> ]
>>
<<
/Size 153
/ID [<2ff783c9846ab546bd49f709cb7be307> <2ff783c9846ab546bd49f709cb7be307> ]
>>
Catalog:
154 0 obj
<<
/Type /Catalog
/Pages 152 0 R
>>
endobj
isTagged: none
PDFIO.PD.pdDocClose
— Function pdDocClose(doc::PDDoc, num::Int) -> Nothing
Reclaim the resources associated with a PDDoc
object. Once called the PDDoc
object cannot be further used.
Example
julia> pdDocClose(doc)
PDFIO.PD.pdDocGetCatalog
— Function pdDocGetCatalog(doc::PDDoc) -> CosObject
Catalog
is considered the topmost level object in PDF document that is subsequently used to traverse and extract information from a PDF document. To be used for accessing PDF internal objects from document structure when no direct API is available.
Example
julia> pdDocGetCatalog(doc)
154 0 obj
<<
/Pages 152 0 R
/Type /Catalog
>>
endobj
PDFIO.PD.pdDocGetNamesDict
— Function pdDocGetNamesDict(doc::PDDoc) -> CosObject
Some information in PDF is stored as name and value pairs not essentially a dictionary. They are all aggregated and can be accessed from one names
dictionary object in the document catalog. This method provides access to such values in a PDF file. Not all PDF document may have a names dictionary. In such cases, a CosNull
object may be returned.
Please refer to the PDF specification for further details.
Example
julia> pdDocGetNamesDict(doc)
220 0 obj
<<
/IDS 123 0 R
/Dests 119 0 R
/URLS 124 0 R
>>
endobj
PDFIO.PD.pdDocGetInfo
— Function pdDocGetInfo(doc::PDDoc) -> Dict
Given a PDF document provides the document information available in the Document Info
dictionary. The information typically includes creation date, modification date, author, creator used etc. However, all information content are not mandatory. Hence, all information needed may not be available in a document. If document does not have Info dictionary at all this method returns nothing
.
Please refer to the PDF specification for further details.
Example
julia> pdDocGetInfo(doc)
Dict{String,Union{CDDate, String, CosObject}} with 7 entries:
"Subject" => "AU-B Australian Documents"
"Producer" => "HPA image bureau 1998-1999"
"Author" => "IP Australia"
"ModDate" => D:19990527113911Z
"Keywords" => "Patents"
"Creator" => "HPA image bureau 1998-1999"
"Title" => "199479714D"
PDFIO.PD.pdDocGetCosDoc
— Function pdDocGetCosDoc(doc::PDDoc) -> CosDoc
PDF document format is developed in two layers. A logical PDF document information is represented over a physical file structure called COS. CosDoc
is an access object to the physical file structure of the PDF document. To be used for accessing PDF internal objects from document structure when no direct API is available.
One can access any aspect of PDF using the COS level APIs alone. However, they may require you to know the PDF specification in details and it is not the most intuititive.
Example
julia> cosdoc = pdDocGetCosDoc(doc)
CosDoc ==>
filepath: /home/sambit/.julia/dev/PDFIO/test/PDFTest-0.0.4/stillhq/3.pdf
size: 817945
hasNativeXRefStm: false
Trailer dictionaries:
<<
/ID [<2ff783c9846ab546bd49f709cb7be307> <2ff783c9846ab546bd49f709cb7be307> ]
/Size 163
/Prev 814755
/Info 146 0 R
/Root 154 0 R
>>
<<
/ID [<2ff783c9846ab546bd49f709cb7be307> <2ff783c9846ab546bd49f709cb7be307> ]
/Size 153
>>
PDFIO.PD.pdDocGetPage
— Function pdDocGetPage(doc::PDDoc, num::Int) -> PDPage
pdDocGetPage(doc::PDDoc, ref::CosIndirectObjectRef) -> PDPage
Given a document absolute page number or object reference, provides the associated page object.
Example
julia> page = pdDocGetPage(doc, 1)
PDFIO.PD.PDPageImpl(...)
julia> page = pdDocGetPage(doc, CosIndirectObjectRef(155, 0))
PDFIO.PD.PDPageImpl(...)
PDFIO.PD.pdDocGetPageCount
— Function pdDocGetPageCount(doc::PDDoc) -> Int
Returns the number of pages associated with the document.
Example
julia> pdDocGetPageCount(doc)
30
PDFIO.PD.pdDocGetPageRange
— Function pdDocGetPageRange(doc::PDDoc, nums::AbstractRange{Int}) -> Vector{PDPage}
pdDocGetPageRange(doc::PDDoc, label::AbstractString) -> Vector{PDPage}
Given a range of page numbers or a label returns an array of pages associated with it. For a detailed explanation on page labels, refer to the method pdDocHasPageLabels
.
Example
julia> pages = pdDocGetPageRange(doc, 1:4);
julia> typeof(pages)
Array{PDFIO.PD.PDPageImpl,1}
julia> length(pages)
4
PDFIO.PD.pdDocHasPageLabels
— Function pdDocHasPageLabels(doc::PDDoc) -> Bool
Returns true
if the document has page labels defined.
As per PDF Specification 1.7 Section 12.4.2, a document may optionally define page labels (PDF 1.3) to identifyeach page visually on the screen or in print. Page labels and page indices need not coincide: the indices shallbe fixed, running consecutively through the document starting from 0 for the first page, but the labels may be specified in any way that is appropriate for the particular document.
Example
julia> PDFIO.PD.pdDocHasPageLabels(doc)
false
PDFIO.PD.pdDocGetPageLabel
— Function pdDocGetPageLabel(doc::PDDoc, pageno::Int) -> String
Returns the page label if the page has a page label associated to it.
As per PDF Specification 1.7 Section 12.4.2, a document may optionally define page labels (PDF 1.3) to identify each page visually on the screen or in print. Page labels and page indices need not coincide: the indices shallbe fixed, running consecutively through the document starting from 0 for the first page, but the labels may be specified in any way that is appropriate for the particular document.
Example
julia> pdDocGetPageLabel(doc, 3)
"ii"
PDFIO.PD.pdDocGetOutline
— Function pdDocGetOutline(doc::PDDoc) -> PDOutline
Given a PDF document provides the document Outline (Table of Contents) available in the Document Catalog
dictionary. If document does not have Outline, this method returns nothing
.
A PDF document may contain a document outline that the conforming reader may display on the screen, allowing the user to navigate interactively from one part of the document to another. The outline consists of a tree-structured hierarchy of outline items (sometimes called bookmarks), which serve as a visual table of contents to display the document’s structure to the user. The user may interactively open and close individual items by clicking them with the mouse. When an item is open, its immediate children in the hierarchy shall become visible on the screen; each child may in turn be open or closed, selectively revealing or hiding further parts of the hierarchy. When an item is closed, all of its descendants in the hierarchy shall be hidden. Clicking the text of any visible item activates the item, causing the conforming reader to jump to a destination or trigger an action associated with the item. - Section 12.3.3 - Document management — Portable document format — Part 1: PDF 1.7
Example
julia> outline = pdDocGetOutline(doc)
555 0 R
julia> iob = IOBuffer();
julia> using AbstractTrees; print_tree(iob, outline)
julia> write(stdout, iob.data)
Contents
├─ Table of Contents
├─ 1. Introduction
├─ 2. Quick Steps - Kernel Compile
│ ├─ 2.1. Precautionary Preparations
│ ├─ 2.2. Minor Upgrading of Kernel
│ ├─ 2.3. For the Impatient
│ ├─ 2.4. Building New Kernel - Explanation of Steps
│ ├─ 2.5. Troubleshooting
...
PDFIO.PD.pdDocHasSignature
— Function pdDocHasSignature(doc::PDDoc) -> Bool
Returns true
when the document has at least one signature field.
This does not mean there is an actual digital signature embedded in the document. A PDF document can be signed and content can be approved by one or more reviewers. Signature fields are placeholders for storing and rendering such information.
Example
julia> pdDocHasSignature(doc)
true
PDFIO.PD.pdDocValidateSignatures
— Function pdDocValidateSignatures(doc::PDDoc; export_certs=false) -> Vector{Dict{Symbol, Any}}
Input
param | Description |
---|---|
doc | The document for which all the signatures are to be validated. |
export_certs | Optional keyword parameter when set, exports all the |
certificates that are embeded in the PDF document. These | |
certificates can be for end-entities or one or more certifying | |
authorities. | |
Certificates are exported to the file <PDF filename>.pem . |
Output
Vector of dictionary objects representing one dictionary object for each signature. The dictionary objects map the symbols to output as per the following table.
Symbol | Description |
---|---|
:Name | The name of the person or authority signing the document. |
:P | Object reference of the page in which the signature is found. |
:M | The CDDate when the document was signed. |
:certs | The certificates associated with every signature object. |
:subfilter | The subfilter of PDF signature object. |
:FQT | Fully qualified title of the signature form. |
:chain | The certificate chain that validated the signature. |
:passed | Validation status of the signature (true / false) |
:error_message | Error message returned during the validation |
:stacktrace | The stack dump of where the validation failure occurred |
Notes
- Any additional certificates needed for validating a certificate trust chain has to be added manually to the Trust Store file at:
<Package Directory>/data/certs/cacerts.pem
in the PEM format. Normally, certificate authorities (root as well as intermediate) are represented in the trust store. - Presence of an end-entity certificate in the Trust Store ensures that the chain validation for the certificate does not have to be carried out. However, this is not considered a good practice for certificates as the certificate validation is an important attribute to avoid security breaches in the chain. In case of self-signed certificates with not CA capabilities this may be the only option.
- Validation of digital signatures are limited to the approval signature validation as per section 12.8.1 of PDF Spec. 1.7. Signatures for permissions and usage rights are not validated as per this method. This API only provides a validation report. It does not modify access to any parts of the document based on the validation output. The consumer of the API needs to take appropriate action based on the validation report as desired in their applications.
- Revocation - When time is embedded in the signature as signing-time attribute or a signed timestamp or PDF sigature dictionary has M attribute, then those are picked up for validation. However, revocation information are not used during validation.
- PDF 2.0 Support - The support is only experimental. While some subfilters like
/ETSI.CAdES.detached
are supported. Document Security Store (DSS) and Document Time Stamp (DTS) has not been implemented.
Example
julia> r = pdDocValidateSignatures(doc);
julia> r[1] # Failure case
Dict{Symbol,Any} with 8 entries:
:Name => "JAYANT KUMAR ARORA"
:P => 1 0 R
:M => D:20190425173659+05'30
:error_message => "Error in Crypto Library:
140322274480320:error:02001002:system library:..."
:subfilter => /adbe.pkcs7.sha1
:stacktrace => ["error(::String) at error.jl:33",
"openssl_error(::Int32) at PDCrypt.jl:96",
"PDFIO.PD.PDCertStore() at PDCrypt.jl:148",
...]
:FQT => "Signature1"
:passed => false
julia> r[1] # Passed case
Dict{Symbol,Any} with 8 entries:
:Name => "JAYANT KUMAR ARORA"
:P => 1 0 R
:M => D:20190425173659+05'30
:certs => Dict{Symbol,Any}[Certificate Parameters...]
:subfilter => /adbe.pkcs7.sha1
:FQT => "Signature1"
:chain => Dict{Symbol,Any}[Certificate Parameters...]
:passed => true
PDFIO.PD.pdPageGetContents
— Function pdPageGetContents(page::PDPage) -> CosObject
Page rendering objects are normally stored in a CosStream
object in a PDF file. This method provides access to the stream object.
Please refer to the PDF specification for further details.
Example
julia> pdPageGetContents(page)
448 0 obj
<<
/Length 437
/FFilter /FlateDecode
/F (/tmp/tmpZnGGFn/tmp5J60vr)
>>
stream
...
endstream
endobj
PDFIO.PD.pdPageIsEmpty
— Function pdPageIsEmpty(page::PDPage) -> Bool
Returns true
when the page has no associated content object.
Example
julia> pdPageIsEmpty(page)
false
PDFIO.PD.pdPageGetCosObject
— Function pdPageGetCosObject(page::PDPage) -> CosObject
PDF document format is developed in two layers. A logical PDF document information is represented over a physical file structure called COS. This method provides the internal COS object associated with the page object.
PDFIO.PD.pdPageGetContentObjects
— Function pdPageGetContentObjects(page::PDPage) -> CosObject
Page rendering objects are normally stored in a CosStream
object in a PDF file. This method provides access to the stream object.
PDFIO.PD.pdPageGetMediaBox
— Function pdPageGetMediaBox(page::PDPage) -> CDRect{Float32}
pdPageGetCropBox(page::PDPage) -> CDRect{Float32}
Returns the media box associated with the page. See 14.11.2 PDF 1.7 Spec.
It's typically, the designated size of the paper for the page. When a crop box is not defined, it defaults to the media box.
Example
julia> pdPageGetMediaBox(page)
Rect:[0.0 0.0 595.0 792.0]
julia> pdPageGetCropBox(page)
Rect:[0.0 0.0 595.0 792.0]
PDFIO.PD.pdPageGetFonts
— Function pdPageGetFonts(page::PDPage) -> Dict{CosName, PDFont}()
Returns a dictionary of fonts in the page.
#Example
julia> pdPageGetFonts(page)
Dict{CosName,PDFIO.PD.PDFont} with 4 entries:
/F0 => PDFont(…
/F4 => PDFont(…
/F8 => PDFont(…
/F9 => PDFont(…
PDFIO.PD.pdPageExtractText
— Function pdPageExtractText(io::IO, page::PDPage) -> IO
Extracts the text from the page
. This extraction works best for tagged PDF files. For PDFs not tagged, some line and word breaks will not be extracted properly.
Example
Following code will extract the text from a full PDF file.
function getPDFText(src, out)
doc = pdDocOpen(src)
docinfo = pdDocGetInfo(doc)
open(out, "w") do io
npage = pdDocGetPageCount(doc)
for i=1:npage
page = pdDocGetPage(doc, i)
pdPageExtractText(io, page)
end
end
pdDocClose(doc)
return docinfo
end
PDFIO.PD.pdPageGetPageNumber
— Function pdPageGetPageNumber(page::PDPage)
Returns the page number of the document page.
Example
julia> pdPageGetPageNumber(page)
1
PDFIO.PD.pdFontIsBold
— Function pdFontIsBold(pdfont::PDFont) ->Bool
Returns `true` is the fonts have the attribute specified
PDFIO.PD.pdFontIsItalic
— Function pdFontIsItalic(pdfont::PDFont) ->Bool
Returns `true` is the fonts have the attribute specified
PDFIO.PD.pdFontIsFixedW
— Function pdFontIsFixedW(pdfont::PDFont) ->Bool
Returns `true` is the fonts have the attribute specified
PDFIO.PD.pdFontIsAllCap
— Function pdFontIsAllCap(pdfont::PDFont) ->Bool
Returns `true` is the fonts have the attribute specified
PDFIO.PD.pdFontIsSmallCap
— Function pdFontIsSmallCap(pdfont::PDFont)->Bool
Returns `true` is the fonts have the attribute specified
PDFIO.PD.PDOutline
— Type PDOutline
Representation of PDF document Outline (Table of Contents).
Use the methods from AbstractTrees
package to traverse the elements.
PDFIO.PD.PDOutlineItem
— Type PDOutlineItem
Representation of PDF document Outline item.
PDFIO.PD.PDDestination
— Type PDDestination
Used for variety of purposes to locate a rectangular region in a PDF document. Particularly, used in outlines, actions etc.
The structure can denote a location outside of a document as well like in remote GoTo(GoToR) actions. In such cases, it's best be used with filename additionally. Moreover, page references have no meaning in remote file references. Hence, the pageno
attribute has been set to Int
unlike the PDF Spec 32000-2008 12.3.2.2.
- `pageno::Int` - Page number location
- `layout::CosName` - Various view layouts are possible. Please review the
PDF spec for details. - values::Vector{Float32}
- [left, bottom, right, top] sequence array. Not all values are used. The usage depends on the layout
parameter. - zoom::Float32
- Zoom value for the view. Can be zero
depending on - layout
where it's intrinsic; hence, redundant.
PDFIO.PD.pdOutlineItemGetAttr
— Function pdOutlineItemGetAttr(item::PDOutlineItem) -> Dict{Symbol, Any}
Attributes stored with an PDOutlineItem
object. The traversal parameters like Prev
, Next
, First
, Last
and Parent
are stored with the structure.
The following keys are stored in the dictionary object returned:
:Title
- The title assigned to the item (shows up in the table of content):Count
- A representation of no of items open under the outline item. Please
refer to the PDF Spec 32000-2008 section 12.3.2.2 for details. Mostly, used for rendering on a user interface.
:Destination
-(filepath, PDDestination)
value. Filepath is an empty string
if the destination refers to a location in the same PDF file. This parameter is a combination of /Dest
and /A
attribute in the PDF specification. The action element is analyzed and data is extracted and stored with the PDDestination
as the final refered location.
:C
- The color of the outline in theDeviceRGB
space.:F
- Flags for title text renderingitalic=1
,bold=2
Example
julia> pdOutlineItemGetAttr(outlineitem)
Dict{Symbol,Any} with 5 entries:
:F => 0x00
:Title => "Table of Contents"
:Count => 0
:Destination => ("", PDDestination(2, /XYZ, Float32[0.0, 0.0, 0.0, 756.0], 0.0))
:C => Float32[0.0, 0.0, 0.0]
PDF Page objects
PDFIO.PD.PDPageObject
— Type PDPageObject
The content streams associated with PDF pages contain the objects that can be rendered. These objects are represented by PDPageObject
. These objects can contain a postfix notation based operator prefixed by its operands like:
(Draw this text) Tj
As can be seen above, the string object is a CosString
which is a parameter to the operand Tj
or draw text. These class of objects are represented by PDPageElement
.
However, there are certain objects which only provide grouping information or begin and end markers for grouping information. For example, a text object:
BT
/F1 11 Tf %selectfont
(Draw this text) Tj
ET
These kind of objects are represented by PDPageObjectGroup
. In this case, the PDPageObjectGroup
contains four PDPageElement
. Namely, represented as operators BT
, Tf
, Tj
, ET
.
PDPageElement
and PDPageObjectGroup
can be extended by composition. Hence, there are more specialized objects that can be seen as well.
PDFIO.PD.PDPageElement
— Type PDPageElement
A representation of a content object with operator and operand. See PDPageObject
for more details.
PDFIO.PD.PDPageObjectGroup
— Type PDPageObjectGroup
A representation of a content object that encloses other content objects. See PDPageObject
for more details.
PDFIO.PD.PDPageTextObject
— Type PDPageTextObject
A PDPageObjectGroup
object that represents a block of text. See PDPageObject
for more details.
PDFIO.PD.PDPageTextRun
— Type PDPageTextRun
In PDF text may not be contiguous as there may be chnge of font, style, graphics rendering parameters. PDPageTextRun
is a unit of text which can be rendered without any change to the graphical parameters. There is no guarantee that a text run will represent a meaningful word or sentence.
PDPageTextRun
is a composition implementation of PDPageElement
.
PDFIO.PD.PDPageMarkedContent
— Type PDPageMarkedContent
A PDPageObjectGroup
object that represents a group of a object that is logically grouped together in case of a structured PDF document.
PDFIO.PD.PDPageInlineImage
— Type PDPageInlineImage
Most images in PDF documents are defined in the PDF document and referenced from the page content stream. PDPageInlineImage
objects are directly defined in the page content stream.
PDFIO.PD.PDPage_BeginGroup
— Type PDPage_BeginGroup
A PDPageElement
that represents the beginning of a group object.
PDFIO.PD.PDPage_EndGroup
— Type PDPage_EndGroup
A PDPageElement
that represents the end of a group object.
COS Methods
PDFIO.Cos.CosDoc
— Type CosDoc
PDF file structure provides how the objects are arranged in a PDF file. PDF is designed to be accessed in a random access order. Some of the objects in PDF like fonts can be referred from multiple page objects. To address these concerns objects are provided reference identifiers and mappings are provided from various locations in the PDF files. Moreover, to reduce the size of the files, the objects are put inside stream containers and can be compressed. Access to a specific object reference may need several lookups before the actual object can be traced. All these lead to a fairly complex arrangement of objects. CosDoc
wraps all the object reference schemes and provide a simplified API called cosDocGetObject
and simplifies object look up. Thus any PDF object can be classified into the following forms based on how they are represented in a document:
- Direct Objects: Direct objects are defined where they are referred or used.
- Indirect Objects: Indirect objects have reference identifiers, there location in a PDF document is described through a Object Reference identifier.
One can access any aspect of PDF using the COS level APIs alone. However, they may require you to know the PDF specification in details and they are not the most intuititive.
PDFIO.Cos.cosDocOpen
— Function cosDocOpen(filepath::AbstractString) -> CosDoc
Provides the access to the physical file and file structure of the PDF document. Returns a CosDoc
which can be subsequently used for all query into the PDF files. Remember to release the document with cosDocClose
, once the object is used.
PDFIO.Cos.cosDocClose
— Function cosDocClose(doc::CosDoc)
Reclaims all system resources consumed by the CosDoc
. The CosDoc
should not be used after this method is called. cosDocClose
only needs to be explicitly called if you have opened the document by 'cosDocOpen'. Documents opened with pdDocOpen
do not need to use this method.
PDFIO.Cos.cosDocGetRoot
— Function cosDocGetRoot(doc::CosDoc) -> CosDoc
The structural starting point of a PDF document. Also known as document catalog dictionary.
PDFIO.Cos.cosDocGetObject
— Function cosDocGetObject(doc::CosDoc, obj::CosObject) -> CosObject
PDF objects are distributed in the file and can be cross referenced from one location to another. This is called as indirect object referencing. However, to extract actual information one needs access to the complete object (direct object). This method provides access to the direct object after searching for the object in the document structure. If an indirect object reference is passed as obj
parameter the complete indirect object
(reference as well as all content of the object) are returned. A direct object
passed to the method is returned as is without any translation. This ensures the user does not have to go through checking the type of the objects before accessing the contents.
Example
julia> cosDocGetObject(doc.cosDoc, CosIndirectObjectRef(555, 0))
555 0 obj
<<
/Count 18
/Last 629 0 R
/First 556 0 R
>>
endobj
julia> cosDocGetObject(doc.cosDoc, cn"DirectObject")
/DirectObject
cosDocGetObject(doc::CosDoc, dict::CosObject, key::Union{CosName, CosNullType}) -> CosObject
Returns the object referenced inside the dict
dictionary.
dict
can be a PDF dictionary object reference or an indirect object or a directCosDict
object.key
can beCosNull
as well. In such a case, a replicatedCosDict
with direct or indirect objects will be returned for all the inputdict
keys.
Example
julia> catalog
652 0 obj
<<
/Outlines 555 0 R
/PageLayout /SinglePage
/PageMode /UseOutlines
/Pages 446 0 R
/Type /Catalog
/OpenAction [447 0 R /XYZ null null 0 ]
>>
endobj
julia> pages = cosDocGetObject(doc.cosDoc, catalog, cn"Pages")
446 0 obj
<<
/Kids [447 0 R 449 0 R 451 0 R]
/Count 3
/Type /Pages
>>
endobj
julia> cosDocGetObject(doc.cosDoc, catalog, cn"PageLayout")
/SinglePage