GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together. If nothing happens, download GitHub Desktop and try again. If nothing happens, download Xcode and try again. If nothing happens, download the GitHub extension for Visual Studio and try again.
The examples directory has a few scripts which use the library. But that assumes several things, including that the personnel know how to operate the hardware and software. The cat.
The alter. One difference is that, since cat is creating a new PDF structure, and alter is attempting to modify an existing PDF structure, the PDF produced by alter and also by watermark.
For example, the alter. If you ever want to print something that is like a small booklet, but needs to be spiral bound, you either have to do some fancy rearranging, or just waste half your paper. But, every other page is flipped, so that you can print double-sided and the pages will line up properly and be pre-collated.
The copy. If you know about reportlab, you know that if you can faithfully render a PDF to a reportlab canvas, you can do pretty much anything else with that PDF you want. This kind of low level manipulation should be done only if you really need to. The philosophy of the library portion of pdfrw is to provide intuitive functions to read, manipulate, and write PDF files. There should be minimal leakage between abstraction layers, although getting useful work done makes "pure" functionality separation difficult.
A key concept supported by the library is the use of Form XObjects, which allow easy embedding of pieces of one PDF into another. Addition of core support to the library is typically done carefully and thoughtfully, so as not to clutter it up with too many special cases. There are a lot of incorrectly formatted PDFs floating around; support for these is added in some cases.
The decision is often based on what acroread and okular do with the PDFs; if they can display them properly, then eventually pdfrw should, too, if it is not too difficult or costly.
Contributions are welcome; one user has contributed some decompression filters and the ability to process PDF 1. Additional functionality that would obviously be useful includes additional decompression filters, the ability to process password-protected PDFs, and the ability to output linearized PDFs. The philosophy of the examples is to provide small, easily-understood examples that showcase pdfrw functionality. In general, PDF files conceptually map quite well to Python.
The major objects to think about are:. To flatten out a circular reference, an indirect object is referred to instead of being directly included in another object.
E-Readers for Fanfiction Readers?
PDF files have a global mechanism for locating indirect objects, and they all have two reference numbers a reference number and a "generation" number, in case you wanted to append to the PDF file rather than just rewriting the whole thing. When pdfrw encounters an indirect PDF file object, the corresponding Python object it creates will have an 'indirect' attribute with a value of True. When writing a PDF file, if you have created arbitrary data, you just need to make sure that circular references are broken up by putting an attribute named 'indirect' which evaluates to True on at least one object in every cycle.
Another PDF file concept that doesn't quite map to regular Python is a "stream". Streams are dictionaries which each have an associated unformatted data block. The usage model for pdfrw treats most objects as strings it takes their string representation when writing them to a file.
The two main exceptions are the PdfArray object and the PdfDict object. PdfArray is a subclass of list with two special features. Second, pdfrw reads files lazily, so PdfArray knows about, and resolves references to other indirect objects on an as-needed basis. PdfDict is a subclass of dict that also has an indirect attribute and lazy reference resolution as well.
And the subclassed IndirectPdfDict has indirect automatically set True. But PdfDict also has an optional associated stream. So usage of PdfDict objects is normally via attribute access, although non-standard names though still with a leading slash can be accessed via dictionary index lookup.
In addition to the tree structure, pdfrw creates a special attribute named pages , that is a list of all the pages in the document. Each entry in the pages list is the PdfDict object for one of the pages in the file, in order. As you can see, it is quite easy to dig down into a PDF document. But what about when it's time to write it out?
Excellent PDF Reader
That's all it takes to create a new PDF. You may still need to read the Adobe PDF reference manual to figure out what needs to go into the PDF, but at least you don't have to sweat actually building it and getting the file offsets right. For the most part, pdfrw tries to be agnostic about the contents of PDF files, and support them as containers, but to do useful work, something a little higher-level is required, so pdfrw works to understand a bit about the contents of the containers.
For example:. However, in most cases, you can do a lot of useful work with PDFs without actually removing compression, because only certain elements inside PDFs are actually compressed. One feature that all the PDF object classes have in common is the inclusion of an 'indirect' attribute. If 'indirect' exists and evaluates to True, then when the object is written out, it is written out as an indirect object.
That is to say, it is addressable in the PDF file, and could be referenced by any number including zero of container objects.
Adobe Acrobat Reader – The PDF Reader
This indirect object capability saves space in PDF files by allowing objects such as fonts to be referenced from multiple pages, and also allows PDF files to contain internal circular references. This latter capability is used, for example, when each page object has a "parent" object in its dictionary. It can be used either by calling it or getting an attribute, e.
In the example above, there is a slight difference between the objects returned from PdfName, and the object returned from PdfObject. This is important, because only these may be used as keys in PdfDict objects. The class has encode and decode methods for the strings.
A regular list could be used instead, but use of the PdfArray class allows for an indirect attribute to be set, and also allows for proxying of unresolved indirect objects that haven't been read in yet in a manner that is transparent to pdfrw clients. A regular dict could be used instead, but the PdfDict class matches the requirements of PDF files more closely:.
If a PdfDict has an associated data stream in the PDF file, the stream is accessed via the 'stream' all lower-case attribute. To set private attributes that will not be written out to a new PDF file on a dictionary, use the 'private' attribute:.
Excellent PDF Reader
Some attributes of PDF pages are "inheritable. The "inheritable" attribute allows for easy discovery of these:. Although these are non-transparent inside the library, client code should never see one of these -- they exist inside the PdfArray and PdfDict container types, but are resolved before being returned to a client of those types.
It uses the PdfTokens class in tokens. The PdfReader class does not, in general, parse into containers e. It will have a private attribute set on it that is named 'pages' that is a list containing all the pages in the file. When instantiating a PdfReader object, there are options available for decompressing all the objects in the file.
Katie s trunk pdf reader
Also, there are no options for decryption yet. If you have PDF files that are encrypted or heavily compressed, you may find that using another program like pdftk on them can make them readable by pdfrw. In general, the objects are read from the file lazily, but this is not currently true with compressed object streams -- all of these are decompressed and read in when the PdfReader is instantiated.
In the simplest case, an instance of PdfWriter is instantiated, and then pages are added to it from one or more source files or created programmatically , and then the write method is called to dump the results out to a file.
If you have a source PDF and do not want to disturb the structure of it too badly, then you may pass its trailer directly to PdfWriter rather than letting PdfWriter construct one for you. There is an example of this alter.
These may be reused in new PDFs essentially as if they were images. It is normally used in conjunction with buildxobj, to be able to reuse parts of existing PDFs when using reportlab. It contains classes to create a new page or overlay an existing page using one or more rectangles from other pages. There are examples showing its use for watermarking, scaling, 4-up output, splitting each page in 2, etc.
See a Problem?
The extract. Very few filters are currently supported, so an external tool like pdftk might be good if you require the ability to decompress or, for that matter, decrypt PDF files.
The tests associated with pdfrw require a large number of PDFs, which are not distributed with the library. It can do decompression and decryption and seems to know a lot about items inside at least some kinds of PDF files.
2. Nitro Pro 12
The Form XObject capability of pdfrw means that, in many cases, it does not actually need to decompress objects -- they can be left compressed. My understanding is that pagecatcher would have done exactly what I wanted when I built pdfrw.
But I was on a zero budget, so I've never had the pleasure of experiencing pagecatcher. I do, however, use and like reportlab open source, from the people who make pagecatcher so I'm sure pagecatcher is great, better documented and much more full-featured than pdfrw. This looks like a useful, actively-developed program. It is quite large, but then, it is trying to actively comprehend a full PDF document. From the website:. Unlike other PDF-related tools, it allows to obtain the exact location of texts in a page, as well as other extra information such as font information or ruled lines.
It has an extensible PDF parser that can be used for other purposes instead of text analysis. Skip to content.
Adobe Acrobat Reader – The PDF Reader
Dismiss Join GitHub today GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together. Sign up. Python Jupyter Notebook.
Python Branch: master New pull request. Find file.