There are two types of PDF:
Does the type of PDF created matter? Yes, when converting a PDF, the nature of the PDF does matter.
Native PDFs are generated from an electronic source document, for example:
.... which have an internal structure that can be read and interpreted by software.
These "generated" native PDF documents therefore already contain characters that have an electronic character designation. In most cases, the PDF creation software will take information from the structure of the source document such as character information, word placement data, etc. and retain these items in the created PDF output. This is the reason why you can word search a text-based PDF document.
A scanned PDF comes about where a physical paper document needs to be converted into an electronic form (i.e. where it is inefficient or not viable to re-type/recreate documents manually into electronic form and then convert them into PDFs).
The solution is to scan the document using an electronic scanning device. The scanner digitally captures the image of the physical document into an electronic form, creating a "snapshot" picture of the document. (Note: the scanner does not reconstruct the character of every word when it creates this scanned image.) This snapshot is then turned into a PDF by using software integrated with the scanner.
The result is a scanned PDF document.
However, even though the image may be of a document that contains words, the computer recognizes those words only as "images", which it displays without any information structure behind it.
This is the reason why if you try to text search the document, the PDF search engine will not return any results.
To convert a scanned PDF into an searchable/editable format, OCR (optical character recognition) software is required to analyze the "image" of each character and match it to an electronic character-based file. This process is often not error free, and it may be difficult to determine that the character "recognized" by the OCR software is indeed the character on the scanned document.
One should note, that the quality of OCR output is affected by matters such as:
For financial statements of course, the quality of OCR conversion is of paramount importance. Accordingly these files need to be very carefully processed followed by manual verification and correction of the OCR output to assure accuracy of the results.
Following the above stage, the file is then read for conversion to iXBRL or XBRL.
An easy solution, to avoid this OCR stage, is to obtain and provide us with the source document from which the paper document was printed (and then scanned). This is likely to be:
.... created just before signature of the financial statements. Note: that the actual signature is neither needed nor utilized for iXBRL /XBRL conversion.