• 0 Posts
  • 24 Comments
Joined 2 years ago
cake
Cake day: July 9th, 2023

help-circle




  • The problem lies in the PDFs themselves. In there are objects that represent lines of glyphs. If you are lucky. A conversion tool can guess which of those lines belong together and produce the text.

    It cannot know any intentions behind it, though. Take a numbered list. The first line is two line objects: the number plus the . or the ), and the first line of text. The conversion tool can now guess. As the line blocks with the numbers are all left of the line blocks with text, this could be a numbered list. Or it could be a table with two columns. Nothing in the PDF is giving any hints.

    And that is the easy part. This assumes that the document either uses default fonts, or keeps its embedded fonts untouched. If they use embedded fonts and a PDF optimizer that only embeds the used characters and renumbers them, any copy or conversion tool is bound to fail.

    Same with protected PDFs where you simply cannot copy the text from the start.

    And then there are PDFs that just consist of scanned pages. Here you would need an OCR software to get something readable out of them.

    PDF is an archival, output format, the end of a process. Not something to work from.

    Always preserve the original file. Keep it safe. If you change tools, make sure you have a conversion path into something editable. The PDF is for giving away, nothing else.







  • I use my former PC as the home server. It is probably 10+ years old, has no M2 slot or something, but an SSD for the OS. More than big and fast enough for all my needs: File service (Samba), Web service (apache2), Wiki service (mediawiki), Database (MySQL), Calendar service (Radicale), Project service (Subversion), and probably some others I forgot. All of it running on Ubuntu Server, aministrated by WebMin.

    The only investment I did when I turned this into a server was that I put 2x8TB in it as a RAID for bulk storage - I dump the family PCs backups on that machine, too.






  • I wonder when I can set default options for importing images. I’m using Open/LibreOffice since nearly the beginnings as StarOffice (I think I still have the CD somewhere), and one thing really irks me: Whenever I import an image (which I do rather regularly), it always imports the image as “Anchor->To Character”. For me, this is the wrong default, I always need “Anchor->As Character”. For thousands of images, I have to set this manually.