Unless the only thing you want is a description of the image, then the real answer is NO. You can get an LLM to do something like "If you encounter an image that is not easily convertable to standard markdown, insert a [[DESCRIPTION OF IMAGE]] here." placeholder, but at that point you've lost information that may be salient to the original PDF.
The reason is because these multimodal LLMs can give you descriptions/OCR/etc., but they cannot give you quantifiable information related to placement.
So if there was a picture of a tiger in the middle of the page converted to a bitmap, you couldn't get the LLM to give you something like this: "Image detected at pixel position (120, 200) - (240, 500)." - because that's really want you want.
You almost need segmentation system middleware that the LLM can forward to which can cut out these images to use in markdown syntax:
I had one of these but it would pinch nerves in my legs so my legs would go numb after awhile. I switched to a stressless recliner and never looked back.
Is this something that could be used to create a chatbot with a knowledge base based on all pages and all snapshots of those pages of a specific domain archived by web.archive.org/?