Project Naptha

Words on the web exist in two forms: there’s the text of articles, emails, tweets, chats and blogs— which can be copied, searched, translated, edited and selected— and then there’s the text which is shackled to images, found in comics, document scans, photographs, posters, charts, diagrams, screenshots and memes. Interaction with this second type of text has always been a second class experience, the only way to search or copy a sentence from an image would be to do as the ancient monks did, manually transcribing regions of interest.

This entire webpage is a live demo. You can watch as moving your cursor over a block of words changes it into the little I-beam. You can drag over a few lines and watch as a semitransparent blue box highlights the text, helping you keep track of where you are and what you’re reading. Hit Ctrl+C to copy the text, where you can paste it into a search bar, a Word document, an email or a chat window. Right-click and you can erase the words from an image, edit the words, or even translate it into a different language.

This was made by @antimatter15 (+KevinKwok on Google+), and Guillermo Webster.

If you stare at these three animated gifs long and hard enough, you might not need to read anything.

Early in October 2013, coincidentally less than a week before I developed the first prototype of this extension, xkcd published a comic (shown on the right) which somewhat ironically depicts the impetus for the extension.

The comic decries websites which arbitrarily hinder users from absentmindedly selecting random blocks of text— but the irony is that xkcd should count himself among the long list of offenders because up until now, it simply wasn't possible to select text inside a comic.

An interesting thing to note is the language agnostic nature of Project Naptha's underlying SWT algorithm (see the technical details by scrolling down a bit more) makes it detect the little squiggles as text as well. Depending on how you look at it, this can be seen as a bug, or a feature.

Also, because handwriting detection is particularly difficult (in particular, the issue is character segmentation, it's quite difficult to separate apart letters which are smushed so close as to be connected), if you try to copy and paste text from a comic, it ends up jumbled. This might be improved in the future, because certain parts of the Naptha stack do lag behind the present state-of-the-art by a few years.

It usually takes some special software to convert a scan into a PDF document that you can highlight and copy from, and this extra step means that a lot of the time, you aren't dealing with a nicely formatted and processed PDF, but a raw scan distributed as a TIFF or JPEG.

Usually, that just meant suffering through the document, or in the worst case, printing it out so that I could scribble with a pen along, while I read. But with this extension, it's possible to just select text from a picture, attached to an email, or linked from a class action lawsuit overview.

It's even possible for files you have locally on your computer. Simply drag the image file over to your browser window. Note that you might have to go to chrome://extensions and check the "Allow access to file URLs" checkbox.

The algorithm used by Project Naptha (Stroke Width Transform) was actually designed for detecting text in natural scenes and photographs (a more technically challenging and general problem than most regular images).

Naptha actually also supports rotated text (though it is still absolutely hopeless if the text is rotated by more than 30 degrees or so— sorry vertical text, I'll figure you out later!), which actually took a really long time to implement.

But with these types of images, the actual text recognition becomes somewhat of a crapshoot. While it's quite possible that the quality could improve in future versions, with better trained models and algorithms, and the inclusion of human-aided transcription services, you should probably calibrate your expectations fairly low to avoid disappointment.

Diagrams are cool. There are charts and diagrams all over the web, and sometimes you'll want to look up one of the chart axes, and it's pretty convenient to be able to do that without needing to type it up again. Maybe there's a circuit diagram and you want to check out where a certain component can be bought— just highlight its label and copy and paste it into the search bar.

This particular diagram was found on Blake Masters' course notes for Peter Thiel's Stanford class. I haven't actually read it, but it was on Hacker News so it just happened to be one of the tabs that I currently have open.

The truth is that I've spent way too much time on reddit and 4chan in search of test images for the text detection and layout analysis algorithms. Time really does go by when you can rationalize procrastination as something "productive". The result is that my test corpus is something on the order of 50% internet meme (In particular, I'm a fan of Doge, in part because Comic Sans is interpreted remarkably well by the built-in Ocrad text recognizer).

It's actually a bit difficult to recognize the text of the standard-template internet meme (mad props to CaptionBot, bro). Bold Impact font is actually notoriously hard to recognize with general-purpose text recognizers because a lot of what distinguishes letters isn't the overall shape, but rather the subtle rounding of corners (compare D, 0, O) or relatively short protrusions (the stubby little tail for L that differentiates it from an I).

I started building a text recognizer algorithm specifically designed for Impact font, and it was actually working pretty well, but I kind of misplaced the code somewhere. So, until I find it or replace it, you'll have to use Tesseract configured with the "Internet Meme" language.

Screenshots are a nice way to save things in a state that you can recall later in a more or less complete form— the only caveat being the fact that you would have to re-type the text later if you find a need for it. On the other hand, copying and saving just the text of something ends up losing the spatial context of its origin.

Project Naptha kind of transforms static screenshots into something more akin to an interactive snapshot of the computer as it was when the screen was captured. While clicking on buttons won't submit forms or upload documents, the cursor changes when hovering over different parts, and blocks of text become selectable, just like they were before frozen in carbonite.

While it's not a perfect substitute— the text recognition screws up every once in a while, so the reconstruction isn't reliably perfect, it still has a rather significant and profound effect.

There's always been the dream of a universal translation machine— something like the Babel fish out of Hitchhiker's Guide that will allow anyone to magically communicate with anyone else and to fully appreciate the art and culture of any society (Vogon poetry notwithstanding).

I'm here to say that it's still a ways off, but at least I have enough to do a pretty impressive demo.

Try it out: Highlight some region of text on that image. Right click on it and navigate to the "Translate" menu. Select whatever language you want.

This is actually the first step in translating an image: to erase the text from the image so new words can be put on top of it. This is done by something called "Inpainting" and these types of algorithms are most famously deployed as Adobe Photoshop's "Content-Aware Fill" feature.

It extrapolates solid colors from the regions surrounding the text, and propagates the colors inwards until the entire area is covered. From a distance, it usually does a pretty good job, but it's hardly a substitute for a true original.

Try it out: Highlight the text over the cat's face. Right click on the selection squig and click "Erase Text", which can be found under the "Translate" menu.

With the same trick that Translation uses— it's possible to substitute in your own text. This will probably work better in the future, once there's some actual font detecting logic besides if uppercase and super bold, then Impact font, if uppercase otherwise then XKCD font, and for everything else, Helvetica Neue.

I don't know where else to mention this, because it's one of those little things that simultaneously applies to everything and nothing at once— but it's also possible to select multiple regions by holding the shift key. I spent way too long writing the algorithms to merge multiple selection regions when appropriate.

Try it out: Highlight some meme text. Right click on the selection squig and click "Reprint Text", which can be found under the "Translate" menu. After that, select the text on one region that you'd like to edit and click "Modify Text" which should appear in the context menu.

During May 2012, I was reading about seam carving, an interesting and almost magical algorithm which could rescale images without apparently squishing it. After playing with the the little seams that the seam carver tended to generate, I noticed that they tended to converge arrange themselves in a way that cut through the spaces in between letters (dynamic programming approaches are actually fairly common when it comes to letter segmentation, but I didn't know that). It was then, while reading a particularly verbose smbc comic, I thought that it should be possible to come up with something which would read images (with <canvas>), figure out where lines and letters were, and draw little selection overlays to assuage a pervasive text-selection habit.

My first attempt was simple. It projected the image onto the side, forming a vertical pixel histogram. The significant valleys of the resulting histograms served as a signature for the ends of text lines. Once horizontal lines were found, it cropped each line, and repeated the histogram process, but vertically this time, in order to determine the letter positions. It only worked for strictly horizontal machine printed text, because otherwise the projection histograms would end up too noisy. For one reason or another, I decided that the problem either wasn't worth tackling, or that I wasn't ready to.

Fast forward a year and a half, I'm a freshman at MIT during my second month of school. There's a hackathon that I think I might have signed up for months in advance, marketed as MIT's biggest. I slept late the night before for absolutely no particular reason, and woke up at 7am because I wanted to make sure that my registration went through. I walked into the unfrozen ice rink, where over 1,000 people were claiming tables and splaying laptop cables on the ground— so this is what my first ever hackathon is going to look like.

Everyone else was "plugged in" or something; big headphones, staring intently at dozens of windows of Sublime Text. Fair enough, it was pretty loud. I had no idea what I would have ended up doing, and I wasn't able to meet anyone else who was both willing to collaborate and had an idea interesting enough for me to want to. So I decided to walk back to my dorm and take a nap.

I woke up from that nap feeling only slightly more tired, and nowhere closer to figuring out what I was going to do. I decided to make my way back to the hackathon, because there's free food there or something.

If you paid attention to the permissions requested in the installation dialog, you might have wondered about why exactly this extension requires such sweeping access to your information. Project Naptha operates a very low level, it's actually ideally the kind of functionality that gets built in to browsers and operating systems natively. In order to allow you to highlight and interact with images everywhere, it needs the ability to read images located everywhere.

One of the more impressive things about this project is the fact that it's almost entirely written in client side javascript. That means that it's pretty much totally functional without access to a remote server. That does come at a bit of a caveat, which is that online translation running offline is an oxymoron, and lacking access to a cached OCR service running in the cloud means reduced performance and lower transcription accuracy.

So there is a trade-off that has to happen between privacy and user experience. And I think the default settings strike a delicate balance between having all the functionality made available and respecting user privacy. I've heard complaints on both sides (roughly equal in quantity, actually, which is kind of intriguing)— lots of people want high quality transcription to the default, and others want no server communication whatsoever as the default.

By default, when you begin selecting text, it sends a secure HTTPS request containing the URL of the specific image and literally nothing else (no user tokens, no website information, no cookies or analytics) and the requests are not logged. The server responds with a list of existing translations and OCR languages that have been done. This allows you to recognize text from an image with much more accuracy than otherwise possible. However, this can be disabled simply by checking the "Disable Lookup" item under the Options menu.

The translation feature is currently in limited rollout, due to scalability issues. The online OCR service also has per-user metering, and so such requests include a unique identifier token. However, the token is entirely anonymous and is not linked with any personally identifiable information (it handled entirely separately from the lookup requests).

So actually, the thing that is running on this page isn't the fully fledged Project Naptha. It's essentially just the front-end, so it lacks all of the computational heavy lifting that actually makes it cool. All the text metrics and layout analyses were precomputed. Before you raise your pitchforks, there's actually a good reason this demo page runs what amounts to a Weenie Hut Jr. version of the script.

The computationally expensive backend uses WebWorkers extensively, which, although has fairly good modern browser support, has subtle differences between platforms. Safari has some weird behavior when it comes to sending around ImageData instances, and transferrable typed arrays are slightly different in Firefox and Chrome. Most importantly though, the current stable version (34) of Google Chrome, at time of writing actually suffers from a debilitatingly broken WebWorkers implementation. Rather fortunately, Chrome extensions don't seem to suffer from the same problem.

The dichotomy between words expressed as text and those trapped within images is so firmly engrained into the browsing experience, that you might not even recognize it as counter-intuitive. For a technical crowd, the limitation is natural, lying in the fact that images are fundamentally “raster” entities, devoid of the semantic information necessary to indicate which regions should be selectable and what text is contained.

Computer vision is an active field of research essentially about teaching computers how to actually “see” things, recognizing letters, shapes and objects, rather than simply pushing copies of pixels around.

In fact, optical character recognition (OCR) is nothing new. It has been used by libraries and law firms to digitize books and documents for at least 30 years. More recently, it has been combined with text detection algorithms to read words off photographs of street signs, house numbers and business cards.

The primary feature of Project Naptha is actually the text detection, rather than optical character recognition. It runs an algorithm called the Stroke Width Transform, invented by Microsoft Research in 2008, which is capable of identifying regions of text in a language-agnostic manner. In a sense that’s kind of like what a human can do: we can recognize that a sign bears written language without knowing what language it's written in, nevermind what it means.

However, half a second is still quite noticeable, as studies have shown that users not only discern, but feel readily annoyed by delays as short as a hundred milliseconds. To get around that, Project Naptha is actually continually watching cursor movements and extrapolating half a second into the future so that it can kick off the processing in advance, so it feels instantaneous.

In conjunction with other algorithms, like connected components analysis (identifying distinct letters), otsu thresholding (determining word spacing), disjoint set forests (identifying lines of text), Project Naptha can very quickly build a model of text regions, words, and letters— all while completely unaware of the specifics, what specific letters exist.

Once a user begins to select some text, however, it scrambles to run character recognition algorithms in order to determine what exactly is being selected. This recognition process happens on a per-region basis, so there’s no wasted effort in doing it before the user is done with the final selection.

The recognition process involves blowing up the region of interest so that each line is on the order of 100 pixels tall, which can be as large as a 5x magnification. It then does an intelligent color masking filter before sending it to a built-in pure-javascript port of the open source Ocrad OCR engine.

Because this process is relatively computational expensive, it makes sense to do this type of “lazy” recognition- staving off until the last possible moment to run the process. It can take as much as five to ten seconds to complete, depending on the size of the image and selection. So there’s a good chance that by the time you hit Ctrl+C and the text gets copied into your clipboard, the OCR engine still isn’t done processing the text.

That’s all okay though, because in place of the text which is still getting processed, it inserts a little flag describing where the selection is and which part of the image to read from. For the next 60 seconds, Naptha tracks that flag and substitutes it with the final, recognized text as soon as it can.

Sometimes, the built-in OCR engine isn’t good enough. It only supports languages with the Latin alphabet and a limited number of diacritics, and doesn’t contain a language model so that it outputs a series of letters dependent on the probability given its context (for instance, the algorithm may decide that “he1|o” is a better match than “hello” because it only looks at the letter shape). So there’s the option of sending the selected region over to a cloud based text recognition service powered by Tesseract, Google’s (formerly HP’s) award-winning open-source OCR engine which supports dozens of languages, and uses an advanced language model.

If anyone triggers the Tesseract engine on a public image, the recognition result is saved, so that future users who stumble upon the same image will instantaneously load the cached version of the text.

There is a class of algorithms for something called “Inpainting”, which is about reconstructing pictures or videos in spite of missing pieces. This is widely used for film restoration, and commonly found in Adobe Photoshop as the “Content-Aware Fill” feature.

Project Naptha uses the regions detected as text as a mask for a particular inpainting algorithm developed in 2004 based on the Fast Marching Method by Alexandru Telea. This mask can be used to fill in the spots where the text is taken from, creating a blank slate for which new content can be printed.

With some rudimentary layout analysis and text metrics, Project Naptha can figure out the alignment parameters of the text (centered, justified, right or left aligned), the font size and font weight (bold, light or normal). With that information, it can reprint the text in a similar font, in the same place. Or, you can even change the text to say whatever you want it to say.

It can even be chained to an online translation service, Google Translate, Microsoft Translate, or Yandex Translate in order to do automatic document translations. With Tesseract’s advanced OCR engine, this means it’s possible to read text in languages with different scripts (Chinese, Japanese, or Arabic) which you might not be able to type into a translation engine.

The prototype which was demonstrated at HackMIT 2013, later winning 2nd place, was rather blandly dubbed "Images as Text". Sure, it pretty aptly summed up the precise function of the extension, but it really lacked that little spark of life.

So from then, I set forth searching for a new name, something that would be rife with puntastic possibilities. One of the possibilities was "Pyranine", the chemical used in making the ink for flourescent highlighters (my roommate, a chemistry major, happened to be rather fond of the name). I slept on that idea for a few nights, and realized that I had totally forgotten how to spell it, and so it was crossed off the candidate list.

Naptha, its current name, is drawn from an even more tenuous association. See, it comes from the fact that "highlighter" kind of sounds like "lighter", and that naptha is a type of fuel often used for lighters. It was in fact one of the earliest codenames of the project, and brought rise to a rather fun little easter egg which you can play with by quickly clicking about a dozen times over some block of text inside a picture.

Project Naptha

highlight, copy, and translate text from any image.

Animated GIF tl;dr

Chronology

Security & Privacy

About this Demo...

How it Works

What's in a Name?