Reading yesterday Andres Riancho’s blog post about weak CAPTCHAs, it quickly crossed my mind that it would be ridiculously simple to adjust his script to defeat an unofficial CAPTCHA solution for BlogEngine.NET that I myself used a few days on my blog.
The image creation procedure used for this CAPTCHA, if one searches on Google, seems to be mentioned in a few places, a gradient technique mixing blue and red.
So what we have:
- Only letters and numbers are used.
- The letters and numbers are not rotated.
- The letters and numbers are pretty clean.
- All letters and numbers have the same height.
- The letters and numbers do not have the same color, but the way the colors are used, this should not pose any major problems.
- The way the background noise is used, we should be able to get rid of it pretty quick.
We just need to add a few lines to Andres Riancho’s script.
1. First convert it to gray scale.
2. Then adjust the contrast a little bit to get black letters and numbers with a some grey background noise.
3. And by now, we can use the original script since the letters and numbers should be black, so we can filter out the background noise, just like in the original script.
4. The results:
I tested 18 different such CAPTCHA images, it missed just 3(say confused O with 0), YMMV.
from PIL import Image
from PIL import ImageEnhance
#convert it to gray scale
img = Image.open('input2.gif')
img = img.convert('L')
#adjust the contrast a little bit
test = Image.open('1.gif')
img = ImageEnhance.Contrast(test)
img = Image.open('2.gif')
img = img.convert("RGBA")
pixdata = img.load()
# Clean the background noise, if color != black, then set to white.
for y in xrange(img.size):
for x in xrange(img.size):
if pixdata[x, y] != (0, 0, 0, 255):
pixdata[x, y] = (255, 255, 255, 255)
# Make the image bigger (needed for OCR)
im_orig = Image.open('3.gif')
big = im_orig.resize((116, 56), Image.NEAREST)
ext = ".tif"
big.save("4" + ext)
# Perform OCR using pytesser library
from pytesser import *
image = Image.open('4.tif')