angle-uparrow-clockwisearrow-counterclockwisearrow-down-uparrow-leftatcalendarcard-listchatcheckenvelopefolderhouseinfo-circlepencilpeoplepersonperson-fillperson-plusphoneplusquestion-circlesearchtagtrashx

IMAPClient and fetching body parts

Fetching IMAP server body parts with Python and IMAPClient is worth it. The reduction in download times is significant.

26 June 2020 Updated 21 July 2020
In Email
post main image
https://unsplash.com/@tobiastu

I decided to temporarily shift focus from developing the software for my CMS / Blog to a smaller project. Main reason is that I hoped to learn new things about Python that are useful.

I always wanted to have my own IMAP client software. Maybe my choice was also heavily influenced by some annoyance about the IMAP client Dekko2 for Ubuntu Touch, the OS of my mobile phone. I know that I should be happy with the existence of Dekko2 and I am, really, its running fine on my OnePlus One. But some actions are slow or do not deliver the requested result. For example searching is slow. And sometimes the results appear not to be complete.

Dekko2 is a nice piece of software, it saves bandwidth and space on your phone but personally I do not care about this. Mobile subscriptions are getting cheaper and phone memory is often 8 GB minimum.

My approach

Write a basic IMAP client. Store the email HEADER data and BODY TEXT (text/plain) and BODY HTML (text/html) in a local database. Then we have most of the data we need on our PC or phone. We only download the attachments, inline images on request.

SQLite is the way to go of course, well supported and stable (but never use it in a high performance multi-threaded application). And searching is fast and does not take any bandwidth.

I must say that I was also fascinated by the statement on the SQLite website that storing data in a BLOB was faster than storing it as a file. Well, at least as long as the size is not to big. Most emails I receive are below 20-30 KB so this fits the rule.

I decided to use the IMAPClient package to make my life a bit more easy. It is very easy use ... when you start.

About what amount of data are we talking

I have an email account with an INBOX of some 5000 messages. With the IMAPClient package it is easy to download the ENVELOPE. This contains information like:

  • Date and time
  • Subject
  • Email addresses: to, from, cc, bcc, etc.

When we fetch the ENVELOPE we can fetch the BODYSTRUCTURE as well. This is essential as we need the BODYSTRUCTURE to download the BODY TEXT and BODY HTML parts. We need these parts for searching.

Do not know about you but I do not care about images. And if we need an image we can always download it. The same is true for attachments.

I decided to store the BODYSTRUCTURE with the message and then later decoded this when I request to download the required parts. So loading messages consist of two stages.

In the first stage we download the ENVELOPE and BODYSTRUCTURE. The ENVELOPE is decoded into date, subject, email addresses and then all this information is stored. After this stage we have a nice list of messages to show, but without any body parts. In the second stage we decode the BODYSTRUCTURE, download the BODY TEXT and BODY HTML, and store this information in the database.

Time for some tests. 

+-------+---------------------------------+---------+--------------+---------------+
| Stage | Action                          | DB size | Time on PC   | Time on phone |
+-------+---------------------------------+---------+--------------+---------------+
|   1   | Download and store              | 25 MB   | 1.5 minutes  | 3 minutes     |
|       | ENVELOPE and BODYSTRUCTURE data |         | 0.45 seconds | 3 minutes     |
+-------+---------------------------------+---------+--------------+---------------+
|   2   | Download and store              | 182 MB  | 5 minutes    | 15 minutes    |
|       | BODY TEXT and BODY HTML         |         | 4 minutes    | 12 minutes    |
+-------+---------------------------------+---------+--------------+---------------+

There you have it. My PC is an i7 and gets its data via an internet connection. My phone is a OnePlus One running Ubuntu Touch and get its data via 3G/4G. Of course there are many factors affecting these times so this is just an indication.

There are two times per item. The first is with 'debug messages on', the log file is 1,4 GB, the second is without debug messages.

The times are for loading an empty database with data for 5000 messages. In a typical situation you have all messages already and only get new ones (or delete removed ones).

I did not download attachments including attached (forwarded) messages. What is the size difference with a full download? My mail server uses Dovecot. The Maildir directory for my INBOX:

  • new: 5.7M
  • cur: 681M

This is a total of 690 MB. This means we did not download 690 MB - 170 MB = 520 MB. In other words, by downloading 170/690 = 25% of all data, we can search by email address and search the BODY TEXT and BODY HTML without connecting to the IMAP server by querying our database.

SQLite and optimizations

To store the data I use the following tables:

  • imap_server
  • imap_server_folder
  • imap_mail_msg
  • imap_mail_msg_addr

Performance optimizations with SQLite are easy when you use excutemany. I have a table for the messages and a table for the email addresses and only commit after 100 messages. The difference in time was some 50%.

But the biggest performance gain I achieved was by limiting the number of email addresses. I am a member of some email lists and some lists send a message to over 400 visible email addresses, in the to-field or cc-field. I decided to store the ENVELOPE local and limit the number of email address per to, cc, bcc, etc. to 20. If I want to see more, probably not, then I always can read the ENVELOPE again and store and show them. Like when showing the email, we can show an additional 20 email addresses with a link to 380 more. Pretty useless in most cases.

I did not optimize the UPDATE operation when storing the BODY TEXT and BODY HTML. I read somewhere that this would not change much but every second counts so I will investigate this later.

IMAPClient and fetching body parts BODY TEXT (text/plain) and BODY HTML (text/html)

Using IMAPClient to fetch the uids of all messages is easy. But to fetch body parts is a challenge. The returned BODYSTRUCTURE is converted by IMAPClient into tuples, lists. To fetch a body part we need the body number and this number is not in the BODYSTRUCTURE.

This means we must flatten the BODYSTRUCTURE ourselves and assign body numbers. I found some code on the internet that was very helpful, see links below: 'simple modern command line client'.

After the flatten operation, we must select the BODY TEXT (text/plain) and BODY HTML (text/html) for download. I decided to create a body parts Class that holds all relevant data to download a part. When updating the table imap_mail_msg with the fetched BODY TEXT and BODY HTML, this Class is also pickled and stored in table imap_mail_msg in case we want to download attachments later. Here is the Class I use:

class BodystructurePart:

    def __init__(self, 
        part=None,
        body_number=None,
        content_type=None,
        charset=None,
        size=None,
        decoder=None,
        is_inline_or_attachment=None,
        is_inline=None,
        inline_or_attachment_info=None,
        is_attached_message=None
        ):
        self.part = part
        self.body_number = body_number
        self.content_type = content_type
        self.charset = charset
        self.size = size
        self.decoder = decoder
        self.is_inline_or_attachment = is_inline_or_attachment
        self.is_inline = is_inline
        self.inline_or_attachment_info = inline_or_attachment_info
        self.is_attached_message = is_attached_message

Inline_or_attachment_info is a dictionary with properties of the attachment. I did not yet look into decoding forwarded messages.

Downloading and decoding body parts

This worked fine for many messages but for 20 of the 5000, 0,4%, there was an decoding exception. For example, one message said that the charset was 'us-ascii' but decoding with this charset caused the following error:

'ascii' codec can't decode byte 0xfb in position 6494: ordinal not in range(128)

Fortunately there is a package called chardet that tries to detect the encoding of a string. It suggested the encoding charset was 'ISO-8859-1' and with this charset decoding gave no errors. Another message said the charset was 'utf-8' but gave decoding error:

'utf-8' codec can't decode byte 0xe8 in position 2773: invalid continuation byte

Chardet suggested the encoding charset was 'Windows-1252' and decoding with this charset gave no errors. I manually checked the decoded messages and they looked fine. This part of the code:

    if bodystructure_part.content_type in ['text/plain', 'text/html']:

        BODY = 'BODY[{}]'.format(bodystructure_part.body_number)
        fetch_result = imap_server.fetch([msg_uid], [BODY])

        if msg_uid not in fetch_result:
            if dbg: logging.error(fname + ': msg_uid = {} not in fetch_result'.format(msg_uid))
            continue

        if BODY not in fetch_result[msg_uid]:
            if dbg: logging.error(fname + ': BODY not in fetch_result[msg_uid = {}]'.format(msg_uid))
            continue

        data = fetch_result[msg_uid][BODY]

        if bodystructure_part.decoder == b'base64':
            decoded_data = base64.b64decode(data)
        elif bodystructure_part.decoder == b'quoted-printable':
            decoded_data = quopri.decodestring(data)
        else:
            decoded_data = data
        
        # this may fail if charset is wrong
        is_decoded = False
        try:
            text = decoded_data.decode(bodystructure_part.charset)
            is_decoded = True
        except Exception as e:
            logging.error(fname + ': msg_uid = {}, problem decoding decoded_data with bodystructure_part.charset = {}, e = {}, decoded_data = {}'.format(msg_uid, bodystructure_part.charset, e, decoded_data))

        if not is_decoded:
            # try to get encoding
            r = chardet.detect(decoded_data)
            charset = r['encoding']
            try:
                text = decoded_data.decode(charset)
                is_decoded = True
            except Exception as e:
                logging.error(fname + ': msg_uid = {}, problem decoding decoded_data with detected charset = {}, e = {}, decoded_data = {}'.format(msg_uid, charset, e, decoded_data))
                
        if not is_decoded:
            logging.error(fname + ': msg_uid = {}, cannot decode'.format(msg_uid))

How do we know we can decode all mail?

Here we touch a big problem when developing an email client. My code was capable of decoding all 5000 messages without errors. But will it also work for message 5001? And there is even a bigger problem. If the message is decoded without errors then how do we know that the decoded message is correct?

There are few ways to solve this. One way is to create a huge test set of email messages and manually approve the decoded message parts. But a certainly faster way is to use an existing proven email client, feed it with our emails, and compare the decoded message parts with our results.

Viewing the emails

Flask and Bootstrap are the perfect tools for this, In a few hours I build a frontend that shows a single page consisting of two parts. The upper part is the list of emails, the bottom part is an IFRAME that show the email BODY TEXT or BODY HTML.

Summary

Most examples on the internet only deal with downloading the full message and then decode this. Fetching and decoding IMAP email is a challenge because we must convert the BODYSTRUCTURE and then flatten it to get the numbers of the body parts. The IMAPClient package certainly helps but lacks good examples. To be sure that our decoding works we need a method to check this. Finally, by storing ENVELOPE and BODY TEXT and BODY HTML data in a SQLite database I have almost all the information I want, searching is very fast because it does not have to interact with the IMAP server.

Links / credits

IMAPClient
https://imapclient.readthedocs.io/en/2.1.0/

simple modern command line client
https://github.com/christianwengert/mail

Leave a comment

Comment anonymously or log in to comment.

Comments

Leave a reply

Reply anonymously or log in to reply.