Non-Unicode strings in .msg files are encoded using an encoding that is
defined in a separate message property (PR_INTERNET_CPID for the body,
PR_MESSAGE_CODEPAGE for everything else).
The specification says that this property is required, however some real
world .msg files do not have it. This is why the decoding code has a
fallback to "cp1252" (Windows code page 1252, "Western Europe").
fixes#24
Newer versions of Python complain about "\s" not being correct syntax
(SyntaxError during import); changing the string to a raw string fixes
the issue.
Co-authored-by: Martijn van de Streek <martijn.vandestreek@exxellence.nl>
We've found some .msg files in the wild that have a CREATION_TIME that
has all 64 bits set: 9223372036854775807.
Adding this number of 100ns intervals to the base timestamp of
1601-01-01 results in a timestamp somewhere in the year 30828 which is
not supported by Python's datetime module, as datetime.MAXYEAR is
currently 9999.
Co-authored-by: Martijn van de Streek <martijn.vandestreek@exxellence.nl>
We've found that messages with RTF formatting that contain embedded images
contain attachments without a "__properties_version1.0" stream.
As the current code is built around the "__properties_version1.0" stream,
these are skipped for now.
These image attachments do contain streams named "Ole" and "MailStream"
that should help with decoding/parsing in the future, but that's a bigger
project.
The fourth argument to `re.sub` is `count`, but `re.I` (a flag) was passed
instead.
Because if this, messages with a lower-case "content-type" header would
never have their content-type header removed, leading to parse errors.
By explicity naming the parameter (`flags=`) to re.sub, the match
actually becomes case-insensitive.