Commit Graph

12 Commits

Author SHA1 Message Date
Joshua Tauberer
ce796116a5 Refactor how encodings are handled 2024-02-23 08:41:20 -05:00
Martijn van de Streek
674896d603 Decode byte strings in .msg files correctly
Non-Unicode strings in .msg files are encoded using an encoding that is
defined in a separate message property (PR_INTERNET_CPID for the body,
PR_MESSAGE_CODEPAGE for everything else).

The specification says that this property is required, however some real
world .msg files do not have it. This is why the decoding code has a
fallback to "cp1252" (Windows code page 1252, "Western Europe").

fixes #24
2024-02-23 08:01:46 +01:00
Martijn van de Streek
6f1a6e4b4a Use a raw string in re, so \n and \s work (#23)
Newer versions of Python complain about "\s" not being correct syntax
(SyntaxError during import); changing the string to a raw string fixes
the issue.

Co-authored-by: Martijn van de Streek <martijn.vandestreek@exxellence.nl>
2022-03-10 18:35:07 -05:00
Martijn van de Streek
5fa8976f86 Fix a crash when all 64 bits in timestamp are 1 (#22)
We've found some .msg files in the wild that have a CREATION_TIME that
has all 64 bits set: 9223372036854775807.

Adding this number of 100ns intervals to the base timestamp of
1601-01-01 results in a timestamp somewhere in the year 30828 which is
not supported by Python's datetime module, as datetime.MAXYEAR is
currently 9999.

Co-authored-by: Martijn van de Streek <martijn.vandestreek@exxellence.nl>
2022-02-10 11:41:08 -05:00
Martijn van de Streek
64c07db5b0 Use logging to log parse errors (#19)
Use `logging` to log parse errors, replacing print()
2021-07-21 10:05:27 -04:00
Martijn van de Streek
560a513349 Skip attachments without "__properties_version1.0" streams (#18)
We've found that messages with RTF formatting that contain embedded images
contain attachments without a "__properties_version1.0" stream.

As the current code is built around the "__properties_version1.0" stream,
these are skipped for now.

These image attachments do contain streams named "Ole" and "MailStream"
that should help with decoding/parsing in the future, but that's a bigger
project.
2021-07-21 10:03:19 -04:00
Martijn van de Streek
a057080bad Fix removing of Content-Type header from transport headers (#16)
The fourth argument to `re.sub` is `count`, but `re.I` (a flag) was passed
instead.

Because if this, messages with a lower-case "content-type" header would
never have their content-type header removed, leading to parse errors.

By explicity naming the parameter (`flags=`) to re.sub, the match
actually becomes case-insensitive.
2021-05-03 17:56:05 -04:00
Manabu Niseki
7f80b8e6bc Improve attachment filename normalization (#14)
Use `os.path.basename` instead of `urllib.parse.quote_plus` to improve filename normalization.
2020-07-05 18:47:28 -04:00
Alfredo
73fac36c80 Check for ATTACH_LONG_FILENAME before ATTACH_FILENAME (#7) 2019-05-22 06:21:36 -04:00
Alfredo
eee84c759f Check for key in props (#8) 2019-05-22 06:20:10 -04:00
Joshua Tauberer
4779154c8c urlencode attachment filenames to avoid some recursion depth exceeded error when message is converted to bytes 2018-03-16 17:35:24 -04:00
Joshua Tauberer
3f72102e4b initial commit 2018-03-14 16:24:47 -04:00