Commit Graph

17 Commits

Author SHA1 Message Date
Joshua Tauberer 4104dc937d Use rtfparse to extract HTML message bodies from RTF containers and create mutlipart/alternative messages if both plain text and HTML are available
Also fixes #20.
2024-02-23 09:57:15 -05:00
Joshua Tauberer 6fc382e9a6 Merge pull request #25 from MartijnVdS/string8_encoding
Decode byte strings in .msg files correctly
2024-02-23 09:30:54 -05:00
Joshua Tauberer ce796116a5 Refactor how encodings are handled 2024-02-23 08:41:20 -05:00
Martijn van de Streek 674896d603 Decode byte strings in .msg files correctly
Non-Unicode strings in .msg files are encoded using an encoding that is
defined in a separate message property (PR_INTERNET_CPID for the body,
PR_MESSAGE_CODEPAGE for everything else).

The specification says that this property is required, however some real
world .msg files do not have it. This is why the decoding code has a
fallback to "cp1252" (Windows code page 1252, "Western Europe").

fixes #24
2024-02-23 08:01:46 +01:00
Martijn van de Streek 6f1a6e4b4a Use a raw string in re, so \n and \s work (#23)
Newer versions of Python complain about "\s" not being correct syntax
(SyntaxError during import); changing the string to a raw string fixes
the issue.

Co-authored-by: Martijn van de Streek <martijn.vandestreek@exxellence.nl>
2022-03-10 18:35:07 -05:00
Martijn van de Streek 5fa8976f86 Fix a crash when all 64 bits in timestamp are 1 (#22)
We've found some .msg files in the wild that have a CREATION_TIME that
has all 64 bits set: 9223372036854775807.

Adding this number of 100ns intervals to the base timestamp of
1601-01-01 results in a timestamp somewhere in the year 30828 which is
not supported by Python's datetime module, as datetime.MAXYEAR is
currently 9999.

Co-authored-by: Martijn van de Streek <martijn.vandestreek@exxellence.nl>
2022-02-10 11:41:08 -05:00
Martijn van de Streek 64c07db5b0 Use logging to log parse errors (#19)
Use `logging` to log parse errors, replacing print()
2021-07-21 10:05:27 -04:00
Martijn van de Streek 560a513349 Skip attachments without "__properties_version1.0" streams (#18)
We've found that messages with RTF formatting that contain embedded images
contain attachments without a "__properties_version1.0" stream.

As the current code is built around the "__properties_version1.0" stream,
these are skipped for now.

These image attachments do contain streams named "Ole" and "MailStream"
that should help with decoding/parsing in the future, but that's a bigger
project.
2021-07-21 10:03:19 -04:00
Martijn van de Streek a057080bad Fix removing of Content-Type header from transport headers (#16)
The fourth argument to `re.sub` is `count`, but `re.I` (a flag) was passed
instead.

Because if this, messages with a lower-case "content-type" header would
never have their content-type header removed, leading to parse errors.

By explicity naming the parameter (`flags=`) to re.sub, the match
actually becomes case-insensitive.
2021-05-03 17:56:05 -04:00
Martijn van de Streek d9edd0d32f Make package "pip install"able (#15)
By specifying "py_modules" instead of "packages" in setup.py, the
single-file module is found and installed in site-packages correctly.
2020-10-22 10:39:37 -04:00
Manabu Niseki 7f80b8e6bc Improve attachment filename normalization (#14)
Use `os.path.basename` instead of `urllib.parse.quote_plus` to improve filename normalization.
2020-07-05 18:47:28 -04:00
Rodrigo Salvador d4a5944aba Include dependencies by requirements.txt (#10) 2019-09-16 19:49:04 -04:00
Alfredo 73fac36c80 Check for ATTACH_LONG_FILENAME before ATTACH_FILENAME (#7) 2019-05-22 06:21:36 -04:00
Alfredo eee84c759f Check for key in props (#8) 2019-05-22 06:20:10 -04:00
Jeff Kerr a8e1e8f064 Create LICENSE (#6) 2018-11-12 11:10:02 -05:00
Joshua Tauberer 4779154c8c urlencode attachment filenames to avoid some recursion depth exceeded error when message is converted to bytes 2018-03-16 17:35:24 -04:00
Joshua Tauberer 3f72102e4b initial commit 2018-03-14 16:24:47 -04:00