XLSX Invalid Characters Fix

Question

Look at this code:

text
        def export_results_to_xlsx(self, results, file_path):
            """Export analysis results to XLSX."""
            try:
                wb = openpyxl.Workbook()
                ws = wb.active
                ws.title = "Analysis Results"

                # Add data
                ws.append(["Digital Forensics dconf Analysis Results"])
                ws.append([])

                # Analysis Results
                ws.append(["Analysis Results"])
                analysis = results["Analysis Results"]
                for key in ["Desktop Environment", "Hidden Files Visible", "Value Bytes"]:
                    value = analysis[key]
                    ws.append([key, str(value)])
                ws.append([])

                # File Metadata
                ws.append(["File Metadata"])
                metadata = results["File Metadata"]
                for key, value in metadata.items():
                    ws.append([key, str(value)])
                ws.append([])

                # System Info
                ws.append(["System Information"])
                sys_info = results["System Info"]
                for key, value in sys_info.items():
                    ws.append([key, str(value)])
                ws.append([])

                # File Hashes
                ws.append(["File Hashes"])
                hashes = results["File Hashes"]
                for key, value in hashes.items():
                    ws.append([key, str(value)])
                ws.append([])

                # Keys
                ws_keys = wb.create_sheet("Keys")
                ws_keys.append(["Key", "Value"])
                for key, value in results["All Keys"].items():
                    ws_keys.append([key, str(value)])

                # Hex Dump
                ws_hex = wb.create_sheet("Hex Dump")
                ws_hex.append(["Offset", "Hex Values", "ASCII"])
                for line in results["Combined Hex and ASCII Dump"]:
                    if len(line) > 9:
                        offset = line[:8]
                        separator_pos = line.find("|")
                        if separator_pos != -1:
                            hex_part = line[9:separator_pos].strip()
                            ascii_part = line[separator_pos + 1:].strip()
                            ws_hex.append([offset, hex_part, ascii_part])

                # Logged Output
                ws_log = wb.create_sheet("Logged Output")
                ws_log.append(["Timestamp", "Event"])
                ws_log.append([results["Current Timestamp"], "Analysis Started"])
                ws_log.append([results["Current Timestamp"], f"Analyst: {results['Analyst']}"])
                ws_log.append([results["Current Timestamp"], f"Desktop Environment: {analysis['Desktop Environment']}"])
                ws_log.append([results["Current Timestamp"], f"Hidden Files Visible: {analysis['Hidden Files Visible']}"])

                wb.save(file_path)
                return True

            except Exception as e:
                print(f"Error exporting to XLSX: {e}")
                return False

    app = QtWidgets.QApplication(sys.argv)
    gui = DconfAnalysisToolGUI()
    gui.show()
    sys.exit(app.exec())
    pass
else:
    print_banner()
    print(
        f"{Fore.YELLOW}{' Digital Forensics dconf Analysis Tool ':─^80}{Style.RESET_ALL}"
    )
    print(
        "Please select the dconf folder (usually located at ~/.config/dconf on Linux)"
    )

    user_file = select_dconf_file()
    if not user_file:
        print("No file selected. Exiting.")
        sys.exit(1)
    print(f"Analysing file: {user_file}")
    process_file_and_output(user_file)

What is causing it so that when I open the Excel XLSX that it generates it gives me the error "We found a problem with some content in XXXX.xlsx'. Do you want us to try to recover as much as we can? If you trust the source of this workbook, click Yes,"

Answer 1

In almost all cases like this, the culprit is that Excel’s XML-based XLSX files cannot handle certain raw (non-printable) characters. Inside your loop for the “Hex Dump” sheet, you’re feeding potentially invalid XML characters into openpyxl (particularly in the ASCII portion). Those characters might include things like control characters (ASCII 0–31, except for tab/newline/carriage-return) which are not valid in XML and cause Excel to report corrupted content.

Why this happens
When Excel (or any XML-based parser) sees those control characters, it can’t parse the file properly and displays the error:

“We found a problem with some content in XXXX.xlsx…”

Once you click “Yes,” Excel attempts to remove or replace those invalid characters, and the file will open (albeit with those parts sanitized).

How to fix it
You need to sanitize the strings before calling ws_hex.append(...):

Strip out non-printable control characters. A quick way is to replace or remove characters outside the valid Unicode range for XML (0x09, 0x0A, 0x0D, plus everything from 0x20 and up).
Or at least filter out the known control characters from the ASCII side of the dump.

For example:

python
import re

def remove_invalid_xml_chars(s):
    # Regex to match valid XML 1.0 characters: 
    # tab (\u0009), newline (\u000A), carriage return (\u000D),
    # plus the “printable” range [\u0020-\uD7FF\uE000-\uFFFD]
    return re.sub(
        r'[^\u0009\u000A\u000D\u0020-\uD7FF\uE000-\uFFFD]+',
        '',
        s
    )

Then in your code:

python
ascii_part = remove_invalid_xml_chars(ascii_part)
ws_hex.append([offset, hex_part, ascii_part])

This ensures that any raw binary or control characters from your dconf hex dump are removed (or replaced) so that the XLSX can be parsed without errors.

Answer 2

The most common reason this error persists—despite removing invalid XML characters from the “ASCII” portion—is that other fields may also contain invalid characters (or characters not allowed by Excel’s XML). In your current code, you only call remove_invalid_xml_chars() on some values (e.g., the ASCII portion of the hex dump, the “value” of metadata, etc.), but not on:

The key fields (e.g., in “All Keys” you only sanitize value, not key).
The offset and/or hex_part in the “Hex Dump” sheet.
The logged output strings, timestamps, or any other fields that you’re appending without sanitizing.

If any field in any row contains an invalid XML character, Excel will flag the entire workbook as corrupted.

What to Do

Sanitize all strings you write to the workbook—keys, offsets, timestamps, everything. A good rule of thumb is:
```
python
ws.append([
    remove_invalid_xml_chars(str(value)) 
    for value in [field1, field2, field3, ...]
])
```
so that every cell in that row is guaranteed to be clean.

Sanitize the “keys” in “All Keys”:

python
ws_keys.append([
    remove_invalid_xml_chars(key),
    remove_invalid_xml_chars(value)
])

Sanitize offset and hex_part in the hex dump:

python
offset = remove_invalid_xml_chars(offset)
hex_part = remove_invalid_xml_chars(hex_part)
ascii_part = remove_invalid_xml_chars(ascii_part)
ws_hex.append([offset, hex_part, ascii_part])

Sanitize logged output (timestamps, analyst name, etc.) in “Logged Output”:

python
ws_log.append([
    remove_invalid_xml_chars(results["Current Timestamp"]), 
    remove_invalid_xml_chars("Analysis Started")
])
# same pattern for each row you append

By ensuring all text you insert is run through remove_invalid_xml_chars(), you eliminate the possibility that hidden control characters or other invalid XML bytes will sneak in and break the workbook.

Why This Happens

The XLSX file format is really a collection of XML files in a ZIP container. If any cell text contains bytes that XML 1.0 disallows (e.g., ASCII control characters 0–31 other than tab/newline/CR), Excel’s parser complains that the file is corrupt. Even a single bad character in a single cell can break the entire file.

Once you sanitize every string, the error about “We found a problem with some content” should disappear.

Answer 3

Potential Reasons & Debugging Steps

Even with all strings being sanitized, Excel can still report corruption in certain scenarios. Here are some additional (and somewhat hidden) causes you should check:

1. There might still be invalid characters slipping through

Ensure you are importing re only once at the top of the file and not inside the function, or at least consistently.
Double-check your remove_invalid_xml_chars() function covers all invalid Unicode ranges for XML 1.0. (For instance, if you are dealing with surrogate pairs or characters outside the Basic Multilingual Plane, it can be trickier; \uD800-\uDFFF are surrogate code points that can also cause trouble in XML.)

Tip: You can insert a quick debug print in your loop to spot-check the sanitized values:

python
cleaned_value = remove_invalid_xml_chars(original_value)
if cleaned_value != original_value:
    print(f"Sanitized something:\nOriginal: {repr(original_value)}\nCleaned: {repr(cleaned_value)}")
ws.append([cleaned_value])

If you see something suspicious, that might still be your culprit.

2. A non-string object is sneaking in

Sometimes if a value is not a string (e.g., None, a dictionary, or a list), openpyxl might mis-handle it or attempt to coerce it into a string in ways that produce invalid XML. Make sure everything you write to the cell is a proper string, especially if you are iterating over something like for key, value in metadata.items():.

Quick check:

python
for key, value in metadata.items():
    if not isinstance(key, str):
        key = str(key)
    if not isinstance(value, str):
        value = str(value)
    ws.append([remove_invalid_xml_chars(key), remove_invalid_xml_chars(value)])

3. Worksheet names

You are already sanitizing them, but keep in mind that Excel limits sheet names to 31 characters. Also, certain characters ([ ] : * ? / \) are disallowed in sheet names. If you have a result dictionary key that attempts to create a sheet name longer than 31 characters or with invalid symbols, Excel can flag it as corrupt.

4. Extremely long cell values

In very rare cases, extremely large cell values (multiple megabytes of text) can trip up Excel. If any single “ASCII dump” line is huge, consider splitting it or just confirm it’s not unreasonably large.

5. A style or formula corruption

If you (or openpyxl by default) are applying styles or formulas with invalid references, it can lead to “We found a problem with some content.” This is less common with the usage above, but worth checking if you have extra code that’s not shown.

6. The file is truly being corrupted upon save (I/O issues)

Verify that:

You’re not re-opening or overwriting a file while it’s still in use.
You’re not saving multiple times to the same path in quick succession.
You have the correct permissions (especially on Windows).
The version of openpyxl is recent. Older versions have had known issues.

7. Find exactly where corruption occurs (Incremental Debugging)

If all else fails, do incremental (step-by-step) workbook building:

Create the workbook and save just the “Analysis Results” sheet. Open in Excel. Does it work?
If yes, add the “File Metadata” sheet. Save. Open in Excel.
Continue adding one sheet at a time until you hit the corruption. The last sheet (or row) you added is the culprit.

Often, you’ll discover that one row (e.g., a single line in your hex/ASCII dump) is injecting something invalid that your regex hasn’t stripped. Or maybe a key has an unusual character. This step-by-step approach is the most reliable way to isolate the root cause.

Summary

Sanitize everything—including keys, offsets, timestamps, sheet names, etc.
Check for non-string objects (convert them to strings first).
Watch out for invalid sheet names or extremely large text.
Incrementally debug by building your workbook one sheet at a time until you spot the problem.

One of these steps will almost certainly pinpoint where the corruption is sneaking in, and from there, you can handle or sanitize it properly.

XLSX Invalid Characters Fix

Câu hỏi

Suy nghĩ

Trả lời

Câu hỏi

Suy nghĩ

Trả lời

What to Do

Why This Happens

Câu hỏi

Suy nghĩ

Trả lời

1. There might still be invalid characters slipping through

2. A non-string object is sneaking in

3. Worksheet names

4. Extremely long cell values

5. A style or formula corruption

6. The file is truly being corrupted upon save (I/O issues)

7. Find exactly where corruption occurs (Incremental Debugging)

Summary

Chia sẻ Q&A này