XLSX Invalid Characters Fix
Created on: January 10, 2025
Created on: January 10, 2025
Look at this code:
textdef export_results_to_xlsx(self, results, file_path): """Export analysis results to XLSX.""" try: wb = openpyxl.Workbook() ws = wb.active ws.title = "Analysis Results" # Add data ws.append(["Digital Forensics dconf Analysis Results"]) ws.append([]) # Analysis Results ws.append(["Analysis Results"]) analysis = results["Analysis Results"] for key in ["Desktop Environment", "Hidden Files Visible", "Value Bytes"]: value = analysis[key] ws.append([key, str(value)]) ws.append([]) # File Metadata ws.append(["File Metadata"]) metadata = results["File Metadata"] for key, value in metadata.items(): ws.append([key, str(value)]) ws.append([]) # System Info ws.append(["System Information"]) sys_info = results["System Info"] for key, value in sys_info.items(): ws.append([key, str(value)]) ws.append([]) # File Hashes ws.append(["File Hashes"]) hashes = results["File Hashes"] for key, value in hashes.items(): ws.append([key, str(value)]) ws.append([]) # Keys ws_keys = wb.create_sheet("Keys") ws_keys.append(["Key", "Value"]) for key, value in results["All Keys"].items(): ws_keys.append([key, str(value)]) # Hex Dump ws_hex = wb.create_sheet("Hex Dump") ws_hex.append(["Offset", "Hex Values", "ASCII"]) for line in results["Combined Hex and ASCII Dump"]: if len(line) > 9: offset = line[:8] separator_pos = line.find("|") if separator_pos != -1: hex_part = line[9:separator_pos].strip() ascii_part = line[separator_pos + 1:].strip() ws_hex.append([offset, hex_part, ascii_part]) # Logged Output ws_log = wb.create_sheet("Logged Output") ws_log.append(["Timestamp", "Event"]) ws_log.append([results["Current Timestamp"], "Analysis Started"]) ws_log.append([results["Current Timestamp"], f"Analyst: {results['Analyst']}"]) ws_log.append([results["Current Timestamp"], f"Desktop Environment: {analysis['Desktop Environment']}"]) ws_log.append([results["Current Timestamp"], f"Hidden Files Visible: {analysis['Hidden Files Visible']}"]) wb.save(file_path) return True except Exception as e: print(f"Error exporting to XLSX: {e}") return False app = QtWidgets.QApplication(sys.argv) gui = DconfAnalysisToolGUI() gui.show() sys.exit(app.exec()) pass else: print_banner() print( f"{Fore.YELLOW}{' Digital Forensics dconf Analysis Tool ':─^80}{Style.RESET_ALL}" ) print( "Please select the dconf folder (usually located at ~/.config/dconf on Linux)" ) user_file = select_dconf_file() if not user_file: print("No file selected. Exiting.") sys.exit(1) print(f"Analysing file: {user_file}") process_file_and_output(user_file)
What is causing it so that when I open the Excel XLSX that it generates it gives me the error "We found a problem with some content in XXXX.xlsx'. Do you want us to try to recover as much as we can? If you trust the source of this workbook, click Yes,"
In almost all cases like this, the culprit is that Excel’s XML-based XLSX files cannot handle certain raw (non-printable) characters. Inside your loop for the “Hex Dump” sheet, you’re feeding potentially invalid XML characters into openpyxl (particularly in the ASCII portion). Those characters might include things like control characters (ASCII 0–31, except for tab/newline/carriage-return) which are not valid in XML and cause Excel to report corrupted content.
Why this happens
When Excel (or any XML-based parser) sees those control characters, it can’t parse the file properly and displays the error:
“We found a problem with some content in XXXX.xlsx…”
Once you click “Yes,” Excel attempts to remove or replace those invalid characters, and the file will open (albeit with those parts sanitized).
How to fix it
You need to sanitize the strings before calling ws_hex.append(...)
:
For example:
pythonimport re def remove_invalid_xml_chars(s): # Regex to match valid XML 1.0 characters: # tab (\u0009), newline (\u000A), carriage return (\u000D), # plus the “printable” range [\u0020-\uD7FF\uE000-\uFFFD] return re.sub( r'[^\u0009\u000A\u000D\u0020-\uD7FF\uE000-\uFFFD]+', '', s )
Then in your code:
pythonascii_part = remove_invalid_xml_chars(ascii_part) ws_hex.append([offset, hex_part, ascii_part])
This ensures that any raw binary or control characters from your dconf hex dump are removed (or replaced) so that the XLSX can be parsed without errors.
I did this:
textdef export_results_to_xlsx(self, results, file_path): """Export analysis results to XLSX.""" try: def remove_invalid_xml_chars(s): # Regex to match valid XML 1.0 characters import re return re.sub( r'[^\u0009\u000A\u000D\u0020-\uD7FF\uE000-\uFFFD]+', '', str(s) ) wb = openpyxl.Workbook() ws = wb.active ws.title = "Analysis Results" # Add data ws.append(["Digital Forensics dconf Analysis Results"]) ws.append([]) # Analysis Results ws.append(["Analysis Results"]) analysis = results["Analysis Results"] for key in ["Desktop Environment", "Hidden Files Visible", "Value Bytes"]: value = analysis[key] ws.append([key, remove_invalid_xml_chars(value)]) ws.append([]) # File Metadata ws.append(["File Metadata"]) metadata = results["File Metadata"] for key, value in metadata.items(): ws.append([key, remove_invalid_xml_chars(value)]) ws.append([]) # System Info ws.append(["System Information"]) sys_info = results["System Info"] for key, value in sys_info.items(): ws.append([key, remove_invalid_xml_chars(value)]) ws.append([]) # File Hashes ws.append(["File Hashes"]) hashes = results["File Hashes"] for key, value in hashes.items(): ws.append([key, remove_invalid_xml_chars(value)]) ws.append([]) # Keys ws_keys = wb.create_sheet("Keys") ws_keys.append(["Key", "Value"]) for key, value in results["All Keys"].items(): ws_keys.append([key, remove_invalid_xml_chars(value)]) # Hex Dump ws_hex = wb.create_sheet("Hex Dump") ws_hex.append(["Offset", "Hex Values", "ASCII"]) for line in results["Combined Hex and ASCII Dump"]: if len(line) > 9: offset = line[:8] separator_pos = line.find("|") if separator_pos != -1: hex_part = line[9:separator_pos].strip() ascii_part = line[separator_pos + 1:].strip() # Sanitize the ASCII part specifically ascii_part = remove_invalid_xml_chars(ascii_part) ws_hex.append([offset, hex_part, ascii_part]) # Logged Output ws_log = wb.create_sheet("Logged Output") ws_log.append(["Timestamp", "Event"]) ws_log.append([results["Current Timestamp"], "Analysis Started"]) ws_log.append([results["Current Timestamp"], f"Analyst: {results['Analyst']}"]) ws_log.append([results["Current Timestamp"], f"Desktop Environment: {analysis['Desktop Environment']}"]) ws_log.append([results["Current Timestamp"], f"Hidden Files Visible: {analysis['Hidden Files Visible']}"]) wb.save(file_path) return True except Exception as e: print(f"Error exporting to XLSX: {e}") return False app = QtWidgets.QApplication(sys.argv) gui = DconfAnalysisToolGUI() gui.show() sys.exit(app.exec()) pass else: print_banner() print( f"{Fore.YELLOW}{' Digital Forensics dconf Analysis Tool ':─^80}{Style.RESET_ALL}" ) print( "Please select the dconf folder (usually located at ~/.config/dconf on Linux)" ) user_file = select_dconf_file() if not user_file: print("No file selected. Exiting.") sys.exit(1) print(f"Analysing file: {user_file}") process_file_and_output(user_file)
But same error
The most common reason this error persists—despite removing invalid XML characters from the “ASCII” portion—is that other fields may also contain invalid characters (or characters not allowed by Excel’s XML). In your current code, you only call remove_invalid_xml_chars()
on some values (e.g., the ASCII portion of the hex dump, the “value” of metadata, etc.), but not on:
key
fields (e.g., in “All Keys” you only sanitize value
, not key
).offset
and/or hex_part
in the “Hex Dump” sheet.If any field in any row contains an invalid XML character, Excel will flag the entire workbook as corrupted.
Sanitize all strings you write to the workbook—keys, offsets, timestamps, everything. A good rule of thumb is:
pythonws.append([ remove_invalid_xml_chars(str(value)) for value in [field1, field2, field3, ...] ])
so that every cell in that row is guaranteed to be clean.
Sanitize the “keys” in “All Keys”:
pythonws_keys.append([ remove_invalid_xml_chars(key), remove_invalid_xml_chars(value) ])
Sanitize offset
and hex_part
in the hex dump:
pythonoffset = remove_invalid_xml_chars(offset) hex_part = remove_invalid_xml_chars(hex_part) ascii_part = remove_invalid_xml_chars(ascii_part) ws_hex.append([offset, hex_part, ascii_part])
Sanitize logged output (timestamps, analyst name, etc.) in “Logged Output”:
pythonws_log.append([ remove_invalid_xml_chars(results["Current Timestamp"]), remove_invalid_xml_chars("Analysis Started") ]) # same pattern for each row you append
By ensuring all text you insert is run through remove_invalid_xml_chars()
, you eliminate the possibility that hidden control characters or other invalid XML bytes will sneak in and break the workbook.
The XLSX file format is really a collection of XML files in a ZIP container. If any cell text contains bytes that XML 1.0 disallows (e.g., ASCII control characters 0–31 other than tab/newline/CR), Excel’s parser complains that the file is corrupt. Even a single bad character in a single cell can break the entire file.
Once you sanitize every string, the error about “We found a problem with some content” should disappear.
textdef export_results_to_xlsx(self, results, file_path): """Export analysis results to XLSX.""" try: def remove_invalid_xml_chars(s): # Regex to match valid XML 1.0 characters import re return re.sub( r'[^\u0009\u000A\u000D\u0020-\uD7FF\uE000-\uFFFD]+', '', str(s) ) wb = openpyxl.Workbook() ws = wb.active ws.title = remove_invalid_xml_chars("Analysis Results") # Add data ws.append([remove_invalid_xml_chars("Digital Forensics dconf Analysis Results")]) ws.append([]) # Analysis Results ws.append([remove_invalid_xml_chars("Analysis Results")]) analysis = results["Analysis Results"] for key in ["Desktop Environment", "Hidden Files Visible", "Value Bytes"]: ws.append([ remove_invalid_xml_chars(key), remove_invalid_xml_chars(analysis[key]) ]) ws.append([]) # File Metadata ws.append([remove_invalid_xml_chars("File Metadata")]) metadata = results["File Metadata"] for key, value in metadata.items(): ws.append([ remove_invalid_xml_chars(key), remove_invalid_xml_chars(value) ]) ws.append([]) # System Info ws.append([remove_invalid_xml_chars("System Information")]) sys_info = results["System Info"] for key, value in sys_info.items(): ws.append([ remove_invalid_xml_chars(key), remove_invalid_xml_chars(value) ]) ws.append([]) # File Hashes ws.append([remove_invalid_xml_chars("File Hashes")]) hashes = results["File Hashes"] for key, value in hashes.items(): ws.append([ remove_invalid_xml_chars(key), remove_invalid_xml_chars(value) ]) ws.append([]) # Keys ws_keys = wb.create_sheet(remove_invalid_xml_chars("Keys")) ws_keys.append([ remove_invalid_xml_chars("Key"), remove_invalid_xml_chars("Value") ]) for key, value in results["All Keys"].items(): ws_keys.append([ remove_invalid_xml_chars(key), remove_invalid_xml_chars(value) ]) # Hex Dump ws_hex = wb.create_sheet(remove_invalid_xml_chars("Hex Dump")) ws_hex.append([ remove_invalid_xml_chars("Offset"), remove_invalid_xml_chars("Hex Values"), remove_invalid_xml_chars("ASCII") ]) for line in results["Combined Hex and ASCII Dump"]: if len(line) > 9: offset = line[:8] separator_pos = line.find("|") if separator_pos != -1: hex_part = line[9:separator_pos].strip() ascii_part = line[separator_pos + 1:].strip() ws_hex.append([ remove_invalid_xml_chars(offset), remove_invalid_xml_chars(hex_part), remove_invalid_xml_chars(ascii_part) ]) # Logged Output ws_log = wb.create_sheet(remove_invalid_xml_chars("Logged Output")) ws_log.append([ remove_invalid_xml_chars("Timestamp"), remove_invalid_xml_chars("Event") ]) timestamp = remove_invalid_xml_chars(results["Current Timestamp"]) ws_log.append([timestamp, remove_invalid_xml_chars("Analysis Started")]) ws_log.append([timestamp, remove_invalid_xml_chars(f"Analyst: {results['Analyst']}")]) ws_log.append([timestamp, remove_invalid_xml_chars(f"Desktop Environment: {analysis['Desktop Environment']}")]) ws_log.append([timestamp, remove_invalid_xml_chars(f"Hidden Files Visible: {analysis['Hidden Files Visible']}")]) wb.save(file_path) return True except Exception as e: print(f"Error exporting to XLSX: {e}") return False app = QtWidgets.QApplication(sys.argv) gui = DconfAnalysisToolGUI() gui.show() sys.exit(app.exec()) pass else: print_banner() print( f"{Fore.YELLOW}{' Digital Forensics dconf Analysis Tool ':─^80}{Style.RESET_ALL}" ) print( "Please select the dconf folder (usually located at ~/.config/dconf on Linux)" ) user_file = select_dconf_file() if not user_file: print("No file selected. Exiting.") sys.exit(1) print(f"Analysing file: {user_file}") process_file_and_output(user_file)
Same error
Potential Reasons & Debugging Steps
Even with all strings being sanitized, Excel can still report corruption in certain scenarios. Here are some additional (and somewhat hidden) causes you should check:
re
only once at the top of the file and not inside the function, or at least consistently.remove_invalid_xml_chars()
function covers all invalid Unicode ranges for XML 1.0. (For instance, if you are dealing with surrogate pairs or characters outside the Basic Multilingual Plane, it can be trickier; \uD800-\uDFFF
are surrogate code points that can also cause trouble in XML.)Tip: You can insert a quick debug print in your loop to spot-check the sanitized values:
pythoncleaned_value = remove_invalid_xml_chars(original_value) if cleaned_value != original_value: print(f"Sanitized something:\nOriginal: {repr(original_value)}\nCleaned: {repr(cleaned_value)}") ws.append([cleaned_value])
If you see something suspicious, that might still be your culprit.
Sometimes if a value is not a string (e.g., None
, a dictionary, or a list), openpyxl might mis-handle it or attempt to coerce it into a string in ways that produce invalid XML. Make sure everything you write to the cell is a proper string, especially if you are iterating over something like for key, value in metadata.items():
.
Quick check:
pythonfor key, value in metadata.items(): if not isinstance(key, str): key = str(key) if not isinstance(value, str): value = str(value) ws.append([remove_invalid_xml_chars(key), remove_invalid_xml_chars(value)])
[ ] : * ? / \
) are disallowed in sheet names. If you have a result dictionary key that attempts to create a sheet name longer than 31 characters or with invalid symbols, Excel can flag it as corrupt.In very rare cases, extremely large cell values (multiple megabytes of text) can trip up Excel. If any single “ASCII dump” line is huge, consider splitting it or just confirm it’s not unreasonably large.
If you (or openpyxl by default) are applying styles or formulas with invalid references, it can lead to “We found a problem with some content.” This is less common with the usage above, but worth checking if you have extra code that’s not shown.
Verify that:
openpyxl
is recent. Older versions have had known issues.If all else fails, do incremental (step-by-step) workbook building:
Often, you’ll discover that one row (e.g., a single line in your hex/ASCII dump) is injecting something invalid that your regex hasn’t stripped. Or maybe a key has an unusual character. This step-by-step approach is the most reliable way to isolate the root cause.
One of these steps will almost certainly pinpoint where the corruption is sneaking in, and from there, you can handle or sanitize it properly.