In standard Unix fashion you can stuff anything in a path- they're just byte strings. But a lot of programs aren't equipped to deal with this and will often break in spectacular ways. https://dwheeler.com/essays/fixing-unix-linux-filenames.html is a fantastic read on the subject and dives in to not just characters but their positioning, such as dashes at the start of filenames being interpreted as option flags.
The article concludes with these recommendations/conversation starts:
- Forbid/escape ASCII control characters (bytes 1-31 and 127) in filenames, including newline, escape, and tab. I know of no user or program that actually requires this capability. As far as I can tell, this capability exists only to make it hard to write correct software, to ease the job of attackers, and to create interoperability problems. Chuck it.
- Forbid/escape leading “-”. This way, you can always distinguish option flags from filenames, eliminating a host of stupid errors. Nobody in their right mind writes programs that depend on having dash-prefixed files on a Unix system. Even on Windows systems they’re a bad idea, because many programs use “-” instead of “/” to identify options.
- Forbid/escape filenames that aren’t a valid UTF-8 encoding. This way, filenames can always be correctly displayed. Trying to use environment values like LC_ALL (or other LC_* values) or LANG is just a hack that often fails. This will take time, as people slowly transition and minor tool problems get fixed, but I believe that transition is already well underway.
- Forbid/escape leading/trailing space characters — at least trailing spaces. Adjacent spaces are somewhat dodgy, too. These confuse users when they happen, with no utility. In particular, filenames that are only space characters are nothing but trouble. Some systems may want to go further and forbid space characters outright, but I doubt that’ll be acceptable everywhere, and with the other approaches these are less necessary. As noted above, an interesting alternative would be quietly convert (in the API) all spaces into unbreakable spaces.
- Forbid/escape “problematic” characters that get specially interpreted by shells, other interpreters (such as perl), and HTML/XML. This is less important, and I would expect this to happen (at most) on specific systems. With the steps above, a lot of programs and statements like “cat ” just work correctly. But funny characters cause troubles for shell scripts and perl, because they need to quote them when typing in commands.. and they often forget to do so. They can also be a cause for trouble when they’re passed down to other programs, especially if they run “exec” and so on. They’re also helpful for web applications, again, because the characters that should be escapes are sometimes not escaped. A short list would be “”, “?”, and “[”; by eliminating those three characters and control characters from filenames, and removing the space character from IFS, you can process filenames in shells without quoting variable references — eliminating a common source of errors. Forbidding/escaping “<” and “>” would eliminate a source of nasty errors for perl programs, web applications, and anyone using HTML or XML. A more stringent list would be “*?:[]"<>|(){}&'!;” (this is Glindra’s “safe” list with ampersand, single-quote, bang, backslash, and semicolon added). This list is probably a little extreme, but let’s try and see. As noted earlier, I’d need to go through a complete analysis of all characters for a final list; for security, you want to identify everything that is permissible, and disallow everything else, but its manifestation can be either way as long as you’ve considered all possible cases. But if this set can be determined locally, based on local requirements, there’s less need to get complete agreement on a list.
- Forbid/escape leading “~” (tilde). Shells specially interpret such filenames. This is definitely low priority.
Forbidding spaces too could be interesting. Using a whitelist might be a good approach, as from the article:
"In the end, you're safer if filenames are limited to the characters that are never misused. In a system where security is at a premium, I can see configuring it to only permit filenames with characters in the set A-Za-z0-9_-, with the additional rule that it must not begin with a dash. These display everywhere, are unambiguous, and this limitation cuts off many attack avenues."
You could implement rich filenames using an xattar, while having the actual filenames be easy to type and parse by programs.