In the polyseed-examples repository, the `utf8_nfc` and `utf8_nfkd` functions will never return a value exceeding `POLYSEED_STR_SIZE - 1`
In your code, the utf8_norm function has variable return behavior that seems odd
In case of a normalization error, the underlying normalizer will return a negative value, at which point your function just returns POLYSEED_STR_SIZE (this is unclear)
In case the buffer isn't large enough, the normalizer will return the required buffer size but have undefined internal behavior, at which point your function returns a value exceeding POLYSEED_STR_SIZE
Otherwise, it uses the normalizer's return value (indicating the written size) to continue with re-encoding
tobtoht: Czarek Nakamoto: polyseed asserts that the return value < POLYSEED_STR_SIZE, so if normalization fails the program crashes..
> I think my idea was to have have polyseed check the return value and return an error code instead of asserting, which would in turn throw the "Unicode normalization failed" error
> I'll upstream that. In the meantime you can replace the injected function with
```cpp
inline size_t utf8_norm(const char* str, polyseed_str norm, utf8proc_option_t options) {
utf8proc_int32_t buffer[POLYSEED_STR_SIZE];
utf8proc_ssize_t result;
result = utf8proc_decompose(reinterpret_cast<const uint8_t*>(str), 0, buffer, POLYSEED_STR_SIZE, options);
if (result < 0 || result > (POLYSEED_STR_SIZE - 1)) {
throw std::runtime_error("Unicode normalization failed");
}
result = utf8proc_reencode(buffer, result, options);
if (result < 0 || result > POLYSEED_STR_SIZE) {
throw std::runtime_error("Unicode normalization failed");
}
strcpy(norm, reinterpret_cast<const char*>(buffer));
sodium_memzero(buffer, sizeof(buffer));
return result;
}
```
tobtoht:
Since only the composed languages are broken, it could also be that canonical composition is producing weird output. Try dumping whatever seed string is being fed to polyseed_decode to hex and we should be able to tell.
Or try removing UTF8PROC_LUMP from utf8_nfc