If you want to resubmit PA1, please read this section carefully. You need to pass all the tests in the original PA1, while also implementing an extra function described below.
Takes a UTF-8 encoded string and a codepoint index. Calculates the codepoint at that index. Then, calculates the code point with value one higher (so e.g. for ”é“ U+00E9 that would be “ê” (U+00EA), and for “🐩” (U+1F429) that would be “🐪” (U+1F42A)). Saves the encoding of that code point in the result
array starting at index 0
.
char str[] = "Joséph";
char result[100];
int32_t idx = 3;
next_utf8_char(str, idx, result);
printf("Next Character of Codepoint at Index 3: %s\n",result);
// 'é' is the 4th codepoint represented by the bytes 0xC3 0xA9
// 'ê' in UTF-8 hex bytes is represented as 0xC3 0xAA
=== Output ===
Next Character of Codepoint at Index 3: ê
Now, Your final output on running the utfanalyzer
code that will be graded should contain this extra line
Next Character of Codepoint at Index 3: FILL
Note: If the number of codepoints in the input string is less than 4, this added line would only have the prompt without any character as follows:
Next Character of Codepoint at Index 3:
The complete program output for example, should look like:
$ ./utf8analyzer
Enter a UTF-8 encoded string: My 🐩’s name is Erdős.
Valid ASCII: false
Uppercased ASCII: "MY 🐩’S NAME IS ERDőS."
Length in bytes: 27
Number of code points: 21
Bytes per code point: 1 1 1 4 3 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1
Substring of the first 6 code points: "My 🐩’s"
Code points as decimal numbers: 77 121 32 128041 8217 115 32 110 97 109 101 32 105 115 32 69 114 100 337 115 46
Animal emojis: 🐩
Next Character of Codepoint at Index 3: 🐪
(All our tests will check for this newly added line, in addition to lines from the original PA)
Consider the 3-byte sequence 11100000 10000000 10100001
. Answer the following questions:
- What code point does it encode in UTF-8, and what character is that?
- What are the three other ways to encode that character?
- Give an example of a character that has exactly three encodings (but not four, like the one in the previous example does)
- What are some problems with having these multiple encodings, especially for ASCII characters? A web search for “overlong UTF-8 encoding” may be useful here.
-
Most important: The
test_script
we shared has an unfortunate bug – if a.txt.expect
file doesn't have a blank line at the end, it may skip checking that the last line of the.expect
correctly matches the program's output. So, to be super sure your tests are working, you should make sure your.test.expect
files have a blank line/newline at the end. Our two sample tests didn't! So if you were confused about why a test was passing that shouldn't, that could be a reason why. -
Some people noticed that in our provided test we didn't include the quotes around the output for uppercased ASCII:
Uppercased ASCII: "MY 🐩’S NAME IS ERDőS."
vs.
Uppercased ASCII: MY 🐩’S NAME IS ERDőS.
Either is fine. If you want to pick one, include the quotes.
-
The problem didn't say what to do with
utf8_substring
if the end index is larger than theutf8_strlen
for the string. In that case, it should act as if the end index was exactlyutf8_strlen
of the string. That makes it so if you take the substring of the first 6 code points of a string with fewer than 6, you get the whole string.
Representing text is straightforward using ASCII: one byte per character fits well within char[]
and it represents most English text. However, there are many more than 256 characters in the text we use, from non-Latin alphabets (Cyrillic, Arabic, and Chinese character sets, etc.) to emojis and other symbols like €, to accented characters like é and ü.
The UTF-8 encoding is the default encoding of text in the majority of software today. If you've opened a web page, read a text message, or sent an email in the past 15 years that had any special characters, the text was probably UTF-8 encoded.
Not all software handles UTF-8 correctly! For example, Joe got a marketing email recently with a header “Take your notes further with Connect​” We're guessing that was supposed to be an ellipsis (…), UTF-8 encoded as the three bytes 0x11100010 0x10000000 0x10100110, and likely the software used to author the email mishandled the encoding and treated it as three extended ASCII characters.
This can cause serious problems for real people. For example, people with accented letters in their names can run into issues with sign-in forms (check out Twitter/X account @yournameisvalid for some examples). People with names best written in an alphabet other than Latin can have their names mangled in official documents, and need to have a "Latinized" version of their name for business in the US. Joe had trouble writing lecture notes because LaTeX does not support UTF-8 by default.
UTF-8 bugs can and do cause security vulnerabities in products we use every day. A simple search for UTF-8 in the CVE database of security vulnerabilities turns up hundreds of results.
It's useful to get some experience with UTF-8 so you understand how it's supposed to work and can recognize when it doesn't. To that end, you'll write several functions that work with UTF-8 encoded text, and use them to analyze some example texts.
To get started, visit the Github Classroom assignment link. Select your username from the list (or if you don't see it, you can skip and use your Github username). A repository will be created for you to use to to your work. You can do your programming however you like; a Codespace will keep you in the environment we are using in class and lab.
The functions described below are organized into milestones; you should definitely finish the functions in a milestone set before moving onto the next.
In general, you should work one function at a time, and earlier functions may be useful in implementing later functions.
A good first task is to implement only is_ascii
and the corresponding part of main
needed to read input and print the result for is_ascii
, and make sure you can test that. Then move onto capitalize_ascii
, and so on.
You can and should save your work by using git
commits (if you're comfortable with that), or even just saving copies of your .c
file when you hit important milestones. We may ask to see your work from an earlier milestone if you ask us for help on a function from a later one.
Some reminders and information about the function signatures:
int32_t
is a 32-bit (4-byte) integer. You can think of it likeint
in Java, we just want to be explicit about sizes of things when we program in C, andint
can mean different things on different systems. This type is defined instdint.h
, so#include <stdint.h>
at the top of a program will make it usable.- We use
cpi
as an abbreviation in some variable names, it stands for “code point index”. - We use
bi
as an abbreviation in some variable names, it stands for “byte index”.
Takes a UTF-8 encoded string and returns if it is valid ASCII (e.g. all bytes are 127 or less).
printf("Is 🔥 ASCII? %d\n", is_ascii("🔥"));
=== Output ===
Is 🔥 ASCII? 0
printf("Is abcd ASCII? %d\n", is_ascii("abcd"));
=== Output ===
Is abcd ASCII? 1
Takes a UTF-8 encoded string and changes it in-place so that any ASCII lowercase characters a
-z
are changed to their uppercase versions. Leaves all other characters unchanged. It returns the number of characters updated from lowercase to uppercase.
int32_t ret = 0;
char str[] = "abcd";
ret = capitalize_ascii(str);
printf("Capitalized String: %s\nCharacters updated: %d\n", str, ret);`
=== Output ===
Capitalized String: ABCD
Characters updated: 4
Given the start byte of a UTF-8 sequence, return how many bytes it indicates the sequence will take (start byte + continuation bytes).
Returns 1 for ASCII characters, and -1 if byte is not a valid start byte.
char s[] = "Héy"; // same as { 'H', 0xC3, 0xA9, 'y', 0 }, é is start byte + 1 cont. byte
printf("Width: %d bytes\n", width_from_start_byte(s[1])); // start byte 0xC3 indicates 2-byte sequence
=== Output ===
Width: 2 bytes
printf("Width: %d bytes\n", width_from_start_byte(s[2])); // start byte 0xA9 is a continuation byte, not a start byte
=== Output ===
Width: -1
Takes a UTF-8 encoded string and returns the number of UTF-8 codepoints it represents.
Returns -1 if there are any errors encountered in processing the UTF-8 string.
char str[] = "Joséph";
printf("Length of string %s is %d\n", str, utf8_strlen(str)); // 6 codepoints, (even though 7 bytes)
=== Output ===
Length of string Joséph is 6
Given a UTF-8 encoded string, and a codepoint index, return the byte index in the string where the Unicode character at the given codepoint index starts.
Returns -1 if there are any errors encountered in processing the UTF-8 string.
char str[] = "Joséph";
int32_t idx = 4;
printf("Codepoint index %d is byte index %d\n", idx, codepoint_index_to_byte_index("Joséph", idx));
=== Output ===
Codepoint index 4 is byte index 5
Takes a UTF-8 encoded string and start(inclusive) and end(exclusive) codepoint indices, and writes the substring between those indices to result
, with a null terminator. Assumes that result
has sufficient bytes of space available. (Hint: result
will be created beforehand with a given size and passed as input here. Can any of the above functions be used to determine what the size of result
should be?)
If cpi_start
is greater than cpi_end
or either is negative, the function should have no effect.
char result[17];
utf8_substring("🦀🦮🦮🦀🦀🦮🦮", 3, 7, result)
printf("String: %s\nSubstring: %s", result); // these emoji are 4 bytes long
=== Output ===
String: 🦀🦮🦮🦀🦀🦮🦮
Substring: 🦀🦀🦮🦮
Takes a UTF-8 encoded string and a codepoint index, and returns a decimal representing the codepoint at that index.
char str[] = "Joséph";
int32_t idx = 4;
printf("Codepoint at %d in %s is %d\n", idx, str, codepoint_at(str, idx)); // 'p' is the 4th codepoint
=== Output ===
Codepoint at 4 in Joséph is 112
Takes a UTF-8 encoded string and an codepoint index, and returns if the code point at that index is an animal emoji.
For simplicity for this question, we will define that that the “animal emojii” are in two ranges: from 🐀 to 🐿️ and from 🦀 to 🦮. (Yes, this technically includes things like 🐽 which are only related to or part of an animal, and excludes a few things like 🙊, 😸, which are animal faces.). You may find the wikipedia page on Unicode emoji helpful here.
You'll also write a program that reads UTF-8 input and prints out some information about it.
Here's what the output of a sample run of your program should look like:
$ ./utf8analyzer
Enter a UTF-8 encoded string: My 🐩’s name is Erdős.
Valid ASCII: false
Uppercased ASCII: "MY 🐩’S NAME IS ERDőS."
Length in bytes: 27
Number of code points: 21
Bytes per code point: 1 1 1 4 3 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1
Substring of the first 6 code points: "My 🐩’s"
Code points as decimal numbers: 77 121 32 128041 8217 115 32 110 97 109 101 32 105 115 32 69 114 100 337 115 46
Animal emojis: 🐩
Next Character of Codepoint at Index 3: 🐪
You can also test the contents of files by using the <
operator:
$ cat utf8test.txt
My 🐩’s name is Erdős.
$ ./utf8analyzer < utf8test.txt
Enter a UTF-8 encoded string:
Valid ASCII: false
Uppercased ASCII: "MY 🐩’S NAME IS ERDőS."
Length in bytes: 27
Number of code points: 21
Bytes per code point: 1 1 1 4 3 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1
Substring of the first 6 code points: "My 🐩’s"
Code points as decimal numbers: 77 121 32 128041 8217 115 32 110 97 109 101 32 105 115 32 69 114 100 337 115 46
Animal emojis: 🐩
Next Character of Codepoint at Index 3: 🐪
We provide 2 basic tests in the tests
folder - which contain simple tests for detecting if there are errors in your code while identifying valid ASCII and converting ASCII lowercase to uppercase characters. We have provided a test bash file that checks if your program output contains each line in the .expect file. You can use the following commands to run the tests (You may need to change the permission of the test_script
file to be executable with the command chmod u+x test_script
.):
gcc *.c -o utfanalyzer // compiles your C code into an executable called utfanalyzer
./test_script utfanalyzer
Then it will print out result in your terminal.
You can see the result for a single test by using:
./utf8analyzer < test-file
Here are some other ideas for tests you should write. They aren't necessarily comprehensive (you should design your own!) but they should get you started. For each of these kinds of strings, you should check how UTF-8 analyzer handles them:
- Strings with a single UTF-8 character that is 1, 2, 3, 4 bytes
- Strings with two UTF-8 characters in all combinations of 1/2/3/4 bytes. (e.g.
"aa"
,"aá"
,"áa"
,"áá"
, and so on) - Strings with and without animal emojii, including at the beginning, middle, and end of the string, and at the beginning, middle, and end of the range
- Strings of exactly 5 characters
Answer each of these with a few sentences or paragraphs; don't write a whole essay, but use good writing practice to communicate the essence of the idea. A good response doesn't need to be long, but it needs to have attention to detail and be clear. Examples help!
-
Another encoding of Unicode is UTF-32, which encodes all Unicode code points in 4 bytes. For things like ASCII, the leading 3 bytes are all 0's. What are some tradeoffs between UTF-32 and UTF-8?
-
UTF-8 has a leading
10
on all the bytes past the first for multi-byte code points. This seems wasteful – if the encoding for 3 bytes were instead1110XXXX XXXXXXXX XXXXXXXX
(whereX
can be any bit), that would fit 20 bits, which is over a million code points worth of space, removing the need for a 4-byte encoding. What are some tradeoffs or reasons the leading10
might be useful? Can you think of anything that could go wrong with some programs if the encoding didn't include this restriction on multi-byte code points?
Refer to the policies on assignments for working with others or appropriate use of tools like ChatGPT or Github Copilot.
You can use any code from class, lab, or discussion in your work.
- Any
.c
files you wrote (can be one file or many; it's totally reasonable to only have one). We will rungcc *.c -o utfanalyzer
to compile your code, so you should make sure it works when we do that. - A file
DESIGN.md
(with exactly that name) containing the answers to the design questions - Your tests with expected output in files
tests/*.txt
,tests/*.txt.expect
Hand in to the pa1
assignment on Gradescope. The submission system will show you the output of compiling and running your program on the test input described above to make sure the baseline format of your submission works. You will not get feedback about your overall grade before the deadline.