On 9/21/25 04:47, Alyssa Ross wrote:
Demi Marie Obenour <demiobenour@gmail.com> writes:
Right now, the makefiles in host/rootfs, vm/sys/net, and img/app have manually-maintained lists of files and symlinks. These duplicate the information in the git repository and can easily get out of sync or cause unnecessary merge conflicts. Fix all of these issues by having the git repository be the source of truth, and using a script to generate the file lists. Developers can regenerate the lists before every commit, or even add a git hook to do that.
Signed-off-by: Demi Marie Obenour <demiobenour@gmail.com> ---
I like where this is going. :)
Yay!
Changes in v3: - Only include the file list generator. Move the rest to separate patch series. - Remove the update-file-list make targets from img/app/Makefile and vm/sys/net/Makefile. - Link to v2: https://lore.kernel.org/r/20250910-genfiles-v2-0-37ebe07a3cdc@gmail.com
Changes in v2: - Drop the last patch (switching to /etc/s6-rc/compiled) as it is controversial and should be reviewed separately. - Add missing copyright notices. - Use a wrapper shell script to make the awk code easier to read. - Improve documentation. - Add helper scripts for use in git hooks and rebasing. - Link to v1: https://spectrum-os.org/lists/archives/spectrum-devel/20250903-genfiles-v1-0... --- Documentation/development/built-in-vms.adoc | 17 ++++ host/rootfs/Makefile | 102 +---------------------- host/rootfs/file-list.mk | 99 +++++++++++++++++++++++ img/app/Makefile | 80 +++---------------- img/app/file-list.mk | 65 +++++++++++++++ lib/common.mk | 1 + scripts/genfiles.awk | 120 ++++++++++++++++++++++++++++ scripts/genfiles.sh | 29 +++++++ scripts/git-rebase | 17 ++++ scripts/pre-commit.sh | 11 +++
Let's take git-rebase and pre-commit.sh out of this patch, and focus on the generated file lists first.
Will change.
vm/sys/net/Makefile | 50 ++---------- vm/sys/net/file-list.mk | 42 ++++++++++ 12 files changed, 422 insertions(+), 211 deletions(-)
diff --git a/Documentation/development/built-in-vms.adoc b/Documentation/development/built-in-vms.adoc index e90009ee5a3c2c254a7ae11e36121576b819eee7..0addc7d1a2fd322fa12918656baa3d169478504d 100644 --- a/Documentation/development/built-in-vms.adoc +++ b/Documentation/development/built-in-vms.adoc
Copyright header please!
Will fix. Also, in the future you have permission to fix missing copyright headers when you commit. It's fine if you aren't comfortable doing that.
@@ -44,6 +44,23 @@ NOTE: As a special convenience, it's not necessary to run `make clean` if the only change to the Nix files is modifying the packages installed in the VM.
+The list of files used for the VM image is stored in a separate file, +`file-lists.mk`. To update it, run `scripts/genfiles.sh`
Typo: file-list*s*.mk. Also, so far we haven't used code syntax for file names.
Maybe "used for images" would be better, since this also applies to host/rootfs. (Obviously the ideal would be if this documentation wasn't only written for VM images but that's out of scope. We'll get to it.)
Will fix.
+which will regenerate it from the output of `git ls-files`. Any +changes you made will be lost. This script uses uses Git's index to
I think "Any changes you made will be lost." is a bit scary, because it's not clear it only means changes to those files. The sentence could probably just be dropped altogether — I think it's implied by "regenerate".
Will fix.
+generate the list, so you need to use `git add`, `git rm`, and `git mv` +to ensure that Git knows about your changes. It is not necessary to +commit the changes.
"so only staged changes will be reflected"? All the extra stuff has potential for confusion I think — for example "It is not necessary to commit the changes." could be read as "when you make a commit, do not include changes to file-list.mk".
Will fix.
diff --git a/lib/common.mk b/lib/common.mk index 277c3544036d9a9057f8ba4ad37fe2207548cc59..0a03ff440cc671264d2b859a2ae048db9252d047 100644 --- a/lib/common.mk +++ b/lib/common.mk @@ -1,5 +1,6 @@ # SPDX-License-Identifier: EUPL-1.2+ # SPDX-FileCopyrightText: 2021, 2023, 2025 Alyssa Ross <hi@alyssa.is> +# SPDX-FileCopyrightText: 2025 Demi Marie Obenour <demiobenour@gmail.com>
BACKGROUND = background CPIO = cpio
Accident?
Yes.
diff --git a/scripts/genfiles.awk b/scripts/genfiles.awk new file mode 100644 index 0000000000000000000000000000000000000000..6fe327fd0a314d226dbce23854aa8f119e9c8f34 --- /dev/null +++ b/scripts/genfiles.awk @@ -0,0 +1,120 @@ +#!/usr/bin/env -S LC_ALL=C LANGUAGE=C awk -E +# SPDX-License-Identifier: EUPL-1.2+ +# SPDX-FileCopyrightText: 2025 Demi Marie Obenour <demiobenour@gmail.com> +BEGIN { + RS = "\n"; + FS = "\t"; + file_count = 0; + symlink_count = 0; + rc_count = 0; + is_rc = 0; + exit_code = 0; + done = 0;
awk variables are implicitly initialized to 0 when you try to do arithmetic on an undefined variable, so no need for these.
GNU Awk can lint against that. I used its lint mode because it also warns against non-portable constructs. Also, an undefined awk variable used as an array subscript is treated as the empty string, not 0, which could lead to confusion.
+ modes["120000"] = "symlink"; + modes["040755"] = "directory"; + modes["100644"] = "regular"; + modes["100755"] = "regular"; +} + +function fail(msg, status) { + if (status ~ /^([1-9][0-9]?|1[0-9]{2}|2[0-4][1-9]|25[1-5])$/) { + exit_code = status; + } else { + exit_code = 1; + status = 1; + } + print ("FATAL: " msg) > "/dev/stderr"; + exit status;
Do we ever want to exit something other than 1 from this function?
Nope.
+} +done { fail("Junk after DONE", 1); } +/^DONE$/ { + done = 1 + next +} + +# Make sure git produced valid output. +!/^[0-7]{6}\t[ -~]+$/ { + fail("git ls-files produced invalid output", 1); +} +
This is very unlikely to happen, and if it does, it will be obvious from the diff.
Will drop.
+# Extract data from built-in variables. +{ + filename = $2; + raw_mode = $1; + # awk autocreates empty string entries if the key is invalid, + # but the code exits in this case so that is okay. + mode = modes[raw_mode]; +} + +# Another check for a git bug. +filename ~ /^\/|((^|\/)\.{0,2}($|\/))/ { + fail("git ls-files output non-canonical or absolute path '" filename "'", 1); +} +
If there are git bugs, we will notice and report them. We do not need to be the test suite for git here.
Okay, fair!
+filename ~ /[^[:alnum:]_.+@/-]/ { + fail("filename '" filename "' has forbidden characters", 1); +} + +/\.license$/ { + if (raw_mode != "100644") { + fail("License file '" filename "' is executable or not regular file", 1); + } + next; +}
This is also not really in scope for a script that does not care about license files.
Fair. I will leave that to the reuse check.
+ +mode == "directory" { next }
Getting a directory from git ls-files would be sufficiently unexpected that I don't think we should treat it any differently from an unrecognized mode.
Will fix.
+ +filename ~ /^image\/etc\/s6-rc\// { + if (mode != "regular") { + fail("s6-rc-compile input '" filename "' isn't a regular file"); + } + rc_count += 1; + rc_files[rc_count] = filename;
rc_files[rc_count++]
(will make it 0-indexed though so update the loops too)
I think this might break without explicit variable initialization.
+ next; +} + +mode == "symlink" { + symlink_count += 1; + symlinks[symlink_count] = filename; + next; +} + +mode == "regular" { + file_count += 1; + files[file_count] = filename; + next; +} + +{ fail("File '" filename "' is not regular file, directory, or symlink (mode " raw_mode ")"); } + +END { + if (exit_code) { + exit exit_code; + } + if (!done) { + fail("Did not receive DONE line", 1); + } + printf ("# SPDX-License-Identifier: CC0-1.0\n" \ + "# SPDX-FileCopyrightText: 2025 Demi Marie Obenour <demiobenour@gmail.com>\n" \
Okay, so, it's silly that this needs to have a copyright header on it at all, but since we have to have one to make reuse happy, I think it should be mine from 2021, because the comment about links is the closest thing to creative expression in here.
Will fix.
+ "# Generated by scripts/genfile.sh. Any changes will be overwritten.\n" \ + "FILES ::=") > out_file;
I note the change to ::=. Do you think we should do that across the board in our Makefiles?
POSIX specifies ::= and it has better semantics in most cases, but I don't know if the BSD makes implement it. ::= causes the RHS to be expanded immediately, so subsequent changes in variables referenced by it do not affect the LHS.
+ for (array_index = 1; array_index <= file_count; array_index += 1) { + printf " \\\n\t%s", files[array_index] > out_file; + } + printf ("\n\n" \ +"# These are separate because they need to be included, but putting\n" \ +"# them as make dependencies would confuse make.\n" \ +"LINKS ::=") > out_file; + for (array_index = 1; array_index <= symlink_count; array_index += 1) { + printf " \\\n\t%s", symlinks[array_index] > out_file; + } + printf "\n\nS6_RC_FILES ::=" > out_file; + for (array_index = 1; array_index <= rc_count; array_index += 1) { + printf " \\\n\t%s", rc_files[array_index] > out_file; + } + printf "\n" > out_file; + if (close(out_file)) { + print ("Cannot close output file: " ERRNO "\n") > "/dev/stderr"; + exit 1; + } +} diff --git a/scripts/genfiles.sh b/scripts/genfiles.sh new file mode 100755 index 0000000000000000000000000000000000000000..77a8d95e88b6851be9447698556efe4f1eab174b --- /dev/null +++ b/scripts/genfiles.sh @@ -0,0 +1,29 @@ +#!/usr/bin/env -S LC_ALL=C LANGUAGE=C bash --
env -S is not portable, and I don't think anything here needs bash specifically.
$'\t' doesn't work with all shells, though I believe it is either part of the current POSIX standard or will be added. I'll use /usr/bin/env bash, which breaks if the script is renamed to something starting with '-'.
We can set the locale variables after the script starts, because I don't think this wrapper script is going to do anything locale-specific. (And shouldn't they be C.UTF-8?)
The C locale is actually what I intended. The script does not rely on support for non-ASCII characters, and it does use the fact that negated character classes match all bytes. Admittedly, this will only be needed if there is a git bug.
+set -euo pipefail +unset output_file astatus
This is a bit overly defensive IMO. Both of these variables are assigned before use, and if they weren't, the person making those changes would be very unlikely to not notice because they had those variables defined in their environment.
Fair!
+case $0 in +(/*) cd "${0%/*}/..";; +(*/*) cd "./${0%/*}/..";; +(*) cd ..;; +esac
Perhaps we could use git rev-parse --show-toplevel?
git ls-files doesn't have that option.
+for i in host/rootfs img/app vm/sys/net; do + output_file=$i/file-list.mk + { + git -C "$i" -c core.quotePath=true ls-files $'--format=%(objectmode)\t%(path)' -- image | + sort -t $'\t' -k 2
TIL sort -t and -k! 🤯
+ echo DONE
Why do we need this?
To avoid producing any output file if the input is truncated.
+ } | + gawk -v "out_file=$output_file.tmp" -E scripts/genfiles.awk
Why not stdout?
The output file is created by awk so that it is only created if nothing went wrong.
And why gawk? I didn't immediately notice anything non-POSIX, and as usual would prefer to stick to it.
POSIX does not specify -E. I can use -f instead, though.
+ if [ -f "$output_file" ]; then + # Avoid changing output file if it is up to date, as that + # would cause unnecessary rebuilds. + if cmp -s -- "$output_file.tmp" "$output_file"; then + rm -- "$output_file.tmp" + continue + else + astatus=$? + if [ "$astatus" != 1 ]; then exit "$astatus"; fi
Could avoid the need for the variable and multiple ifs. Up to you whether you prefer it:
set +e cmp -s -- "$output_file.tmp" "$output_file" set -e case $? in 0) rm -- "$output_file.tmp" continue ;; 1) ;; *) exit $? ;; esac
This might set $? to the return value of 'set -e' (0). Whether or not it actually does is at least not obvious from reading the code.
+ fi + fi + mv -- "$output_file.tmp" "$output_file" +done
-- Sincerely, Demi Marie Obenour (she/her/hers)