Skip to content

Conversation

asomers
Copy link
Contributor

@asomers asomers commented Jul 1, 2025

Specifying the verbose flag twice will display a list of all corrupt sectors within each corrupt file, as opposed to just the name of the file.

Signed-off-by: Alan Somers asomers@gmail.com
Sponsored by: ConnectWise

Motivation and Context

Displays the record number of every corrupt record in every corrupt file. I find this is very useful when cleaning up the fallout from #16626.

Description

The kernel already tracks the blkid of every corrupt record, and already transmits that information to userland. But libzfs has always thrown it away, until now. This PR adds a -vv option to zpool status. When used, it will print the level and blkid of every corrupt record. It works in combination with -j, too.

How Has This Been Tested?

Manually tested on about half a dozen production datasets that had on-disk corruption as a result of #16626 , in both L0 and L1 blocks.
Manually tested on a test dataset that I intentionally corrupted. That one had multiple corrupted records on multiple files.

Example output, in human readable mode:

...
errors: Permanent errors have been detected in the following files:

        /testpool/randfile.7 L0 record 3
        /testpool/randfile.7 L0 record 9
        /testpool/randfile.7 L0 record 16
        /testpool/randfile.9 L0 record 8
        /testpool/randfile.9 L0 record 15
        /testpool/randfile.10 L0 record 3
        /testpool/randfile.10 L0 record 11
        /testpool/randfile.5 L0 record 17
        /testpool/randfile.8 L0 record 11
        /testpool/randfile.8 L0 record 19
        /testpool/randfile.6 L0 record 3
        /testpool/randfile.6 L0 record 12
Example output, in json mode
{
  "output_version": {
    "command": "zpool status",
    "vers_major": 0,
    "vers_minor": 1
  },
  "pools": {
    "testpool": {
      "name": "testpool",
      "state": "ONLINE",
      "pool_guid": "10305967396160717712",
      "txg": "1523",
      "spa_version": "5000",
      "zpl_version": "5",
      "status": "One or more devices has experienced an error resulting in data\n\tcorruption.  Applications may be affected.\n",
      "action": "Restore the file in question if possible.  Otherwise restore the\n\tentire pool from backup.\n",
      "msgid": "ZFS-8000-8A",
      "moreinfo": "https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A",
      "scan_stats": {
        "function": "SCRUB",
        "state": "FINISHED",
        "start_time": "Tue Jul  1 12:11:14 2025",
        "end_time": "Tue Jul  1 12:11:14 2025",
        "to_examine": "56.0M",
        "examined": "56.0M",
        "skipped": "92K",
        "processed": "0B",
        "errors": "12",
        "bytes_per_scan": "0B",
        "pass_start": "1751393474",
        "scrub_pause": "-",
        "scrub_spent_paused": "0",
        "issued_bytes_per_scan": "55.9M",
        "issued": "55.9M"
      },
      "vdevs": {
        "testpool": {
          "name": "testpool",
          "vdev_type": "root",
          "guid": "10305967396160717712",
          "class": "normal",
          "state": "ONLINE",
          "alloc_space": "56.0M",
          "total_space": "112M",
          "def_space": "112M",
          "read_errors": "0",
          "write_errors": "0",
          "checksum_errors": "0",
          "vdevs": {
            "/tmp/zfs.img": {
              "name": "/tmp/zfs.img",
              "vdev_type": "file",
              "guid": "1719526601577822810",
              "path": "/tmp/zfs.img",
              "class": "normal",
              "state": "ONLINE",
              "alloc_space": "56.0M",
              "total_space": "112M",
              "def_space": "112M",
              "rep_dev_size": "116M",
              "self_healed": "1.50K",
              "phys_space": "128M",
              "read_errors": "0",
              "write_errors": "0",
              "checksum_errors": "27",
              "slow_ios": "0"
            }
          }
        }
      },
      "error_count": "12",
      "errlist": [
        {
          "path": "/testpool/randfile.7",
          "level": 0,
          "record": 3
        },
        {
          "path": "/testpool/randfile.7",
          "level": 0,
          "record": 9
        },
        {
          "path": "/testpool/randfile.7",
          "level": 0,
          "record": 16
        },
        {
          "path": "/testpool/randfile.9",
          "level": 0,
          "record": 8
        },
        {
          "path": "/testpool/randfile.9",
          "level": 0,
          "record": 15
        },
        {
          "path": "/testpool/randfile.10",
          "level": 0,
          "record": 3
        },
        {
          "path": "/testpool/randfile.10",
          "level": 0,
          "record": 11
        },
        {
          "path": "/testpool/randfile.5",
          "level": 0,
          "record": 17
        },
        {
          "path": "/testpool/randfile.8",
          "level": 0,
          "record": 11
        },
        {
          "path": "/testpool/randfile.8",
          "level": 0,
          "record": 19
        },
        {
          "path": "/testpool/randfile.6",
          "level": 0,
          "record": 3
        },
        {
          "path": "/testpool/randfile.6",
          "level": 0,
          "record": 12
        }
      ]
    }
  }
}

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Quality assurance (non-breaking change which makes the code more robust against bugs)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

@gamanakis
Copy link
Contributor

On a first pass it looks good to me, thanks!. Though not really sure why the checks are not successful. Could you squash and re-push?

@asomers
Copy link
Contributor Author

asomers commented Jul 20, 2025

Well, I think the "checkstyle" check is failing because I didn't update libzfs.abi. But I can find no instructions for how to do that. @ixhamza you were the last to do it. Could you please tell me how to update libzfs.abi due to a function prototype change?

@gmelikov
Copy link
Member

gmelikov commented Jul 20, 2025

In addition to abi I see:

cmd/zpool/zpool_main.c: In function ‘errors_nvlist’:
cmd/zpool/zpool_main.c:9590:41: error: ‘errnvl’ may be used uninitialized [-Werror=maybe-uninitialized]
 9590 |                                         fnvlist_add_nvlist_array(item,
      |                                         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 9591 |                                             "errlist",
      |                                             ~~~~~~~~~~
 9592 |                                             (const nvlist_t **)errnvl,
      |                                             ~~~~~~~~~~~~~~~~~~~~~~~~~~
 9593 |                                             count);
      |                                             ~~~~~~
cmd/zpool/zpool_main.c:9532:44: note: ‘errnvl’ was declared here
 9532 |                                 nvlist_t **errnvl;
      |                                            ^~~~~~
cc1: all warnings being treated as errors

You may get new abi here https://github.com/openzfs/zfs/actions/runs/16008660282 (see artifact, direct link to it https://github.com/openzfs/zfs/actions/runs/16008660282/artifacts/3443942581r )

Copy link
Member

@amotin amotin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no critical objections, but few things I would do differently. Also please rebase it to the latest master and clean the commit history.

Comment on lines 9559 to 9653
if (cb->cb_verbosity < 2) {
errl[i] = safe_malloc(len);
zpool_obj_to_path(zhp, dsobj,
obj, errl[i++], len);
} else {
uint64_t lvl, blkid;

errnvl[i] = fnvlist_alloc();
lvl = fnvlist_lookup_uint64(nv,
ZPOOL_ERR_LEVEL);
blkid = fnvlist_lookup_uint64(
nv, ZPOOL_ERR_BLKID);
zpool_obj_to_path(zhp, dsobj,
obj, pathbuf, len);
fnvlist_add_string(errnvl[i],
"path", pathbuf);
fnvlist_add_uint64(errnvl[i],
"level", lvl);
fnvlist_add_uint64(errnvl[i++],
"record", blkid);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't we always do it the verbose way here to simplify the code, and just not print the additional information later? It does not look like very performance-sensitive code path.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really, because if verbosity is < 2, then nverrlist won't contain the "level" and "blockid" fields. And I don't want for zpool_get_errlog to always supply those fields, because it could result in enormous nvlists if a single file has many corrupt records.

Copy link
Contributor

@behlendorf behlendorf Aug 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

enormous nvlists if a single file has many corrupt records.

Along these same lines I'm a bit concerned with logging a line per block. That could be overwhelming.

Thinking about this from a user perspective, I really don't care about how ZFS decided to internally layout the file (objid, level, blkid). What is useful to me are the file offsets which are corrupt. That's a little more work to generate but shouldn't be too bad. Maybe something like:

errors: Permanent errors have been detected in the following files:

        /testpool/randfile.7 393216-524288,1048576-1179647,1966080-2097151
        /testpool/randfile.5 917504-1048575
        ...

Copy link
Member

@amotin amotin Aug 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And I don't want for zpool_get_errlog to always supply those fields

I didn't mean to always supply all the ranges. Merely using the same data structure with nvlists, just with one vs many entries per file, if it allow to make code cleaner. But if no, I won't insist.

@behlendorf behlendorf added Status: Code Review Needed Ready for review and testing Status: Revision Needed Changes are required for the PR to be accepted labels Aug 6, 2025
Specifying the verbose flag twice will display a list of all corrupt
sectors within each corrupt file, as opposed to just the name of the
file.

Signed-off-by:	Alan Somers <asomers@gmail.com>
Sponsored by:	ConnectWise
@github-actions github-actions bot removed the Status: Revision Needed Changes are required for the PR to be accepted label Aug 12, 2025
@asomers asomers requested a review from amotin August 12, 2025 19:46
@asomers
Copy link
Contributor Author

asomers commented Aug 12, 2025

I applied your suggestions, rebased, and squashed, @amotin .

@amotin
Copy link
Member

amotin commented Aug 12, 2025

@asomers You seem to address 1.5 of my comments. What's about other 1.5?

@asomers
Copy link
Contributor Author

asomers commented Aug 13, 2025

@asomers You seem to address 1.5 of my comments. What's about other 1.5?

Do you mean the comment about "Couldn't we always do it the verbose way here to simplify the code"? I explained why I thought that would be a bad idea. Or do you mean that I didn't replace one safe_malloc call with calloc? I thought it better to stick with safe_malloc so I wouldn't need to add extra error handling.

@amotin
Copy link
Member

amotin commented Aug 13, 2025

Do you mean the comment about "Couldn't we always do it the verbose way here to simplify the code"? I explained why I thought that would be a bad idea.

I'm sorry if I lost it in context switches, but I don't remember. Could you point where?

Or do you mean that I didn't replace one safe_malloc call with calloc? I thought it better to stick with safe_malloc so I wouldn't need to add extra error handling.

This is also. You've added error checks for two calloc() cases, while rely on safe_malloc() in the third case, which seems to be doing exactly the same. Why not use safe_malloc() in all 3 places then and let it handle the things? Sure calloc() might look nicer for arrays, but do we really care?

@asomers
Copy link
Contributor Author

asomers commented Aug 13, 2025

Do you mean the comment about "Couldn't we always do it the verbose way here to simplify the code"? I explained why I thought that would be a bad idea.

I'm sorry if I lost it in context switches, but I don't remember. Could you point where?

I was referring to #17502 (comment) .

Or do you mean that I didn't replace one safe_malloc call with calloc? I thought it better to stick with safe_malloc so I wouldn't need to add extra error handling.

This is also. You've added error checks for two calloc() cases, while rely on safe_malloc() in the third case, which seems to be doing exactly the same. Why not use safe_malloc() in all 3 places then and let it handle the things? Sure calloc() might look nicer for arrays, but do we really care?

Yes exactly. I chose to use calloc precisely because I was allocating an array. I think that it's best to always use calloc for arrays not just because it looks pretty, but because it protects from overflows during the multiplication. Do you really want me to change it?

@amotin
Copy link
Member

amotin commented Aug 13, 2025

I was referring to #17502 (comment) .

Right. I see my question, but not your response. Did you send it? ;)

Do you really want me to change it?

No. I won't fight over it.

@asomers
Copy link
Contributor Author

asomers commented Aug 13, 2025

I was referring to #17502 (comment) .

Right. I see my question, but not your response. Did you send it? ;)

Ahh, that's it. It was stuck in the "Pending" state. It's posted now.

errors_nvlist(zpool_handle_t *zhp, status_cbdata_t *cb, nvlist_t *item)
{
uint64_t nerr;
int verbosity = cb->cb_verbosity;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should use cb->cb_verbosity throughout and remove the local variable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only reason I created the local variable is so I didn't have to split a long line at 80 columns. IMHO it looks better this way. But I'll change it if you want me to.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd just like to make sure we're using either the local variable or cb->cb_verbosity consistently in this function. I don't feel strongly about which one, so if you want to stick with the local variable we should update the other places it's used.

Comment on lines 9559 to 9653
if (cb->cb_verbosity < 2) {
errl[i] = safe_malloc(len);
zpool_obj_to_path(zhp, dsobj,
obj, errl[i++], len);
} else {
uint64_t lvl, blkid;

errnvl[i] = fnvlist_alloc();
lvl = fnvlist_lookup_uint64(nv,
ZPOOL_ERR_LEVEL);
blkid = fnvlist_lookup_uint64(
nv, ZPOOL_ERR_BLKID);
zpool_obj_to_path(zhp, dsobj,
obj, pathbuf, len);
fnvlist_add_string(errnvl[i],
"path", pathbuf);
fnvlist_add_uint64(errnvl[i],
"level", lvl);
fnvlist_add_uint64(errnvl[i++],
"record", blkid);
}
Copy link
Contributor

@behlendorf behlendorf Aug 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

enormous nvlists if a single file has many corrupt records.

Along these same lines I'm a bit concerned with logging a line per block. That could be overwhelming.

Thinking about this from a user perspective, I really don't care about how ZFS decided to internally layout the file (objid, level, blkid). What is useful to me are the file offsets which are corrupt. That's a little more work to generate but shouldn't be too bad. Maybe something like:

errors: Permanent errors have been detected in the following files:

        /testpool/randfile.7 393216-524288,1048576-1179647,1966080-2097151
        /testpool/randfile.5 917504-1048575
        ...

@asomers
Copy link
Contributor Author

asomers commented Aug 18, 2025

Along these same lines I'm a bit concerned with logging a line per block. That could be overwhelming.

Me too. That's why I don't want it to be the default, but only opted-into with "-vv"

Thinking about this from a user perspective, I really don't care about how ZFS decided to internally layout the file (objid, > level, blkid). What is useful to me are the file offsets which are corrupt. That's a little more work to generate but
shouldn't be too bad.

I have two thoughts about this:

  • While it may not be for everyone, I actually do find the level and blkid to be useful when I'm trying to recover from the damage caused by Occasional panics with "blkptr at XXX has invalid YYY" #16626 .
  • In order to switch the display to byte offsets I would need to know the recsize of the object, not just the dataset. There's no ioctl to get that for a given object id. I could use the st_blksize, if zpool_obj_to_path succeeds. But if it does not, then I don't know of any way to get the object's recsize. Do you?

@behlendorf
Copy link
Contributor

That's fair. Yeah, now that you point it out I don't see a great solution for getting the record size. You could imagine either extending or adding a new ioctl interface, but that's more complexity and compatibility code I'd really prefer to avoid. Reporting block IDs it is. Perhaps then something just a little more concise?

        /testpool/randfile.7 L0=3-4,7,10 L1=1

* Use a local variable more consistently
* Condense error reports into runs of contiguous blocks
@asomers
Copy link
Contributor Author

asomers commented Aug 20, 2025

@behlendorf with the latest push, error reports look like this:

errors: Permanent errors have been detected in the following files:

        /testpool/tmp/randfile.6 L0=0-2,L0=4
        /testpool/tmp/randfile.5 L0=0-7,L0=22-62
        ...
        /testpool2/randfile L1=5

Combining runs of contiguous blocks is probably good. But I'm not sure that I like combining discontiguous runs onto a single line. That means a file with many discontiguous errors could end up being printed as an extremely long line.

@asomers asomers requested a review from behlendorf August 20, 2025 23:29
@behlendorf
Copy link
Contributor

behlendorf commented Aug 22, 2025

@asomers I finally realized why this felt familiar. PR #9781 was working on adding exactly this same functionality but it unfortunately ended up stalling out. #9781 is unsurprisingly very similar to yours, but it has a few additions we should incorporate.

For example, the output format which they settled on I think is quite nice. It collapses contiguous ranges, prints the range byte offsets and even a nice summary of the damaged blocks. Here's the example output:

errors: Permanent errors have been detected in the following files:

    /var/tmp/testdir/10m_file: found 9 corrupted 128K blocks
       [0x0-0x1ffff] (128K)
       [0x100000-0x1fffff] (1M)

    /var/tmp/testdir/1m_file: found 1 corrupted 128K block
       [0x0-0x1ffff] (128K)

The original PR extends the ZFS_IOC_OBJ_TO_STATS ioctl to accomplish this. Now we can't do that exactly because it's one of the legacy ioctl interface and we don't want to break the user/kernel ioctl ABI by adding fields to zfs_stat_t, which is embedded in the zfs_cmd_t. But, we could register a new ioctl which uses the modern io/out nvlists and use that.

@tonyhutter
Copy link
Contributor

@behlendorf whoa I totally forgot about that old PR (which ironically is actually an updated version of an even older PR #8902)).

@asomers feel free to revive whatever you want from that PR. I remember using the range tree was a nice way to collapse error ranges. If you do use bits from the old PR, please credit the original author: TulsiJain <tulsi.jain@delphix.com>.

Along with that, since we now support JSON (zpool status --json) you'll want to add in the JSON-ified versions of the error ranges.

@asomers
Copy link
Contributor Author

asomers commented Sep 24, 2025

@tonyhutter I forgot to mention that the PR as-is already works with JSON output. It looks like this:

      "error_count": "2",
      "errlist": [
        {
          "path": "POOL/DATASET@SNAPSHOT1:/FILE",
          "level": 0,
          "record": 867511
        },
        {
          "path": "POOL/DATASET@SNAPSHOT2:/FILE",
          "level": 0,
          "record": 867511
        },

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Code Review Needed Ready for review and testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants