Adds data loaders for Roman SSC spectral files by havok2063 · Pull Request #1303 · astropy/specutils

havok2063 · 2026-01-14T19:31:06Z

This PR adds new data loaders for Roman SSC spectral data products. It adds loaders to handle the bundled per-detector 1d extracted spectral files for the per-exposure 1d_individual and the 1d_combined files. These files contain multiple sources, loadable as SpectrumList. If loading with Spectrum,the source keyword argument can be used to specify which source to grab, otherwise the first source is extracted by default when no source id is specified. It also adds a Spectrum loader for a single source Roman spectral file. These files will be created dynamically on demand by MAST via astrocut.

Marking as draft until (close to) final spectral data products are available. Updates may be needed to account for data model or structural changes.

Loaders work ok for files with <1000 sources (0.3 seconds to load), but won't scale to likely numbers of 10-50,000 sources per detector. Test of 100,000 sources took ~44 seconds to load into a SpectrumList. Bulk of time is spent creating the Spectrum objects in the SpectrumList, but 12 seconds is reading the asdf file.

Edit: After initial lazy loading implementation, test file with 100,000 sources takes the initial 12 second to load the file once. Subsequent list access or SpectrumList.read uses cached file and spectrum objects.

The PR now updates SpectrumList to

allow for lazy loading of Spectrum objects in the list
apply a mapping of an alternate string label to index the list by

Lazy Loading

Currently, the lazy loader is opt-in per data-loader. The data_loader decorator has a new lazy_loader kwarg that accepts a function to be called on each list item index. Each item is cached after load, so once accessed, the same item does not re-call the loader. Once specified you tell SpectrumList to lazily load the list by passing lazy_load=True to SpectrumList.read. Optionally specify cache_asdf to instruct it to cache the open asdf file itself.

# create a lazy loadable spectrum list
s = SpectrumList.read(tmp, lazy_load=True, cache_asdf=True)

# the list of n spectra with placeholder objects
s
[<object object at 0x118878500>, <object object at 0x118878500>, <object object at 0x118878500>, <object object at 0x118878500>, <object object at 0x118878500>, <object object at 0x118878500>]

# load the first item
s[0]
<Spectrum(flux=[nan ... nan] W / (nm m2) (shape=(275,), mean=0.00000 W / (nm m2)); spectral_axis=<SpectralAxis [ 750.          752.47542943  754.95902919 ... 1837.848077   1843.91402794
 1850.        ] nm> (length=275); uncertainty=StdDevUncertainty)>

# repr is updated
s
[<Spectrum(flux=[nan ... nan] W / (nm m2) (shape=(275,), mean=0.00000 W / (nm m2)); spectral_axis=<SpectralAxis [ 750.          752.47542943  754.95902919 ... 1837.848077   1843.91402794
 1850.        ] nm> (length=275); uncertainty=StdDevUncertainty)>, <object object at 0x118878500>, <object object at 0x118878500>, <object object at 0x118878500>, <object object at 0x118878500>, <object object at 0x118878500>]

Without lazy_load=True, it falls back to eager loading of all spectra using the standard defined data loader

s = SpectrumList.read(tmp)
s
[<Spectrum(flux=[nan ... nan] W / (nm m2) (shape=(275,), mean=0.00000 W / (nm m2)); spectral_axis=<SpectralAxis [ 750.          752.47542943  754.95902919 ... 1837.848077   1843.91402794
 1850.        ] nm> (length=275); uncertainty=StdDevUncertainty)>, <Spectrum(flux=[nan ... nan] W / (nm m2) (shape=(275,), mean=0.00000 W / (nm m2)); spectral_axis=<SpectralAxis [ 750.          752.47542943  754.95902919 ... 1837.848077   1843.91402794
 1850.        ] nm> (length=275); uncertainty=StdDevUncertainty)>, <Spectrum(flux=[nan ... nan] W / (nm m2) (shape=(275,), mean=0.00000 W / (nm m2)); spectral_axis=<SpectralAxis [ 750.          752.47542943  754.95902919 ... 1837.848077   1843.91402794
 1850.        ] nm> (length=275); uncertainty=StdDevUncertainty)>, <Spectrum(flux=[nan ... nan] W / (nm m2) (shape=(275,), mean=0.00000 W / (nm m2)); spectral_axis=<SpectralAxis [ 750.          752.47542943  754.95902919 ... 1837.848077   1843.91402794
 1850.        ] nm> (length=275); uncertainty=StdDevUncertainty)>, <Spectrum(flux=[nan ... nan] W / (nm m2) (shape=(275,), mean=0.00000 W / (nm m2)); spectral_axis=<SpectralAxis [ 750.          752.47542943  754.95902919 ... 1837.848077   1843.91402794
 1850.        ] nm> (length=275); uncertainty=StdDevUncertainty)>, <Spectrum(flux=[nan ... nan] W / (nm m2) (shape=(275,), mean=0.00000 W / (nm m2)); spectral_axis=<SpectralAxis [ 750.          752.47542943  754.95902919 ... 1837.848077   1843.91402794
 1850.        ] nm> (length=275); uncertainty=StdDevUncertainty)>]

s['402849']
<Spectrum(flux=[nan ... nan] W / (nm m2) (shape=(275,), mean=0.00000 W / (nm m2)); spectral_axis=<SpectralAxis [ 750.          752.47542943  754.95902919 ... 1837.848077   1843.91402794
 1850.        ] nm> (length=275); uncertainty=StdDevUncertainty)>

Alternate Labels

You can specify string labels in your data loader to use as alternate indices for your SpectrumList. This can be independent of (with SpectrumList.set_id_map) or used in conjunction with (pass labels to SpectrumList.from_lazy) lazy loading. For the roman loaders, I pass in the source ids as string labels.

# labels specified independently of lazy loading
s = SpectrumList.read(tmp, lazy_load=True, cache_asdf=True)
s
[<object object at 0x118878500>, <object object at 0x118878500>, <object object at 0x118878500>, <object object at 0x118878500>, <object object at 0x118878500>, <object object at 0x118878500>]

 s['402849']
<Spectrum(flux=[nan ... nan] W / (nm m2) (shape=(275,), mean=0.00000 W / (nm m2)); spectral_axis=<SpectralAxis [ 750.          752.47542943  754.95902919 ... 1837.848077   1843.91402794
 1850.        ] nm> (length=275); uncertainty=StdDevUncertainty)>

# labels specifed in the lazy loader callable
s = SpectrumList.read(tmp, lazy_load=True, cache_asdf=True)
s
['402849', '403613', '403686', '404935', '404979', '414981']

s[0]
<Spectrum(flux=[nan ... nan] W / (nm m2) (shape=(275,), mean=0.00000 W / (nm m2)); spectral_axis=<SpectralAxis [ 750.          752.47542943  754.95902919 ... 1837.848077   1843.91402794
 1850.        ] nm> (length=275); uncertainty=StdDevUncertainty)>

s['404935']
<Spectrum(flux=[nan ... nan] W / (nm m2) (shape=(275,), mean=0.00000 W / (nm m2)); spectral_axis=<SpectralAxis [ 750.          752.47542943  754.95902919 ... 1837.848077   1843.91402794
 1850.        ] nm> (length=275); uncertainty=StdDevUncertainty)>
[<Spectrum(flux=[nan ... nan] W / (nm m2) (shape=(275,), mean=0.00000 W / (nm m2)); spectral_axis=<SpectralAxis [ 750.          752.47542943  754.95902919 ... 1837.848077   1843.91402794
 1850.        ] nm> (length=275); uncertainty=StdDevUncertainty)>, '403613', '403686', <Spectrum(flux=[nan ... nan] W / (nm m2) (shape=(275,), mean=0.00000 W / (nm m2)); spectral_axis=<SpectralAxis [ 750.          752.47542943  754.95902919 ... 1837.848077   1843.91402794
 1850.        ] nm> (length=275); uncertainty=StdDevUncertainty)>, '404979', '414981']

SDSS example of lazy loading + alt id

This uses the extension name as alt string labels.

sdss='spec-015004-59190-4400696424.fits'

s = SpectrumList.read(sdss, format='SDSS-V spec', lazy_load=True)
s
['COADD', 'MJD_EXP_59190-00', 'MJD_EXP_59190-01', 'MJD_EXP_59190-02']

s['COADD']
<Spectrum(flux=[-35847924.0 ... -1.905903697013855] 1e-17 erg / (Angstrom s cm2) (shape=(4648,), mean=-102516.42969 1e-17 erg / (Angstrom s cm2)); spectral_axis=<SpectralAxis [ 3566.9744  3567.795   3568.6177 ... 10394.412  10396.809  10399.206 ] Angstrom> (length=4648); uncertainty=InverseVariance)>

s
[<Spectrum(flux=[-35847924.0 ... -1.905903697013855] 1e-17 erg / (Angstrom s cm2) (shape=(4648,), mean=-102516.42969 1e-17 erg / (Angstrom s cm2)); spectral_axis=<SpectralAxis [ 3566.9744  3567.795   3568.6177 ... 10394.412  10396.809  10399.206 ] Angstrom> (length=4648); uncertainty=InverseVariance)>, 
'MJD_EXP_59190-00', 'MJD_EXP_59190-01', 'MJD_EXP_59190-02']

havok2063 · 2026-01-26T19:26:22Z

@rosteen I think this is ready for an initial look with respect to the lazy loading and alternate list indexing. Let me know what you think. This does not implement any file caching for FITS files, but we could probably do that if we wanted. Once we're happy with the implementation I can start updating the Sphinx docs.

havok2063 · 2026-02-02T17:01:44Z

@rosteen I'd like to start writing up some docs for the lazy loading of SpectrumList. Should I start a draft of that now or wait for feedback on the implementation?

rosteen · 2026-02-02T17:46:49Z

@rosteen I'd like to start writing up some docs for the lazy loading of SpectrumList. Should I start a draft of that now or wait for feedback on the implementation?

I'll take a look at this this afternoon, if you want to hold off until tomorrow.

rosteen · 2026-02-02T21:47:54Z

This is looking good I think. The only thing that makes me hesitate is the repr behavior when lazy loaded. I'm just worried that users will print out their loaded spectrum list and see something like ['402849', '403613', '403686', '404935', '404979', '414981'] and go "that's not right, where are my spectra!?" having not read the documentation. Maybe in that case the repr could include a short inline statement that it's been lazy loaded and accessing the indices will return the actual Spectrum object? I'm not sure what the best strategy is, maybe it's fine as-is. I don't see any issues with the actual functionality/implementation.

havok2063 · 2026-02-03T19:50:51Z

I can remove the labels for lazy list if you think that's too confusing. The items could still be accessed with the string source id. It just wouldn't be clear that you could do that. Would you prefer the default placeholder of [<object object at 0x10be7c500>, <object object at 0x10be7c500>,...]?

I considered placeholder , placeholder, placeholder,...] but didn't like how that looked.

Alternatively, what about this as an inline reference?

s = SpectrumList.read(tmp, lazy_load=True, cache_asdf=True)
s
lazy list: 0 items loaded; access an index to load a spectrum:
['402849', '403613', '403686', '404935', '404979', '414981']

s[0]

s
lazy list: 1 items loaded; access an index to load a spectrum:
[<Spectrum(flux=[nan ... nan] W / (nm m2) (shape=(275,), mean=0.00000 W / (nm m2)); spectral_axis=<SpectralAxis [ 750.          752.47542943  754.95902919 ... 1837.848077   1843.91402794
 1850.        ] nm> (length=275); uncertainty=StdDevUncertainty)>, '403613', '403686', '404935', '404979', '414981']

Once all items are loaded, it uses the normal list repr and the inline statement no longer appears.

How do you feel about the logic of adding a lazy loader? I do wonder if it's still too manual a process, but I couldn't think of a solution completely that would automatically work for existing SpectrumList loaders without modification.

rosteen · 2026-02-03T20:38:56Z

Alternatively, what about this as an inline reference?

I quite like this, I think it's a good solution.

I don't think this is too manual for adding a lazy loader, personally, and probably not all missions need one anyway.

havok2063 · 2026-02-04T15:15:00Z

Sounds good. I'll start working on the docs today. Locally I get 4 tests that fails all from test_loaders.py. Should I worry about fixing those, or any of the CI issues?

FAILED tests/test_loaders.py::test_tabular_fits_compressed[bzip2] - astropy.io.registry.base.IORegistryError: Format could not be identified based on the file name or contents, please provide a 'format' argument.
FAILED tests/test_loaders.py::test_tabular_fits_compressed[xz] - astropy.io.registry.base.IORegistryError: Format could not be identified based on the file name or contents, please provide a 'format' argument.
FAILED tests/test_loaders.py::test_wcs1d_fits_compressed[bzip2] - astropy.io.registry.base.IORegistryError: Format could not be identified based on the file name or contents, please provide a 'format' argument.
FAILED tests/test_loaders.py::test_wcs1d_fits_compressed[xz] - astropy.io.registry.base.IORegistryError: Format could not be identified based on the file name or contents, please provide a 'format' argument.

havok2063 · 2026-02-04T19:44:39Z

Another thing is that, for these Roman products, if you don't have the correct datamodel package installed, you get warnings like these each time the file is opened.

/Users/bcherinka/anaconda3/envs/specutils/lib/python3.12/site-packages/asdf/yamlutil.py:363: AsdfConversionWarning: asdf://asdf-pydantic/examples/tags/g2dp-meta-1.0.0 is not recognized, converting to raw Python data structure
  warnings.warn(
/Users/bcherinka/anaconda3/envs/specutils/lib/python3.12/site-packages/asdf/yamlutil.py:363: AsdfConversionWarning: asdf://asdf-pydantic/examples/tags/g2dp-1d-spectra-1.0.0 is not recognized, converting to raw Python data structure
  warnings.warn(
/Users/bcherinka/anaconda3/envs/specutils/lib/python3.12/site-packages/asdf/_asdf.py:274: AsdfPackageVersionWarning: File 'file:///Users/bcherinka/Work/roman/v3.3/spectra/g2dp_ver3.3_prism/wfi_spec_combined_1d_r0000201001001001001_0002_WFI01.asdf'was created with extension URI 'asdf://asdf-pydantic/examples/extensions/G2DP-extension-1.0.0', which is not currently installed
  warnings.warn(msg, AsdfPackageVersionWarning)

You can pass kwargs to asdf.open to disable these. Setting ignore_unrecognized_tag=True and ignore_missing_extensions=True. Should we explicitly set these in the asdf.open calls or leave them be and let the user pass them into Spectrum.read?

havok2063 added 9 commits January 12, 2026 15:31

adding init roman ssc loaders

b9c6f2a

adding tests for roman

37bef2b

initial lazy loader for spectrum list for roman

f42dc8d

some cleanup and init object caching; tests

fda0758

tweaking tests

bdc44f7

renaming lazy kwarg

7fed771

cleanup

6e57522

adding example lazy loader for sdss-v spec files

950f3d2

cleanup

66812d6

rosteen added enhancement io performance labels Feb 2, 2026

havok2063 added 2 commits February 4, 2026 10:05

updating lazy repr with inline note

7e50553

adding lazy repr test

0b05358

fixing test

53044ac

havok2063 added 2 commits February 4, 2026 16:02

add checks for unique labels

611aa93

updating docs

b0e2d02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Adds data loaders for Roman SSC spectral files#1303

Adds data loaders for Roman SSC spectral files#1303
havok2063 wants to merge 14 commits into
astropy:mainfrom
havok2063:romansscloaders

havok2063 commented Jan 14, 2026 •

edited

Loading

Uh oh!

havok2063 commented Jan 26, 2026

Uh oh!

havok2063 commented Feb 2, 2026

Uh oh!

rosteen commented Feb 2, 2026

Uh oh!

rosteen commented Feb 2, 2026 •

edited

Loading

Uh oh!

havok2063 commented Feb 3, 2026

Uh oh!

rosteen commented Feb 3, 2026

Uh oh!

havok2063 commented Feb 4, 2026

Uh oh!

havok2063 commented Feb 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

havok2063 commented Jan 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Lazy Loading

Alternate Labels

SDSS example of lazy loading + alt id

Uh oh!

havok2063 commented Jan 26, 2026

Uh oh!

havok2063 commented Feb 2, 2026

Uh oh!

rosteen commented Feb 2, 2026

Uh oh!

rosteen commented Feb 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

havok2063 commented Feb 3, 2026

Uh oh!

rosteen commented Feb 3, 2026

Uh oh!

havok2063 commented Feb 4, 2026

Uh oh!

havok2063 commented Feb 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

havok2063 commented Jan 14, 2026 •

edited

Loading

rosteen commented Feb 2, 2026 •

edited

Loading