Skip to content

Adds data loaders for Roman SSC spectral files#1303

Draft
havok2063 wants to merge 14 commits into
astropy:mainfrom
havok2063:romansscloaders
Draft

Adds data loaders for Roman SSC spectral files#1303
havok2063 wants to merge 14 commits into
astropy:mainfrom
havok2063:romansscloaders

Conversation

@havok2063
Copy link
Copy Markdown
Contributor

@havok2063 havok2063 commented Jan 14, 2026

This PR adds new data loaders for Roman SSC spectral data products. It adds loaders to handle the bundled per-detector 1d extracted spectral files for the per-exposure 1d_individual and the 1d_combined files. These files contain multiple sources, loadable as SpectrumList. If loading with Spectrum,the source keyword argument can be used to specify which source to grab, otherwise the first source is extracted by default when no source id is specified. It also adds a Spectrum loader for a single source Roman spectral file. These files will be created dynamically on demand by MAST via astrocut.

Marking as draft until (close to) final spectral data products are available. Updates may be needed to account for data model or structural changes.

Loaders work ok for files with <1000 sources (0.3 seconds to load), but won't scale to likely numbers of 10-50,000 sources per detector. Test of 100,000 sources took ~44 seconds to load into a SpectrumList. Bulk of time is spent creating the Spectrum objects in the SpectrumList, but 12 seconds is reading the asdf file.

Edit: After initial lazy loading implementation, test file with 100,000 sources takes the initial 12 second to load the file once. Subsequent list access or SpectrumList.read uses cached file and spectrum objects.

The PR now updates SpectrumList to

  1. allow for lazy loading of Spectrum objects in the list
  2. apply a mapping of an alternate string label to index the list by

Lazy Loading

Currently, the lazy loader is opt-in per data-loader. The data_loader decorator has a new lazy_loader kwarg that accepts a function to be called on each list item index. Each item is cached after load, so once accessed, the same item does not re-call the loader. Once specified you tell SpectrumList to lazily load the list by passing lazy_load=True to SpectrumList.read. Optionally specify cache_asdf to instruct it to cache the open asdf file itself.

# create a lazy loadable spectrum list
s = SpectrumList.read(tmp, lazy_load=True, cache_asdf=True)

# the list of n spectra with placeholder objects
s
[<object object at 0x118878500>, <object object at 0x118878500>, <object object at 0x118878500>, <object object at 0x118878500>, <object object at 0x118878500>, <object object at 0x118878500>]

# load the first item
s[0]
<Spectrum(flux=[nan ... nan] W / (nm m2) (shape=(275,), mean=0.00000 W / (nm m2)); spectral_axis=<SpectralAxis [ 750.          752.47542943  754.95902919 ... 1837.848077   1843.91402794
 1850.        ] nm> (length=275); uncertainty=StdDevUncertainty)>

# repr is updated
s
[<Spectrum(flux=[nan ... nan] W / (nm m2) (shape=(275,), mean=0.00000 W / (nm m2)); spectral_axis=<SpectralAxis [ 750.          752.47542943  754.95902919 ... 1837.848077   1843.91402794
 1850.        ] nm> (length=275); uncertainty=StdDevUncertainty)>, <object object at 0x118878500>, <object object at 0x118878500>, <object object at 0x118878500>, <object object at 0x118878500>, <object object at 0x118878500>]

Without lazy_load=True, it falls back to eager loading of all spectra using the standard defined data loader

s = SpectrumList.read(tmp)
s
[<Spectrum(flux=[nan ... nan] W / (nm m2) (shape=(275,), mean=0.00000 W / (nm m2)); spectral_axis=<SpectralAxis [ 750.          752.47542943  754.95902919 ... 1837.848077   1843.91402794
 1850.        ] nm> (length=275); uncertainty=StdDevUncertainty)>, <Spectrum(flux=[nan ... nan] W / (nm m2) (shape=(275,), mean=0.00000 W / (nm m2)); spectral_axis=<SpectralAxis [ 750.          752.47542943  754.95902919 ... 1837.848077   1843.91402794
 1850.        ] nm> (length=275); uncertainty=StdDevUncertainty)>, <Spectrum(flux=[nan ... nan] W / (nm m2) (shape=(275,), mean=0.00000 W / (nm m2)); spectral_axis=<SpectralAxis [ 750.          752.47542943  754.95902919 ... 1837.848077   1843.91402794
 1850.        ] nm> (length=275); uncertainty=StdDevUncertainty)>, <Spectrum(flux=[nan ... nan] W / (nm m2) (shape=(275,), mean=0.00000 W / (nm m2)); spectral_axis=<SpectralAxis [ 750.          752.47542943  754.95902919 ... 1837.848077   1843.91402794
 1850.        ] nm> (length=275); uncertainty=StdDevUncertainty)>, <Spectrum(flux=[nan ... nan] W / (nm m2) (shape=(275,), mean=0.00000 W / (nm m2)); spectral_axis=<SpectralAxis [ 750.          752.47542943  754.95902919 ... 1837.848077   1843.91402794
 1850.        ] nm> (length=275); uncertainty=StdDevUncertainty)>, <Spectrum(flux=[nan ... nan] W / (nm m2) (shape=(275,), mean=0.00000 W / (nm m2)); spectral_axis=<SpectralAxis [ 750.          752.47542943  754.95902919 ... 1837.848077   1843.91402794
 1850.        ] nm> (length=275); uncertainty=StdDevUncertainty)>]

s['402849']
<Spectrum(flux=[nan ... nan] W / (nm m2) (shape=(275,), mean=0.00000 W / (nm m2)); spectral_axis=<SpectralAxis [ 750.          752.47542943  754.95902919 ... 1837.848077   1843.91402794
 1850.        ] nm> (length=275); uncertainty=StdDevUncertainty)>

Alternate Labels

You can specify string labels in your data loader to use as alternate indices for your SpectrumList. This can be independent of (with SpectrumList.set_id_map) or used in conjunction with (pass labels to SpectrumList.from_lazy) lazy loading. For the roman loaders, I pass in the source ids as string labels.

# labels specified independently of lazy loading
s = SpectrumList.read(tmp, lazy_load=True, cache_asdf=True)
s
[<object object at 0x118878500>, <object object at 0x118878500>, <object object at 0x118878500>, <object object at 0x118878500>, <object object at 0x118878500>, <object object at 0x118878500>]

 s['402849']
<Spectrum(flux=[nan ... nan] W / (nm m2) (shape=(275,), mean=0.00000 W / (nm m2)); spectral_axis=<SpectralAxis [ 750.          752.47542943  754.95902919 ... 1837.848077   1843.91402794
 1850.        ] nm> (length=275); uncertainty=StdDevUncertainty)>

# labels specifed in the lazy loader callable
s = SpectrumList.read(tmp, lazy_load=True, cache_asdf=True)
s
['402849', '403613', '403686', '404935', '404979', '414981']

s[0]
<Spectrum(flux=[nan ... nan] W / (nm m2) (shape=(275,), mean=0.00000 W / (nm m2)); spectral_axis=<SpectralAxis [ 750.          752.47542943  754.95902919 ... 1837.848077   1843.91402794
 1850.        ] nm> (length=275); uncertainty=StdDevUncertainty)>

s['404935']
<Spectrum(flux=[nan ... nan] W / (nm m2) (shape=(275,), mean=0.00000 W / (nm m2)); spectral_axis=<SpectralAxis [ 750.          752.47542943  754.95902919 ... 1837.848077   1843.91402794
 1850.        ] nm> (length=275); uncertainty=StdDevUncertainty)>
[<Spectrum(flux=[nan ... nan] W / (nm m2) (shape=(275,), mean=0.00000 W / (nm m2)); spectral_axis=<SpectralAxis [ 750.          752.47542943  754.95902919 ... 1837.848077   1843.91402794
 1850.        ] nm> (length=275); uncertainty=StdDevUncertainty)>, '403613', '403686', <Spectrum(flux=[nan ... nan] W / (nm m2) (shape=(275,), mean=0.00000 W / (nm m2)); spectral_axis=<SpectralAxis [ 750.          752.47542943  754.95902919 ... 1837.848077   1843.91402794
 1850.        ] nm> (length=275); uncertainty=StdDevUncertainty)>, '404979', '414981']

SDSS example of lazy loading + alt id

This uses the extension name as alt string labels.

sdss='spec-015004-59190-4400696424.fits'

s = SpectrumList.read(sdss, format='SDSS-V spec', lazy_load=True)
s
['COADD', 'MJD_EXP_59190-00', 'MJD_EXP_59190-01', 'MJD_EXP_59190-02']

s['COADD']
<Spectrum(flux=[-35847924.0 ... -1.905903697013855] 1e-17 erg / (Angstrom s cm2) (shape=(4648,), mean=-102516.42969 1e-17 erg / (Angstrom s cm2)); spectral_axis=<SpectralAxis [ 3566.9744  3567.795   3568.6177 ... 10394.412  10396.809  10399.206 ] Angstrom> (length=4648); uncertainty=InverseVariance)>

s
[<Spectrum(flux=[-35847924.0 ... -1.905903697013855] 1e-17 erg / (Angstrom s cm2) (shape=(4648,), mean=-102516.42969 1e-17 erg / (Angstrom s cm2)); spectral_axis=<SpectralAxis [ 3566.9744  3567.795   3568.6177 ... 10394.412  10396.809  10399.206 ] Angstrom> (length=4648); uncertainty=InverseVariance)>, 
'MJD_EXP_59190-00', 'MJD_EXP_59190-01', 'MJD_EXP_59190-02']

@havok2063
Copy link
Copy Markdown
Contributor Author

@rosteen I think this is ready for an initial look with respect to the lazy loading and alternate list indexing. Let me know what you think. This does not implement any file caching for FITS files, but we could probably do that if we wanted. Once we're happy with the implementation I can start updating the Sphinx docs.

@havok2063
Copy link
Copy Markdown
Contributor Author

@rosteen I'd like to start writing up some docs for the lazy loading of SpectrumList. Should I start a draft of that now or wait for feedback on the implementation?

@rosteen
Copy link
Copy Markdown
Contributor

rosteen commented Feb 2, 2026

@rosteen I'd like to start writing up some docs for the lazy loading of SpectrumList. Should I start a draft of that now or wait for feedback on the implementation?

I'll take a look at this this afternoon, if you want to hold off until tomorrow.

@rosteen
Copy link
Copy Markdown
Contributor

rosteen commented Feb 2, 2026

This is looking good I think. The only thing that makes me hesitate is the repr behavior when lazy loaded. I'm just worried that users will print out their loaded spectrum list and see something like ['402849', '403613', '403686', '404935', '404979', '414981'] and go "that's not right, where are my spectra!?" having not read the documentation. Maybe in that case the repr could include a short inline statement that it's been lazy loaded and accessing the indices will return the actual Spectrum object? I'm not sure what the best strategy is, maybe it's fine as-is. I don't see any issues with the actual functionality/implementation.

@havok2063
Copy link
Copy Markdown
Contributor Author

I can remove the labels for lazy list if you think that's too confusing. The items could still be accessed with the string source id. It just wouldn't be clear that you could do that. Would you prefer the default placeholder of [<object object at 0x10be7c500>, <object object at 0x10be7c500>,...]?

I considered placeholder , placeholder, placeholder,...] but didn't like how that looked.

Alternatively, what about this as an inline reference?

s = SpectrumList.read(tmp, lazy_load=True, cache_asdf=True)
s
lazy list: 0 items loaded; access an index to load a spectrum:
['402849', '403613', '403686', '404935', '404979', '414981']

s[0]

s
lazy list: 1 items loaded; access an index to load a spectrum:
[<Spectrum(flux=[nan ... nan] W / (nm m2) (shape=(275,), mean=0.00000 W / (nm m2)); spectral_axis=<SpectralAxis [ 750.          752.47542943  754.95902919 ... 1837.848077   1843.91402794
 1850.        ] nm> (length=275); uncertainty=StdDevUncertainty)>, '403613', '403686', '404935', '404979', '414981']

Once all items are loaded, it uses the normal list repr and the inline statement no longer appears.

How do you feel about the logic of adding a lazy loader? I do wonder if it's still too manual a process, but I couldn't think of a solution completely that would automatically work for existing SpectrumList loaders without modification.

@rosteen
Copy link
Copy Markdown
Contributor

rosteen commented Feb 3, 2026

Alternatively, what about this as an inline reference?

I quite like this, I think it's a good solution.

I don't think this is too manual for adding a lazy loader, personally, and probably not all missions need one anyway.

@havok2063
Copy link
Copy Markdown
Contributor Author

Sounds good. I'll start working on the docs today. Locally I get 4 tests that fails all from test_loaders.py. Should I worry about fixing those, or any of the CI issues?

FAILED tests/test_loaders.py::test_tabular_fits_compressed[bzip2] - astropy.io.registry.base.IORegistryError: Format could not be identified based on the file name or contents, please provide a 'format' argument.
FAILED tests/test_loaders.py::test_tabular_fits_compressed[xz] - astropy.io.registry.base.IORegistryError: Format could not be identified based on the file name or contents, please provide a 'format' argument.
FAILED tests/test_loaders.py::test_wcs1d_fits_compressed[bzip2] - astropy.io.registry.base.IORegistryError: Format could not be identified based on the file name or contents, please provide a 'format' argument.
FAILED tests/test_loaders.py::test_wcs1d_fits_compressed[xz] - astropy.io.registry.base.IORegistryError: Format could not be identified based on the file name or contents, please provide a 'format' argument.

@havok2063
Copy link
Copy Markdown
Contributor Author

Another thing is that, for these Roman products, if you don't have the correct datamodel package installed, you get warnings like these each time the file is opened.

/Users/bcherinka/anaconda3/envs/specutils/lib/python3.12/site-packages/asdf/yamlutil.py:363: AsdfConversionWarning: asdf://asdf-pydantic/examples/tags/g2dp-meta-1.0.0 is not recognized, converting to raw Python data structure
  warnings.warn(
/Users/bcherinka/anaconda3/envs/specutils/lib/python3.12/site-packages/asdf/yamlutil.py:363: AsdfConversionWarning: asdf://asdf-pydantic/examples/tags/g2dp-1d-spectra-1.0.0 is not recognized, converting to raw Python data structure
  warnings.warn(
/Users/bcherinka/anaconda3/envs/specutils/lib/python3.12/site-packages/asdf/_asdf.py:274: AsdfPackageVersionWarning: File 'file:///Users/bcherinka/Work/roman/v3.3/spectra/g2dp_ver3.3_prism/wfi_spec_combined_1d_r0000201001001001001_0002_WFI01.asdf'was created with extension URI 'asdf://asdf-pydantic/examples/extensions/G2DP-extension-1.0.0', which is not currently installed
  warnings.warn(msg, AsdfPackageVersionWarning)

You can pass kwargs to asdf.open to disable these. Setting ignore_unrecognized_tag=True and ignore_missing_extensions=True. Should we explicitly set these in the asdf.open calls or leave them be and let the user pass them into Spectrum.read?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants