Adds data loaders for Roman SSC spectral files#1303
Conversation
|
@rosteen I think this is ready for an initial look with respect to the lazy loading and alternate list indexing. Let me know what you think. This does not implement any file caching for FITS files, but we could probably do that if we wanted. Once we're happy with the implementation I can start updating the Sphinx docs. |
|
@rosteen I'd like to start writing up some docs for the lazy loading of SpectrumList. Should I start a draft of that now or wait for feedback on the implementation? |
I'll take a look at this this afternoon, if you want to hold off until tomorrow. |
|
This is looking good I think. The only thing that makes me hesitate is the |
|
I can remove the labels for lazy list if you think that's too confusing. The items could still be accessed with the string source id. It just wouldn't be clear that you could do that. Would you prefer the default placeholder of I considered Alternatively, what about this as an inline reference? Once all items are loaded, it uses the normal list repr and the inline statement no longer appears. How do you feel about the logic of adding a lazy loader? I do wonder if it's still too manual a process, but I couldn't think of a solution completely that would automatically work for existing SpectrumList loaders without modification. |
I quite like this, I think it's a good solution. I don't think this is too manual for adding a lazy loader, personally, and probably not all missions need one anyway. |
|
Sounds good. I'll start working on the docs today. Locally I get 4 tests that fails all from |
|
Another thing is that, for these Roman products, if you don't have the correct datamodel package installed, you get warnings like these each time the file is opened. You can pass kwargs to |
This PR adds new data loaders for Roman SSC spectral data products. It adds loaders to handle the bundled per-detector 1d extracted spectral files for the per-exposure
1d_individualand the1d_combinedfiles. These files contain multiple sources, loadable asSpectrumList. If loading withSpectrum,thesourcekeyword argument can be used to specify which source to grab, otherwise the first source is extracted by default when no source id is specified. It also adds aSpectrumloader for a single source Roman spectral file. These files will be created dynamically on demand by MAST via astrocut.Marking as draft until (close to) final spectral data products are available. Updates may be needed to account for data model or structural changes.
Loaders work ok for files with <1000 sources (0.3 seconds to load), but won't scale to likely numbers of 10-50,000 sources per detector. Test of 100,000 sources took ~44 seconds to load into a
SpectrumList. Bulk of time is spent creating theSpectrumobjects in theSpectrumList, but 12 seconds is reading the asdf file.Edit: After initial lazy loading implementation, test file with 100,000 sources takes the initial 12 second to load the file once. Subsequent list access or SpectrumList.read uses cached file and spectrum objects.
The PR now updates
SpectrumListtoLazy Loading
Currently, the lazy loader is opt-in per data-loader. The
data_loaderdecorator has a newlazy_loaderkwarg that accepts a function to be called on each list item index. Each item is cached after load, so once accessed, the same item does not re-call the loader. Once specified you tellSpectrumListto lazily load the list by passinglazy_load=TruetoSpectrumList.read. Optionally specifycache_asdfto instruct it to cache the open asdf file itself.Without
lazy_load=True, it falls back to eager loading of all spectra using the standard defined data loaderAlternate Labels
You can specify string labels in your data loader to use as alternate indices for your
SpectrumList. This can be independent of (withSpectrumList.set_id_map) or used in conjunction with (passlabelstoSpectrumList.from_lazy) lazy loading. For the roman loaders, I pass in the source ids as string labels.SDSS example of lazy loading + alt id
This uses the extension name as alt string labels.