cas package
Subpackages
- cas.accession package
- cas.ingest package
- cas.linkml_ops package
- cas.matrix_file package
- cas.utils package
- Submodules
- cas.utils.conversion_utils module
calculate_labelset_rank()
calculate_labelset()
add_labelsets_to_cas()
get_cell_ids()
get_cl_annotations_from_anndata()
collect_parent_cell_ids()
generate_parent_cell_lookup()
update_parent_info()
add_parent_cell_hierarchy()
add_parent_hierarchy_to_annotations()
get_authors_from_doi()
reformat_json()
convert_complex_type()
copy_and_update_file_path()
fetch_anndata()
retrieve_schema()
create_accession_mapping()
- cas.utils.validation_utils module
- Module contents
Submodules
cas.abc_cas_converter module
- cas.abc_cas_converter.validate_dataframe_columns(df, required_columns)[source]
Validates a DataFrame for the required columns. :type df:
DataFrame
:param df: DataFrame to validate. :type df: pandas.DataFrame :type required_columns:list
:param required_columns: List of required column names. :type required_columns: list- Parameters:
df (DataFrame)
required_columns (list)
- cas.abc_cas_converter.generate_catset_dataframe(cas)[source]
Generate a DataFrame representing the Cluster Annotation Term Set (cat_set) from the given Cell Annotation Schema (CAS) dictionary.
- Parameters:
cas (Dict[str, Any]) – The Cell Annotation Schema (CAS) dictionary.
- Returns:
DataFrame representing the Cluster Annotation Term Set (cat_set).
- Return type:
pd.DataFrame
- cas.abc_cas_converter.generate_cat_dataframe(cas)[source]
Generate a DataFrame representing the Cluster Annotation Term (cat) from the given Cell Annotation Schema (CAS) dictionary.
- Parameters:
cas (Dict[str, Any]) – The Cell Annotation Schema (CAS) dictionary.
- Returns:
DataFrame representing the Cluster Annotation Term (cat).
- Return type:
pd.DataFrame
- cas.abc_cas_converter.calculate_order_mapping(order_values)[source]
Calculate a mapping dictionary based on the order values.
- Parameters:
order_values (pandas.Series) – Series containing the order values.
- Returns:
Mapping dictionary where keys are order values and values are rank values.
- Return type:
Dict[str, str]
- cas.abc_cas_converter.abc2cas(cat_set_file_path, cat_file_path, output_file_path)[source]
Converts given ABC files to a Cell Annotation Schema (CAS) JSON and writes it to a file with output_file_path name. :type cat_set_file_path:
str
:param cat_set_file_path: Path to the Cluster Annotation Term Set file. :type cat_file_path:str
:param cat_file_path: Path to the Cluster Annotation Term file. :type output_file_path:str
:param output_file_path: Output CAS file name (default: output.json).- Parameters:
cat_set_file_path (str)
cat_file_path (str)
output_file_path (str)
- cas.abc_cas_converter.add_annotations(cas, cat)[source]
Adds annotations to the Cell Annotation Schema (CAS) based on the data from the Cluster Annotation Term DataFrame.
- Parameters:
cas (Dict[str, Any]) – Dictionary representing the Cell Annotation Schema.
cat (pd.DataFrame) – DataFrame containing Cluster Annotation Term data.
- cas.abc_cas_converter.add_labelsets(cas, cat_set)[source]
Adds labelsets to the Cell Annotation Schema (CAS) based on the data from the Cluster Annotation Term Set DataFrame.
- Parameters:
cas (Dict[str, Any]) – Cell Annotation Schema dictionary.
cat_set (pandas.DataFrame) – DataFrame containing Cluster Annotation Term Set data.
- cas.abc_cas_converter.init_metadata()[source]
Initializes metadata for Cell Annotation Schema (CAS).
- Returns:
Metadata dictionary containing default values for various fields.
- Return type:
Dict[str, Any]
- cas.abc_cas_converter.cas2abc(cas_file_path, cat_set_file_path, cat_file_path)[source]
Converts given Cell Annotation Schema (CAS) to ABC files: cluster_annotation_term and cluster_annotation_term_set, and writes them to files with cat_file_path and cat_set_file_path.
- Parameters:
cas_file_path (
str
) – Path to the Cell Annotation Schema (CAS) filecat_set_file_path (
str
) – Path to the Cluster Annotation Term Set file.cat_file_path (
str
) – Path to the Cluster Annotation Term file.
cas.anndata_conversion module
- cas.anndata_conversion.merge(cas_file_path, anndata_path, validate, output_file_name)[source]
Tests if CAS json and AnnData are compatible and merges CAS into AnnData if possible.
- This function performs the following checks:
Verifies that all cell barcodes (cell IDs) in CAS exist in AnnData and vice versa.
Identifies matching labelset names between CAS and AnnData.
Validates that cell sets associated with each annotation match between CAS and AnnData.
Checks if the cell labels are identical; if not, provides options to update or terminate.
- Parameters:
cas_file_path (
str
) – The path to the CAS json file.anndata_path (
Optional
[str
]) – The path to the AnnData file.validate (
bool
) – Boolean to determine if validation checks will be performed before writing to the output AnnData file.output_file_name (
str
) – Output AnnData file name.
- cas.anndata_conversion.merge_cas_object(input_json, anndata_file_path, validate, output_file_path, download_dir=None)[source]
Tests if CAS json and AnnData are compatible and merges CAS into AnnData if possible.
- This function performs the following checks:
Verifies that all cell barcodes (cell IDs) in CAS exist in AnnData and vice versa.
Identifies matching labelset names between CAS and AnnData.
Validates that cell sets associated with each annotation match between CAS and AnnData.
Checks if the cell labels are identical; if not, provides options to update or terminate.
- Parameters:
input_json (
dict
) – The CAS json object.anndata_file_path (
Optional
[str
]) – The path to the AnnData file.validate (
bool
) – Boolean to determine if validation checks will be performed before writing to the output AnnData file.output_file_path (
str
) – Output AnnData file name.download_dir (
Optional
[str
]) – The directory to download AnnData files.
- cas.anndata_conversion.test_compatibility(anndata_obs, input_json, validate)[source]
Tests if CAS and AnnData can be merged.
- Args:
anndata_obs: The AnnData obs object. input_json: The CAS data json object. validate: Boolean to determine if validation checks will be performed before writing to the output AnnData file.
cas.anndata_splitter module
- cas.anndata_splitter.split_anndata_to_file(anndata_file_path, cas_json_paths, multiple_outputs, compression_method='gzip')[source]
Splits an AnnData file into multiple files based on provided CAS JSON files and writes them to disk.
- Parameters:
anndata_file_path (
Optional
[str
]) – Path to the AnnData file.cas_json_paths (
List
[str
]) – List of CAS JSON file paths.multiple_outputs (
bool
) – If True, outputs multiple files, one for each CAS JSON file; otherwise, outputs a single file.compression_method (
Optional
[Literal
['gzip'
,'lzf'
]]) – Compression method utilized in anndata write function. Default is “gzip”.
- cas.anndata_splitter.split_anndata(adata, cas, multiple_outputs)[source]
Splits an AnnData object into multiple or single AnnData objects based on the provided CAS data.
- Parameters:
adata (
AnnData
) – AnnData object.cas (
Dict
[str
,Dict
[str
,Any
]]) – Dictionary representing the CAS data with its file name as keys.multiple_outputs (
bool
) – Determines if the output should be multiple AnnData objects or a single one.
- Return type:
List
[AnnData
]- Returns:
A list of AnnData objects if multiple_outputs is True, otherwise a single AnnData object.
- Raises:
ValueError – If any required terms do not exist in the CAS data under ‘parent_cell_set_name’.
cas.anndata_to_cas module
- cas.anndata_to_cas.anndata2cas(anndata_file_path, labelsets, output_file_path, include_hierarchy, accession_columns=None)[source]
Convert an AnnData file to Cell Annotation Schema (CAS) JSON.
- Parameters:
anndata_file_path (str) – Path to the AnnData file.
labelsets (List[str]) – List of labelsets, which are names of observation (obs) fields used to record author
order (cell type names. The labelsets should be provided in)
0 (starting from rank)
ranks. (to higher)
output_file_path (str) – Output CAS file name.
include_hierarchy (bool) – Flag indicating whether to include hierarchy in the output.
accession_columns (List[str], optional) – List of columns in the AnnData obs that contain accession information. If provided, these columns will be used to populate the ‘cell_set_accession’ field in the CAS annotations. Otherwise, accession IDs will be automatically generated using a hash of the cells in each cell set. Defaults to None.
- cas.anndata_to_cas.generate_cas_metadata(uns)[source]
Generates CAS metadata based on the provided ‘uns’ dictionary.
- Parameters:
uns (Dict[str, Any]) – The ‘uns’ dictionary containing metadata.
- Returns:
The generated CAS metadata dictionary.
- Return type:
Dict[str, Any]
- cas.anndata_to_cas.add_annotations_to_cas(cas, labelset_dict, parent_cell_look_up)[source]
Generates CAS annotations based on the provided AnnData object and updates the CAS dictionary with new annotations. This function can optionally use a precomputed parent cell lookup dictionary to enrich the annotations with hierarchical information.
- Parameters:
cas (Dict[str, Any]) – The CAS dictionary to be updated with annotations. Expected to have a key ‘annotations’ where new annotations will be appended.
labelset_dict (Dict[str, Any]) – A dictionary defining labelsets and their members. This is used to match cell labels with their respective metadata and annotations.
parent_cell_look_up (Dict[str, Any]) – A precomputed dictionary containing hierarchical metadata about cell labels.
- Returns:
The function directly updates the cas dictionary with new annotations. The parent_cell_look_up is used for enrichment and must be generated beforehand if hierarchical information is to be included.
- Return type:
None
cas.cas_splitter module
- cas.cas_splitter.split_cas_to_file(cas_json_path, split_terms, multiple_outputs)[source]
Splits a CAS JSON file into files based on provided terms, and writes them to disk.
- Parameters:
cas_json_path (
str
) – Path to the CAS JSON file.split_terms (
Union
[List
[str
],str
]) – Terms used to determine how to split the CAS file; can be a string or a list of strings.multiple_outputs (
bool
) – If True, outputs multiple files, one for each split term; otherwise, outputs a single file.
- cas.cas_splitter.split_cas(cas, split_terms, multiple_outputs)[source]
Splits a CAS dictionary into multiple or single dictionary based on split terms.
- Parameters:
cas (
Dict
[str
,Any
]) – Dictionary representing the CAS data.split_terms (
Union
[List
[str
],str
]) – Terms used to filter and split the CAS data; can be a string or a list of strings.multiple_outputs (
bool
) – Determines if the output should be multiple dictionaries or a single dictionary.
- Return type:
Union
[List
[Dict
[str
,Any
]],Dict
[str
,Any
]]- Returns:
A list of dictionaries if multiple_outputs is True, otherwise a single dictionary.
- Raises:
ValueError – If any split_terms do not exist in the CAS data under ‘parent_cell_set_name’.
- cas.cas_splitter.filter_and_copy_cas_entries(cas, label_to_copy_list)[source]
Copies entries from the CAS based on a list of labels to copy.
- Parameters:
cas (
Dict
[str
,Any
]) – Dictionary representing the original CAS data.label_to_copy_list (
List
[str
]) – List of labels indicating which entries to copy.
- Return type:
Dict
[str
,Any
]- Returns:
A dictionary with filtered CAS entries.
- cas.cas_splitter.get_split_terms(parent_dict, split_terms)[source]
Resolves split terms into a comprehensive list of terms based on a parent-child relationship dictionary.
- Parameters:
parent_dict (
Dict
[str
,List
[str
]]) – Dictionary mapping parent terms to lists of child terms.split_terms (
Union
[List
[str
],str
]) – Initial terms to resolve, can be a string or a list of strings.
- Return type:
List
[str
]- Returns:
A list of all terms, resolved from the parent_dict.
cas.cas_to_rdf module
- cas.cas_to_rdf.export_to_rdf(cas_schema, data, ontology_namespace, ontology_iri, output_path=None, validate=True, include_cells=True)[source]
Generates and returns an RDF graph from the provided data and CAS schema, with an option to write the RDF graph to a file. :type cas_schema:
Union
[str
,dict
,None
] :param cas_schema: Name of the CAS release (such as base, cap, bican), path to the CAS schema file, or CAS schema JSON object.If not provided, reads the base CAS schema from the CAS module.
- Parameters:
data (Union[str, dict]) – The data JSON file path or JSON object dictionary.
ontology_namespace (str) – Ontology namespace (e.g., MTG).
ontology_iri (str) – Ontology IRI (e.g., https://purl.brain-bican.org/ontology/AIT_MTG/).
labelsets (Optional[List[str]]) – Labelsets used in the taxonomy, such as [“Cluster”, “Subclass”, “Class”].
output_path (Optional[str]) – Path to the output RDF file, if specified.
validate (bool) – Determines if data-schema validation checks will be performed. True by default.
include_cells (bool) – Determines if cell data will be included in the RDF output. True by default.
cas_schema (Optional[Union[str, dict]])
- Return type:
Graph
- Returns:
An RDFlib graph object.
cas.cxg_utils module
cxg_utils.py
This module provides utility functions for working with AnnData datasets in the context of the CellxGene Census library.
- cas.cxg_utils.download_dataset_with_id(dataset_id, file_path=None)[source]
Download an AnnData dataset with the specified ID.
- Parameters:
dataset_id (str) – The ID of the dataset to download.
file_path (Optional[str], optional) – The file path to save the downloaded AnnData. If not provided, the dataset will be saved in the current working directory with the dataset_id as the file name. Supports both absolute and relative paths.
- Returns:
The path to the downloaded AnnData dataset
- Return type:
str
cas.file_utils module
- cas.file_utils.read_json_file(file_path)[source]
Reads and parses a JSON file into a Python dictionary.
- Parameters:
file_path (str) – The path to the JSON file.
- Returns:
The JSON data as a Python dictionary.
- Return type:
dict
Returns None if the file does not exist or if there is an issue parsing the JSON content.
Example
json_data = read_json_file(‘path/to/your/file.json’) if json_data is not None:
# Use the parsed JSON data as a dictionary print(json_data)
- cas.file_utils.read_cas_json_file(file_path)[source]
Reads and parses a JSON file into a CAS object.
- Parameters:
file_path (str) – The path to the JSON file.
- Returns:
The JSON data as a CAS object.
- Return type:
dict
- cas.file_utils.read_cas_from_anndata(anndata_path)[source]
Reads the CAS json from the anndata uns and parses into a CAS object. :type anndata_path:
str
:param anndata_path: The path to the Anndata file.- Return type:
- Returns:
CellTypeAnnotation object.
- Parameters:
anndata_path (str)
- cas.file_utils.write_json_file(cas, out_file, print_undefined=False)[source]
Writes cell type annotation object to a json file. :type cas:
CellTypeAnnotation
:param cas: cell type annotation object to serialize. :type out_file:str
:param out_file: output file path. :type print_undefined:bool
:param print_undefined: prints null values to the output json if true. Omits undefined values from the json output if- Parameters:
cas (CellTypeAnnotation)
out_file (str)
print_undefined (bool)
- cas.file_utils.write_dict_to_json_file(output_file_path, dictionary)[source]
- Parameters:
output_file_path (str)
dictionary (dict)
- cas.file_utils.read_anndata_file(file_path)[source]
Load anndata object from a file.
- Parameters:
file_path (
str
) – The path to the file containing the anndata object.- Return type:
Optional
[AnnData
]- Returns:
The loaded anndata object if successful, else None.
- cas.file_utils.read_table_to_dict(table_path, id_column=0, generated_ids=False)[source]
Reads table file content into a dict. Key is the first column value and the value is dict representation of the :type table_path: :param table_path: Path of the table file :type id_column: :param id_column: Id column becomes the key of the dict. This column should be unique. Default value is first column. :type generated_ids: :param generated_ids: If ‘True’, uses row number as the key of the dict. Initial key is 0.
- Returns:
first; headers of the table and second; the TSV content dict. Key of the content is the first column value and the values are dict of row values.
- Return type:
Function provides two return values
- cas.file_utils.read_tsv_to_dict(tsv_path, id_column=0, generated_ids=False)[source]
Reads tsv file content into a dict. Key is the first column value and the value is dict representation of the row values (each header is a key and column value is the value). :type tsv_path: :param tsv_path: Path of the TSV file :type id_column: :param id_column: Id column becomes the key of the dict. This column should be unique. Default value is first column. :type generated_ids: :param generated_ids: If ‘True’, uses row number as the key of the dict. Initial key is 0.
- Returns:
first; headers of the table and second; the TSV content dict. Key of the content is the first column value and the values are dict of row values.
- Return type:
Function provides two return values
- cas.file_utils.read_csv_to_dict(csv_path, id_column=0, id_column_name='', delimiter=',', id_to_lower=False, generated_ids=False)[source]
Reads tsv file content into a dict. Key is the first column value and the value is dict representation of the row values (each header is a key and column value is the value). :type csv_path: :param csv_path: Path of the CSV file :type id_column: :param id_column: Id column becomes the keys of the dict. This column should be unique. Default is the first column. :type id_column_name: :param id_column_name: Alternative to the numeric id_column, id_column_name specifies id_column by its header string. :type delimiter: :param delimiter: Value delimiter. Default is comma. :type id_to_lower: :param id_to_lower: applies string lowercase operation to the key :type generated_ids: :param generated_ids: If ‘True’, uses row number as the key of the dict. Initial key is 1.
- Returns:
first; headers of the table and second; the CSV content dict. Key of the content is the first column value and the values are dict of row values.
- Return type:
Function provides two return values
- cas.file_utils.read_json_config(file_path)[source]
Reads the configuration object from the given path. :type file_path:
str
:param file_path: path to the json file :rtype:dict
:return: configuration object (List of data column config items)- Parameters:
file_path (str)
- Return type:
dict
- cas.file_utils.read_yaml_config(file_path)[source]
Reads the configuration object from the given path. :type file_path:
str
:param file_path: path to the yaml file :rtype:dict
:return: configuration object (List of data column config items)- Parameters:
file_path (str)
- Return type:
dict
- cas.file_utils.read_config(file_path)[source]
Reads the configuration object from the given path. :type file_path:
str
:param file_path: path to the configuration file :rtype:dict
:return: configuration object (List of data column config items)- Parameters:
file_path (str)
- Return type:
dict
- cas.file_utils.update_obs(obs, data)[source]
Updates the obs with data dict.
- Parameters:
obs (
CapAnnDataDF
) – Dataset representing the obs field in the AnnData file.data (
dict
) – Dictionary containing flattened data.
- cas.file_utils.update_uns(uns, data)[source]
Updates the uns with data dict.
- Parameters:
uns (
CapAnnDataDF
) – The HDF5 group to write data to.data (
dict
) – Dictionary containing the data to be written.
- cas.file_utils.get_cas_schema_names()[source]
Returns the list of available CAS schema names.
- Returns:
The available CAS schema names.
- Return type:
dict
- cas.file_utils.get_cas_schema(schema_name='base')[source]
Reads the schema file from the CAS module and returns as a dictionary. :type schema_name:
Optional
[str
] :param schema_name: The name of the schema to be returned. Default is ‘base’.- Returns:
The schema as a dictionary.
- Return type:
dict
- Parameters:
schema_name (str | None)
cas.flatten_data_to_anndata module
- cas.flatten_data_to_anndata.is_list_of_strings(var)[source]
Check if a value is a list of strings.
- Parameters:
var (list or any) – The value to be checked.
- Returns:
- True if the value is a list containing only string elements,
False otherwise.
- Return type:
bool
- cas.flatten_data_to_anndata.export2cap(cas_file_path, anndata_file_path, output_file_path, fill_na)[source]
Processes and integrates information from a CAS JSON file and an AnnData file, creating a new AnnData object that incorporates metadata. The resulting AnnData object is then saved to a new file.
Note
At least one of cas_file_path or anndata_file_path must be provided. If cas_file_path is not supplied, the CAS JSON will be loaded from the AnnData file’s ‘uns’ section. Conversely, if anndata_file_path is not provided, the AnnData file will be downloaded using the matrix file id from the CAS JSON.
- Parameters:
cas_file_path (
Optional
[str
]) – Optional path to the CAS JSON file. If not provided, the CAS JSON will be extracted from the AnnData file’s ‘uns’ section.anndata_file_path (
Optional
[str
]) – Optional path to the AnnData file. If not provided, the AnnData file will be downloaded using the matrix file id from the CAS JSON.output_file_path (
str
) – Output AnnData file name.fill_na (
bool
) – Boolean flag indicating whether to fill missing values in the ‘obs’ field with pd.NA. If True, missing values will be replaced with pd.NA; if False, they will remain as empty strings.
- cas.flatten_data_to_anndata.export_cas_object2cap(input_json, anndata_file_path, output_file_path, fill_na)[source]
Processes and integrates information from a CAS JSON and an AnnData (Annotated Data) file, creating a new AnnData object that incorporates metadata. If a CAS JSON object is not provided via the input parameter, it is extracted from the AnnData file’s ‘uns’ section. Conversely, if the AnnData file is not provided, it will be downloaded using the matrix file id from the CAS JSON.
Note
At least one of input_json or anndata_file_path must be provided. If neither is provided, the operation cannot proceed.
- Parameters:
input_json (
Optional
[dict
]) – Optional CAS JSON object. If not provided, the CAS JSON will be extracted from the AnnData file’s ‘uns’ section.anndata_file_path (
Optional
[str
]) – Optional path to the AnnData file. If not provided, the AnnData file will be downloaded using the matrix file id from the CAS JSON.output_file_path (
str
) – Output AnnData file name.fill_na (
bool
) – Boolean flag indicating whether to fill missing values in the ‘obs’ field with pd.NA. If True, missing values will be replaced with pd.NA; if False, they will remain as empty strings.
- cas.flatten_data_to_anndata.process_annotations(annotations, obs_index, parent_cell_ids, fill_na)[source]
Processes annotations and generates flattened data for obs dataset.
- Parameters:
annotations (list) – List of annotations.
obs_index (np.ndarray) – Array representing the index of the obs dataset.
parent_cell_ids (dict) – Dictionary containing parent cell ids.
fill_na (bool)
- Returns:
Dictionary containing flattened data.
- Return type:
dict
- cas.flatten_data_to_anndata.generate_uns_json(input_json)[source]
Generates a dictionary representing the uns (unstructured) field in an AnnData object from a given JSON input.
This function processes information from a JSON input and generates a dictionary that represents the uns (unstructured) field in an AnnData object. The resulting dictionary can be used to populate the uns field in the AnnData object.
- Parameters:
input_json (dict) – A dictionary representing the input CAS JSON data containing annotations.
- Returns:
- A dictionary representing the uns (unstructured) field in an AnnData object, ready to be used as input
for writing to an AnnData file.
- Return type:
dict
- cas.flatten_data_to_anndata.unflatten(json_file_path, anndata_file_path, output_file_path, output_json_path)[source]
Unflatten an Anndata file and save it. Also creates a CAS json file as output.
- Parameters:
json_file_path (
Optional
[str
]) – The path to the CAS json file.anndata_file_path (
str
) – The path to the AnnData file.output_file_path (
str
) – Output AnnData file name.output_json_path (
str
) – Output CAS JSON file name.
- cas.flatten_data_to_anndata.unflatten_obs(obs_df, uns_df, cas_json, cellhash_lookup)[source]
Reverse the flattening process to update the “annotations” section in a CAS object.
- Parameters:
obs_df (
DataFrame
) – DataFrame containing the flattened obs columns from an AnnData object.uns_df (
Dict
[str
,Any
]) – Dictionary containing the flattened uns section from an AnnData object.cas_json (
Optional
[Dict
[str
,Any
]]) – Optional CAS JSON object.cellhash_lookup (
Dict
[str
,Any
]) – Cell hash lookup dictionary.
- Return type:
Dict
[str
,Any
]- Returns:
Updated CAS JSON with revised annotations.
- cas.flatten_data_to_anndata.create_cell_label_lookup(df_dict)[source]
Create a lookup dictionary for cell labels with corresponding observations.
- Parameters:
df_dict (
Dict
[str
,DataFrame
]) – A dictionary of DataFrames keyed by label sets.- Return type:
dict
- Returns:
A nested dictionary where keys are cell labels and values are the observations from the obs field in the AnnData object.
- cas.flatten_data_to_anndata.update_cas_annotation(cas_dict, cas_json, cellhash_lookup)[source]
Update the annotations in the CAS JSON using the provided lookup dictionary.
This function checks the CAS JSON annotations against a lookup dictionary. It updates annotations where cell labels or cell hashes match and discards mismatches. It also adds new annotations from the lookup dictionary that are not in the CAS JSON.
- Parameters:
cas_dict (
Dict
[str
,Dict
[str
,Any
]]) – A lookup dictionary where keys are cell labels or hashes, and values are dictionaries with annotation data.cas_json (
Dict
[str
,Any
]) – The CAS JSON object containing existing annotations.cellhash_lookup (
Dict
[str
,Any
]) – The lookup for cell hashes.
- Return type:
List
[Dict
[str
,Any
]]- Returns:
The updated annotations based on the lookup dictionary.
- cas.flatten_data_to_anndata.generate_cas_json(uns_data, cas_dict, schema_name=None)[source]
Generates a CAS JSON object from provided annotation metadata and schema.
This function constructs a CAS JSON object by retrieving a specified schema (or the default “cap” schema if none is provided) and using properties and metadata from the uns_data and cas_dict arguments. It populates the top-level properties, annotations, and label sets for the CAS JSON structure.
- Parameters:
uns_data (
Dict
[str
,Any
]) – A dictionary containing unstructured annotation metadata, which provides values for the CAS JSON’s top-level properties and cell annotation metadata.cas_dict (
Dict
[str
,Dict
[str
,Any
]]) – A dictionary of CAS annotation sets, each representing an annotation structure used to populate the CAS JSON annotations.schema_name (
Optional
[str
]) – An optional schema name to retrieve the schema for building the CAS JSON. If not provided, defaults to “cap”.
- Return type:
Dict
[str
,Any
]- Returns:
A dictionary representing the CAS JSON structure populated with data from uns_data and cas_dict, following the specified schema format.
cas.flatten_data_to_tables module
- cas.flatten_data_to_tables.serialize_to_tables(cta, file_name_prefix, out_folder, project_config)[source]
Writes cell type annotation object to a series of tsv files. Tables to generate:
Annotation table (main)
Labelset table
Metadata
Annotation transfer
- Parameters:
cta – cell type annotation object to serialize.
file_name_prefix – Name prefix for table names
out_folder – output folder path.
project_config – project configuration with extra metadata
- cas.flatten_data_to_tables.generate_annotation_transfer_table(cta, out_folder)[source]
Generates annotation transfer table.
- Parameters:
cta – cell type annotation object to serialize.
out_folder – output folder path.
- cas.flatten_data_to_tables.generate_metadata_table(cta, project_config, out_folder)[source]
Generates the metadata table.
- Parameters:
cta – cell type annotation object to serialize.
project_config – metadata coming from project config
out_folder – output folder path.
- cas.flatten_data_to_tables.generate_labelset_table(cta, out_folder)[source]
Generates labelset table.
- Parameters:
cta – cell type annotation object to serialize.
out_folder – output folder path.
- cas.flatten_data_to_tables.generate_annotation_table(accession_prefix, cta, out_folder)[source]
Generates annotation table.
- Parameters:
cta – cell type annotation object to serialize.
out_folder – output folder path.
accession_prefix – accession id prefix
- cas.flatten_data_to_tables.generate_reviews_table(cta, out_folder)[source]
Generates annotation reviews table.
- Parameters:
cta – cell type annotation object to serialize.
out_folder – output folder path.
- cas.flatten_data_to_tables.list_to_string(my_list)[source]
Converts a list to its string representation. Nanobot has problem with single quotations so removes them as well. :type my_list:
list
:param my_list: list to serialize- Returns:
string representation of the list
- Parameters:
my_list (list)
- cas.flatten_data_to_tables.assign_parent_accession_ids(accession_manager, std_parent_records, std_parent_records_dict, labelsets)[source]
Assigns accession ids to parent clusters and updates their references from the child clusters. :type accession_manager: :param accession_manager: accession ID generator :type std_parent_records: :param std_parent_records: list of all parents to assign accession ids :type std_parent_records_dict: :param std_parent_records_dict: parent cluster - child clusters dictionary :type labelsets: :param labelsets: labelsets list
- cas.flatten_data_to_tables.assign_parent_cell_set_names(id_index)[source]
Assigns parent cell set names to the child cell sets. :type id_index:
dict
:param id_index: dictionary of cell set accessions and their corresponding records- Parameters:
id_index (dict)
- cas.flatten_data_to_tables.normalize_column_name(column_name)[source]
Normalizes column name for url compatibility. URL compatible column name requirement: All names must match: ^[w_ ]+$’ for to_url()
- Parameters:
column_name (
str
) – current column name- Return type:
str
- Returns:
normalized column_name
cas.model module
- class cas.model.EncoderMixin[source]
Bases:
DataClassJsonMixin
-
dataclass_json_config:
Optional
[dict
] = {'exclude': <function EncoderMixin.<lambda>>, 'undefined': Undefined.EXCLUDE}
-
dataclass_json_config:
- class cas.model.AutomatedAnnotation(algorithm_name, algorithm_version, algorithm_repo_url, reference_location)[source]
Bases:
EncoderMixin
- Parameters:
algorithm_name (str)
algorithm_version (str)
algorithm_repo_url (str)
reference_location (str | None)
-
algorithm_name:
str
The name of the algorithm used. It MUST be a string of the algorithm’s name.
-
algorithm_version:
str
The version of the algorithm used (if applicable). It MUST be a string of the algorithm’s version, which is typically in the format ‘[MAJOR].[MINOR]’, but other versioning systems are permitted (based on the algorithm’s versioning).
-
algorithm_repo_url:
str
This field denotes the URL of the version control repository associated with the algorithm used (if applicable). It MUST be a string of a valid URL.
-
reference_location:
Optional
[str
] This field denotes a valid URL of the annotated dataset that was the source of annotated reference data. This MUST be a string of a valid URL. The concept of a ‘reference’ specifically refers to ‘annotation transfer’ algorithms, whereby a ‘reference’ dataset is used to transfer cell annotations to the ‘query’ dataset.
- class cas.model.Labelset(name, description=None, annotation_method=None, automated_annotation=None, rank=None)[source]
Bases:
EncoderMixin
- Parameters:
name (str)
description (str | None)
annotation_method (str | None)
automated_annotation (AutomatedAnnotation | None)
rank (int | None)
-
name:
str
name of annotation key
-
description:
Optional
[str
] = None Some text describing what types of cell annotation this annotation key is used to record
-
annotation_method:
Optional
[str
] = None ‘algorithmic’, ‘manual’, or ‘both’
- Type:
The method used for creating the cell annotations. This MUST be one of the following strings
-
automated_annotation:
Optional
[AutomatedAnnotation
] = None A set of fields for recording the details of the automated annotation algorithm used. (Common ‘automated annotation methods’ would include PopV, Azimuth, CellTypist, scArches, etc.)
-
rank:
Optional
[int
] = None A number indicating relative granularity with 0 being the most specific. Use this where a single dataset has multiple keys that are used consistently to record annotations and different levels of granularity.
- class cas.model.AnnotationTransfer(transferred_cell_label=None, source_taxonomy=None, source_node_accession=None, algorithm_name=None, comment=None)[source]
Bases:
EncoderMixin
- Parameters:
transferred_cell_label (str | None)
source_taxonomy (str | None)
source_node_accession (str | None)
algorithm_name (str | None)
comment (str | None)
-
transferred_cell_label:
Optional
[str
] = None Transferred cell label
-
source_taxonomy:
Optional
[str
] = None PURL of source taxonomy.
-
source_node_accession:
Optional
[str
] = None accession of node that label was transferred from
-
algorithm_name:
Optional
[str
] = None The name of the algorithm used.
-
comment:
Optional
[str
] = None Free text comment on annotation transfer
- class cas.model.Review(datestamp=None, reviewer=None, review=None, explanation=None)[source]
Bases:
EncoderMixin
Annotation review.
- Parameters:
datestamp (datetime | None)
reviewer (str | None)
review (str | None)
explanation (str | None)
-
datestamp:
Optional
[datetime
] = None Time and date review was last edited.
-
reviewer:
Optional
[str
] = None Review Author.
-
review:
Optional
[str
] = None Reviewer’s verdict on the annotation. Must be ‘Agree’ or ‘Disagree’.
-
explanation:
Optional
[str
] = None Free-text review of annotation. This is required if the verdict is disagree and should include reasons for disagreement.
- class cas.model.Annotation(labelset, cell_label, cell_set_accession=None, cell_fullname=None, cell_ontology_term_id=None, cell_ontology_term=None, cell_ids=None, rationale=None, rationale_dois=None, marker_gene_evidence=None, synonyms=None, parent_cell_set_name=None, parent_cell_set_accession=None, author_annotation_fields=None, neurotransmitter_accession=None, neurotransmitter_rationale=None, neurotransmitter_marker_gene_evidence=None, transferred_annotations=None, reviews=None)[source]
Bases:
EncoderMixin
A collection of fields recording a cell type/class/state annotation on some set os cells, supporting evidence and provenance. As this is intended as a general schema, compulsory fields are kept to a minimum. However, tools using this schema are encouarged to specify a larger set of compulsory fields for publication. Note: This schema deliberately allows for additional fields in order to support ad hoc user fields, new formal schema extensions and project/tool specific metadata.
- Parameters:
labelset (str)
cell_label (str)
cell_set_accession (str | None)
cell_fullname (str | None)
cell_ontology_term_id (str | None)
cell_ontology_term (str | None)
cell_ids (List[str] | None)
rationale (str | None)
rationale_dois (List[str] | None)
marker_gene_evidence (List[str] | None)
synonyms (List[str] | None)
parent_cell_set_name (str | None)
parent_cell_set_accession (str | None)
author_annotation_fields (dict | None)
neurotransmitter_accession (str | None)
neurotransmitter_rationale (str | None)
neurotransmitter_marker_gene_evidence (List[str] | None)
transferred_annotations (List[AnnotationTransfer] | None)
reviews (List[Review] | None)
-
labelset:
str
The unique name of the set of cell annotations. Each cell within the AnnData/Seurat file MUST be associated with a ‘cell_label’ value in order for this to be a valid ‘cellannotation_setname’.
-
cell_label:
str
This denotes any free-text term which the author uses to label cells.
-
cell_set_accession:
Optional
[str
] = None An identifier that can be used to consistently refer to the set of cells being annotated, even if the cell_label changes.
-
cell_fullname:
Optional
[str
] = None - This MUST be the full-length name for the biological entity listed in cell_label by the author. (If the value
in cell_label is the full-length term, this field will contain the same value.)
- NOTE: any reserved word used in
the field ‘cell_label’ MUST match the value of this field.
-
cell_ontology_term_id:
Optional
[str
] = None This MUST be a term from either the Cell Ontology or from some ontology that extends it by classifying cell types under terms from the Cell Ontology e.g. the Provisional Cell Ontology.
-
cell_ontology_term:
Optional
[str
] = None This MUST be the human-readable name assigned to the value of ‘cell_ontology_term_id
-
cell_ids:
Optional
[List
[str
]] = None List of cell barcode sequences/UUIDs used to uniquely identify the cells
-
rationale:
Optional
[str
] = None The free-text rationale which users provide as justification/evidence for their cell annotations. Researchers are encouraged to use this field to cite relevant publications in-line using standard academic citations of the form (Zheng et al., 2020) This human-readable free-text MUST be encoded as a single string. All references cited SHOULD be listed using DOIs under rationale_dois. There MUST be a 2000-character limit.
-
rationale_dois:
Optional
[List
[str
]] = None A list of valid publication DOIs cited by the author to support or provide justification/evidence/context for ‘cell_label’.
-
marker_gene_evidence:
Optional
[List
[str
]] = None List of gene names explicitly used as evidence for this cell annotation.
-
synonyms:
Optional
[List
[str
]] = None This field denotes any free-text term of a biological entity which the author associates as synonymous with the biological entity listed in the field ‘cell_label’.
-
parent_cell_set_name:
Optional
[str
] = None
-
parent_cell_set_accession:
Optional
[str
] = None A list of accessions of cell sets that subsume this cell set. This can be used to compose hierarchies of annotated cell sets, built from a fixed set of clusters.
-
author_annotation_fields:
Optional
[dict
] = None “A dictionary of author defined key value pairs annotating the cell set. The names and aims of these fields MUST not clash with official annotation fields.
-
neurotransmitter_accession:
Optional
[str
] = None Accessions of cell neurotransmitter associated with this cell set.
-
neurotransmitter_rationale:
Optional
[str
] = None The free-text rationale which users provide as justification/evidence for supporting the neurotransmitter association.
-
neurotransmitter_marker_gene_evidence:
Optional
[List
[str
]] = None List of gene names used as evidence for neurotransmitter association. Each gene MUST be included in the matrix of the AnnData/Seurat file.
-
transferred_annotations:
Optional
[List
[AnnotationTransfer
]] = None
- add_user_annotation(user_annotation_set, user_annotation_label)[source]
Adds a user defined annotation which is not supported by the standard schema. :type user_annotation_set: :param user_annotation_set: name of the user annotation set :type user_annotation_label: :param user_annotation_label: label of the user annotation set
- class cas.model.CellTypeAnnotation(author_name, annotations, title, description=None, matrix_file_id=None, labelsets=None, author_contact=None, orcid=None, cellannotation_schema_version=None, cellannotation_timestamp=None, cellannotation_version=None, cellannotation_url=None, author_list=None)[source]
Bases:
EncoderMixin
- Parameters:
author_name (str)
annotations (List[Annotation])
title (str)
description (str | None)
matrix_file_id (str | None)
labelsets (List[Labelset] | None)
author_contact (str | None)
orcid (str | None)
cellannotation_schema_version (str | None)
cellannotation_timestamp (str | None)
cellannotation_version (str | None)
cellannotation_url (str | None)
author_list (List[str] | None)
-
author_name:
str
This MUST be a string in the format [FIRST NAME] [LAST NAME]
-
annotations:
List
[Annotation
] A collection of fields recording a cell type/class/state annotation on some set os cells, supporting evidence and provenance. As this is intended as a general schema, compulsory fields are kept to a minimum. However, tools using this schema are encouarged to specify a larger set of compulsory fields for publication.
-
title:
str
The title of the dataset. This MUST be less than or equal to 200 characters. e.g. ‘Human retina cell atlas - retinal ganglion cells’.
-
description:
Optional
[str
] = None The description of the dataset. e.g. ‘A total of 15 retinal ganglion cell clusters were identified from over 99K retinal ganglion cell nuclei in the current atlas. Utilizing previous characterized markers from macaque, 5 clusters
can be annotated.’
-
matrix_file_id:
Optional
[str
] = None accession, e.g. CellXGene_dataset:8e10f1c4-8e98-41e5-b65f-8cd89a887122. Please see https://github.com/cellannotation/cell-annotation -schema/registry/registry.json for supported namespaces.
- Type:
A resolvable ID for a cell by gene matrix file in the form namespace
-
author_contact:
Optional
[str
] = None This MUST be a valid email address of the author
-
orcid:
Optional
[str
] = None This MUST be a valid ORCID for the author
-
cellannotation_schema_version:
Optional
[str
] = None The schema version, the cell annotation open standard. Current version MUST follow 0.1.0 This versioning MUST follow the format ‘[MAJOR].[MINOR].[PATCH]’ as defined by Semantic Versioning 2.0.0, https://semver.org/
-
cellannotation_timestamp:
Optional
[str
] = None The timestamp of all cell annotations published (per dataset). This MUST be a string in the format ‘%yyyy-%mm-%dd %hh:%mm:%ss’.
-
cellannotation_version:
Optional
[str
] = None The version for all cell annotations published (per dataset). This MUST be a string. The recommended versioning format is ‘[MAJOR].[MINOR].[PATCH]’ as defined by Semantic Versioning 2.0.0, https://semver.org/
-
cellannotation_url:
Optional
[str
] = None A persistent URL of all cell annotations published (per dataset).
-
author_list:
Optional
[List
[str
]] = None This field stores a list of users who are included in the project as collaborators, regardless of their specific role. An example list; John Smith|Cody Miller|Sarah Jones.
- add_annotation_object(obj)[source]
Adds given object to annotation objects list :type obj: :param obj: Annotation object to add
- get_all_annotations(show_cell_ids=False, labels=None)[source]
Lists all annotations.
- Parameters:
show_cell_ids (
bool
) – identifies if result have ‘cell_ids’ column. Default value is falselabels (
Optional
[list
]) – list of key(labelset), value(cell_label) pairs to filter annotations
- Return type:
DataFrame
- Returns:
Annotations data frame
cas.populate_cell_ids module
- cas.populate_cell_ids.populate_cell_ids(cas_json_path, anndata_path, labelsets=None, validate=False)[source]
Add/update CellIDs in a CAS JSON file using matching data from an AnnData file.
This function reads a CAS JSON file and an AnnData file, validates their consistency, and updates CellIDs in CAS based on matching labelsets from the AnnData obs DataFrame. The modified CAS JSON is then saved back to the original file.
- Parameters:
cas_json_path (str) – Path to the CAS JSON file.
anndata_path (str) – Path to the AnnData file.
labelsets (list, optional) – A list of labelsets to update with CellIDs from AnnData. If None, the labelset with rank ‘0’ is used by default.
validate (bool, optional) – If True, runs validation checks to ensure labelset consistency. The program will exit with an error if validation fails. Defaults to False.
- Raises:
Exception – If the AnnData file cannot be read.
- Returns:
None
- cas.populate_cell_ids.update_cas_with_cell_ids(cas_json, anndata_obs, labelsets=None)[source]
Update a CAS dictionary by adding or modifying CellIDs using matching AnnData observations.
This function takes a CAS dictionary and an AnnData obs DataFrame and updates the CAS with cell IDs extracted from the specified labelsets in the AnnData.
- Parameters:
cas_json (dict) – The CAS dictionary to update with cell IDs from AnnData.
anndata_obs (CapAnnDataDF) – The obs DataFrame extracted from an AnnData object.
labelsets (list, optional) – A list of labelsets to update with IDs from AnnData. If None, the labelset with rank ‘0’ is used.
- Returns:
The updated CAS dictionary with CellIDs populated.
- Return type:
dict
- cas.populate_cell_ids.add_cell_ids(cas, ad_obs, labelsets=None, validate=False)[source]
Add/update CellIDs to CAS from matching AnnData file.
- Parameters:
cas (
dict
) – CAS JSON objectad_obs (
Union
[DataFrame
,CapAnnDataDF
]) – Obs DataFrame extracted from an AnnData object.labelsets (
Optional
[list
]) – List of labelsets to update with IDs from AnnData. If value is null, rank ‘0’ labelset is used. Theorder (labelsets should be provided in)
0 (starting from rank)
validate (bool, optional) – If True, runs validation checks to ensure labelset consistency. The program wil`l exit with an error if validation fails. Defaults to False.
- cas.populate_cell_ids.get_obs_cluster_identifier_column(obs_keys, labelsets=None, rank_zero_labelset=None)[source]
Anndata files may use different column names to uniquely identify Clusters. Get the cluster identifier column name for the current file. :type obs_keys:
List
[str
] :param obs_keys: Anndata observation keys. :type labelsets:Optional
[list
] :param labelsets: List of labelsets to update with IDs from AnnData. The labelsets should be provided in order, :param starting from rank 0: :type starting from rank 0: leaf nodes :type rank_zero_labelset:Optional
[str
] :param rank_zero_labelset: rank 0 labelset name- Returns:
cluster identifier column name
- Parameters:
obs_keys (List[str])
labelsets (list | None)
rank_zero_labelset (str | None)
cas.reports module
- cas.reports.get_all_annotations(cas, show_cell_ids=False, labels=None)[source]
Lists all annotations.
- Parameters:
cas (
dict
) – Cell Annotation Schema json object.show_cell_ids (
bool
) – identifies if result have ‘cell_ids’ column. Default value is falselabels (
Optional
[list
]) – list of key(labelset), value(cell_label) pairs to filter annotations
- Return type:
DataFrame
- Returns:
Annotations data frame
cas.spreadsheet_to_cas module
- cas.spreadsheet_to_cas.read_spreadsheet(file_path, sheet_name, schema)[source]
Read the specific sheet from the Excel file into a pandas DataFrame.
- Parameters:
file_path (str) – Path to the Excel file.
sheet_name (str, optional) – Target sheet name. If not provided, reads the first sheet.
schema (
dict
) – Cell annotation schema
- Returns:
Tuple containing metadata (dict), column names (list), and raw data (pd.DataFrame).
- Return type:
tuple
- cas.spreadsheet_to_cas.custom_lowercase_transform(s)[source]
Transforms the given string to lowercase except for words that are acronyms or specific cell type names which are three characters or fewer.
- Parameters:
s (str) – The input string.
- Returns:
The transformed string.
- Return type:
str
- cas.spreadsheet_to_cas.spreadsheet2cas(spreadsheet_file_path, sheet_name, anndata_file_path, labelset_list, schema_name, output_file_path)[source]
Convert a spreadsheet to Cell Annotation Schema (CAS) JSON.
- Parameters:
spreadsheet_file_path (str) – Path to the spreadsheet file.
sheet_name (Optional[str]) – Target sheet name in the spreadsheet. Can be a string or None.
anndata_file_path (
Optional
[str
]) – The path to the AnnData file.labelset_list (Optional[List[str]]) – List of names of observation (obs) fields used to record author cell
names (type)
spreadsheet. (which determine the rank of labelsets in a)
schema_name (Optional[str]) – Name of the CAS schema, can be one of ‘base’, ‘bican’ or ‘cap’.
output_file_path (str) – Output CAS file name.
- cas.spreadsheet_to_cas.add_annotations_to_cas(cas, raw_data_result, columns, schema, parent_cell_look_up)[source]
Adds processed annotations from raw data to the CAS structure and tracks labelsets. Assumes certain external definitions for column names and transformation functions.
- Parameters:
cas (dict) – The CAS structure to update with annotations.
raw_data_result (DataFrame) – Raw annotation data.
columns (list) – Column names of raw data to process.
schema (dict) – Cell annotation schema.
parent_cell_look_up (Dict[str, Any]) – A precomputed dictionary containing hierarchical metadata about cell labels.
- Returns:
Tracks labelsets encountered, initialized to None.
- Return type:
OrderedDict
Note
Requires custom_lowercase_transform, get_cell_ids, and column constants to be defined.
- cas.spreadsheet_to_cas.initialize_cas_structure(matrix_file_id, meta_data_result)[source]
Initializes the Cell Annotation Schema (CAS) structure with basic information and placeholders for annotations and labelsets. Fields initialized with None values are omitted in the final output.
- Parameters:
matrix_file_id (str) – The ID of the matrix file, used within the CAS for identification.
meta_data_result (dict) – Metadata containing at least the ‘matrix_file_id’ for the CAS URL.
- Returns:
- The initial CAS structure with the matrix file ID, annotation URL, and placeholders
for future data. Excludes fields that remain None.
- Return type:
dict
- cas.spreadsheet_to_cas.load_or_fetch_anndata(anndata_file_path, meta_data_result)[source]
Loads or fetches an AnnData file, based on a local path or a matrix file ID from metadata.
- Parameters:
anndata_file_path (str) – Path to an AnnData file, or None to fetch using metadata.
meta_data_result (dict) – Metadata with ‘matrix_file_id’ for fetching the dataset.
- Returns:
(AnnData object, matrix file ID), ready for use.
- Return type:
tuple
- Raises:
ValueError – If ‘matrix_file_id’ is missing from metadata.
cas.validate module
- cas.validate.validate(schema_name, data_path)[source]
Validates all instances in data_path against the given schema. Assumes all *.json files in the test_dir should validate against the schema. Logs all validation errors and throws an exception if any of the test files is invalid. :type schema_name:
str
:param schema_name: One of ‘base’, ‘bican’ or ‘cap’. Identifies the CAS schema to validate data against. :type data_path:str
:param data_path: Path to the data file (or folder) to validate- Parameters:
schema_name (str)
data_path (str)