Docs
PixelGuard
PixelGuard is a Python package/application dedicated to medical image de-identification. Our proposed method offers up to 72% compression in comparison to the original DICOM files. Not only does this have implications for the long-term storage of these large files, but it also allows for substantially increased short-term storage for applications in machine learning (i.e., batch processing). These images can be saved as lossless JPEGs or DICOMs. This package is designed to be modular, with separate functions for those seeking the de-identification procedure partition of the pipeline solely. This part of the pipeline processing may be easily extended to other imaging modalities such as magnetic resonance imaging (MRI), computed tomography (CT), and other radiographs, etc. Additional tools are embedded in the package for stand-alone use, including common image processing and statistical tools. This includes Sorensen–Dice coefficient (Dice) score dice plotting functions, and functions that will take a numpy array and embed the bytes associated with the image into JPEG format.
Data Flow
Overview of data flow/process. The directory supplied to the software will be walked through for all compatible images within that parent directory. For file path names that have already been processed successfully, the program will skip over these according to its own log files (stored in the output file path). If the file is not supported or an error occurs, this will be written to log files.
Customization
Various options exist for processing medical images, depending on the user's preferences. These range from solely de-identification to de-identification and image cleaning/cropping in preparation for machine learning experiments.
An overview of data cleaning and anonymization involves loading data, which may be in the form of DICOM, JPEG stack, or video file, and subjecting it to text detection and removal through an OCR and masking procedure. Any identified text is extracted and saved to a separate .csv file. If the user decides to obfuscate the file names, a random series of alphanumeric characters are concatenated and the original filename is further added to the .csv file, which now simultaneously serves as a cross-walk file. Additionally, the images may be further cleaned by eliminating any extraneous elements. This is performed through a sequence of filtering, morphological, and geometric operations (of which there are various versions explained below).
Import
from pylogik import deid
from pylogik import im_analysis
De-ID only
This call to the program will simply perform de-identification of the medical images. It masks burned-in text from the image, writes discovered text to a .csv file under the column header 'text'. The CSV file will also contain the original file name (for crosswalk purposes and any other scrubbed attributes in the case of DICOMs). Lastly, image array(s) are written as either lossless jpeg(s) or DICOMs at the user's discretion.
deid.deid(directory_path = "/path/to/directory", output_directory_path = "/path/to/directory", rename_files = None, output_type = 'jpeg', rename_csv = None, use_dates=False, date_header = 'remove', threshold = 0)
directory_path : path to image files (DICOM, JPEG, PNG etc.)
output_directory_path : path where you want to save images
output_type: 'jpeg' (default), options: 'jpeg', 'DICOM'
rename_files: None (default), options: 'csv', 'random'. This will determine how to rename the files when saving them to output. None will not change the filenames (so be cognizant of your own duplicates). 'random' will set the filenames to a 10-character alphanumeric string and will check to ensure nothing is duplicated. Otherwise, the user can specify 'csv' in which case, attributes from the DICOM (PatientID and StudyDate or AcquisitionDateTime will be leveraged to define how to save (see subsequent parameter below)
rename_csv: path to .csv file for renaming patient IDs/files, column headers in the corresponding csv should be named akin to: 'original_id', 'datetime', 'new_id', 'file_save_name'. This permits users to swap the PatientID in the DICOM header with a value of their choice. Note that if there isn't a csv specified then the PatientID is destroyed.
use_dates: Boolean, False (default), set to True if you require date information to further subdivide IDs and filenames. Then file renaming/ID swapping will take into account date information (in the event the same patient ID had multiple scans on the same date)
date_header: 'remove' (default), options: '0101year' . Setting this to 'remove' will scrub DICOM attributes ['StudyDate', 'InstanceCreationDate', 'ContentDate', 'OverlayDate', 'CurveDate', 'AcquisitionDateTime'] otherwise '0101year' will retain the same year and shift the day and month to Jan 1st of the same year. Of note, this will remove any time in these datetimes will be set to 00:00:00.
** For DICOM files, header attributes [PatientName, PatientBirthDate, IdentifyingComments, ReferringPhysician, InstitutionName, ReferringPhysicianAddress, PatientAddress, PrivateInformation, AccessionNumber] are all removed **
De-ID and Ultrasound Cleaning for ML
This call to the program will perform de-identification of the ultrasound medical images. It masks burned-in text from the image, writes discovered text to a .csv file under the column header 'text'. The CSV file will contain the original file name (for crosswalk purposes and any other scrubbed attributes in the case of DICOMs). Further geometric and morphological image processing is performed on the image array to determine ultrasound shape/contour matching to isolate the ROI (in preparation for machine learning experiments). These image frame(s) are then saved as lossless jpeg(s) or DICOMs at the user's discretion.
deid.deid_clean(directory_path = "/path/to/directory", output_directory_path = "/path/to/directory", rename_files = None, output_type = 'jpeg', rename_csv = None, use_dates=False, date_header = 'remove', threshold = 0)
directory_path : path to image files (DICOM, JPEG, PNG etc.)
output_directory_path : path where you want to save images
output_type: 'jpeg' (default), options: 'jpeg', 'DICOM'
rename_files: None (default), options: 'csv', 'random'. This will determine how to rename the files when saving them to output. None will not change the filenames (so be cognizant of your own duplicates). 'random' will set the filenames to a 10-character alphanumeric string and will check to ensure nothing is duplicated. Otherwise, the user can specify 'csv' in which case, attributes from the DICOM (PatientID and StudyDate or AcquisitionDateTime will be leveraged to define how to save (see subsequent parameter below)
rename_csv: path to .csv file for renaming patient IDs/files, column headers in the corresponding csv should be named akin to: 'original_id', 'datetime', 'new_id', 'file_save_name'. This permits users to swap the PatientID in the DICOM header with a value of their choice. Note that if there isn't a csv specified then the PatientID is destroyed.
use_dates: Boolean, False (default), set to True if you require date information to further subdivide IDs and filenames. Then file renaming/ID swapping will take into account date information (in the event the same patient ID had multiple scans on the same date)
date_header: 'remove' (default), options: '0101year' . Setting this to 'remove' will scrub DICOM attributes ['StudyDate', 'InstanceCreationDate', 'ContentDate', 'OverlayDate', 'CurveDate', 'AcquisitionDateTime'] otherwise '0101year' will retain the same year and shift the day and month to Jan 1st of the same year. Of note, this will remove any time in these datetimes will be set to 00:00:00.
** For DICOM files, header attributes [PatientName, PatientBirthDate, IdentifyingComments, ReferringPhysician, InstitutionName, ReferringPhysicianAddress, PatientAddress, PrivateInformation, AccessionNumber] are all removed **
De-ID and Retention of Single Largest ROI
This call to the program will perform de-identification of medical images. It masks burned-in text from the image, writes discovered text to a .csv file under the column header 'text'. The CSV file will contain the original file name (for crosswalk purposes and any other scrubbed attributes in the case of DICOMs). Further image processing is performed on the image array to isolate the single largest ROI (in preparation for machine learning experiments). This is intended for general use with general medical imaging data and will work for ultrasound. However, for those seeking to use this with ultrasound data, we recommend using the algorithm mentioned above. These image frame(s) are then saved as lossless jpeg(s) or DICOMs at the user's discretion.
deid.deid_clean(directory_path = "/path/to/directory", output_directory_path = "/path/to/directory", rename_files = None, output_type = 'jpeg', rename_csv = None, use_dates=False, date_header = 'remove', threshold = 0)
directory_path : path to image files (DICOM, JPEG, PNG etc.)
output_directory_path : path where you want to save images
output_type: 'jpeg' (default), options: 'jpeg', 'DICOM'
rename_files: None (default), options: 'csv', 'random'. This will determine how to rename the files when saving them to output. None will not change the filenames (so be cognizant of your own duplicates). 'random' will set the filenames to a 10-character alphanumeric string and will check to ensure nothing is duplicated. Otherwise, the user can specify 'csv' in which case, attributes from the DICOM (PatientID and StudyDate or AcquisitionDateTime will be leveraged to define how to save (see subsequent parameter below)
rename_csv: path to .csv file for renaming patient IDs/files, column headers in the corresponding csv should be named akin to: 'original_id', 'datetime', 'new_id', 'file_save_name'. This permits users to swap the PatientID in the DICOM header with a value of their choice. Note that if there isn't a csv specified then the PatientID is destroyed.
use_dates: Boolean, False (default), set to True if you require date information to further subdivide IDs and filenames. Then file renaming/ID swapping will take into account date information (in the event the same patient ID had multiple scans on the same date)
date_header: 'remove' (default), options: '0101year' . Setting this to 'remove' will scrub DICOM attributes ['StudyDate', 'InstanceCreationDate', 'ContentDate', 'OverlayDate', 'CurveDate', 'AcquisitionDateTime'] otherwise '0101year' will retain the same year and shift the day and month to Jan 1st of the same year. Of note, this will remove any time in these datetimes will be set to 00:00:00.
** For DICOM files, header attributes [PatientName, PatientBirthDate, IdentifyingComments, ReferringPhysician, InstitutionName, ReferringPhysicianAddress, PatientAddress, PrivateInformation, AccessionNumber] are all removed **
De-ID and Cleaning
This call to the program will simply perform de-identification of the medical images. It masks burned-in text from the image, writes discovered text to a .csv file under the column header 'text'. The CSV file will contain the original file name (for crosswalk purposes and any other scrubbed attributes in the case of DICOMs). Further image processing is performed on the image array to remove smaller extraneous features while retaining multiple larger entities (in preparation for machine learning experiments). This is intended for general use with general medical imaging data. These image frame(s) are then saved as lossless jpeg(s) or DICOMs at the user's discretion.
deid.deid_clean(directory_path = "/path/to/directory", output_directory_path = "/path/to/directory", rename_files = None, output_type = 'jpeg', rename_csv = None, use_dates=False, date_header = 'remove', threshold = 0)
directory_path : path to image files (DICOM, JPEG, PNG etc.)
output_directory_path : path where you want to save images
output_type: 'jpeg' (default), options: 'jpeg', 'DICOM'
rename_files: None (default), options: 'csv', 'random'. This will determine how to rename the files when saving them to output. None will not change the filenames (so be cognizant of your own duplicates). 'random' will set the filenames to a 10-character alphanumeric string and will check to ensure nothing is duplicated. Otherwise, the user can specify 'csv' in which case, attributes from the DICOM (PatientID and StudyDate or AcquisitionDateTime will be leveraged to define how to save (see subsequent parameter below)
rename_csv: path to .csv file for renaming patient IDs/files, column headers in the corresponding csv should be named akin to: 'original_id', 'datetime', 'new_id', 'file_save_name'. This permits users to swap the PatientID in the DICOM header with a value of their choice. Note that if there isn't a csv specified then the PatientID is destroyed.
use_dates: Boolean, False (default), set to True if you require date information to further subdivide IDs and filenames. Then file renaming/ID swapping will take into account date information (in the event the same patient ID had multiple scans on the same date)
date_header: 'remove' (default), options: '0101year' . Setting this to 'remove' will scrub DICOM attributes ['StudyDate', 'InstanceCreationDate', 'ContentDate', 'OverlayDate', 'CurveDate', 'AcquisitionDateTime'] otherwise '0101year' will retain the same year and shift the day and month to Jan 1st of the same year. Of note, this will remove any time in these datetimes will be set to 00:00:00.
** For DICOM files, header attributes [PatientName, PatientBirthDate, IdentifyingComments, ReferringPhysician, InstitutionName, ReferringPhysicianAddress, PatientAddress, PrivateInformation, AccessionNumber] are all removed **
Finding Text
This call to the program will simply detect text in the image file path provided and write it out to a series of csvs by the same name.
deid.find_txt(input_path = "path_to_files",output_path="path_to_save_files")
directory_path : path to image files (DICOM, JPEG, PNG etc.)
output_directory_path : path where you want to save images
Dice Calculation and Plotting
This call to the program will simply detect text in the image file path provided and write it out to a series of csvs by the same name.
im_analysis.dice_score(pred_array, true_array, k=1)
pred_array : array of the predicted segmentation
true_array: array of the ground truth segmentation
k : value to perform matching on (default=1)
Returns: dice score (float)
im_analysis.imshowpair(pred_array, true_array, color1 = (124,252,0), color2=(255,0,252), show_fig=True)
pred_array : array of the predicted segmentation
true_array: array of the ground truth segmentation
k : value to perform matching on (default=1)
color1 : RGB value color (tuple)
color2: RGB value color (tuple)
Returns: dice score (float)