|
ORE Studio 0.0.4
|
Classes | |
| class | DownloadProgressReporter |
| class | EntityInfo |
| class | ProgressReporter |
Functions | |
| Path | get_lei_data_dir () |
| Tuple[Optional[str], Optional[str]] | try_gleif_api (str api_base, str date_str) |
| Optional[str] | try_direct_url (str base_url, int file_id, datetime date, str file_type) |
| Tuple[str, str] | discover_gleif_urls (Optional[datetime] target_date=None) |
| str | download_and_extract (str url, Path output_dir) |
| Tuple[str, str] | download_gleif_data (Path output_dir, Optional[str] lei_url=None, Optional[str] rr_url=None, Optional[datetime] target_date=None) |
| Set[str] | load_anchor_leis (Path data_dir) |
| int | count_lines (str filepath) |
| str | detect_sector (str name) |
| str | detect_fund_type (str name) |
| Tuple[str, str] | find_data_files (Path directory) |
| str | get_subset_filename (str original, str size) |
| Tuple[Dict[str, List[str]], Dict[str, str]] | build_relationship_map (str rr_file) |
| Dict[str, EntityInfo] | analyze_entities (str lei_file, Dict[str, List[str]] parent_to_children, Dict[str, str] child_to_parent) |
| Set[str] | select_subset (Dict[str, EntityInfo] entities, Dict[str, List[str]] parent_to_children, Dict[str, str] child_to_parent, Config config, Set[str] anchor_leis=None) |
| write_subset (str lei_file, str rr_file, Set[str] selected_leis, str output_lei, str output_rr) | |
| main () | |
Variables | |
| dict | SUBSET_SIZES |
| list | GLEIF_API_ENDPOINTS |
| str | GLEIF_DOWNLOAD_BASE = "https://leidata-preview.gleif.org/storage/golden-copy-files" |
| int | LEI_COL_ID = 0 |
| int | LEI_COL_NAME = 1 |
| int | LEI_COL_COUNTRY = 43 |
| int | LEI_COL_CATEGORY = 191 |
| int | LEI_COL_SUBCATEGORY = 192 |
| int | LEI_COL_LEGAL_FORM_CODE = 193 |
| int | LEI_COL_OTHER_LEGAL_FORM = 194 |
| int | LEI_COL_STATUS = 199 |
| int | RR_COL_START_NODE = 0 |
| int | RR_COL_END_NODE = 2 |
| int | RR_COL_RELATIONSHIP_TYPE = 4 |
| dict | SECTOR_KEYWORDS |
| dict | FINANCIAL_SECTORS |
| int | FINANCIAL_PRIORITY_MULTIPLIER = 3 |
| dict | FUND_TYPE_KEYWORDS |
Extract a diverse subset from the GLEIF LEI dataset.
This script creates a smaller, representative subset of the LEI data that includes:
- Entities from as many countries as possible
- All entity categories (GENERAL, FUND, SOLE_PROPRIETOR, etc.)
- Entities with different relationship depths (0, 1, 2, 3, 4, 5+ children)
- Diversity across detected sectors (banks, insurance, funds, energy, etc.)
- Various legal forms and fund types
Usage:
python lei_extract_subset.py --size small # Small subset for quick testing
python lei_extract_subset.py --size large # Larger subset for comprehensive testing
python lei_extract_subset.py --download # Download latest data first
Output files are written to external/lei/:
- <original_lei_file>-subset-<size>.csv
- <original_rr_file>-subset-<size>.csv
| Path get_lei_data_dir | ( | ) |
Get the path to external/lei directory.

| Tuple[Optional[str], Optional[str]] try_gleif_api | ( | str | api_base, |
| str | date_str | ||
| ) |
Try to get URLs from a GLEIF API endpoint.

| Optional[str] try_direct_url | ( | str | base_url, |
| int | file_id, | ||
| datetime | date, | ||
| str | file_type | ||
| ) |
Try to access a direct URL with a specific file ID.

| Tuple[str, str] discover_gleif_urls | ( | Optional[datetime] | target_date = None | ) |
Discover the download URLs for LEI and RR files from GLEIF.
Tries multiple strategies:
1. Query GLEIF API endpoints
2. Try direct URL construction with ID probing
URL pattern: https://leidata-preview.gleif.org/storage/golden-copy-files/YYYY/MM/DD/ID/filename.csv.zip
Args:
target_date: Date to download (defaults to today)
Returns:
Tuple of (lei_url, rr_url)

| str download_and_extract | ( | str | url, |
| Path | output_dir | ||
| ) |
Download a zip file and extract the CSV.
Args:
url: URL to download
output_dir: Directory to save the extracted file
Returns:
Path to the extracted CSV file

| Tuple[str, str] download_gleif_data | ( | Path | output_dir, |
| Optional[str] | lei_url = None, |
||
| Optional[str] | rr_url = None, |
||
| Optional[datetime] | target_date = None |
||
| ) |
Download the latest GLEIF golden copy files.
Args:
output_dir: Directory to save files
lei_url: Optional explicit URL for LEI file
rr_url: Optional explicit URL for RR file
target_date: Target date for downloads (defaults to today)
Returns:
Tuple of (lei_file_path, rr_file_path)

| Set[str] load_anchor_leis | ( | Path | data_dir | ) |
Load anchor LEIs from anchor_leis.json. The JSON file is generated by lei_extract_anchor_leis.py from regulatory lists (ECB SSM, UK PRA) plus hand-curated G-SIB LEIs.

| int count_lines | ( | str | filepath | ) |
Count lines in a file efficiently.

| str detect_sector | ( | str | name | ) |
Detect sector from entity name using keywords.

| str detect_fund_type | ( | str | name | ) |
Detect fund type from fund entity name.

| Tuple[str, str] find_data_files | ( | Path | directory | ) |
Find the LEI and RR data files in the directory.

| str get_subset_filename | ( | str | original, |
| str | size | ||
| ) |
Generate subset filename by inserting '-subset-<size>' before .csv extension.

| Tuple[Dict[str, List[str]], Dict[str, str]] build_relationship_map | ( | str | rr_file | ) |
Build a map of parent LEI -> list of child LEIs. Uses IS_DIRECTLY_CONSOLIDATED_BY relationships where: - StartNode (column 0) is the child - EndNode (column 2) is the parent

| Dict[str, EntityInfo] analyze_entities | ( | str | lei_file, |
| Dict[str, List[str]] | parent_to_children, | ||
| Dict[str, str] | child_to_parent | ||
| ) |
Read and analyze all entities, collecting classification info.

| Set[str] select_subset | ( | Dict[str, EntityInfo] | entities, |
| Dict[str, List[str]] | parent_to_children, | ||
| Dict[str, str] | child_to_parent, | ||
| Config | config, | ||
| Set[str] | anchor_leis = None |
||
| ) |
Select a diverse subset of LEIs based on multiple dimensions.

| write_subset | ( | str | lei_file, |
| str | rr_file, | ||
| Set[str] | selected_leis, | ||
| str | output_lei, | ||
| str | output_rr | ||
| ) |
Write the subset files.

| dict SUBSET_SIZES |
| list GLEIF_API_ENDPOINTS |
| dict FINANCIAL_SECTORS |
| dict FUND_TYPE_KEYWORDS |