ORE Studio 0.0.4
Loading...
Searching...
No Matches
Classes | Functions | Variables
src.lei_extract_subset Namespace Reference

Classes

class  DownloadProgressReporter
 
class  EntityInfo
 
class  ProgressReporter
 

Functions

Path get_lei_data_dir ()
 
Tuple[Optional[str], Optional[str]] try_gleif_api (str api_base, str date_str)
 
Optional[str] try_direct_url (str base_url, int file_id, datetime date, str file_type)
 
Tuple[str, str] discover_gleif_urls (Optional[datetime] target_date=None)
 
str download_and_extract (str url, Path output_dir)
 
Tuple[str, str] download_gleif_data (Path output_dir, Optional[str] lei_url=None, Optional[str] rr_url=None, Optional[datetime] target_date=None)
 
Set[str] load_anchor_leis (Path data_dir)
 
int count_lines (str filepath)
 
str detect_sector (str name)
 
str detect_fund_type (str name)
 
Tuple[str, str] find_data_files (Path directory)
 
str get_subset_filename (str original, str size)
 
Tuple[Dict[str, List[str]], Dict[str, str]] build_relationship_map (str rr_file)
 
Dict[str, EntityInfoanalyze_entities (str lei_file, Dict[str, List[str]] parent_to_children, Dict[str, str] child_to_parent)
 
Set[str] select_subset (Dict[str, EntityInfo] entities, Dict[str, List[str]] parent_to_children, Dict[str, str] child_to_parent, Config config, Set[str] anchor_leis=None)
 
 write_subset (str lei_file, str rr_file, Set[str] selected_leis, str output_lei, str output_rr)
 
 main ()
 

Variables

dict SUBSET_SIZES
 
list GLEIF_API_ENDPOINTS
 
str GLEIF_DOWNLOAD_BASE = "https://leidata-preview.gleif.org/storage/golden-copy-files"
 
int LEI_COL_ID = 0
 
int LEI_COL_NAME = 1
 
int LEI_COL_COUNTRY = 43
 
int LEI_COL_CATEGORY = 191
 
int LEI_COL_SUBCATEGORY = 192
 
int LEI_COL_LEGAL_FORM_CODE = 193
 
int LEI_COL_OTHER_LEGAL_FORM = 194
 
int LEI_COL_STATUS = 199
 
int RR_COL_START_NODE = 0
 
int RR_COL_END_NODE = 2
 
int RR_COL_RELATIONSHIP_TYPE = 4
 
dict SECTOR_KEYWORDS
 
dict FINANCIAL_SECTORS
 
int FINANCIAL_PRIORITY_MULTIPLIER = 3
 
dict FUND_TYPE_KEYWORDS
 

Detailed Description

Extract a diverse subset from the GLEIF LEI dataset.

This script creates a smaller, representative subset of the LEI data that includes:
- Entities from as many countries as possible
- All entity categories (GENERAL, FUND, SOLE_PROPRIETOR, etc.)
- Entities with different relationship depths (0, 1, 2, 3, 4, 5+ children)
- Diversity across detected sectors (banks, insurance, funds, energy, etc.)
- Various legal forms and fund types

Usage:
    python lei_extract_subset.py --size small   # Small subset for quick testing
    python lei_extract_subset.py --size large   # Larger subset for comprehensive testing
    python lei_extract_subset.py --download     # Download latest data first

Output files are written to external/lei/:
    - <original_lei_file>-subset-<size>.csv
    - <original_rr_file>-subset-<size>.csv

Function Documentation

◆ get_lei_data_dir()

Path get_lei_data_dir ( )
Get the path to external/lei directory.
Here is the caller graph for this function:

◆ try_gleif_api()

Tuple[Optional[str], Optional[str]] try_gleif_api ( str  api_base,
str  date_str 
)
Try to get URLs from a GLEIF API endpoint.
Here is the caller graph for this function:

◆ try_direct_url()

Optional[str] try_direct_url ( str  base_url,
int  file_id,
datetime  date,
str  file_type 
)
Try to access a direct URL with a specific file ID.
Here is the caller graph for this function:

◆ discover_gleif_urls()

Tuple[str, str] discover_gleif_urls ( Optional[datetime]   target_date = None)
Discover the download URLs for LEI and RR files from GLEIF.

Tries multiple strategies:
1. Query GLEIF API endpoints
2. Try direct URL construction with ID probing

URL pattern: https://leidata-preview.gleif.org/storage/golden-copy-files/YYYY/MM/DD/ID/filename.csv.zip

Args:
    target_date: Date to download (defaults to today)

Returns:
    Tuple of (lei_url, rr_url)
Here is the caller graph for this function:

◆ download_and_extract()

str download_and_extract ( str  url,
Path  output_dir 
)
Download a zip file and extract the CSV.

Args:
    url: URL to download
    output_dir: Directory to save the extracted file

Returns:
    Path to the extracted CSV file
Here is the caller graph for this function:

◆ download_gleif_data()

Tuple[str, str] download_gleif_data ( Path  output_dir,
Optional[str]   lei_url = None,
Optional[str]   rr_url = None,
Optional[datetime]   target_date = None 
)
Download the latest GLEIF golden copy files.

Args:
    output_dir: Directory to save files
    lei_url: Optional explicit URL for LEI file
    rr_url: Optional explicit URL for RR file
    target_date: Target date for downloads (defaults to today)

Returns:
    Tuple of (lei_file_path, rr_file_path)
Here is the caller graph for this function:

◆ load_anchor_leis()

Set[str] load_anchor_leis ( Path  data_dir)
Load anchor LEIs from anchor_leis.json.

The JSON file is generated by lei_extract_anchor_leis.py from regulatory
lists (ECB SSM, UK PRA) plus hand-curated G-SIB LEIs.
Here is the caller graph for this function:

◆ count_lines()

int count_lines ( str  filepath)
Count lines in a file efficiently.
Here is the caller graph for this function:

◆ detect_sector()

str detect_sector ( str  name)
Detect sector from entity name using keywords.
Here is the caller graph for this function:

◆ detect_fund_type()

str detect_fund_type ( str  name)
Detect fund type from fund entity name.
Here is the caller graph for this function:

◆ find_data_files()

Tuple[str, str] find_data_files ( Path  directory)
Find the LEI and RR data files in the directory.
Here is the caller graph for this function:

◆ get_subset_filename()

str get_subset_filename ( str  original,
str  size 
)
Generate subset filename by inserting '-subset-<size>' before .csv extension.
Here is the caller graph for this function:

◆ build_relationship_map()

Tuple[Dict[str, List[str]], Dict[str, str]] build_relationship_map ( str  rr_file)
Build a map of parent LEI -> list of child LEIs.

Uses IS_DIRECTLY_CONSOLIDATED_BY relationships where:
- StartNode (column 0) is the child
- EndNode (column 2) is the parent
Here is the caller graph for this function:

◆ analyze_entities()

Dict[str, EntityInfo] analyze_entities ( str  lei_file,
Dict[str, List[str]]  parent_to_children,
Dict[str, str]  child_to_parent 
)
Read and analyze all entities, collecting classification info.
Here is the caller graph for this function:

◆ select_subset()

Set[str] select_subset ( Dict[str, EntityInfo entities,
Dict[str, List[str]]  parent_to_children,
Dict[str, str]  child_to_parent,
Config  config,
Set[str]   anchor_leis = None 
)
Select a diverse subset of LEIs based on multiple dimensions.
Here is the caller graph for this function:

◆ write_subset()

write_subset ( str  lei_file,
str  rr_file,
Set[str]  selected_leis,
str  output_lei,
str   output_rr 
)
Write the subset files.
Here is the caller graph for this function:

Variable Documentation

◆ SUBSET_SIZES

dict SUBSET_SIZES
Initial value:
1= {
2 'small': {
3 'per_country': 20,
4 'per_depth': 10,
5 'per_sector': 15,
6 'per_category': 20,
7 'per_fund_type': 10,
8 'per_legal_form': 5,
9 },
10 'large': {
11 'per_country': 60,
12 'per_depth': 30,
13 'per_sector': 40,
14 'per_category': 60,
15 'per_fund_type': 30,
16 'per_legal_form': 10,
17 },
18}

◆ GLEIF_API_ENDPOINTS

list GLEIF_API_ENDPOINTS
Initial value:
1= [
2 "https://api.gleif.org/api/v1/golden-copies/publishes",
3 "https://goldencopy.gleif.org/api/v2/golden-copies/publishes",
4]

◆ FINANCIAL_SECTORS

dict FINANCIAL_SECTORS
Initial value:
1= {
2 'CENTRAL_BANK', 'BANK', 'INSURANCE', 'INVESTMENT_FUND', 'ETF',
3 'HEDGE_FUND', 'PRIVATE_EQUITY', 'PENSION', 'ASSET_MANAGEMENT',
4 'BROKER_DEALER', 'CUSTODY_CLEARING', 'PAYMENTS_FINTECH',
5 'MORTGAGE_LENDING', 'TRUST_FIDUCIARY', 'CAPITAL_MARKETS',
6}

◆ FUND_TYPE_KEYWORDS

dict FUND_TYPE_KEYWORDS
Initial value:
1= {
2 'ETF': ['ETF', 'EXCHANGE TRADED', 'EXCHANGE-TRADED'],
3 'MONEY_MARKET': ['MONEY MARKET', 'LIQUIDITY', 'CASH FUND'],
4 'BOND': ['BOND', 'FIXED INCOME', 'DEBT', 'CREDIT', 'HIGH YIELD', 'TREASURY'],
5 'EQUITY': ['EQUITY', 'STOCK', 'SHARES', 'GROWTH FUND', 'VALUE FUND'],
6 'INDEX': ['INDEX', 'TRACKER', 'PASSIVE'],
7 'BALANCED': ['BALANCED', 'MIXED', 'MULTI-ASSET'],
8 'REAL_ESTATE': ['REAL ESTATE', 'REIT', 'PROPERTY'],
9 'COMMODITY': ['COMMODITY', 'GOLD', 'PRECIOUS METALS'],
10 'EMERGING_MARKETS': ['EMERGING', 'FRONTIER', 'DEVELOPING'],
11 'GLOBAL': ['GLOBAL', 'WORLD', 'INTERNATIONAL'],
12}