ORE Studio: src.lei_extract_subset Namespace Reference

ORE Studio 0.0.4

Loading...

Searching...

No Matches

Classes
class	DownloadProgressReporter

class	EntityInfo

class	ProgressReporter

Functions
Path	get_lei_data_dir ()

Tuple[Optional[str], Optional[str]]	try_gleif_api (str api_base, str date_str)

Optional[str]	try_direct_url (str base_url, int file_id, datetime date, str file_type)

Tuple[str, str]	discover_gleif_urls (Optional[datetime] target_date=None)

str	download_and_extract (str url, Path output_dir)

Tuple[str, str]	download_gleif_data (Path output_dir, Optional[str] lei_url=None, Optional[str] rr_url=None, Optional[datetime] target_date=None)

Set[str]	load_anchor_leis (Path data_dir)

int	count_lines (str filepath)

str	detect_sector (str name)

str	detect_fund_type (str name)

Tuple[str, str]	find_data_files (Path directory)

str	get_subset_filename (str original, str size)

Tuple[Dict[str, List[str]], Dict[str, str]]	build_relationship_map (str rr_file)

Dict[str, EntityInfo]	analyze_entities (str lei_file, Dict[str, List[str]] parent_to_children, Dict[str, str] child_to_parent)

Set[str]	select_subset (Dict[str, EntityInfo] entities, Dict[str, List[str]] parent_to_children, Dict[str, str] child_to_parent, Config config, Set[str] anchor_leis=None)

	write_subset (str lei_file, str rr_file, Set[str] selected_leis, str output_lei, str output_rr)

	main ()

Variables
dict	SUBSET_SIZES

list	GLEIF_API_ENDPOINTS

str	GLEIF_DOWNLOAD_BASE = "https://leidata-preview.gleif.org/storage/golden-copy-files"

int	LEI_COL_ID = 0

int	LEI_COL_NAME = 1

int	LEI_COL_COUNTRY = 43

int	LEI_COL_CATEGORY = 191

int	LEI_COL_SUBCATEGORY = 192

int	LEI_COL_LEGAL_FORM_CODE = 193

int	LEI_COL_OTHER_LEGAL_FORM = 194

int	LEI_COL_STATUS = 199

int	RR_COL_START_NODE = 0

int	RR_COL_END_NODE = 2

int	RR_COL_RELATIONSHIP_TYPE = 4

dict	SECTOR_KEYWORDS

dict	FINANCIAL_SECTORS

int	FINANCIAL_PRIORITY_MULTIPLIER = 3

dict	FUND_TYPE_KEYWORDS

Detailed Description

Extract a diverse subset from the GLEIF LEI dataset.

This script creates a smaller, representative subset of the LEI data that includes:
- Entities from as many countries as possible
- All entity categories (GENERAL, FUND, SOLE_PROPRIETOR, etc.)
- Entities with different relationship depths (0, 1, 2, 3, 4, 5+ children)
- Diversity across detected sectors (banks, insurance, funds, energy, etc.)
- Various legal forms and fund types

Usage:
    python lei_extract_subset.py --size small   # Small subset for quick testing
    python lei_extract_subset.py --size large   # Larger subset for comprehensive testing
    python lei_extract_subset.py --download     # Download latest data first

Output files are written to external/lei/:
    - <original_lei_file>-subset-<size>.csv
    - <original_rr_file>-subset-<size>.csv

Function Documentation

◆ get_lei_data_dir()

Path get_lei_data_dir ( )

Get the path to external/lei directory.

Here is the caller graph for this function:

◆ try_gleif_api()

Tuple[Optional[str], Optional[str]] try_gleif_api	(	str	api_base,
		str	date_str
	)

Try to get URLs from a GLEIF API endpoint.

Here is the caller graph for this function:

◆ try_direct_url()

Optional[str] try_direct_url	(	str	base_url,
		int	file_id,
		datetime	date,
		str	file_type
	)

Try to access a direct URL with a specific file ID.

Here is the caller graph for this function:

◆ discover_gleif_urls()

Tuple[str, str] discover_gleif_urls ( Optional[datetime] target_date = None )

Discover the download URLs for LEI and RR files from GLEIF.

Tries multiple strategies:
1. Query GLEIF API endpoints
2. Try direct URL construction with ID probing

URL pattern: https://leidata-preview.gleif.org/storage/golden-copy-files/YYYY/MM/DD/ID/filename.csv.zip

Args:
    target_date: Date to download (defaults to today)

Returns:
    Tuple of (lei_url, rr_url)

Here is the caller graph for this function:

◆ download_and_extract()

str download_and_extract	(	str	url,
		Path	output_dir
	)

Download a zip file and extract the CSV.

Args:
    url: URL to download
    output_dir: Directory to save the extracted file

Returns:
    Path to the extracted CSV file

Here is the caller graph for this function:

◆ download_gleif_data()

Tuple[str, str] download_gleif_data	(	Path	output_dir,
		Optional[str]	lei_url = `None`,
		Optional[str]	rr_url = `None`,
		Optional[datetime]	target_date = `None`
	)

Download the latest GLEIF golden copy files.

Args:
    output_dir: Directory to save files
    lei_url: Optional explicit URL for LEI file
    rr_url: Optional explicit URL for RR file
    target_date: Target date for downloads (defaults to today)

Returns:
    Tuple of (lei_file_path, rr_file_path)

Here is the caller graph for this function:

◆ load_anchor_leis()

Set[str] load_anchor_leis ( Path data_dir )

Load anchor LEIs from anchor_leis.json.

The JSON file is generated by lei_extract_anchor_leis.py from regulatory
lists (ECB SSM, UK PRA) plus hand-curated G-SIB LEIs.

Here is the caller graph for this function:

◆ count_lines()

int count_lines ( str filepath )

Count lines in a file efficiently.

Here is the caller graph for this function:

◆ detect_sector()

str detect_sector ( str name )

Detect sector from entity name using keywords.

Here is the caller graph for this function:

◆ detect_fund_type()

str detect_fund_type ( str name )

Detect fund type from fund entity name.

Here is the caller graph for this function:

◆ find_data_files()

Tuple[str, str] find_data_files ( Path directory )

Find the LEI and RR data files in the directory.

Here is the caller graph for this function:

◆ get_subset_filename()

str get_subset_filename	(	str	original,
		str	size
	)

Generate subset filename by inserting '-subset-<size>' before .csv extension.

Here is the caller graph for this function:

◆ build_relationship_map()

Tuple[Dict[str, List[str]], Dict[str, str]] build_relationship_map ( str rr_file )

Build a map of parent LEI -> list of child LEIs.

Uses IS_DIRECTLY_CONSOLIDATED_BY relationships where:
- StartNode (column 0) is the child
- EndNode (column 2) is the parent

Here is the caller graph for this function:

◆ analyze_entities()

Dict[str, EntityInfo] analyze_entities	(	str	lei_file,
		Dict[str, List[str]]	parent_to_children,
		Dict[str, str]	child_to_parent
	)

Read and analyze all entities, collecting classification info.

Here is the caller graph for this function:

◆ select_subset()

Set[str] select_subset	(	Dict[str, EntityInfo]	entities,
		Dict[str, List[str]]	parent_to_children,
		Dict[str, str]	child_to_parent,
		Config	config,
		Set[str]	anchor_leis = `None`
	)

Select a diverse subset of LEIs based on multiple dimensions.

Here is the caller graph for this function:

◆ write_subset()

write_subset	(	str	lei_file,
		str	rr_file,
		Set[str]	selected_leis,
		str	output_lei,
		str	output_rr
	)

Write the subset files.

Here is the caller graph for this function:

Variable Documentation

◆ SUBSET_SIZES

dict SUBSET_SIZES

Initial value:

=  {
    'small': {
        'per_country': 20,
        'per_depth': 10,
        'per_sector': 15,
        'per_category': 20,
        'per_fund_type': 10,
        'per_legal_form': 5,
    },
    'large': {
        'per_country': 60,
        'per_depth': 30,
        'per_sector': 40,
        'per_category': 60,
        'per_fund_type': 30,
        'per_legal_form': 10,
    },
}

◆ GLEIF_API_ENDPOINTS

list GLEIF_API_ENDPOINTS

Initial value:

=  [
    "https://api.gleif.org/api/v1/golden-copies/publishes",
    "https://goldencopy.gleif.org/api/v2/golden-copies/publishes",
]

◆ FINANCIAL_SECTORS

dict FINANCIAL_SECTORS

Initial value:

=  {
    'CENTRAL_BANK', 'BANK', 'INSURANCE', 'INVESTMENT_FUND', 'ETF',
    'HEDGE_FUND', 'PRIVATE_EQUITY', 'PENSION', 'ASSET_MANAGEMENT',
    'BROKER_DEALER', 'CUSTODY_CLEARING', 'PAYMENTS_FINTECH',
    'MORTGAGE_LENDING', 'TRUST_FIDUCIARY', 'CAPITAL_MARKETS',
}

◆ FUND_TYPE_KEYWORDS

dict FUND_TYPE_KEYWORDS

Initial value:

=  {
    'ETF': ['ETF', 'EXCHANGE TRADED', 'EXCHANGE-TRADED'],
    'MONEY_MARKET': ['MONEY MARKET', 'LIQUIDITY', 'CASH FUND'],
    'BOND': ['BOND', 'FIXED INCOME', 'DEBT', 'CREDIT', 'HIGH YIELD', 'TREASURY'],
    'EQUITY': ['EQUITY', 'STOCK', 'SHARES', 'GROWTH FUND', 'VALUE FUND'],
    'INDEX': ['INDEX', 'TRACKER', 'PASSIVE'],
    'BALANCED': ['BALANCED', 'MIXED', 'MULTI-ASSET'],
    'REAL_ESTATE': ['REAL ESTATE', 'REIT', 'PROPERTY'],
    'COMMODITY': ['COMMODITY', 'GOLD', 'PRECIOUS METALS'],
    'EMERGING_MARKETS': ['EMERGING', 'FRONTIER', 'DEVELOPING'],
    'GLOBAL': ['GLOBAL', 'WORLD', 'INTERNATIONAL'],
}