Legacy System Archival — Do's and Don'ts
A practitioner's guide for vendors and project teams.
1 Planning & Scoping
Do
- Inventory the data source before starting. Count tables, views, stored procedures, and total size. For file-based systems (FoxPro DBF, Firebird FDB), count files, total size, and identify memo/index files. This sets expectations and helps estimate effort.
- Identify the system type and database technology early. Each technology has different extraction tools and challenges:
- SQL Server (SEAGHA, CHARISMA, SAGE): use standard SQL tools, pyodbc, SSMS
- Firebird (FINAWIN): may require specialized drivers or raw binary parsing if custom collations block SQL
- Visual FoxPro DBF (TAAVI): use dbfread (Python), handle CP1257/CP1250 Baltic/CEE encodings
- Vendor-managed platforms (MICROSCOP/Mikrocop): no direct database access — must engage vendor
- Identify the record types in the system (e.g., Import/Export/Transit declarations, payroll/HR/contracts, invoices). Each type may have its own table structure, views, and output format.
- Get sample exports early. Obtain Excel, CSV, or PDF files that represent the expected output format. These are your specification — the extraction must reproduce them exactly. For payroll systems, get sample payslips; for customs, get sample declaration exports.
- Identify the date field used for filtering. In customs systems this is typically the acceptance date (
DatAanv), not the creation or modification date. In payroll systems, it is the pay period. In HR systems, it is the contract date or employment start date. - Document data volumes per type and per year upfront. This determines whether you can export everything at once or need to work in chunks.
- Check if the database has existing views or stored procedures that produce report formats. These are your building blocks — do not reinvent them. SEAGHA had 164 views that were essential; FINAWIN had 650 stored procedures; CHARISMA had complex salary calculation procedures.
- Identify character encoding requirements immediately. Legacy systems in Europe frequently use non-standard encodings:
- Latvian: UNICODE_LV custom collation (FINAWIN)
- Estonian: CP1257 Baltic Windows codepage (TAAVI)
- Slovenian: EBCDIC mainframe encoding (MICROSCOP)
- Romanian: Windows-1250 Central European (CHARISMA)
- Always verify that diacritics and special characters survive extraction.
- Map the complete data flow. Understand where data originates, where it is processed, and where it ends up. Some data may only exist at a vendor (MICROSCOP: Mikrocop held the full 10-year archive, not FedEx).
- Identify regulatory retention periods by country before starting:
- Belgium: customs declarations — 7 years
- Estonia: employment contracts — employment + 10 years; payroll — 7 years
- Latvia: payroll and personnel records — 75 years for some categories
- Slovenia: invoices — 10 years
- Romania: payroll records — 50 years for pension-relevant data
- Always confirm with local legal/compliance — do not guess.
- Engage the business stakeholders early. They know which data matters, which reports they actually use, and what questions auditors ask. They are the specification, not the database schema.
Don't
- Don't start writing extraction code without a sample output file. You will waste time guessing column order, data transformations, and join logic.
- Don't assume the database schema documentation is complete or current. Always verify against the actual DDL (
CREATE TABLE/VIEWstatements) or the live database structure. - Don't underestimate the size. A 20 GB SQL Server database can produce 5+ GB of text output. A 2 GB Firebird database with BLOBs can produce hundreds of extracted files. Plan for storage and transfer.
- Don't plan to export the entire database in one query or one pass. Break it down by type, by sheet/report, and optionally by date range.
- Don't assume all data is in-house. Vendor-managed systems (MICROSCOP) may hold the authoritative archive externally. Discover this in planning, not during extraction.
- Don't assume someone on the team knows the system. Legacy system knowledge often resides with departed employees. Document everything from the first conversation.
2 Database Access & Setup
Do
- Restore the database backup on a separate server (e.g., DWH/staging), never on the production application server.
- Verify the data range immediately after restore. The backup may not contain the most recent years. The SEAGHA backup only had data through 2020 — supplementary Excel exports were needed for 2021–2022. Plan to fill gaps from other sources.
- Script the full schema (tables, views, stored procedures) to a
.sqlfile early. This becomes your primary reference for understanding the data model. For FINAWIN, this meant extracting 650 stored procedures, 60 functions, and 331 triggers. - Use read-only access. Archival is extraction, not modification.
- Use
WITH (NOLOCK)hints for large read-only queries on SQL Server to avoid blocking. - For non-SQL databases, verify driver compatibility before committing to an approach:
- Firebird: test that
fdb/firebirdsqlPython drivers can read all text columns. Custom collations (UNICODE_LV) can cause all text to return NULL. - FoxPro: test that
dbfreadhandles the codepage correctly. Check for memo files (.fpt) that contain supplementary data. - If standard drivers fail, be prepared for alternative extraction methods (raw binary parsing worked for FINAWIN when SQL failed).
- Firebird: test that
- Preserve the original data files. Keep the original
.bak,.fdb,.dbffiles as-is alongside any extracted output. These are your insurance policy.
Don't
- Don't use
INSERT,UPDATE,ALTER, orDELETEstatements. This is aSELECT/extraction project, not data loading. - Don't modify the source database. Treat it as read-only.
- Don't skip the schema export. Without it, you're navigating blind when the application is decommissioned.
- Don't discard original database files after extraction. Keep them archived alongside the output.
- Don't assume a driver that connects successfully is reading data correctly. Verify text fields contain actual characters, not NULLs or garbled encoding.
3 Extraction Development
Do
- Use the existing database views as building blocks. They contain the correct join logic, data transformations, and column aliases that match the report format.
- Preserve the exact column order to match the expected output template. Column position matters — the business users know their data by column position.
- Build parameterized queries with
@StartDate/@EndDateand@AllRecordsflag. This gives flexibility for testing (small range) and full extraction. - Add a
@TestModewith a few known record IDs for verification against sample data. - Handle data type conversions carefully:
ISNUMERIC()checks beforeCAST, date year checks (≤1970 = NULL), string cleanup (REPLACE,RTRIM). - Handle character encoding explicitly in your extraction scripts:
- Specify encoding when opening files:
open(file, encoding='utf-8')orencoding='cp1257' - Use a fallback chain: try UTF-8 → CP1257 → CP1250 → Latin-1
- Verify diacritics manually: check that names like "Jansone-Birsnece" (Latvian) or "Poljakova" (Estonian) appear correctly
- For Firebird with custom collations: consider raw binary page-level parsing as a fallback
- Specify encoding when opening files:
- Extract BLOBs and binary content separately. Employee photos, report templates, PDF payslips, XML files — these need special handling:
- Store BLOBs as individual files in a structured directory (
blobs/table_name/row_id.ext) - Create a metadata table linking BLOB files back to their source rows
- Identify BLOB types: images, PDFs, XML templates, RTF documents
- Store BLOBs as individual files in a structured directory (
- Document each query/script: what output it produces, how many columns/rows expected, which source tables it reads, what the output filename should be.
- Test with a small set of known records first. Compare output column-by-column against the sample export file.
Don't
- Don't hardcode record ID ranges. Always use date-based filtering through the appropriate date field.
- Don't skip columns, even if they appear empty. The column structure is the contract with downstream consumers.
- Don't confuse column aliases in views with actual database column names (e.g.,
StatValin Excel may beStatAdjuin the table). - Don't ignore version/revision logic. Many systems keep multiple versions of a record — views often select
MAX(SubID) WHERE Function='9'to get the latest. - Don't mix header-level and detail-level data. Documents, parties, and financial info may exist at both levels with different join keys.
- Don't ignore credential/password tables. Note their existence but flag them for review — they may contain hashed passwords (SHA-512 in TAAVI) that should be excluded from the public archive.
4 Data Export & Output Formats
Do
- Export to multiple formats based on the audience:
- Tab-delimited text (
.txt): most portable intermediate format, works with any tool - Parquet: compressed columnar format for large datasets (SEAGHA: 22 GB → 650 MB Parquet)
- SQLite: preserves relational structure, queryable, self-contained (FINAWIN, TAAVI)
- CSV: Excel-ready for non-technical users (one CSV per table)
- Excel (
.xlsx): final delivery format with formatting, auto-filters, freeze panes
- Tab-delimited text (
- Use consistent, descriptive filenames:
Type_SheetName.txt(e.g.,Import_Main.txt,Export_Detail.txt) orTableName.csv. - Separate output per sheet/report type: Main, Detail, Parties, Documents, etc. Each becomes a sheet in the final Excel or a separate CSV.
- Provide two Excel views when applicable:
per_year: all sheets for one year (good for audits of a specific period)per_type: all years for one sheet type (good for trend analysis)- Users need both depending on their query.
- Create a
Master_Index.xlsxorREADME.mdwith clickable links to all files and a description of what each contains. - Extract and preserve non-database content:
- PDF payslips (TAAVI: 1,676 salary slip PDFs)
- SEPA XML payment files (TAAVI: 43 files)
- Tax declaration files (TAAVI: 104 files)
- Report templates (FINAWIN: 225 FastReport XML templates)
- Stored procedure source code (FINAWIN: 650 procedures)
- Build a web-based viewer for non-technical users. A simple Node.js + Express + better-sqlite3 application has been proven across multiple projects:
- Searchable table list, paginated data grids
- Row detail panel, BLOB viewer (inline images, PDFs)
- SQL query runner with preset queries
- No build step, runs on localhost
Don't
- Don't try to export the entire database at once. Work by type, by year, and by chunk.
- Don't use Excel as the intermediate format. It has row limits (1M rows), is slow to write, and loses precision on large numbers.
- Don't forget to handle NULL values. NULLs should appear as empty cells, not as the string
'NULL'. - Don't neglect number formatting. Use
#,##0.00for amounts and plain integer format for IDs (no thousands separator). - Don't deliver only one format. SQLite for queries + CSV for Excel users + web viewer for browsing = maximum accessibility.
5 Validation & Quality Assurance
Do
- Verify row counts: the number of rows exported must match the source query count. Document counts per table.
- Spot-check specific records: pick 5 known records and verify every column against the original application or sample export.
- Verify column count: the output must have exactly the expected number of columns (e.g., 140 for Import Main in SEAGHA).
- Verify character encoding: check that special characters (Latvian: ā, č, ē, ģ, ī, ķ, ļ, ņ, š, ū, ž; Estonian: ä, ö, ü, õ; Slovenian: č, š, ž; Romanian: ă, â, î, ș, ț) appear correctly in the output.
- Check for unknown/orphan records: rows with NULL dates go into
*_Unknownfiles — verify these are expected. - Document any data gaps: if a backup only covers 2003–2020 and supplementary exports cover 2022, note the gap (2021). Be explicit about what is missing.
- Keep the extraction scripts, schema DDL, and pipeline code as part of the archival package. The data is useless without the context of how it was extracted.
- Verify BLOB extraction: spot-check that extracted images open correctly, PDFs are readable, XML templates parse.
- Test the web viewer with a non-technical user. Can they find an employee record? A specific declaration? A payslip?
Don't
- Don't assume the extraction is correct just because the query runs without errors. Always validate against sample data.
- Don't delete the intermediate files (TXT, Parquet) after generating Excel. They may be needed for re-extraction or verification.
- Don't lose the schema DDL file. It is the only documentation of the database structure once the server is decommissioned.
- Don't sign off on extraction without encoding verification. A database that looks correct in row counts but has garbled text is a failed archival.
6 Vendor & Stakeholder Engagement
Do
- Identify all external vendors early who hold data or provide services. In MICROSCOP, the vendor (Mikrocop) held the authoritative 10-year archive — not FedEx.
- Establish vendor contact through the business, not through IT alone. Business contacts know the vendor relationship; IT may not have the right access or context.
- Request vendor data export capabilities and pricing upfront. Some vendors charge per-record; others provide bulk exports. Know before you budget.
- Get access to vendor frontends/portals for self-service verification. You need to confirm that the exported data matches what the vendor shows.
- Document the vendor relationship: contract terms, contact persons, data formats, SLAs, and any retention commitments.
- Brief the business on what data will be archived and what will be lost. They must sign off on scope — you cannot make compliance decisions for them.
- Schedule regular check-ins with business stakeholders during extraction. Show them sample output early. Their feedback is the specification.
- Document knowledge transfer explicitly. When the person who knows the system leaves, the archival becomes exponentially harder. Record everything in the first meeting.
Don't
- Don't assume a vendor will cooperate on your timeline. Vendor engagement can stall for months (MICROSCOP: 2+ months waiting for contact).
- Don't skip the business stakeholder. They know which queries auditors run, which reports are legally required, and which data is actually used.
- Don't rely on a single person's knowledge. If only one person knows the system, document what they know immediately — they may leave before archival is complete.
- Don't assume active vendor processes can be retired immediately. MICROSCOP was still actively receiving daily invoice files — you cannot archive and retire simultaneously without a transition plan.
7 Delivery & Handoff
Do
- Upload all files to a shared location (SharePoint, network drive) with a clear folder structure.
- Provide a README or Master Index explaining the folder structure, file naming convention, and how to find specific data.
- Include a technical reference document with: database technology, server details, schema statistics, data volumes, extraction scripts, and all queries used.
- Deliver extraction scripts as standalone files that can be re-executed if the database is still available.
- Note the data retention period and any legal/compliance requirements that drove the archival, including country-specific regulations.
- Include the web viewer application with setup instructions (
npm install,npm start). This makes the archive immediately usable without SQL skills. - Provide preset queries for the most common lookups:
- "Find employee by name"
- "Show all declarations for reference number X"
- "List payroll entries for period Y"
- "Export contract details for employee Z"
- Document the archival date and scope: what was archived, from which system, which date range, and by whom. This metadata is essential for future compliance inquiries.
Don't
- Don't deliver data without documentation. A folder of Excel files without context is barely better than no archive at all.
- Don't assume the recipient knows the database schema. Provide enough context for someone with SQL skills but no domain knowledge to navigate the data.
- Don't keep the archive in only one location. Ensure there is a backup of the backup.
- Don't forget to include the original database backup/files alongside the extracted output. Future needs may require re-extraction with different parameters.
8 Common Pitfalls — Lessons from Real Projects
| Pitfall | Project | What Happened | How to Avoid |
|---|---|---|---|
| Custom database collation | FINAWIN | Latvian UNICODE_LV collation caused all text columns to return NULL via standard SQL | Test text extraction immediately; have a fallback plan (raw binary parsing) |
| Incomplete backup | SEAGHA_A | Database backup only contained data through 2020, missing 2021–2022 | Verify date range immediately after restore; locate supplementary exports |
| Data held by vendor | MICROSCOP | Mikrocop (external vendor) held the full 10-year invoice archive, not FedEx | Map the complete data flow in planning phase; identify external data holders |
| Knowledge departure | MICROSCOP | Business contact moved to a different department; knowledge chain became fragile | Document everything in the first meeting; don't defer |
| Encoding loss | TAAVI | FoxPro files used CP1257 Baltic encoding; default UTF-8 readers produced garbled Estonian names | Always detect and specify encoding explicitly; verify diacritics |
| Active process retirement | MICROSCOP | The invoice process was still actively running — couldn't just archive and shut down | Understand full lifecycle before planning retirement |
| BLOB content ignored | FINAWIN | Employee photos, report templates, and audit logs were stored as BLOBs — would have been lost without explicit extraction | Always check for BLOB/binary columns; extract and catalog them |
| Credential exposure | TAAVI | User tables contained SHA-512 hashed passwords | Flag credential tables; exclude from public-facing outputs |
9 Archival Project Checklist
Use this checklist to track progress on each legacy system archival.
Discovery
- Data source located (database backup, file system, vendor platform)
- System type and technology identified (SQL Server, Firebird, FoxPro, vendor-managed)
- Schema/structure exported (tables, views, stored procedures, file listings)
- Table/view/SP counts documented
- Data size documented (original + estimated output)
- Record types identified (declarations, employees, payroll, invoices, etc.)
- Sample export files obtained from business or application
- Date field for filtering identified
- Data range verified (which years are present)
- Data gaps identified and documented
- Character encoding identified and tested
- External vendor dependencies identified
- Regulatory retention periods confirmed with legal/compliance
- Business stakeholders identified and briefed
- Knowledge holders documented (names, roles, what they know)
Development
- Extraction scripts/queries written for each data type
- Scripts handle character encoding correctly
- Scripts parameterized for testing vs. full extraction
- Scripts tested against sample records
- Column order verified against sample exports
- BLOB/binary content extraction planned
- Output filenames and folder structure documented
Extraction
- All queries/scripts executed for full extraction
- Output files saved with headers
- Row counts verified against expected volumes
- Character encoding verified (diacritics, special characters)
- BLOB content extracted and cataloged
- Intermediate formats created (Parquet, SQLite)
- Final delivery formats created (Excel, CSV)
- Master Index or README created
- Web viewer built and tested (if applicable)
Validation
- 5+ records spot-checked against original system/sample
- Row counts match source
- Column counts match specification
- Encoding verified for local-language content
- BLOBs verified (images open, PDFs readable)
- Data gaps documented explicitly
- Web viewer tested with non-technical user
Delivery
- Files uploaded to SharePoint/shared location
- README/documentation included
- Technical reference document included (DB details, scripts, schema)
- Original database files/backups preserved alongside output
- Extraction scripts included and documented
- Web viewer included with setup instructions
- Preset queries included for common lookups
- Email notification sent to stakeholders
- Data retention period and compliance requirements documented
- Vendor contacts and relationships documented (if applicable)
- Archival date, scope, and responsible person documented
10 Technology Quick Reference
| Database Type | Extraction Tool | Encoding Watch | Output Format |
|---|---|---|---|
| SQL Server | pyodbc, SSMS, sqlcmd | Windows-1252, UTF-8 | TXT → Parquet → Excel |
| Firebird | fdb driver, raw binary parser | Custom collations (ICU) | SQLite + CSV |
| Visual FoxPro (DBF) | dbfread (Python) | CP1257, CP1250 | SQLite + CSV |
| Vendor platform | Vendor API/export | EBCDIC, platform-specific | PDF, CSV (vendor-dependent) |
Proven Tech Stack
- Extraction: Python (
fdb,dbfread,pyodbc,pandas) - Transformation: Python (
sqlite3,pyarrow/parquet,openpyxl/xlsxwriter) - Storage: SQLite, Parquet, CSV, Excel (
.xlsx) - Viewer: Node.js + Express + better-sqlite3
- Frontend: Vanilla JS (no framework — simple, no build step)
- Documentation: Markdown + PDF export