Skip to content

pclubiitk/EventsCatalogue

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Events Catalogue Listener Service

This service listens to an IMAP inbox, filters incoming mailing-list emails, extracts event details using an LLM pipeline, and stores extracted events in a SQL database.

The core runtime file is listen.py, and extraction/cleaning logic is in extract.py.

What This Service Does

  1. Connects to an IMAP server over STARTTLS (port 143 + TLS upgrade).
  2. Tracks mailbox progress using IMAP UID values.
  3. Filters emails whose To contains at least one configured target address.
  4. Extracts message body text (prefers text/plain, then text/html, ignores attachments).
  5. Cleans text with clean_text.
  6. Calls extract_event to parse event metadata.
  7. Stores extracted event fields in SQL.
  8. Persists the last processed UID in SQL so restarts continue from the last checkpoint.

High-Level Flow

  1. Start process.
  2. Load environment variables from .env.dev.
  3. Open IMAP connection and select mailbox.
  4. Read last_processed_uid from DB state table.
  5. If no state exists (0):
    • Read current mailbox max UID.
    • Save that UID as initial state.
    • Do not process historical messages.
  6. Wait for new mail via IMAP IDLE.
  7. For each UID newer than state:
    • Fetch message.
    • Log sender/recipient/subject.
    • Filter by To addresses.
    • Extract and clean body.
    • Send extraction request to model.
    • Store extracted event if present.
    • Update DB checkpoint (last_processed_uid).
  8. On failures, reconnect after delay.

Components

Listener (listen.py)

Responsibilities:

  • IMAP connection and mailbox selection.
  • IDLE waiting (built-in path if available, manual fallback otherwise).
  • New UID discovery and processing loop.
  • Event persistence and checkpoint persistence.
  • Runtime logging.

Extraction (extract.py)

Responsibilities:

  • Convert HTML-ish text into normalized textual content via html2text and regex cleanup.
  • Send prompt + email body to model (qwen3.5:9b).
  • Parse model output as strict JSON or None.

Model path:

  • Local path by default (ollama.generate).
  • Cloud path if started with --cloud (Client(host='https://ollama.com', Authorization Bearer key)).

Database Schema

Tables are auto-created by SQLAlchemy on startup.

extracted_events

  • id (int, primary key, autoincrement)
  • name (string, nullable)
  • venue (string, nullable)
  • date (string, nullable)
  • time (string, nullable)
  • created_at (datetime, UTC default)

processing_state

  • id (int, primary key, autoincrement)
  • last_processed_uid (int)
  • updated_at (datetime, UTC default)

Notes:

  • Current code stores one logical state row and updates it.
  • last_processed_uid is updated for every processed-or-skipped UID to avoid reprocessing on restart.

Configuration

Environment variables are loaded from .env.dev because listen.py calls:

  • load_dotenv('.env.dev')

Required

  • IMAP_EMAIL: IMAP login username.
  • IMAP_PASSWORD: IMAP login password.

Optional (with defaults)

  • IMAP_HOST (default: newmailhost.cc.iitk.ac.in)
  • IMAP_PORT (default: 143)
  • IMAP_MAILBOX (default: INBOX)
  • DATABASE_URL (default: sqlite:///events.db)
  • RECONNECT_DELAY_SECONDS (default: 10)
  • IDLE_TIMEOUT_SECONDS (default: 300)
  • TARGET_TO_ADDRESSES (default resolved from TARGET_ADDRESSES fallback, then built-in list)
  • TARGET_ADDRESSES (used as fallback seed for default target list)
  • OLLAMA_API_KEY (required only when using --cloud)

Important Address Variable Note

Current code defines:

  • DEFAULT_TARGET_TO_ADDRESSES = os.getenv('TARGET_ADDRESSES', 'students@list.iitk.ac.in,all@lists.iitk.ac.in')
  • TARGET_TO_ADDRESSES from os.getenv('TARGET_TO_ADDRESSES', DEFAULT_TARGET_TO_ADDRESSES)

So either of these can affect behavior:

  1. TARGET_TO_ADDRESSES (primary)
  2. TARGET_ADDRESSES (fallback source)

Prefer setting TARGET_TO_ADDRESSES explicitly to avoid confusion.

Example .env.dev

IMAP_HOST=newmailhost.cc.iitk.ac.in
IMAP_PORT=143
IMAP_EMAIL=your_username_or_email
IMAP_PASSWORD=your_password
IMAP_MAILBOX=INBOX

TARGET_TO_ADDRESSES=students@list.iitk.ac.in,all@lists.iitk.ac.in

DATABASE_URL=sqlite:///events.db
RECONNECT_DELAY_SECONDS=10
IDLE_TIMEOUT_SECONDS=300

# Required only for --cloud mode
OLLAMA_API_KEY=your_ollama_api_key

Installation

Use Python 3.11+ recommended.

python3 -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install python-dotenv sqlalchemy ollama html2text pandas

If you use local Ollama, make sure the model exists locally and the Ollama server is available.

Running

Local extraction path:

python3 listen.py

Cloud extraction path:

python3 listen.py --cloud

Logging You Will See

Typical log sequence for one message:

  1. Connection startup
  2. Listener state initialization
  3. New mail metadata log (UID, From, To, Subject)
  4. Body extraction start and source (multipart/text/plain, etc.)
  5. Extraction request sent to model
  6. Either:
    • No event extracted
    • Event stored

Examples:

  • New email received UID ... | From: ... | To: ... | Subject: ...
  • Extracting body for UID ...
  • Body extracted for UID ... from multipart/text/plain
  • Sending extract request for UID ...
  • Stored event for UID ...: {...}

IMAP IDLE Behavior

wait_for_new_mail has two paths:

  1. Built-in IDLE path (imap.idle) if available in Python runtime.
  2. Manual IMAP protocol fallback:
    • Send IDLE
    • Wait on socket with select
    • Send DONE

Both paths are push-like waiting and avoid frequent polling loops.

Restart and Checkpoint Behavior

  • On first startup with empty state:
    • The listener sets checkpoint to current max UID.
    • Historical mail is not processed.
  • On subsequent restarts:
    • Reads stored checkpoint.
    • Processes only UIDs greater than checkpoint.

This prevents duplicate processing across restarts.

Switching to Another SQL Database

Set DATABASE_URL to a SQLAlchemy-compatible URL.

Examples:

# SQLite
DATABASE_URL=sqlite:///events.db

# PostgreSQL
DATABASE_URL=postgresql+psycopg2://user:password@host:5432/dbname

# MySQL
DATABASE_URL=mysql+pymysql://user:password@host:3306/dbname

Install the matching DB driver package when moving away from SQLite.

Troubleshooting

1) Auth failure on IMAP login

  • Verify IMAP_EMAIL and IMAP_PASSWORD.
  • Some servers accept local-part usernames, others need full email.

2) Listener reconnect loops

  • Check network reachability to IMAP host and port.
  • Validate TLS handshake is allowed.
  • Increase RECONNECT_DELAY_SECONDS if needed.

3) No events extracted

  • Confirm message To contains configured target addresses.
  • Inspect logs for body source and extraction call.
  • Verify model availability and model name in extract.py.

4) JSON parse warnings from extractor

  • Model output may deviate from strict JSON.
  • Current code logs warning and skips that message.

5) Cloud mode errors

  • Ensure OLLAMA_API_KEY is set.
  • Confirm outbound access to https://ollama.com.

Operational Notes

  • Attachments are ignored during body extraction.
  • If both plain and HTML body exist, plain text is preferred.
  • clean_text may trim forwarded headers and quote prefixes.
  • Process runs continuously until interrupted.

Security Notes

  • Never commit real credentials in .env.dev.
  • Rotate IMAP and API credentials if leaked.
  • Consider least-privileged mailbox credentials where possible.

About

A script which listens to new mails, decides whether an event is announced in the mail and if yes, extract event details (name. venue, date, time) from the mail body and push it to a database

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages