tududi/backend/docs/telegram-duplicate-prevention.md
2025-06-20 12:04:36 +03:00

256 lines
No EOL
7.7 KiB
Markdown

# Telegram Duplicate Prevention System
## Overview
This document describes the comprehensive duplicate prevention system implemented for Telegram message processing in the Tududi application. The system prevents duplicate inbox items from being created when the same message is processed multiple times due to network issues, processing errors, or other edge cases.
## Problem Statement
When integrating with Telegram's polling API, several scenarios can cause duplicate messages:
- Network timeouts causing message reprocessing
- Application restarts during message processing
- Race conditions in polling cycles
- Telegram API returning the same update multiple times
- Processing failures that lead to retry attempts
## Solution Architecture
The duplicate prevention system implements multiple layers of protection:
### 1. **Update ID Tracking (Application Level)**
- **Purpose**: Prevent reprocessing of the same Telegram update
- **Implementation**: In-memory Set tracking processed update IDs
- **Key Format**: `${userId}-${updateId}`
- **Memory Management**: Automatic cleanup (keeps last 1000 entries)
```javascript
// Example usage
const processedUpdates = new Set();
const updateKey = `${user.id}-${update.update_id}`;
if (!processedUpdates.has(updateKey)) {
// Process the update
processedUpdates.add(updateKey);
}
```
### 2. **Content-Based Duplicate Detection (Database Level)**
- **Purpose**: Prevent duplicate inbox items with identical content
- **Implementation**: Database query checking recent items (30-second window)
- **Scope**: Per-user, per-source (telegram)
```javascript
// Check for existing item within time window
const existingItem = await InboxItem.findOne({
where: {
content: messageContent,
user_id: userId,
source: 'telegram',
created_at: {
[Op.gte]: new Date(Date.now() - 30000) // 30 seconds
}
}
});
```
### 3. **Telegram API Offset Management**
- **Purpose**: Prevent re-fetching already processed updates
- **Implementation**: Track highest processed update ID per user
- **Persistence**: Maintained in poller state
## Code Structure
### Core Files
- **`services/telegramPoller.js`**: Main polling logic with duplicate prevention
- **`services/telegramInitializer.js`**: Initialization and user management
- **`tests/unit/services/telegramPoller.test.js`**: Unit tests for core functions
- **`tests/integration/telegram-duplicates.test.js`**: Integration tests for database interactions
- **`tests/integration/telegram-duplicate-scenario.test.js`**: Real-world scenario tests
### Key Functions
#### `processUpdates(user, updates)`
- Filters out already processed updates
- Updates user status with highest update ID
- Processes each new update individually
- Implements cleanup for memory management
#### `createInboxItem(content, userId, messageId)`
- Checks for recent duplicates before creation
- Creates inbox item with metadata
- Returns existing item if duplicate found
#### `processMessage(user, update)`
- Extracts message data
- Creates inbox item (with duplicate check)
- Sends confirmation back to Telegram
- Handles errors gracefully
## Testing Strategy
### Unit Tests (12 tests)
- Update ID tracking logic
- User list management
- Message parameter creation
- URL generation
- State management
### Integration Tests (12 tests)
- Database-level duplicate prevention
- Poller state management
- Update processing logic
- Error handling scenarios
### Scenario Tests (7 tests)
- Network issue simulations
- Rapid consecutive messages
- Update ID tracking in practice
- Poller restart scenarios
- Memory management verification
- Edge cases and time-based logic
### Running Tests
```bash
# Run all Telegram duplicate tests
npm run test:telegram-duplicates
# Run specific test suites
npm test -- --testPathPatterns="telegramPoller\.test\.js"
npm test -- --testPathPatterns="telegram-duplicates\.test\.js"
npm test -- --testPathPatterns="telegram-duplicate-scenario\.test\.js"
```
## Configuration
### Time Windows
- **Duplicate Detection**: 30 seconds (configurable)
- **Memory Cleanup**: 1000 entries (configurable)
- **Polling Interval**: 5 seconds (configurable)
### Memory Management
- Automatic cleanup when processed updates exceed 1000 entries
- Removes oldest 100 entries during cleanup
- Prevents memory leaks in long-running processes
## Monitoring and Debugging
### Logging
The system includes comprehensive logging for debugging:
```javascript
console.log(`Processing ${updates.length} updates for user ${user.id}`);
console.log(`Duplicate inbox item detected for user ${userId}`);
console.log(`Successfully processed message ${messageId} for user ${user.id}`);
```
### Status Monitoring
```javascript
const status = telegramPoller.getStatus();
// Returns: { running, usersCount, pollInterval, userStatus }
```
## Error Handling
### Network Errors
- Graceful handling of Telegram API timeouts
- Continuation of polling for other users if one fails
- Detailed error logging for debugging
### Database Errors
- Fallback behavior when duplicate check fails
- Transaction safety for inbox item creation
- Proper error responses to users
### Processing Errors
- Individual update error handling
- Continuation of processing for remaining updates
- Error reporting via Telegram messages
## Performance Considerations
### Memory Usage
- Bounded memory growth through automatic cleanup
- Efficient Set operations for duplicate checking
- Minimal memory footprint per user
### Database Performance
- Indexed queries for duplicate detection
- Time-limited searches (30-second window)
- Efficient user-scoped queries
### Network Efficiency
- Optimized Telegram API calls
- Proper offset management
- Reasonable polling intervals
## Security Considerations
### Data Protection
- User-scoped duplicate checking
- Secure token handling
- No sensitive data in logs
### Rate Limiting
- Respectful Telegram API usage
- Configurable polling intervals
- Graceful handling of API limits
## Future Enhancements
### Possible Improvements
1. **Persistent State**: Store processed update IDs in database
2. **Configurable Windows**: Make time windows user-configurable
3. **Metrics Collection**: Add detailed metrics for monitoring
4. **Retry Logic**: Implement exponential backoff for failures
5. **Batch Processing**: Process multiple updates in batches
### Migration Considerations
- Current system maintains backward compatibility
- New features can be added incrementally
- Existing data remains unaffected
## Troubleshooting
### Common Issues
1. **Duplicates Still Occurring**
- Check if time window is appropriate for your use case
- Verify update ID tracking is working
- Review logs for processing errors
2. **Memory Usage Growing**
- Verify cleanup logic is running
- Check if cleanup threshold needs adjustment
- Monitor processed updates Set size
3. **Performance Issues**
- Review database query performance
- Check polling interval settings
- Monitor network latency to Telegram API
### Debug Commands
```javascript
// Check poller status
const status = telegramPoller.getStatus();
// Manual cleanup (for testing)
if (processedUpdates.size > 1000) {
// Cleanup logic
}
// Verify duplicate detection
const existing = await InboxItem.findOne({ /* query */ });
```
## Conclusion
The Telegram duplicate prevention system provides robust protection against message duplication through multiple complementary mechanisms. The comprehensive test suite ensures reliability, while the monitoring and debugging features facilitate maintenance and troubleshooting.
The system is designed to be:
- **Reliable**: Multiple layers of protection
- **Efficient**: Minimal performance impact
- **Maintainable**: Well-tested and documented
- **Scalable**: Bounded memory usage and efficient queries
- **Debuggable**: Comprehensive logging and monitoring