A platform for digitizing, analysing and making available journalism archives using artificial intelligence
  • aiPages specializes in digitizing newspapers through scanning and indexing.
  • The use of AI technology guarantees the accuracy and reliability of the extracted content.
  • It offers the convenience of accessing both the extracted content and the original scanned pages.
  • It is a valuable tool for accessing and analyzing historical newspaper articles.
  • aiPages contributes to preserving and providing public access to historical information in multiple formats.
Problem statement

Organizations that still rely on paper-based documents face several challenges

  • Limited accessibility: Paper documents are stored in physical locations, difficult to be accessed remotely.
  • High storage costs: Storing paper documents is expensive, as it requires physical storage and regular maintenance.
  • Security risks: Paper documents can be lost, or damaged, which can lead to security risks and loss of sensitive information.
  • Inefficiency: Paper documents processes can be inefficient and slow, requiring manual processing which lead to errors and delays.
  • Limited collaboration: Paper documents can be difficult to share and collaborate on between departments or teams.
  • Inability to leverage data: Non-digitized documents are difficult to analyse and extract insights from to make informed decisions.

To address these challenges, they need to digitize and convert paper documents to digital formats that can be easily accessed, searched, and shared to help organizations to improve their efficiency, reduce costs, enhance security, and improve collaboration between departments.

OUR SERVICE

aiPages Features

Block identification
Block identification

aiPages is able to identify the following content blocks in any newspaper pages:

  • Articles
  • Titles in articles
  • Authors of articles
  • Images in articles
  • Columns in articles
  • Advertisements

Text Classification
Text Classification

  • aiPages is classifying text according to its content e.g. art, sports, politics,...
  • aiPages uses topic classification derived from (IPTC).
  • aiPages is able to identify 72 different categories in Arabic. This is critical for serious data analysis and indexing.

Metadata Extraction
Metadata Extraction

  • Title of article
  • Author of article
  • Images in the content block
  • Named entities, each extracted named entity will be linked to a publicly published database and could by a place, person or an organization.

Text Extraction
Text Extraction

  • Text extraction is conducted whether the text falls into one or more columns.
  • It can recognize poorly scanned images using pre-processing of image as final NLP processing of extracted content to correct any OCR misidentified letters or words.

Text Summarization
Text Summarization

  • aiPages is able to identify the most interesting parts of extracted text and stitch them into a meaningful summary.
  • In very long articles, the summary could be only 20-30% of the extracted article.

User access management
User access management

  • aiPages management involves creating and managing user accounts, defining access levels, and enforcing security policies to ensure that users only have access to the parts of the system and data that they need to perform their jobs.

Dashboard
Dashboard

  • The Dashboard Monitor is a robust visualization tool aimed at providing users with comprehensive insights into the processing status of Documents, Pages, and Articles. With an interactive interface, users can filter results by date, choose between different content type views, and get real-time updates, all in one unified dashboard.


Search Capabilities
Search Capabilities

  • aiPages employs a cutting-edge search engine harnessing Semantic Search capabilities. This search type prioritizes understanding the context and meaning of queries over mere keyword matching. By utilizing natural language processing (NLP) and machine learning, it analyzes both the query and the document content, delivering highly relevant and accurate results.

Operational Efficiency
Operational Efficiency

  • Since the solution is composed of various micro services and components, it might happen that some components are under heavy utilization compared to the others, this is totally managed by operator agent that scales up/down components according to its load without need to scale the other components for maximum operational efficiency.

AI usage

Image segmentation

  • This is the technology of allowing the computer to understand different parts of an image.
  • Image segmentation is the core of our system's ability to differentiate content blocks from other not interesting blocks like advertisements, images and classified ads.
  • It can identify content blocks even if they are not in a regular geometric shape (e.g. squares or rectangles) which is very common in modern papers.



  • Named Entity Recognition (NER)

    • This is the technology of allowing the computer to understand different parts of an image.
    • Image segmentation is the core of our system's ability to differentiate content blocks from other not interesting blocks like advertisements, images and classified ads.
    • It can identify content blocks even if they are not in a regular geometric shape (e.g. squares or rectangles) which is very common in modern papers.



    OCR

    • Convolutional neural networks: It is particularly effective for image recognition tasks. They work by processing small regions of an image and using learned features to classify the content.
    • Long short-term memory (LSTM) networksIt is a type of neural network that is effective for processing sequential data and improve the recognition accuracy. They work by maintaining a memory of previous inputs to inform future predictions.
    • Hidden Markov Models (HMMs): Tesseract uses HMMs to model the probability distribution of characters within an image.

    Benefits of using the aiPages platform

    Availability and ease of access to information: the platform allows researchers and those interested in journalism, publications, and media to easily and effectively access this archival information and benefit from it in an unprecedented way.

    Preserving historical records: The digitization of historical publications allows them to be preserved from loss for future generations and not be affected by any damage or loss of hard copies. The biggest difference with the rule-based chatbot is the usage of the machine learning models that significantly increases the functionality of the bot as it is able to identify hundreds of different questions written by a human.

    Analyze and correlate data and identify patterns, allowing information to be inferred that may not be noticeable when reading individual articles.

    The possibility of searching using words and sentences and accessing specific information easily, which is of great value, especially for researchers and academics

    OUR SERVICE

    High-level Architecture

    • The diagram describes the sequence of activities done by the solution and the components of the system
    • REST API is the single point of contact with the solution ,it is used to upload content onto a pre-designated Kubernetes persistent volume or object storage.
    • Uploading content queues the uploaded material for processing.
    • Operator Agent is responsible for queueing, batching, routing and managing load for various tasks on various systems of the component.
    • Operator agent forward tasks to various components while managing and optimizing the load
    • There are multiple deployed servers for the specialized components i.e. Block classifier and OCR, number of these servers are managed by the operator agent according to their load.

    aiPagesArchitecture