Product Framework

Cryptoracle: Privately-Owned Data Assetization — Transforming Private Data from "Fragmented Information" into "Analyzable, Predictable Decision-Making Assets" (of data, by data, for data). Moving beyond the limitations of traditional financial databases—such as relational databases and market data platforms—we focus on converting non-standardized content from private crypto asset social networks into measurable, aggregatable, and model-ready data assets. These serve various scenarios including market monitoring, strategy development, and sentiment analysis.

Core Goal:

  • Centered on the private ecosystem within the crypto asset space, leveraging large language models’ semantic parsing capabilities to deeply deconstruct non-standardized content from Discord, Telegram, private KOL channels, niche community discussions, and more. This forms a multi-dimensional structured data system encompassing user identity tags, semantic sentiment tendencies, event correlations, and propagation path characteristics.

  • Utilizing large language models’ semantic understanding, NLP parsing, and dynamic tagging engines to transform fragmented chat records into measurable metrics.

  • Employing entity correlation algorithms to aggregate high-noise redundant information into traceable relationship networks.

  • Applying time-series data modeling to solidify highly dynamic real-time content into reusable data assets.


1. Foundation Layer

Objective: Address issues of data sourcing, storage, and compliance to provide high-quality structured data for upper layers.

1.1 Multi-Source Integration Module

  • Supports full-channel access to private social networks: Discord/Telegram community messages, private chat records, KOL-exclusive channel content, etc., covering text, images (OCR parsing), and interactive behaviors (likes, shares, @mentions).

  • Provides standardized integration interfaces: API connections, SDK embedding, manual uploads, etc., adaptable to various private scenarios (closed communities, one-on-one chats, exclusive channels).

1.2 Data Compliance & Governance Module

  • Privacy Protection: Automatic desensitization (e.g., hiding phone numbers, nicknames, and other sensitive information), configurable data retention periods based on compliance requirements.

  • Data Quality Governance: Deduplication, cleaning of outliers (e.g., garbled text, repetitive spam), ensuring usability of raw data.

  • Metadata Management: Builds a unified metadata warehouse documenting data assets’ "source, format, field meaning, update frequency, and relationships," supporting metadata retrieval and visualization to help users quickly understand data context.

1.3 Storage & Computation Engine

  • Structured Storage:

    • Relational databases (e.g., MySQL/PostgreSQL) store tagged data (user tags, event tags, etc.), ensuring high availability through master-slave architecture.

    • Time-series databases (e.g., InfluxDB, TimescaleDB) store real-time data streams (e.g., minute-level heat values, sentiment indices), supporting fast time-range queries.

    • New columnar databases (e.g., ClickHouse) store high-cardinality structured data (e.g., full community message records), optimizing aggregation query performance (e.g., "message count by token + date").

  • Distributed Storage Expansion: Introduces HDFS or object storage (e.g., S3, MinIO) for massive unstructured/semi-structured data (e.g., raw chat records, PDF documents), supporting PB-level scalability with replica mechanisms (default 3 replicas) for data reliability.

  • NoSQL Storage Adaptation: Uses document databases (e.g., MongoDB) for unstructured "complex entity data" (e.g., KOL personal information + historical message features) and graph databases (e.g., Neo4j) for "entity relationship networks" (e.g., user-token-community mesh relationships), enabling deep correlation analysis.

  • Hybrid Computation:

    • Real-time computation engines (e.g., Flink) process streaming data (real-time updates of heat/sentiment indices), supporting state management (e.g., cumulative 24-hour mention count for a token).

    • Batch computation engines (e.g., Spark) handle historical data analysis (e.g., weekly trends, strategy backtesting), with resource scheduling via YARN/K8s and dynamic scaling.

    • New "batch-stream integration layer" (e.g., Flink + Hive synergy) ensures unified computation standards for real-time and historical data, avoiding inconsistencies.


2. Data Mining Layer

Objective: Transform fragmented data into structured assets and analyzable dimensions, serving as the core capability carrier.

Cryptoracle’s underlying logic unit is a "five-dimensional structured data unit," standardizing basic elements of social content in a structured manner:

"Who (user) — Where (community) — Said What (content) — About Which Event (event) — Affected Which Token (asset)."

Dimension

Content Description

Example

Who (User)

Who is the speaker? Identity tags, activity behavior, etc.

@CryptoWhale (KOL)

Where (Community)

Which platform/group did the speech occur in? How active is it?

Telegram Group A

What (Content)

Speech content and its linguistic sentiment features, propagation attributes

"SOL is pumping!! 🚀🚀"

About What (Event)

Was a specific event mentioned? What type of event is it?

"Binance announces SOL listing a new feature"

Which Project/Token Affected

Which tokens/projects are primarily involved in the speech?

SOL, BNB

This five-dimensional structure is the foundation for metric calculation and tag system construction.

2.1 Multi-Dimensional Tagging System: From Semantic Recognition to Entity Attribution

The tagging system identifies and categorizes key entities in raw private text, providing behavioral context, attribute attribution, and semantic background. It is the upstream source for all metric construction.

2.1.1 Token Tags (CO-B-01)

Identify tokens mentioned in content, supporting tasks like project attribution, ecosystem structure modeling, and asset classification. Token tags include ecosystem information and technical attribute dimensions.

Label Dimension

Example Labels

Primary Tag (Ecological Attribute)

Public Chain / L2 / DeFi / Stablecoin / AI Sector

Secondary Tag (Application Scenario)

NFT / DEX / Derivatives / RWA

Risk Attribute

High Volatility / Small Market Cap / Trading Activity

2.1.2 User Tags (CO-B-02)

Categorize speakers into types such as regular users, KOLs, project teams, bot accounts, etc., to characterize behavioral traits, influence levels, and content credibility.

Label Dimension

Example Labels

Identity Type

KOL / Retail Investor / Promotional Account / Project Team

Activity Trait

High Frequency Active / Lurker Type / Suspected Bot

Preference Type

Futures Preference / NFT Player / MEME Holder

2.1.3 Community Tags (CO-B-03)

Describe the platform and channel attributes where discussions occur, including platform type (Discord/Telegram/X, etc.), group category, activity level, focus areas, etc.,用于 building propagation networks and information source profiles.

Label Dimension

Example Labels

Activity Level

High Activity / Medium Activity / Zombie Group

Platform

Telegram / Discord / Twitter

Timezone Preference

Asia / Europe & Americas / Global Mixed

2.1.4 Event Tags (CO-B-04)

Identify event-related content and provide structured annotations for event type (e.g., exchange listing, airdrop, attack), source, heat, and propagation scope. These are crucial for event-driven strategy modeling.

Label Dimension

Example Labels

Event Type

Breaking Event / Announcement / Rumor / Hot Discussion

Source Attribute

Official Announcement / KOL Disclosure / Media Repost

Sentiment Direction

Positive / Negative / Neutral

2.2 Metric System: From Text to Modelable Features

Based on the tagging system, Cryptoracle builds a systematic private metric system to quantify and model behavioral, emotional, and structural information hidden in raw text. All metrics have clear data sources, construction logic, and interpretability.

2.2.1 Three Structured Metric Perspectives:

Dimension

Definition

Example Keywords/Indicators

Content

What is the speech about? Topics, viewpoints, token references.

Token mentions, hashtags, technical term recognition

Style

Does the speaking style show emotion, exaggeration, extremity, or aggression?

Sentiment polarity, intensity, subjectivity, extreme expressions

Spread

What is the propagation path of this content? Who is the source? Has it been disseminated?

First mention, KOL chats, co-occurrence count, interaction frequency

2.2.2 Metric Categories & Examples

To better support diverse business needs, metrics are细分 into multiple categories, each targeting different dimensions of text features and market focus.

1. Activity Metrics (CO-A-01)

Purpose: Measure the discussion heat and user participation for tokens or events, reflecting market attention and information liquidity.

Key Examples:

Metric Name

Description

Total Community Messages

Number of valid messages related to the token in associated communities

Number of Active Users

Count of unique users mentioning the token within a specified time period

Number of Mentioning Communities

Number of active groups or channels discussing the token

Total User Interactions

Total number of interactive behaviors including likes, replies, and shares

Application Example:

  • Comparing "total community message volume" and "number of mentioning communities" helps distinguish whether a token’s popularity is localized or widespread, aiding event propagation assessment.

  • "Number of active users" helps detect influx of new users or retail activity.

2. Sentiment Metrics (CO-A-02)

Purpose: Quantify user emotional attitudes toward tokens or events and their dynamics, assisting in predict market sentiment trends and potential risks.

Key Examples:

Metric Name

Description

Positive Sentiment Ratio

Proportion of positively weighted sentiment expressions based on influencer impact

Negative Sentiment Ratio

Proportion of negatively weighted sentiment expressions based on influencer impact

Sentiment Momentum

Difference between positive and negative ratios, reflecting sentiment direction and intensity

Extreme Sentiment Intensity

Weighted frequency proportion of extreme sentiment vocabulary

Sentiment Consistency Score

Degree of consensus in sentiment expression among users

Sentiment Reversal Alert

Identifies critical shifts in sentiment from bullish to bearish or vice versa

Application Example:

  • High "sentiment consistency score" indicates unified market views and concentrated risk; low score may signal diverging opinions and potential volatility.

  • "Extreme sentiment intensity" often captures bubble or panic sentiments, guiding risk management.

3. Event Metrics (CO-A-03)

Purpose: Measure the propagation, impact scope, and discussion depth of specific events in communities.

Key Examples:

Metric Name

Description

Event Keyword Anomaly

Abnormal fluctuation in keyword discussion frequency exceeding historical thresholds within a short time period

Event Discussion Heat

Total volume of discussions related to a specific event within a designated cycle

Event Discussion Breadth

Number of active communities mentioning the event

Event Discussion Depth

Average volume of event discussions within a single community

New Topic Discussion Count

Number of newly emerging topics with heat exceeding a threshold value

Application Example:

  • Combining "event discussion breadth" and "discussion depth" determines whether an event is broadly or concentrated.

  • Monitoring "event keyword anomalies" enables rapid early warning and public opinion monitoring.

4. Attribution Metrics (CO-A-04)

Purpose: Extract trading logic and decision factors behind bullish/bearish views, providing basis for strategy design.

Key Examples:

Metric Name

Description

Bullish Trading Logic Factors

Decision factors and keywords extracted from bullish sentiment texts

Bearish Trading Logic Factors

Decision factors and keywords extracted from bearish sentiment texts

Bullish Factor Heat Evolution

Heat trend changes of bullish factors over time

Bearish Factor Heat Evolution

Heat trend changes of bearish factors over time

Bullish Factor Consistency

Consensus level of bullish factors across different user groups

Bearish Factor Consistency

Consensus level of bearish factors across different user groups

Application Example:

  • Using "factor heat evolution" captures changes in market sentiment and logic, aiding dynamic strategy adjustments.

  • "Factor consistency" measures the dispersion or unity of market views.

5. Confidence Metrics (CO-A-05)

Purpose: Evaluate the quality and reliability of data sources and metrics themselves, ensuring model robustness.

Key Examples:

Metric Name

Description

Core User Dependency

Proportion of content contributed by top 10% users

Active User Ratio

Percentage of active posting users in total community users

Data Coverage Completeness

Coverage scope of metrics across time and communities

Noise Filtering Rate

Proportion of valid data after filtering bot/invalid content

Application Example:

  • High "core user dependency" may indicate metric volatility relies on a few users, requiring cautious interpretation.

  • Low "active user ratio" suggests sparse data or high volatility.

2.3 Metric Construction Methodology: Tag × Metric Synergy

Building high-quality metrics requires following these key steps:

2.3.1 Define Modeling Objectives

The first step is to clarify the application scenario, which determines what information to extract from the vast text data. Common objectives include:

  • Token heat monitoring: Identify "which assets are rapidly gaining attention."

  • Event outbreak tracking: Detect "whether event propagation is accelerating."

  • Sentiment opinion research: Determine "if KOL attitudes toward a project are shifting."

  • Factor model development: Provide "private behavior factors" for quantitative models. Metric design must serve these goals and accommodate strategy usage (timeliness vs. stability, full-market vs. subset).

2.3.2 Use Tags to Define Context Boundaries

Tags provide "information boundaries": Which users/communities should be included in calculations? What content is noise vs. signal? Does an extreme metric originate from credible sources?

Type

Role Description

Data Source Definition

Determine which communities/users' content should be included: e.g., official announcement groups vs. speculative discussion groups vs. meme culture groups

Entity Recognition

Extract key content from messages: tokens, protocols, events, KOLs, etc.

Propagation Tracing

Identify event origins and diffusion paths, determine key nodes: first mentioners, opinion leaders, forwarding chains, etc.

Example: How Community Tags Help Understand Data Sources

A Telegram group tagged as:

  • Group type: General market discussion

  • Token coverage tags: [BTC, ETH, SOL, ARB]

  • Preference tags: Mainstream assets, occasional hot topic coverage 👉 This means the group may naturally pay less attention to "small-cap tokens." If a small token sees high discussion frequency here, it could signal unusual significance. Conversely, it might be sporadic. Tags help distinguish "signal" from "noise" when constructing metrics.

2.3.3 Metric Construction Methods

Within the context defined by tags, metrics extract quantifiable, modelable features from text data, such as:

  • Heat metrics: Message volume, mention frequency, interaction count, number of speakers.

  • Structural metrics: Propagation path depth, KOL ratio, first mention time.

  • Sentiment metrics: Sentiment polarity distribution, sentiment entropy, extreme sentiment intensity. Construction relies on tags:

  • Community tags determine which groups are included in a metric’s calculation.

  • Token tags define the entity boundary for metric statistics.

  • User tags decide whether to apply KOL weights or filter out bot noise.

  • Event tags help extract temporal paths, propagation windows, etc.

2.3.4 Tag × Metric Synergy Value

Tags provide boundaries and context for metrics; metrics tags quantitative expression and modeling power. Together, they form a "traceable, interpretable, modelable" data asset.

What Tags Provide...

What Metrics Solve...

Context, Structure, Semantics

Features, Trends, Signals

Data Interpretability

Data Modelability

Filtering and Categorization Capabilities

Quantification and Comparison Capabilities

👉 Only by viewing metrics within the correct tag context can we make reasonable judgments. For example:

  • A small token repeatedly mentioned in "speculative groups" may indicate opportunities.

  • Frequent project team messages in "official groups" do not necessarily imply community heat up.

2.4 Structured Data Process: Traceable Metric Construction Pipeline

Cryptoracle’s private data system aims to build an interpretable, traceable, and modelable private behavior metric system. Its construction follows a five-step:

Tag Identification → Data Attribution → Feature Extraction → Metric Construction → Model/Strategy Application ↑---------------------------------------------↓ (Interpretability Feedback)

This ensures every metric can trace back to its source context, belonging entity, construction logic, and strategy performance, forming a self-consistent explanation chain.

2.4.1 Structured Path for Raw Content

Raw social content (unstructured text) ↓ Entity recognition (user / community / token / event) ↓ Content tagging   • Influence tags (speaker level, historical behavior)   • Token attribution (mentioned entity mapping)   • Sentiment tags (polarity/intensity)   • Event tags (type, location, related objects) ↓ Metric calculation   • Heat metrics (message volume, heat change rate)   • Sentiment metrics (extreme sentiment intensity, consistency index)   • Structural metrics (divergence rate, drift rate) ↓ Structured output (DataFrame: standard metric format)

2.4.2 Metric Output Example

timestamp
coin
community_volume
sentiment_positive
sentiment_negative
sentiment_extreme
emotion_consistency
emotion_divergence
dominant_emotion
speaker_num
group_num

2025/7/1 14:00

ETH

4321

2300

1800

490

0.78

0.21

positive

912

37

2025/7/1 14:00

SOL

2750

1200

1350

290

0.49

0.52

negative

610

22

Field Explanation:

  • community_volume: Total community messages related to the token.

  • sentiment_positive/negative: Number of positive/negative messages.

  • sentiment_extreme: Number of extreme sentiment messages (e.g., intensity > |0.9|).

  • emotion_consistency: Consistency index (e.g., stability measure of skew distribution).

  • emotion_divergence: Divergence index (degree of polarity difference).

  • dominant_emotion: Dominant sentiment direction.

  • speaker_num / group_num: Number of active users / active groups.


3. Service & Application Layer

Objective: Deliver data value to users efficiently through diverse carriers, catering to different roles.

3.1 Visual Analytics Platform

  • Drill-down interaction: Clicking chart elements (e.g., heat peaks) drills down to raw data (e.g., corresponding time’s message records, related users).

  • Customizable dashboards: Users can drag-and-drop components (trend charts, radar charts, timelines) to generate dedicated dashboards based on tag combinations (e.g., "BTC + Twitter + positive events" dashboard).

3.2 API Service Platform

  • Standardized interfaces: Provide tag data queries (e.g., "get list of tokens mentioned by 'top KOLs' in last 24 hours") and real-time metric (e.g., "trigger callback when 'negative tag density' exceeds threshold").

  • Documentation & debugging: Comes with API documentation, debugging tools, and supports example code generation for scenarios (e.g., Python call for heat query interface).

最后更新于